<a href="https://colab.research.google.com/github/atsuvovor/CyberThreat_Insight/blob/main/model_dev/lagacy_best_model_dev/lagacy_model_dev_github.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **CyberThreat-Insight**  

**Anomalous Behavior Detection in Cybersecurity Analytics using Generative AI**

**Toronto, Septeber 08 2025**  
**Autor : Atsu Vovor**
>Master of Management in Artificial Intelligence    
>Consultant Data Analytics Specialist | Machine Learning |  
Data science | Quantitative Analysis |French & English Bilingual

**# Model Development - Cyber Threat Detection Engine**

The goal of this Model Development section is to build an effective cyber threat detection engine capable of identifying anomalous behavior in security log data. The target variable is **"Threat Level"**, classified as:  
- 0 = Low  
- 1 = Medium  
- 2 = High  
- 3 = Critical  

This section details the full implementation, evaluation, and adaptation of both supervised and unsupervised learning models for detecting multi-class cyber threat levels. We first implement the following machine learning algorythms and select the model with the best performance. We then explore limitations of unsupervised anomaly detection models and propose a robust solution that adapts these models for multi-class classification.

### Train-Test Split: Preparing for Model Evaluation

Following feature engineering, we obtained an **augmented dataset** that combines the original cyber threat data with **synthetically generated anomalies** using techniques such as:

* **Cholesky-based perturbation**
* **SMOTE (Synthetic Minority Over-sampling Technique)**
* **GANs (Generative Adversarial Networks)**

This enriched dataset offers a **balanced distribution** of threat and non-threat instances, making it more suitable for supervised machine learning.

### Objective

To ensure robust model evaluation, we split the **augmented dataset** into training and testing subsets:

* **Training Set (80%)**: Used to train models on both real and synthetic cyber threat patterns.
* **Testing Set (20%)**: Used to validate performance on unseen data.

We apply **stratified sampling** to maintain the class distribution across both subsets critical in cybersecurity where class imbalance (e.g., rare attacks) is a major challenge.

```python
from sklearn.model_selection import train_test_split
def deta_splitting(X_augmented, y_augmented, p_features_engineering_columns, target_column='Threat Level'):

  x_features = [col for col in p_features_engineering_columns if col != target_column]

  #Split the data into training and testing data
  X_train, X_test, y_train, y_test = train_test_split(
    X_augmented[x_features],
    y_augmented,
    test_size=0.2,
    random_state=42
  )

```

1. **Function Purpose:** The function `deta_splitting` facilitates the splitting of a dataset into training and testing subsets for machine learning purposes.
2. **Test Size:** The `test_size=0.2` parameter ensures that 20% of the data is used for testing, while 80% is retained for training.
3. **Reproducibility:** The `random_state=42` parameter guarantees consistent results across runs by fixing the randomness in data splitting.
4. **Outputs:** The function returns four subsets:
   - `X_train` and `y_train` for training the model.
   - `X_test` and `y_test` for evaluating the model's performance.  



## Models Implemented  


| Algorithm                   | Type           | Description                                                                                          |
|-----------------------------|----------------|------------------------------------------------------------------------------------------------------|
| **Isolation Forest**        | Unsupervised   | Anomaly detection by isolating outliers through random partitioning of data.                         |
| **One-Class SVM**           | Unsupervised   | Anomaly detection by identifying a region containing normal data points without labeled data.        |
| **Local Outlier Factor (LOF)** | Unsupervised   | Detects outliers by comparing local data density with that of neighboring points.                     |
| **DBSCAN**                  | Unsupervised   | Density-based clustering, also identifies outliers as noise.                                         |
| **Autoencoder**             | Unsupervised   | A neural network used to learn compressed representations, often for anomaly detection.              |
| **K-means Clustering**      | Unsupervised   | Clustering algorithm that partitions data into clusters without labels based on distance metrics.    |
| **Random Forest**           | Supervised     | An ensemble of decision trees used for classification or regression with labeled data.               |
| **Gradient Boosting**       | Supervised     | An ensemble method that builds sequential trees to improve prediction accuracy in classification or regression. |
| **LSTM (Long Short-Term Memory)** | Supervised/Unsupervised | Typically supervised for sequence prediction tasks, but can also be used in unsupervised anomaly detection. |

  
  

## Model Evaluation

While traditional classification metrics like accuracy, precision, recall, F1-score, ROC-AUC, and PR-AUC are primarily designed for binary classification problems, anomaly detection presents a unique challenge. In anomaly detection, the goal is to identify instances that deviate significantly from the normal pattern, rather than classifying them into predefined categories.

**That said, we can adapt some of these metrics to evaluate anomaly detection models**  

#### Applicable Metrics for Anomaly Detection

1. **Precision, Recall, and F1-Score:**
   - These metrics can be calculated by considering the true positive (TP), false positive (FP), true negative (TN), and false negative (FN) rates.
   - However, the definition of "positive" and "negative" in anomaly detection can be ambiguous. Often, the minority class (anomalies) is considered positive.
   - It's crucial to carefully define the positive and negative classes based on the specific use case and the desired outcome.

2. **ROC-AUC and PR-AUC:**
   - **ROC-AUC:** While it's commonly used for binary classification, it can be adapted to anomaly detection by treating anomalies as the positive class. However, the interpretation might be different.
   - **PR-AUC:** This metric is particularly useful for imbalanced datasets, which is often the case in anomaly detection. It focuses on the precision-recall trade-off.

3. **Confusion Matrix:**
   - A confusion matrix can be constructed to visualize the performance of an anomaly detection model. However, the interpretation might differ from traditional classification.

#### **Specific Considerations for Each Model**

1. **Isolation Forest, OneClassSVM, Local Outlier Factor, DBSCAN:**
   - These models directly output anomaly scores or labels.
   - You can set a threshold to classify instances as anomalies or normal.
   - Once you have the predicted labels, you can calculate the standard metrics.

2. **Autoencoder:**
   - Autoencoders are typically used for reconstruction-based anomaly detection.
   - You can calculate the reconstruction error for each instance.
   - A higher reconstruction error often indicates an anomaly.
   - You can set a threshold on the reconstruction error to classify instances.
   - Once you have the predicted labels, you can calculate the standard metrics.

3. **LSTM:**
   - LSTMs can be used for time series anomaly detection.
   - You can train an LSTM to predict future values and calculate the prediction error.
   - A higher prediction error often indicates an anomaly.
   - You can set a threshold on the prediction error to classify instances.
   - Once you have the predicted labels, you can calculate the standard metrics.

4. **Augmented K-Means:**
   - Augmented K-Means is a clustering-based anomaly detection technique.
   - Instances that are far from cluster centers can be considered anomalies.
   - You can set a distance threshold to classify instances.
   - Once you have the predicted labels, you can calculate the standard metrics.

## What Are the Models Predicting?  

Supervised models were evaluated using classification metrics: accuracy, precision, recall, F1-score, and confusion matrices. We noticed that Random Forest and Gradient Boosting both predicted all 4 classes accurately.  
Unsupervised models were originally evaluated by converting anomaly scores into binary labels (normal vs anomaly). However, they were only able to predict binary classes (typically class 0), failing to capture nuanced threat levels (2 and 3).  


### Supervised Models  

The supervised models directly predict the 'Threat Level' label and were able to classify all four
categories correctly. Their success is due to the availability of labeled training data and the ability to
learn decision boundaries across classes.

* **Objective**: Learn to predict the threat level (`Risk Level`: Class 0–3) directly from labeled training data.
* **Algorithms Used**:

  * Random Forest
  * Gradient Boosting
  * Logistic Regression
  * Stacking (Random Forest + Gradient Boosting)
* **Target**: `Risk Level` (0: No Threat → 3: High Threat)
* **Input**: Normalized features (numeric behavioral and system indicators)

### Unsupervised Models  

Unsupervised models like Isolation Forest, One-Class SVM, LOF, and DBSCAN are designed to distinguish anomalies from normal observations but not multiclass labels. These models predict binary labels (0 or 1). Class 0 indicates normal, class 1 indicates anomaly. When mapped against the threat
levels, they mostly capture only class 0 or 1.

* **Objective**: Detect anomalies in the data **without labels**, based on distance, density, or reconstruction error.
* **Algorithms Used**:

  * Isolation Forest
  * One-Class SVM
  * Local Outlier Factor (LOF)
  * DBSCAN
  * KMeans Clustering
  * Autoencoder (Neural Network)
  * LSTM (for sequential anomaly detection)
* **Output**: Binary anomaly scores (0 = normal, 1 = anomaly), not multiclass predictions

---

## Class Prediction Gaps in Unsupervised Models

### Observation:

All unsupervised models **fail to distinguish between threat levels (Class 1, 2, 3)**. Most anomaly detection models only predict **Class 0** or flag minority of samples as "anomalies", making it difficult to classify **subtle threat patterns**.

### Why Do Unsupervised Models Predict Only Class 0 for Class 2 and 3?

Unsupervised anomaly models fail to predict higher threat levels because:
- They are not trained with class labels and cannot distinguish among multiple classes.
- Anomalies are rare, and severe anomalies (high threat) are even rarer.
- These models generalize outliers as a single anomaly class (often mapped to class 1), unable to differentiate between moderate and critical threats.


### Solution – Adaptation: Use Unsupervised Models as Feature Generators
To overcome this limitation, we adopted a hybrid strategy:

**Approach:** Generate anomaly features from each unsupervised model and include them as
additional input features in a supervised learning pipeline.  

**Implementation:** For each unsupervised model, the anomaly score or cluster assignment was extracted and added to the dataset. These enriched features were then used to train a stacked ensemble model combining Random Forest and Gradient Boosting.  

**Result:** This strategy improved the model’s ability to predict all four threat levels, especially classes 2 and 3, which previously were missed by the unsupervised models alone


##  Implementation: Stacked Supervised Model Using Anomaly Features

### 1. Feature Engineering with Unsupervised Models

**Unsupervised Models used as Feature Generators:**  

| Algorithm        | Feature Extracted               |
| ---------------- | ------------------------------- |
| Isolation Forest | Anomaly score                   |
| One-Class SVM    | Anomaly prediction              |
| LOF              | Local density deviation score   |
| DBSCAN           | Cluster membership or outlier   |
| Autoencoder      | Reconstruction error            |
| KMeans           | Cluster assignment              |
| LSTM             | Time-series anomaly probability |

These anomaly signals are treated as **auxiliary features** in the supervised pipeline.

**Supervised Stack:**
- Base: Random Forest Classifier
- Meta: Gradient Boosting Classifier  

### 2. Supervised Model Pipeline

```python
# Pseudo-structure
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_augmented, y, test_size=0.2)

# Define base and meta learners
base_model = RandomForestClassifier()
meta_model = GradientBoostingClassifier()

stacked_model = StackingClassifier(
    estimators=[('rf', base_model)],
    final_estimator=meta_model
)

# Fit and evaluate
stacked_model.fit(X_train, y_train)
y_pred = stacked_model.predict(X_test)
print(classification_report(y_test, y_pred))
```


## Model Evaluation and Results

### Evaluation Metrics:

* Accuracy
* Precision, Recall, F1-score (per class)
* Confusion Matrix
* ROC-AUC (if needed for binary components)

### Key Observations:

* Unsupervised models alone fail to predict classes 2 and 3 accurately.
* Using anomaly scores as features improved supervised performance by:

  * Enhancing signal for rare threat classes (Class 2, 3)
  * Reducing false negatives (Class 0 misclassifications)  


** Sample Evaluation Metrics**

| Model                        | Accuracy | F1-Score (Class 3) | Recall (Class 3) |
| ---------------------------- | -------- | ------------------ | ---------------- |
| Random Forest Only           | 84%      | 0.51               | 0.48             |
| Gradient Boosting Only       | 83%      | 0.49               | 0.46             |
| **Stacked w/ Anomaly Feat.** | **88%**  | **0.61**           | **0.59**         |

  
  
This stacked pipeline showed improved multiclass classification performance and better detection of critical threat levels.  


## Model Selection and Deployment

* **Selected Model**: StackingClassifier (RandomForest + GradientBoosting) with anomaly features
* **Reason**: Best performance across threat levels, especially Class 3
* **Deployment**: Model serialized and ready for inference; supports real-time scoring with anomaly-enriched feature vectors


## Conclusion

Using unsupervised models as **signal extractors** rather than classifiers proved effective. This hybrid approach leverages both:

* The **anomaly sensitivity** of unsupervised models
* The **targeted pattern learning** of supervised classifiers

**Note:** This methodology is recommended for future applications in **cybersecurity, fraud detection**, or any anomaly-prone classification problem.


In [None]:
!git clone https://github.com/atsuvovor/CyberThreat_Insight.git 2>/dev/null
%run /content/CyberThreat_Insight/model_dev/lagacy_best_model_dev/lagacy_model_dev.py