<a href="https://colab.research.google.com/github/atsuvovor/CyberThreat_Insight/blob/main/feature_engeneering/run_all_feature_engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **CyberThreat-Insight**  
**Anomalous Behavior Detection in Cybersecurity Analytics using Generative AI**

**Toronto, Septeber 08 2025**  
**Autor : Atsu Vovor**
>Master of Management in Artificial Intelligence    
>Consultant Data Analytics Specialist | Machine Learning |  
Data science | Quantitative Analysis |French & English Bilingual   

###  **Feature Engineering**   
The feature engineering process in our *Cyber Threat Insight* project was strategically designed to simulate realistic cyber activity, enhance anomaly visibility, and prepare a high-quality dataset for training robust threat classification models. Given the natural rarity and imbalance of cybersecurity anomalies, we adopted a multi-step workflow combining statistical simulation, normalization, feature selection, explainability, and data augmentation.


## **Feature Analyis**  .

#### **1. Synthetic Data Loading**

We began with a synthetic dataset that simulates real-time user sessions and system behaviors, including attributes such as login attempts, session duration, data transfer, and system resource usage. This dataset serves as a safe and flexible baseline to emulate both normal and suspicious behaviors without exposing sensitive infrastructure data.

#### **2. Anomaly Injection â€“ Cholesky-Based Perturbation**

To introduce statistically sound anomalies, we applied a Cholesky decomposition-based perturbation to the feature covariance matrix. This method creates subtle but realistic multivariate deviations in the dataset, reflecting how actual threats often manifest through combinations of unusual behaviors (e.g., high data transfer coupled with long session durations).

#### **3. Feature Normalization**

All numerical features were normalized using a combination of **Min-Max Scaling** and **Z-score Standardization**. This step ensures that features with different units or scales (e.g., memory usage vs. login attempts) contribute equally during model training, especially for distance-based algorithms.

#### **4. Correlation Analysis**

Using **Pearson** and **Spearman correlation heatmaps**, we examined inter-feature relationships to detect multicollinearity. This analysis helped eliminate redundant features and highlighted meaningful operational linkages between system metrics, such as correlations between CPU and memory usage during suspicious sessions.

#### **5. Feature Importance (Random Forest)**

We trained a Random Forest classifier to compute feature importance scores. These scores provided insights into which features had the most predictive power for classifying threat levels, enabling targeted refinement of the feature set.

#### **6. Model Explainability (SHAP Values)**

To ensure model transparency, we applied **SHAP (SHapley Additive exPlanations)** for both global and local interpretability. SHAP values quantify how each feature impacts the modelâ€™s decisions for individual predictions, which is critical for cybersecurity analysts needing to validate threat classifications.

#### **7. Dimensionality Reduction (PCA)**

We employed **Principal Component Analysis (PCA)** to reduce feature dimensionality while retaining maximum variance. A scree plot was used to identify the optimal number of components. This step improves computational efficiency and enhances model generalization.

#### **8. Data Augmentation (SMOTE and GANs)**

To address the significant class imbalance between benign and malicious sessions, we applied two augmentation strategies:

* **SMOTE (Synthetic Minority Over-sampling Technique)** to interpolate new synthetic samples for underrepresented classes.
* **Generative Adversarial Networks (GANs)** to produce high-fidelity, realistic threat scenarios that further enrich the minority class.

  

#### **Outcome**

Through this comprehensive workflow, we generated a clean, balanced, and interpretable feature set optimized for training machine learning models. This feature engineering pipeline enabled the system to detect nuanced threat patterns while maintaining explainability and performance across diverse threat levels.

 #### **Feature Engineering â€“ Advanced Data Augmentation using SMOTE and GANs**

To address severe class imbalance and enhance the quality of the synthetic training data in our cyber threat insight model, we implemented a hybrid augmentation strategy. This stage of feature engineering combines **SMOTE** (Synthetic Minority Over-sampling Technique) and **GANs** (Generative Adversarial Networks) to increase representation of rare threat levels and capture complex behavioral patterns often found in high-dimensional cybersecurity data.


#### **Literature Review: SMOTE vs GANs for Synthetic Data Generation**


SMOTE and GANs are both used to generate synthetic data to address class imbalance. However, they differ significantly in approach, complexity, application or the types of data they can handle. Here's a breackdown:



**1. Methodology**

   - **SMOTE**: SMOTE is a straightforward oversampling technique for tabular data. It generates synthetic data by interpolating between samples of the minority class. Specifically, it selects a minority class sample, finds its nearest neighbors, and creates synthetic samples along the line segments joining the original sample with one or more of its neighbors. SMOTE is typically applied to structured, tabular data.

   - **GANs**: GANs are a class of deep learning models that involve two neural networksâ€”a generator and a discriminatorâ€”competing against each other. The generator creates synthetic samples, while the discriminator evaluates how close these samples are to real data. Over time, the generator learns to produce increasingly realistic samples. GANs are versatile and can generate complex, high-dimensional data like images, text, and time-series data.

**2. Complexity**

   - **SMOTE**: SMOTE is computationally simple and easier to implement because it doesn't require training a neural network. It's usually faster and works well for moderately complex datasets.

   - **GANs**: GANs are computationally intensive and require training a generator and discriminator, which are often deep neural networks. They require significant data, compute resources, and tuning. GANs are more complex but can capture intricate patterns and distributions in the data.

**3. Types of Data**

   - **SMOTE**: Works best for numerical tabular data, where generating synthetic samples by interpolation is feasible. It can struggle with categorical variables or complex data relationships.

   - **GANs**: Can handle a variety of data types, including high-dimensional and unstructured data like images, audio, and text. GANs are also better suited for generating more realistic and diverse samples for complex distributions.

**4. Application Scenarios**

   - **SMOTE**: Typically applied in class imbalance for binary classification problems, especially in structured data settings. For example, itâ€™s widely used in fraud detection, medical diagnostics, and credit scoring when the minority class samples are significantly fewer than the majority class.

   - **GANs**: GANs are applicable when complex, high-quality synthetic data is required. They are often used in fields like image processing, speech synthesis, and video generation. GANs can also be useful for cybersecurity, where generating realistic threat data may involve complex relationships and high-dimensional feature spaces.

**5. Synthetic Data Quality**

   - **SMOTE**: Produces synthetic samples that are close to the original samples but lacks diversity since it simply interpolates between existing points. This can lead to overfitting, as the generated data may not capture the full range of variability in minority class characteristics.

   - **GANs**: With careful tuning, GANs can generate highly realistic samples that capture complex patterns in the data, offering better generalization and diversity than SMOTE. However, they also come with risks like mode collapse (when the generator produces limited variations of data).


**Summary**
- **SMOTE** is a simpler, faster, and more accessible technique, suitable for lower-dimensional tabular data and basic class imbalance issues.
- **GANs** are more advanced, versatile, and powerful, capable of producing high-dimensional, complex data for applications that demand high-quality synthetic samples.

In cybersecurity, you might use **SMOTE** for imbalanced tabular data with relatively simple feature interactions, while **GANs** can be advantageous for generating more complex synthetic attack patterns or when working with high-dimensional activity logs and network data.  


| Criteria                      | SMOTE                                                               | GANs                                                                                            |
| ----------------------------- | ------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------- |
| **Methodology**               | Interpolates new samples between existing minority class instances. | Uses a generator-discriminator adversarial setup to produce highly realistic synthetic samples. |
| **Complexity**                | Simple, rule-based; no training required.                           | Complex; requires training deep neural networks.                                                |
| **Best for**                  | Structured, tabular data with moderate feature interaction.         | High-dimensional, non-linear, or unstructured data (e.g., logs, behaviors).                     |
| **Synthetic Data Quality**    | Limited diversity; risk of overfitting due to linear interpolation. | Can generate diverse, realistic samples capturing underlying patterns.                          |
| **Cybersecurity Application** | Ideal for boosting minority class in structured event logs.         | Suitable for simulating diverse and realistic threat scenarios.   

---  
     

### **SMOTE + GANs Implementation in Cyber Threat Insight**

To ensure our cyber threat insight model performs robustly across all threat levels including rare but critical cases, we implemented a two-fold data augmentation strategy using **SMOTE (Synthetic Minority Over-sampling Technique)** and **Generative Adversarial Networks (GANs)** as the final step in the feature engineering pipeline.

### Step 1: Handling Imbalanced Classes with SMOTE

In real-world cybersecurity datasets, high-risk threat events are typically underrepresented. To mitigate this class imbalance, we first applied **SMOTE**, a statistical technique that synthesizes new samples by interpolating between existing ones in the feature space. SMOTE oversamples underrepresented threat levels (e.g., High, Critical). This ensures the classifier doesnâ€™t overfit to the majority class, enabling better detection of rare threats.

* **Input**: Cleaned and preprocessed numerical dataset.
* **Process**: SMOTE was applied to oversample the minority class based on `Threat Level`.
* **Output**: A balanced dataset where minority threat classes (e.g., *Critical*, *High*) have increased representation.

```python
X_resampled, y_resampled = balance_data_with_smote(processed_num_df)
```

* **Purpose**: Create a balanced training dataset by synthetically adding interpolated samples from the minority class.
* **Impact**: Improved recall and F1-score for rare threat types.  

This step ensured that our model would not be biased toward majority class labels, improving its ability to generalize and detect less frequent, high-impact events.

  


### Step 2: Enhancing Diversity: Learning Complex Patterns with GAN-Based Threat Simulation

To further enrich the dataset beyond SMOTE's linear interpolations, we trained a custom GAN to generate more diverse non-linear high-fidelity cyber threat behaviors data. Our GAN architecture consists of:

* A **Generator** that learns to create synthetic threat vectors from random noise.
* A **Discriminator** that learns to distinguish real data from synthetic data.

The adversarial training process was carefully monitored using early stopping based on generator loss to prevent overfitting and ensure sample quality.


```python
generator, discriminator = build_gan(latent_dim=100, n_outputs=X_resampled.shape[1])
generator, d_loss_real_list, d_loss_fake_list, g_loss_list = train_gan(
    generator, discriminator, X_resampled, latent_dim=100, epochs=1000
)
```

Once trained, the generator was used to create highly realistic samples( 1,000 synthetic threat vectors), each mimicking realistic but previously unseen behaviors(the statistical distribution of real threat behaviors).


```python
synthetic_data = generate_synthetic_data(generator, n_samples=1000, latent_dim=100, columns=X_resampled.columns)
```



### Step 3:  Final Dataset Augmentation - Data Fusion and Export

The synthetic GAN-generated samples were combined with the SMOTE-resampled dataset to form a robust, high-quality augmented dataset, maximizing both statistical and generative diversity.

```python
X_augmented, y_augmented = augment_data(X_resampled, y_resampled, synthetic_data)
augmented_df = concatenate_data_along_columns(X_augmented, y_augmented)
```

The final augmented dataset was saved to cloud storage for traceability and reproducibility.

```python
save_dataframe_to_google_drive(augmented_df, "x_y_augmented_data_google_drive.csv")
```

  

### Outcomes and Benefits

By combining **SMOTE** and **GANs**, we created a rich, well-balanced dataset that allows our models to:

* Learn effectively from both observed and synthetic threat events.
* **Improve detection accuracy**: Detect rare but impactful security threat events with higher sensitivity.
* Generalize to novel behaviors not originally present in the training data.

This hybrid augmentation pipeline significantly improves the reliability and robustness of our cyber threat insight models

### **SMOTE and GAN augmentation models performance Analysis**


### Impact Visualization

#### 1. Class Distribution Before vs After Augmentation

The leftmost panel below illustrates how SMOTE and GANs successfully **balanced the target variable (`Threat Level`)**, mitigating the original skew toward lower-risk classes:

> ðŸ”· **Blue** â€“ Original data
> ðŸ”´ **Red** â€“ Augmented data (SMOTE + GAN)

```python
plot_combined_analysis_2d_3d(...)
```



#### 2. 2D Projections: Real vs Synthetic Sample Distribution

To visually validate that synthetic threats from GANs approximate real feature space structure:

| Projection Method | Description                                                                                                      |
| ----------------- | ---------------------------------------------------------------------------------------------------------------- |
| **PCA**           | Linear projection of high-dimensional data showing real (blue) and generated (red) samples largely overlapping.  |
| **t-SNE**         | Nonlinear embedding preserving local structure; confirms synthetic threats follow the distribution of real ones. |
| **UMAP**          | Captures both local and global structure; reveals well-mixed clusters of real and synthetic samples.             |

These projections demonstrate that GAN-generated samples are **not outliers**, but learned valid manifolds of real threats.

---

#### 3. 3D Analysis: Density & Spatial Similarity

The 3D visualizations show:

* A **3D histogram** comparing class density before and after augmentation.
* **PCA**, **t-SNE**, and **UMAP** 3D scatter plots confirming **continuity** between real and synthetic samples in 3D space.

```python
# Rendered via plot_combined_analysis_2d_3d(...)
```

---

###  GAN Training Progress Monitoring

To ensure high-quality synthetic sample generation, we tracked GAN training loss across epochs:

| Loss Type       | Meaning                                       |
| --------------- | --------------------------------------------- |
| **D Loss Real** | Discriminator loss on real samples            |
| **D Loss Fake** | Discriminator loss on fake samples            |
| **G Loss**      | Generatorâ€™s ability to fool the discriminator |

These metrics were plotted along with model accuracy during training and validation:

```python
plot_gan_training_metrics(...)
```

**Key Insights**:

* Generator loss steadily decreased, indicating it learned to produce more convincing threats.
* The validation accuracy increased alongside training, suggesting **generalization improved** rather than overfitting.

###  Summary

By integrating SMOTE and GANs in the final feature engineering phase, and validating their effectiveness through rich visualizations, we ensured that our cyber threat insight model is:

* **Class-balanced** (especially for rare threat levels)
* **Generalization-ready** through exposure to novel synthetic patterns
* **Interpretable**, thanks to transparent performance metrics and embeddings

This augmentation pipeline plays a **critical role** in enabling our models to detect both known and previously unseen cyber threats with high reliability.

### run the cell below to execute the whole feature engineering module


In [None]:
!git clone https://github.com/atsuvovor/CyberThreat_Insight.git
%run /content/CyberThreat_Insight/feature_engeneering/f_engineering.py
