<a href="https://colab.research.google.com/github/atsuvovor/CyberThreat_Insight/blob/main/feature_engeneering/feature_creation_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **CyberThreat-Insight**  
### Feature Engineering Using Cholesky-Based Perturbation, SHAP, PCA, SMOTE and GANs
**Anomalous Behavior Detection in Cybersecurity Analytics using Generative AI**

**Toronto, Septeber 08 2025**  
**Autor : Atsu Vovor**
>Master of Management in Artificial Intelligence    
>Consultant Data Analytics Specialist | Machine Learning |  
Data science | Quantitative Analysis |French & English Bilingual   

---
###  **Overview**   
The feature engineering process in our *Cyber Threat Insight* project was strategically designed to simulate realistic cyber activity, enhance anomaly visibility, and prepare a high-quality dataset for training robust threat classification models. Given the natural rarity and imbalance of cybersecurity anomalies, we adopted a multi-step workflow combining statistical simulation, normalization, feature selection, explainability, and data augmentation.


## **Feature Creation & Analysis**  .

#### **1. Synthetic Data Loading**

We began with a synthetic dataset that simulates real-time user sessions and system behaviors, including attributes such as login attempts, session duration, data transfer, and system resource usage. This dataset serves as a safe and flexible baseline to emulate both normal and suspicious behaviors without exposing sensitive infrastructure data.

#### **2. Anomaly Injection ‚Äì Cholesky-Based Perturbation**

To introduce statistically sound anomalies, we applied a Cholesky decomposition-based perturbation to the feature covariance matrix. This method creates subtle but realistic multivariate deviations in the dataset, reflecting how actual threats often manifest through combinations of unusual behaviors (e.g., high data transfer coupled with long session durations).

#### **3. Feature Normalization**

All numerical features were normalized using a combination of **Min-Max Scaling** and **Z-score Standardization**. This step ensures that features with different units or scales (e.g., memory usage vs. login attempts) contribute equally during model training, especially for distance-based algorithms.

#### **4. Correlation Analysis**

Using **Pearson** and **Spearman correlation heatmaps**, we examined inter-feature relationships to detect multicollinearity. This analysis helped eliminate redundant features and highlighted meaningful operational linkages between system metrics, such as correlations between CPU and memory usage during suspicious sessions.

#### **5. Feature Importance (Random Forest)**

We trained a Random Forest classifier to compute feature importance scores. These scores provided insights into which features had the most predictive power for classifying threat levels, enabling targeted refinement of the feature set.

#### **6. Model Explainability (SHAP Values)**

To ensure model transparency, we applied **SHAP (SHapley Additive exPlanations)** for both global and local interpretability. SHAP values quantify how each feature impacts the model‚Äôs decisions for individual predictions, which is critical for cybersecurity analysts needing to validate threat classifications.

#### **7. Dimensionality Reduction (PCA)**

We employed **Principal Component Analysis (PCA)** to reduce feature dimensionality while retaining maximum variance. A scree plot was used to identify the optimal number of components. This step improves computational efficiency and enhances model generalization.

#### **8. Data Augmentation (SMOTE and GANs)**

To address the significant class imbalance between benign and malicious sessions, we applied two augmentation strategies:

* **SMOTE (Synthetic Minority Over-sampling Technique)** to interpolate new synthetic samples for underrepresented classes.
* **Generative Adversarial Networks (GANs)** to produce high-fidelity, realistic threat scenarios that further enrich the minority class.

  

#### **Outcome**

Through this comprehensive workflow, we generated a clean, balanced, and interpretable feature set optimized for training machine learning models. This feature engineering pipeline enabled the system to detect nuanced threat patterns while maintaining explainability and performance across diverse threat levels.

**Run the cell below üèÉ**

In [None]:
!git clone https://github.com/atsuvovor/CyberThreat_Insight.git 2>/dev/null
%run /content/CyberThreat_Insight/feature_engeneering/fe_flowchart.py
%run /content/CyberThreat_Insight/feature_engeneering/feature_creation.py

## ü§ù Connect With Me
I am always open to collaboration and discussion about new projects or technical roles.

Atsu Vovor  
Consultant, Data & Analytics    
Ph: 416-795-8246 | ‚úâÔ∏è atsu.vovor@bell.net    
üîó <a href="https://www.linkedin.com/in/atsu-vovor-mmai-9188326/" target="_blank">LinkedIn</a> | <a href="https://atsuvovor.github.io/projects_portfolio.github.io/" target="_blank">GitHub</a> | <a href="https://public.tableau.com/app/profile/atsu.vovor8645/vizzes" target="_blank">Tableau Portfolio</a>    
üìç Mississauga ON      

### Thank you for visiting!üôè
