<a href="https://colab.research.google.com/github/aminayusif/Intrudex/blob/main/Intrudex_Cybersecurity_Intrusion_Detection_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction

This notebook performs an exploratory data analysis and builds machine learning models to detect cybersecurity intrusions. The analysis includes data preprocessing, feature engineering, handling class imbalance, training and evaluating several supervised learning models (Logistic Regression, Random Forest, Decision Tree, and XGBoost), and applying unsupervised learning techniques (K-Means Clustering and Isolation Forest) for anomaly detection. The goal is to identify patterns associated with malicious activities and build models that can effectively predict potential intrusions.

### Table of Contents

- Data Loading and Initial Exploration
- Data Preprocessing and Feature Engineering
- Handling Class Imbalance
- Supervised Learning Models (Training and Evaluation)
- Unsupervised Learning (K-Means and Isolation Forest)
- Model Interpretation with SHAP
- Summary and Conclusion

#### Data Loading and Initial Exploration

This section covers loading the dataset and performing initial checks such as viewing the head of the dataframe, checking data types and non-null counts, and examining missing values.

#### Data Preprocessing and Feature Engineering

This section details the steps taken to prepare the data for modeling. This includes:
- Dropping irrelevant columns (`session_id`).
- Handling missing values in the `encryption_used` column by imputing with the mode.
- Creating new features such as the `failed_login_ratio`.
- Generating polynomial features (`login_attempts_sq`, `failed_logins_sq`).

#### Handling Class Imbalance

The notebook addresses the class imbalance in the target variable (`attack_detected`) using the Synthetic Minority Over-sampling Technique (SMOTE). Visualizations are included to show the class distribution before and after applying SMOTE.

#### Supervised Learning Models (Training and Evaluation)

This section focuses on building and evaluating supervised learning models:
- The preprocessed and resampled data is split into training and testing sets.
- Logistic Regression, Random Forest, Decision Tree, and XGBoost models are trained and evaluated using GridSearchCV for hyperparameter tuning.
- Performance metrics (accuracy, recall, precision, ROC AUC) are calculated and presented in a table, along with classification reports for each model.
- ROC AUC curves for all models are plotted for visual comparison.

#### Unsupervised Learning (K-Means and Isolation Forest)

This section explores unsupervised learning techniques for anomaly detection:
- K-Means clustering is applied to identify potential groupings in the data, and the Elbow method is used to suggest an optimal number of clusters.
- The clustering results are visualized using scatter plots.
- Isolation Forest is applied to detect anomalies, and the distribution of anomaly scores is visualized.

#### Model Interpretation with SHAP

SHAP (SHapley Additive exPlanations) is used to interpret the Isolation Forest model, providing insights into which features are most influential in determining anomaly scores. A summary plot and a force plot are included to illustrate the SHAP analysis.

#### Summary and Conclusion

This section summarizes the key findings from both the supervised and unsupervised learning analyses, highlighting the performance of the models and insights gained from the data and model interpretation. Possible next steps are also suggested.