# <p style="background-color: #780000; font-family:calibri; color:white; font-size:140%; font-family:Verdana; text-align:center; border-radius:5px 10px; padding: 20px"> Cybersecurity Intrusion and Anomaly detection</p>

<div style="border-radius:10px; padding: 15px; background-color: #fdf0d5; font-size:120%; text-align:left; ">  
    
## Objective 

 - Detect cyber intrusions based on network traffic and user behavior.  
 - Identify anomalies using autoencoders and classify them as potential threats.  
 - Optimize model performance using GridSearchCV for best hyperparameters.  

## Overview
 - Performed Exploratory Data Analysis (EDA) to understand network traffic patterns.
 - Balanced the dataset to address class imbalance for better model performance.
 - Detected anomalies using autoencoders and classified them using multiple models.
 - Tuned hyperparameters using GridSearchCV to improve detection accuracy.  



### <span style= color:#c1121f;> Importing libraries

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MinMaxScaler
from imblearn.over_sampling import SMOTE
from sklearn.utils.class_weight import compute_class_weight
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score,accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier, BaggingClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn import metrics
import xgboost as xgb
import lightgbm as lgb
import catboost as cb
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))



### <span style= color:#c1121f;> Style

In [None]:
colors = ['#780000', '#c1121f', '#fdf0d5', '#003049', '#669bbc']
sns.set_theme(style="darkgrid", rc={'axes.facecolor': colors[2]})

### <span style= color:#c1121f;> Import Data

In [None]:
df = pd.read_csv("/kaggle/input/cybersecurity-intrusion-detection-dataset/cybersecurity_intrusion_data.csv")

# <p style="background-color:#c1121f; font-family:calibri; color:white; font-size:140%; font-family:Verdana; text-align:center; border-radius:5px 10px; padding: 20px">DataOverview</p>

In [None]:
df.info()

In [None]:
#percentage of missing data
no_of_values = df.shape[0]  
temp = df.isnull().sum()  
temp = (temp / no_of_values) * 100  
print(temp)


### <span style= color:#c1121f;> Handling Missing data

In [None]:
df["encryption_used"].unique()

In [None]:
df["encryption_used"] = df["encryption_used"].fillna(value="No_enc")
df.isnull().sum()

## <span style= color:#c1121f;> Exploratory data Analysis

In [None]:
plt.figure(figsize=(10,5))
sns.histplot(df["network_packet_size"],kde=True,color=colors[0])
plt.xlabel("network Packet Size")
plt.ylabel("Count")
plt.title("Distribution of network_packet_size")
plt.show()

In [None]:
obj_col = df.select_dtypes(include="object")
obj_col.drop(columns=["session_id"],inplace=True)
for c in obj_col.columns:
    plt.figure(figsize=(8,4))
    sns.countplot(data=df,x=c,hue='attack_detected',palette=colors)
    plt.xlabel(c)
    plt.ylabel("number of values")
    plt.title(f"{c} distribution over attack_detected")
    plt.show()



In [None]:
num_col = df.select_dtypes(include=["float64","int64"])
for c in num_col.columns:
    plt.figure(figsize=(8,4))
    if num_col[c].nunique() >15:
        sns.histplot(data=num_col,x=c,hue='attack_detected',kde=True,palette=colors)
    else:
        sns.countplot(data=num_col,x=c,hue="attack_detected",palette=colors)
    plt.xlabel(c)
    plt.ylabel("number of values")
    plt.title(f"{c} distribution over attack_detected")
    plt.show()



In [None]:
df.info()

### <span style= color:#c1121f;> Lable Encoding

In [None]:
df = pd.get_dummies(df,columns=["protocol_type","encryption_used","browser_type"], drop_first=False,dtype=int)

In [None]:
df.head()

In [None]:
df.drop(columns=["session_id"],inplace=True)

In [None]:
df.info()

### <span style= color:#c1121f;> Correlation Matrix

In [None]:
plt.figure(figsize=(14, 10))
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Matrix")
plt.show()

In [None]:
correlation_matrix = df.corr()

correlation_price = correlation_matrix['attack_detected'].sort_values(ascending=False).drop("attack_detected")

# Plot the heatmap for the correlation with 'Exited'
plt.figure(figsize=(8, 5))
sns.heatmap(correlation_price.to_frame(), annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation with Exited')
plt.show()

### <span style= color:#c1121f;> Future selection using Random Forest

In [None]:
rf = RandomForestClassifier(n_estimators=100, random_state=42)
X = df.drop(columns=['attack_detected'])
y = df['attack_detected']

In [None]:
print("Shape for X Dataframe: ", X.shape)
print("Columns for X Dataframe: ", X.columns)
print("-"*50)
print("Shape for y Dataframe: ", y.shape)

In [None]:
# Train the model
rf.fit(X, y)

In [None]:
# Get feature importances
feature_importances = pd.DataFrame(rf.feature_importances_, index=X.columns, columns=['importance'])
feature_importances = feature_importances.sort_values('importance', ascending=False)

In [None]:
# Plot feature importances
plt.figure(figsize=(12, 8))
feature_importances.plot(kind='bar', figsize=(10, 6),color=colors)
plt.title("Top 20 Feature Importances")
plt.show()

In [None]:
important_features = feature_importances.head(5).index

In [None]:
fliter_df = df[important_features ]
fliter_df.head()

In [None]:
fliter_df.info()

In [None]:
y.info()

### <span style= color:#c1121f;> Normalization

In [None]:
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(fliter_df)

## <p style="background-color:#c1121f; font-family:calibri; color:white; font-size:140%; font-family:Verdana; text-align:center; border-radius:5px 10px; padding: 20px">Autoencoder Anomaly detection</p>

In [None]:
# Split into training (only normal data) and test sets
X_train = X_scaled[y == 0]  # Normal sessions only
X_test = X_scaled  # Full dataset

In [None]:
# Autoencoder model
input_dim = X_train.shape[1]
encoding_dim = 3 # Compressed representation

encoder = keras.Sequential([
    layers.Dense(10, activation='relu', input_shape=(input_dim,)),
    layers.Dense(encoding_dim, activation='relu')
])

decoder = keras.Sequential([
    layers.Dense(10, activation='relu', input_shape=(encoding_dim,)),
    layers.Dense(input_dim, activation='sigmoid')
])

autoencoder = keras.Sequential([encoder, decoder])
autoencoder.compile(optimizer='adam', loss='mse')

In [None]:
# Train autoencoder
history = autoencoder.fit(X_train, X_train, epochs=50, batch_size=32, validation_data=(X_test, X_test), verbose=1)

# Compute reconstruction errors
X_test_pred = autoencoder.predict(X_test)
reconstruction_errors = np.mean(np.abs(X_test - X_test_pred), axis=1)


In [None]:
# Set anomaly threshold 
threshold = np.percentile(reconstruction_errors, 75)


In [None]:
# Detect anomalies
y_pred = reconstruction_errors > threshold

df_results = df.copy()
df_results["anomaly_score"] = reconstruction_errors
df_results["predicted_attack"] = y_pred.astype(int)

# Plot error distribution
plt.hist(reconstruction_errors, bins=50)
plt.axvline(threshold, color='r', linestyle='dashed', linewidth=2,)
plt.xlabel("Reconstruction Error")
plt.ylabel("Frequency")
plt.title("Error Distribution with Anomaly Threshold")
plt.show()

In [None]:
# Accuracy Matrix
conf_matrix = confusion_matrix(y, y_pred)
# Plot confusion matrix
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Reds", xticklabels=["Negative", "Positive"], yticklabels=["Negative", "Positive"])
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix")
plt.show()

class_report = classification_report(y, y_pred)
print("Classification Report:\n", class_report)

In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(y,y_pred)

# <p style="background-color:#c1121f; font-family:calibri; color:white; font-size:140%; font-family:Verdana; text-align:center; border-radius:5px 10px; padding: 20px">ML attack detection using gridsearch</p>

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled , y, test_size=0.2, random_state=42)

In [None]:
# Define models
models = {
    'Logistic Regression': LogisticRegression(),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'Gradient Boosting': GradientBoostingClassifier(),
    'XGBoost': xgb.XGBClassifier(),
    'LightGBM': lgb.LGBMClassifier(),
    'CatBoost': cb.CatBoostClassifier(silent=True),
    'AdaBoost': AdaBoostClassifier(),
    'Bagging': BaggingClassifier(),
    'KNN': KNeighborsClassifier()
}

In [None]:
# Define reduced parameter grids
param_grids = {
    'Logistic Regression': {
        'C': [0.1, 1],
        'solver': ['liblinear'],
        'penalty': ['l2']
    },
    'Decision Tree': {
        'max_depth': [5, 10],
        'min_samples_split': [2, 5],
        'min_samples_leaf': [1, 2]
    },
    'Random Forest': {
        'n_estimators': [50, 100],
        'max_depth': [10],
        'min_samples_split': [2],
        'min_samples_leaf': [1]
    },
    'Gradient Boosting': {
        'n_estimators': [100],
        'learning_rate': [0.1],
        'max_depth': [5]
    },
    'XGBoost': {
        'n_estimators': [100],
        'learning_rate': [0.1],
        'max_depth': [5],
        'subsample': [0.8, 1.0]
    },
    'SVM (RBF)': {
        'C': [1, 10],
        'gamma': ['scale', 'auto']
    },
    'SVM (Linear)': {
        'C': [1, 10],
    },
    'LightGBM': {
        'n_estimators': [100],
        'learning_rate': [0.1],
        'max_depth': [3, 5],
    },
    'CatBoost': {
        'iterations': [100],
        'learning_rate': [0.1],
        'depth': [3, 5]
    },
    'KNN': {
        'n_neighbors': [3],
        'weights': ['uniform', 'distance']
    },
    'AdaBoost': {
        'n_estimators': [100],
        'learning_rate': [0.01, 0.1]
    },
    'Bagging': {
        'n_estimators': [100],
        'max_samples': [0.8, 1.0]
    }
}

In [None]:
# Initialize an empty dictionary to store results
model_results = {}

# Handle class imbalance by computing class weights for each model that supports it
class_weights = compute_class_weight('balanced', classes=np.array([0, 1]), y=y_train)
class_weight_dict = {0: class_weights[0], 1: class_weights[1]}
print("class_weight_dict: ", class_weight_dict)

In [None]:
smote = SMOTE(sampling_strategy='auto', random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

In [None]:
# Evaluate models with GridSearchCV
for model_name, model in models.items():
    print(f"\nTraining model with GridSearchCV: {model_name}")

    param_grid = param_grids[model_name]
    
    if model_name in ['Logistic Regression', 'Random Forest', 'SVM (RBF)', 'SVM (Linear)']:
        # Assign class weights for models that support it
        if model_name == 'Logistic Regression':
            model = LogisticRegression(class_weight='balanced')
        elif model_name == 'Random Forest':
            model = RandomForestClassifier(class_weight='balanced')
        elif model_name in ['SVM (RBF)', 'SVM (Linear)']:
            model = SVC(probability=True, class_weight='balanced') if model_name == 'SVM (RBF)' else LinearSVC(class_weight='balanced')


    grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)
    
    grid_search.fit(X_train_smote, y_train_smote)
    
    # Get the best model and its parameters
    best_model = grid_search.best_estimator_
    print(f"Best parameters for {model_name}: {grid_search.best_params_}")
    
    # Predict on both train and test sets
    y_train_pred = best_model.predict(X_train_smote)
    y_test_pred = best_model.predict(X_test)
    
    # Store the results
    model_results[model_name] = {
        'train_accuracy': best_model.score(X_train_smote, y_train_smote),
        'test_accuracy': best_model.score(X_test, y_test),
        'classification_report': classification_report(y_test, y_test_pred),
        'roc_auc': roc_auc_score(y_test, best_model.predict_proba(X_test)[:, 1])
    }

    # Print results after all models are evaluated
    print("\nModel Evaluation Results:")
    print(f"Train Accuracy: {model_results[model_name]['train_accuracy']:.4f}")
    print(f"Test Accuracy: {model_results[model_name]['test_accuracy']:.4f}")
    print(f"ROC AUC: {model_results[model_name]['roc_auc']:.4f} \n")
    print(f"Classification Report:\n{model_results[model_name]['classification_report']}")
    print("-" * 100)

<div style="border-radius:10px; padding: 15px; background-color: #fdf0d5; font-size:120%; text-align:left; ">  
    
### **Model Performance Summary:**

| Model               | Train Accuracy | Test Accuracy | ROC AUC |
|---------------------|---------------|--------------|---------|
| Logistic Regression | 0.7182        | 0.7332       | 0.7987  |
| Decision Tree      | 0.8620        | 0.8679       | 0.8773  |
| Random Forest      | 0.8653        | 0.8684       | 0.8728  |
| Gradient Boosting  | 0.8763        | 0.8669       | 0.8779  |
| XGBoost           | 0.8619        | 0.8658       | 0.8799  |
| LightGBM          | 0.8620        | 0.8664       | 0.8750  |

### **Observations:**
1. **Logistic Regression is the weakest performer**, with lower accuracy and ROC AUC compared to tree-based models. This suggests that a linear decision boundary may not be ideal for your dataset.
   
2. **Decision Tree, Random Forest, Gradient Boosting, XGBoost, and LightGBM all perform similarly** in terms of test accuracy (apporx.86.6-86.8%) and ROC AUC (approx 87.2-87.9%). 

3. **Gradient Boosting and XGBoost have the highest ROC AUC (~0.878-0.880), making them the best models for ranking predictions.** 

4. **Overfitting seems minimal**, as the train and test accuracies are quite close for tree-based models.

### **Next Steps:**
- **Feature Importance Analysis:** Identify the most important features using models like Random Forest or XGBoost.
- **Hyperparameter Tuning:** You might explore a broader range of hyperparameters to push performance further.
- **Ensemble Methods:** Combining models (e.g., stacking XGBoost and LightGBM) could further improve performance.
- **Threshold Tuning:** If recall/precision is more critical, fine-tune the decision threshold.
