# Tree-Based Models for Intrusion Detection (CICIDS2017 Dataset)

This notebook focuses on preparing the dataset, cleaning it, splitting into training and testing parts, and then applying feature selection and model training.

## 1. Data Preprocessing

### 1.1 Import Libraries

This block imports the warnings module and disables warning messages. It ensures that the notebook output remains clean and readable without unnecessary warning texts during execution.

In [None]:
import warnings
warnings.filterwarnings("ignore")

Here we import the required Python libraries:

- `numpy` and `pandas` for numerical operations and data manipulation  
- `seaborn` and `matplotlib` for visualization  
- `LabelEncoder` from scikit-learn to convert categorical labels into numeric form  
- `train_test_split` for dataset splitting  
- `classification_report`, `confusion_matrix`, `accuracy_score`, `precision_recall_fscore_support`, `f1_score` for model evaluation  
- `DecisionTreeClassifier`, `RandomForestClassifier`, `ExtraTreesClassifier`, and `XGBoost` for classification models  


In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score,precision_recall_fscore_support
from sklearn.metrics import f1_score
from sklearn.ensemble import RandomForestClassifier,ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier
import xgboost as xgb
from xgboost import plot_importance

### 1.2 Load Dataset

The sampled CICIDS2017 dataset is loaded into a pandas DataFrame (df) from a CSV file.

In [None]:
df = pd.read_csv('C:/Users/hp/Downloads/final/Intelligent Intrusion Detection System/Intrusion Detection system/Intrusion-Detection/data/CICIDS2017_sa.csv')

### 1.3 Normalization and Missing Values

Min-Max Normalization is applied to all numeric features, scaling values between 0 and 1. This step ensures uniformity across features, preventing large-scale values from dominating smaller ones. Missing values are replaced with 0 for consistency.

In [None]:
# Min-max normalization
numeric_features = df.dtypes[df.dtypes != 'object'].index
df[numeric_features] = df[numeric_features].apply(
    lambda x: (x - x.min()) / (x.max()-x.min()))
# Fill empty values by 0
df = df.fillna(0)

### 1.4 Train/Test Split

The `Label` column (attack type) is encoded into numeric values using `LabelEncoder`.  

- `X` contains the independent variables (features).  
- `y` contains the target class labels (dependent variable).  

The dataset is split into **80% training** and **20% testing**, with `stratify=y` ensuring that class distribution remains consistent across both sets.

In [None]:
labelencoder = LabelEncoder()
df.iloc[:, -1] = labelencoder.fit_transform(df.iloc[:, -1])
X = df.drop(['Label'],axis=1).values
y = df.iloc[:, -1].values.reshape(-1,1)
y=np.ravel(y)
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size = 0.8, test_size = 0.2, random_state = 0,stratify = y)

Displays the dimensions (rows and columns) of the training dataset to confirm successful splitting and preprocessing.

In [None]:
X_train.shape

(45328, 77)

Shows the dimensions of the test dataset, ensuring the split is correct.

In [None]:
X_test.shape

(11333, 77)

Counts and displays the frequency of each class in the training dataset. This helps identify class imbalance problems before applying resampling methods.

In [None]:
pd.Series(y_train).value_counts()

0    18184
3    15228
5     6357
2     2213
6     1744
1     1573
4       29
Name: count, dtype: int64

### 1.5 Handling Imbalanced Data (SMOTE)

SMOTE (Synthetic Minority Oversampling Technique) is imported and initialized.

It generates synthetic samples for under-represented classes to balance the dataset. In this case, class 4 is oversampled to create 1500 new examples.

In [None]:
from imblearn.over_sampling import SMOTE
smote=SMOTE(n_jobs=-1, sampling_strategy={4:1500}) # Create 1500 samples for the minority class "4"

This block checks the type, shape, and unique values of y_train. It validates that the labels are encoded properly and ready for training.

In [None]:
print(type(y_train))
print(y_train.shape)
print(np.unique(y_train)[:10])   # show first 10 unique labels


<class 'numpy.ndarray'>
(45328,)
[0 1 2 3 4 5 6]


The training labels (y_train) are encoded again to ensure consistency. Then, SMOTE is applied to oversample the minority classes. 

This balances the training dataset and helps machine learning models perform better on imbalanced data.

In [None]:
y_train =labelencoder.fit_transform(y_train)
X_train, y_train = smote.fit_resample(X_train, y_train)

Displays the new class distribution in the training dataset after SMOTE.

This verifies whether oversampling successfully balanced the dataset.

In [None]:
pd.Series(y_train).value_counts()

0    18184
3    15228
5     6357
2     2213
6     1744
1     1573
4     1500
Name: count, dtype: int64