<a href="https://colab.research.google.com/github/singh-azad/project/blob/main/ddos.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Detection of DDoS attack using Attention base Machine Learning**

## **Libraries imported**

In [1]:
import pandas as pd
import numpy as np
import time
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score



## **Dataset**

In [2]:
df = pd.read_csv('/kaggle/input/ddos-dataset/Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv')

## **Data** **Preprocessing**

**Converting IP adddresses into int**

In [3]:
import ipaddress

In [4]:
df['NumericalSourceIP'] = df[' Source IP'].apply(lambda x: int(ipaddress.IPv4Address(x)))
df['NumericalDestinationIP'] = df[' Destination IP'].apply(lambda x: int(ipaddress.IPv4Address(x)))

**Converting timestamp to int and storing in 'Timestamp'**

In [5]:
df['Timestamp'] = pd.to_datetime(df[' Timestamp']).astype(int) / 10**9

**Before droping the columns in dataframe**

In [6]:
df.columns

Index(['Flow ID', ' Source IP', ' Source Port', ' Destination IP',
       ' Destination Port', ' Protocol', ' Timestamp', ' Flow Duration',
       ' Total Fwd Packets', ' Total Backward Packets',
       'Total Length of Fwd Packets', ' Total Length of Bwd Packets',
       ' Fwd Packet Length Max', ' Fwd Packet Length Min',
       ' Fwd Packet Length Mean', ' Fwd Packet Length Std',
       'Bwd Packet Length Max', ' Bwd Packet Length Min',
       ' Bwd Packet Length Mean', ' Bwd Packet Length Std', 'Flow Bytes/s',
       ' Flow Packets/s', ' Flow IAT Mean', ' Flow IAT Std', ' Flow IAT Max',
       ' Flow IAT Min', 'Fwd IAT Total', ' Fwd IAT Mean', ' Fwd IAT Std',
       ' Fwd IAT Max', ' Fwd IAT Min', 'Bwd IAT Total', ' Bwd IAT Mean',
       ' Bwd IAT Std', ' Bwd IAT Max', ' Bwd IAT Min', 'Fwd PSH Flags',
       ' Bwd PSH Flags', ' Fwd URG Flags', ' Bwd URG Flags',
       ' Fwd Header Length', ' Bwd Header Length', 'Fwd Packets/s',
       ' Bwd Packets/s', ' Min Packet Length', ' Max Pa

**droping the columns Flow Id, Source IP, Destination IP, Timestamp**

In [7]:
columns_to_drop = ['Flow ID', ' Source IP', ' Destination IP', ' Timestamp']
df = df.drop(columns_to_drop, axis=1)

**After droping the columns**

In [8]:
df.columns

Index([' Source Port', ' Destination Port', ' Protocol', ' Flow Duration',
       ' Total Fwd Packets', ' Total Backward Packets',
       'Total Length of Fwd Packets', ' Total Length of Bwd Packets',
       ' Fwd Packet Length Max', ' Fwd Packet Length Min',
       ' Fwd Packet Length Mean', ' Fwd Packet Length Std',
       'Bwd Packet Length Max', ' Bwd Packet Length Min',
       ' Bwd Packet Length Mean', ' Bwd Packet Length Std', 'Flow Bytes/s',
       ' Flow Packets/s', ' Flow IAT Mean', ' Flow IAT Std', ' Flow IAT Max',
       ' Flow IAT Min', 'Fwd IAT Total', ' Fwd IAT Mean', ' Fwd IAT Std',
       ' Fwd IAT Max', ' Fwd IAT Min', 'Bwd IAT Total', ' Bwd IAT Mean',
       ' Bwd IAT Std', ' Bwd IAT Max', ' Bwd IAT Min', 'Fwd PSH Flags',
       ' Bwd PSH Flags', ' Fwd URG Flags', ' Bwd URG Flags',
       ' Fwd Header Length', ' Bwd Header Length', 'Fwd Packets/s',
       ' Bwd Packets/s', ' Min Packet Length', ' Max Packet Length',
       ' Packet Length Mean', ' Packet Length Std',

**Deleting infinite values and null values**

In [9]:
df = df[~np.isinf(df['Flow Bytes/s'])]
df.dropna(inplace=True)

In [10]:
X = df.drop(' Label', axis=1)
y = df[' Label']

**Spliting the dataset in *train* and *test* set**

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**Checking the shape of the data set**

In [12]:
num_columns = X.shape[1]
print(num_columns)

83


### **Features selection**


>**Best 15 features is selected and stored in X_train_selected and X_test_selected**



In [13]:
k = 15  # Number of features to select
selector = SelectKBest(score_func=f_classif, k=k)
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)

  f = msb / msw




> **Selected features**



In [14]:
selected_feature_names = X.columns[selector.get_support()]
selected_feature_names

Index([' Destination Port', ' Protocol', 'Bwd Packet Length Max',
       ' Bwd Packet Length Mean', ' Bwd Packet Length Std',
       ' Min Packet Length', ' Max Packet Length', ' Packet Length Mean',
       ' Packet Length Std', ' Packet Length Variance', ' URG Flag Count',
       ' Average Packet Size', ' Avg Bwd Segment Size',
       ' min_seg_size_forward', 'NumericalDestinationIP'],
      dtype='object')

## **KNN**

---



> **Model**


In [52]:
from sklearn.neighbors import KNeighborsClassifier


res1 = time.time()

# Create a K-Nearest Neighbors classifier
knn = KNeighborsClassifier()
knn = knn.fit(X_train_selected , y_train)
knn

res2 = time.time()

print('KNN took ',res2-res1,'seconds')

KNN took  0.7072451114654541 seconds


> **Validation**

In [16]:
#validating with kfold method
# Define the number of folds (K)
k = 5

# Create a K-Fold cross-validator
kf = KFold(n_splits=k)

# Perform K-fold cross-validation
scores = cross_val_score(knn, X_train_selected, y_train, cv=kf)

# Print the accuracy for each fold
for fold_idx, score in enumerate(scores):
    print(f"Fold {fold_idx + 1} accuracy: {score}")

# Compute the mean accuracy and standard deviation across all folds
mean_accuracy = np.mean(scores)

print(f"\nMean accuracy: {mean_accuracy}")

Fold 1 accuracy: 0.9999169297225453
Fold 2 accuracy: 0.9997784792601208
Fold 3 accuracy: 0.9998061693526057
Fold 4 accuracy: 0.9999169274222579
Fold 5 accuracy: 0.9997230914075264

Mean accuracy: 0.9998283194330112


> **Testing Model**

In [53]:
#testing 

y_pred1 = knn.predict(X_test_selected)

print('Accuracy score= {:.8f}'.format(knn.score(X_test_selected, y_test)))

Accuracy score= 0.99986709


> **Precision**

In [18]:
# Calculate the precision score
precision = precision_score(y_test, y_pred1, average='weighted')
print("Precision score:", precision)

Precision score: 0.9998671016843568


> **Recall**

In [19]:
# Calculate recall
recall = recall_score(y_test , y_pred1, pos_label='DDoS')

print("Recall:", recall)

Recall: 0.9999611257969212


>**F1** **Score**

In [20]:
# Calculate the F1 score
f1 = f1_score(y_test, y_pred1, pos_label='DDoS')

print("F1 Score:", f1)

F1 Score: 0.9998833864572806



> **Confusion Matrix**



In [21]:
# Generate the confusion matrix
cm = confusion_matrix(y_test, y_pred1)

# Print the confusion matrix
print("Confusion Matrix:")
print(cm)

Confusion Matrix:
[[19414     5]
 [    1 25723]]


## **Random Forest**



> **Model** 



In [22]:
res1 = time.time()

rf = RandomForestClassifier()
rf.fit(X_train_selected , y_train)
res2 = time.time()
print('RandomForest  took ',res2-res1,'seconds')

RandomForest  took  10.997257471084595 seconds




> **Validation**



In [23]:
#validating with kfold method
# Define the number of folds (K)
k = 5

# Create a K-Fold cross-validator
kf = KFold(n_splits=k)

# Perform K-fold cross-validation
scores = cross_val_score(rf, X_train_selected, y_train, cv=kf)

# Print the accuracy for each fold
for fold_idx, score in enumerate(scores):
    print(f"Fold {fold_idx + 1} accuracy: {score}")

# Compute the mean accuracy and standard deviation across all folds
mean_accuracy = np.mean(scores)

print(f"\nMean accuracy: {mean_accuracy}")

Fold 1 accuracy: 0.9999446198150301
Fold 2 accuracy: 0.9999169297225453
Fold 3 accuracy: 1.0
Fold 4 accuracy: 0.9999169274222579
Fold 5 accuracy: 0.9999723091407526

Mean accuracy: 0.9999501572201173




> **Testing** **Model**



In [24]:
y_pred1 = rf.predict(X_test_selected)

#Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred1)
print("Accuracy score:", accuracy)


Accuracy score: 0.9999778481713666




> **Precision**



In [25]:
# Calculate the precision score
precision = precision_score(y_test, y_pred1, average='weighted')
print("Precision score:", precision)


Precision score: 0.9999778490324678




> **Recall**



In [26]:
# Calculate recall
recall = recall_score(y_test , y_pred1, pos_label='DDoS')

print("Recall:", recall)

Recall: 1.0




> **F1** **Score**



In [27]:
# Calculate the F1 score
f1 = f1_score(y_test, y_pred1, pos_label='DDoS')

print("F1 Score: ", f1)

F1 Score:  0.9999805632762541




> **Confusion Matrix**



In [28]:
# Generate the confusion matrix
cm = confusion_matrix(y_test, y_pred1)

# Print the confusion matrix
print("Confusion Matrix:")
print(cm)

Confusion Matrix:
[[19418     1]
 [    0 25724]]


## **NB**



> **Model** 



In [29]:
from sklearn.naive_bayes import GaussianNB

res1 = time.time()

# Create a Gaussian Naive Bayes classifier
gnb = GaussianNB()

gnb.fit(X_train_selected , y_train)

res2 = time.time()

print('GNB  took ',res2-res1,'seconds')

GNB  took  0.25122570991516113 seconds




> **Validation**



In [30]:
#validating with kfold method
# Define the number of folds (K)
k = 5

# Create a K-Fold cross-validator
kf = KFold(n_splits=k)

# Perform K-fold cross-validation
scores = cross_val_score(gnb, X_train_selected, y_train, cv=kf)

# Print the accuracy for each fold
for fold_idx, score in enumerate(scores):
    print(f"Fold {fold_idx + 1} accuracy: {score}")

# Compute the mean accuracy and standard deviation across all folds
mean_accuracy = np.mean(scores)

print(f"\nMean accuracy: {mean_accuracy}")

Fold 1 accuracy: 0.808218419449521
Fold 2 accuracy: 0.8068892950102453
Fold 3 accuracy: 0.8086060807443097
Fold 4 accuracy: 0.8079085094010467
Fold 5 accuracy: 0.8059978401129787

Mean accuracy: 0.8075240289436204




> **Testing** **Model**



In [31]:
y_pred1 = gnb.predict(X_test_selected)

#Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred1)
print("Accuracy score:", accuracy)


Accuracy score: 0.8099816139822342




> **Precision**



In [32]:
# Calculate the precision score
precision = precision_score(y_test, y_pred1, average='weighted')
print("Precision score:", precision)


Precision score: 0.857500059415748




> **Recall**



In [33]:
# Calculate recall
recall = recall_score(y_test , y_pred1, pos_label='DDoS')

print("Recall:", recall)

Recall: 1.0




> **F1** **Score**



In [34]:
# Calculate the F1 score
f1 = f1_score(y_test, y_pred1, pos_label='DDoS')

print("F1 Score: ", f1)

F1 Score:  0.8570952587212207




> **Confusion Matrix**



In [35]:
# Generate the confusion matrix
cm = confusion_matrix(y_test, y_pred1)

# Print the confusion matrix
print("Confusion Matrix:")
print(cm)

Confusion Matrix:
[[10841  8578]
 [    0 25724]]


## **Ensemble Model**

In [54]:
from sklearn.ensemble import VotingClassifier
# Create the ensemble model using VotingClassifier

res1 = time.time()

ensemble_model = VotingClassifier(
    estimators=[('knnear', knn), ('random', rf), ('gauss', gnb)],
    voting='soft'  # Use 'hard' voting for majority vote or 'soft' for weighted probability voting
)

# Train the ensemble model
ensemble_model.fit(X_train_selected, y_train)

res2 = time.time()
print('Ensemble Model took ',res2-res1,'seconds')

# Make predictions using the ensemble model
ensemble_predictions = ensemble_model.predict(X_test_selected)

# Calculate evaluation metrics for the ensemble model
accuracy = accuracy_score(y_test, ensemble_predictions)
precision = precision_score(y_test, ensemble_predictions, pos_label='DDoS')
recall = recall_score(y_test, ensemble_predictions, pos_label='DDoS')

print("Ensemble Model Accuracy:", accuracy)
print("Ensemble Model Precision:", precision)
print("Ensemble Model Recall:", recall)


Ensemble Model took  11.276801347732544 seconds
Ensemble Model Accuracy: 0.9998892408568327
Ensemble Model Precision: 0.9998056667573555
Ensemble Model Recall: 1.0


In [44]:
f1 = f1_score(y_test,ensemble_predictions, pos_label='DDoS')

print("F1 Score: ", f1)

F1 Score:  0.9999028239364081
