# Table of Contents
<hr>

[1. Introduction](#introduction)  
[2. Data Dictionary](#dictionary)  
[3. Question of Interest](#question)  
[4. Load Data](#load)  
[5. Data Cleaning](#cleaning)  
- Check Datatypes and Missing Data  
- Checking for Duplicate Data  


## 1.Introduction<a name="introduction"></a>
<hr>

Traditional methods of DDoS detection often rely on predefined rules and signatures, which can struggle to keep pace with the evolving tactics of attackers. As a result, there is an increasing need for more sophisticated and adaptive approaches to identifying and mitigating these threats. This project explores the application of machine learning techniques to detect DDoS attacks by analyzing network traffic data captured by Wireshark, a widely-used network protocol analyzer.

## 2. Data Dictionary <a name="dictionary"></a>
<hr>

<table>
    <tr>
        <th>Field Name</th>
        <th>Description</th>
        <th>Data Type</th>
        <th>Data Source</th>
    </tr>
    <tr>
        <td>SRC_ADD</td>
        <td>Source(client) IP address</td>
        <td>string</td>
        <td>Wireshark</td>
    </tr>
        <tr>
        <td>DES_ADD</td>
        <td>Destination(server) IP address</td>
        <td>string</td>
        <td>Wireshark</td>
    </tr>
    <tr>
        <td>PKT_ID</td>
        <td>Unique identifier of a network packet</td>
        <td>string</td>
        <td>Wireshark</td>
    </tr>
        <tr>
        <td>FROM_NODE</td>
        <td>Source intermediate node</td>
        <td>string</td>
        <td>Wireshark</td>
    </tr>
    <tr>
        <td>TO_NODE</td>
        <td>Destination intermediate node</td>
        <td>string</td>
        <td>Wireshark</td>
    </tr>
        <tr>
        <td>PKT_TYPE</td>
        <td>Format or structure of a data packet used in network communication</td>
        <td>string</td>
        <td>Wireshark</td>
    </tr>
    <tr>
        <td>PKT_SIZE</td>
        <td>Size of the network packet in bytes</td>
        <td>numeric</td>
        <td>Wireshark</td>
    </tr>
    <tr>
        <td>FLAG</td>
        <td>Specific bits within a packet's header that control or indicate how the packet should be processed</td>
        <td>string</td>
        <td>Wireshark</td>
    </tr>
    <tr>
        <td>FID</td>
        <td>Flow Identifier. A unique identifier associated with a flow of network traffic</td>
        <td>numeric</td>
        <td>Wireshark</td>
    </tr>
    <tr>
        <td>SEQ_NUMBER</td>
        <td>Sequence number of a TCP packet within a TCP connection</td>
        <td>numeric</td>
        <td>Wireshark</td>
    </tr>
    <tr>
        <td>NUMBER_OF_PKT</td>
        <td>The total number of packets transmitted within a flow or connection</td>
        <td>numeric</td>
        <td>Wireshark</td>
    </tr>
    <tr>
        <td>NUMBER_OF_BYTE</td>
        <td>The total number of bytes transmitted within a flow or connection</td>
        <td>numeric</td>
        <td>Wireshark</td>
    </tr>
    <tr>
        <td>NODE_NAME_FROM</td>
        <td>Name of the intermediary node sending packages</td>
        <td>string</td>
        <td>Wireshark</td>
    </tr>
    <tr>
        <td>NODE_NAME_TO</td>
        <td>Name of the intermediary node receiving packages</td>
        <td>string</td>
        <td>Wireshark</td>
    </tr>
    <tr>
        <td>PKT_IN</td>
        <td>Time when a packet was sent from a node</td>
        <td>datetime</td>
        <td>Wireshark</td>
    </tr>
    <tr>
        <td>PKT_OUT</td>
        <td>Time when a packet was received by a node</td>
        <td>datetime</td>
        <td>Wireshark</td>
    </tr>
    <tr>
        <td>PKT_R</td>
        <td>Unknown but values are similar to Packet_Received_Time</td>
        <td>datetime</td>
        <td>Wireshark</td>
    </tr>
    <tr>
        <td>PKT_DELAY_NODE</td>
        <td>Travel time to specifc node</td>
        <td>datetime</td>
        <td>Wireshark</td>
    </tr>
    <tr>
        <td>PKT_RATE</td>
        <td>Packet transmission rate, measured in packets per second</td>
        <td>numeric</td>
        <td>Wireshark</td>
    </tr>
    <tr>
        <td>BYTE_RATE</td>
        <td>The rate at which bytes are transmitted or received, measured in bytes per second</td>
        <td>numeric</td>
        <td>Wireshark</td>
    </tr>
    <tr>
        <td>PKT_AVG_SIZE</td>
        <td>Average size of data per packet</td>
        <td>numeric</td>
        <td>Wireshark</td>
    </tr>
    <tr>
        <td>UTILIZATION</td>
        <td>A degree to which the network link is being used</td>
        <td>numeric</td>
        <td>Wireshark</td>
    </tr>
    <tr>
        <td>PKT_DELAY</td>
        <td>Total travel time from source to destination</td>
        <td>numeric</td>
        <td>Wireshark</td>
    </tr>
    <tr>
        <td>PKT_SEND_TIME</td>
        <td>Time at which a packet was sent from source</td>
        <td>datetime</td>
        <td>Wireshark</td>
    </tr>
    <tr>
        <td>PKT_RECEIVED</td>
        <td>Time at which a packet was received at destination</td>
        <td>datetime</td>
        <td>Wireshark</td>
    </tr>
    <tr>
        <td>FIRST_PKT_SENT</td>
        <td>Time when the first packet sent</td>
        <td>datetime</td>
        <td>Wireshark</td>
    </tr>
    <tr>
        <td>LAST_PKT_RECEIVED</td>
        <td>Time when the last packet received</td>
        <td>datetime</td>
        <td>Wireshark</td>
    </tr>
    <tr>
        <td>PKT_CLASS</td>
        <td>Target value indicating class of the packet</td>
        <td>boolean</td>
        <td>Wireshark</td>
    </tr>
</table>

#### Field of interest
- PKT_CLASS. Packet class. Target value indicating class of the packet.

## 3.Question of Interest <a name="question"></a>
<hr>
How to identify malicious traffic that intends to disrupt the normal operation of a network, service, or website. The goal of network attack is to overwhelm the target with a flood of traffic, making it inaccessible to legitimate users.

## 4. Load Data <a name="load"></a>
<hr>

In [21]:
import pandas as pd

In [22]:
network_df = pd.read_csv("datasets/final-dataset.csv")

In [23]:
network_df.head()

Unnamed: 0,SRC_ADD,DES_ADD,PKT_ID,FROM_NODE,TO_NODE,PKT_TYPE,PKT_SIZE,FLAGS,FID,SEQ_NUMBER,...,PKT_RATE,BYTE_RATE,PKT_AVG_SIZE,UTILIZATION,PKT_DELAY,PKT_SEND_TIME,PKT_RESEVED_TIME,FIRST_PKT_SENT,LAST_PKT_RESEVED,PKT_CLASS
0,3.0,24.3,389693,21,23,tcp,1540,-------,4,11339,...,328.240918,505490.0,1540.0,0.236321,0.0,35.519662,35.550032,1.0,50.02192,Normal
1,15.0,24.15,201196,23,24,tcp,1540,-------,16,6274,...,328.205808,505437.0,1540.0,0.236337,0.0,20.156478,20.186848,1.0,50.030211,Normal
2,24.15,15.0,61905,23,22,ack,55,-------,16,1930,...,328.206042,18051.3,55.0,0.008441,0.0,7.039952,7.069962,1.030045,50.060221,UDP-Flood
3,24.9,9.0,443135,23,21,ack,55,-------,10,12670,...,328.064183,18043.5,55.0,0.008437,0.0,39.617967,39.647976,1.030058,50.060098,Normal
4,24.8,8.0,157335,23,21,ack,55,-------,9,4901,...,328.113525,18046.2,55.0,0.008438,0.0,16.029803,16.059813,1.030054,50.061864,Normal


## 5. Data Cleaning <a name="cleaning"></a>
<hr>

Wireshark has provided the snapshot of network traffic. First, let's take a look at what we're working with, and assess the level of cleaning and preprocessing that needs to be done.

In [24]:
# Importing libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [25]:
# Check
network_df.head()

Unnamed: 0,SRC_ADD,DES_ADD,PKT_ID,FROM_NODE,TO_NODE,PKT_TYPE,PKT_SIZE,FLAGS,FID,SEQ_NUMBER,...,PKT_RATE,BYTE_RATE,PKT_AVG_SIZE,UTILIZATION,PKT_DELAY,PKT_SEND_TIME,PKT_RESEVED_TIME,FIRST_PKT_SENT,LAST_PKT_RESEVED,PKT_CLASS
0,3.0,24.3,389693,21,23,tcp,1540,-------,4,11339,...,328.240918,505490.0,1540.0,0.236321,0.0,35.519662,35.550032,1.0,50.02192,Normal
1,15.0,24.15,201196,23,24,tcp,1540,-------,16,6274,...,328.205808,505437.0,1540.0,0.236337,0.0,20.156478,20.186848,1.0,50.030211,Normal
2,24.15,15.0,61905,23,22,ack,55,-------,16,1930,...,328.206042,18051.3,55.0,0.008441,0.0,7.039952,7.069962,1.030045,50.060221,UDP-Flood
3,24.9,9.0,443135,23,21,ack,55,-------,10,12670,...,328.064183,18043.5,55.0,0.008437,0.0,39.617967,39.647976,1.030058,50.060098,Normal
4,24.8,8.0,157335,23,21,ack,55,-------,9,4901,...,328.113525,18046.2,55.0,0.008438,0.0,16.029803,16.059813,1.030054,50.061864,Normal


In [26]:
# How much data are we working with here?
print(f'Our dataframe has {network_df.shape[0]} rows and {network_df.shape[1]} columns.')

Our dataframe has 2160668 rows and 28 columns.


There are `2 160 668` entries and `28` attributes. There is no unique packet identifier since `PKT_ID` has duplicates as a packet travels accross various network devices (nodes) to get to the destination and each travel is the seperate entry in the dataset. We will proceed with caution, and check for duplicates in the data.

Next steps:

- Check datatypes and formats
- Check for duplicate data (is the data unique on the row and column level?)
- Check for missing data

### Check Datatypes and Missing Data

First we will investigate the structure and format of the data to make sure that nothing is missed.

We see a variety of numerical (int/float) and non-numeric columns. Furthermore:

In [27]:
network_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2160668 entries, 0 to 2160667
Data columns (total 28 columns):
 #   Column            Dtype  
---  ------            -----  
 0   SRC_ADD           float64
 1   DES_ADD           float64
 2   PKT_ID            int64  
 3   FROM_NODE         int64  
 4   TO_NODE           int64  
 5   PKT_TYPE          object 
 6   PKT_SIZE          int64  
 7   FLAGS             object 
 8   FID               int64  
 9   SEQ_NUMBER        int64  
 10  NUMBER_OF_PKT     int64  
 11  NUMBER_OF_BYTE    int64  
 12  NODE_NAME_FROM    object 
 13  NODE_NAME_TO      object 
 14  PKT_IN            float64
 15  PKT_OUT           float64
 16  PKT_R             float64
 17  PKT_DELAY_NODE    float64
 18  PKT_RATE          float64
 19  BYTE_RATE         float64
 20  PKT_AVG_SIZE      float64
 21  UTILIZATION       float64
 22  PKT_DELAY         float64
 23  PKT_SEND_TIME     float64
 24  PKT_RESEVED_TIME  float64
 25  FIRST_PKT_SENT    float64
 26  LAST_PKT_RESEV

In [28]:
# Count the number of null or empty values in each column
null_columns = network_df.isnull().sum()  # Count null values in each column
empty_columns = (network_df == '').sum()  # Count empty values in each column

# Total number of null or empty values in each column
total_null_empty_columns = null_columns + empty_columns

print("Number of null or empty columns:")
print(total_null_empty_columns)

Number of null or empty columns:
SRC_ADD             0
DES_ADD             0
PKT_ID              0
FROM_NODE           0
TO_NODE             0
PKT_TYPE            0
PKT_SIZE            0
FLAGS               0
FID                 0
SEQ_NUMBER          0
NUMBER_OF_PKT       0
NUMBER_OF_BYTE      0
NODE_NAME_FROM      0
NODE_NAME_TO        0
PKT_IN              0
PKT_OUT             0
PKT_R               0
PKT_DELAY_NODE      0
PKT_RATE            0
BYTE_RATE           0
PKT_AVG_SIZE        0
UTILIZATION         0
PKT_DELAY           0
PKT_SEND_TIME       0
PKT_RESEVED_TIME    0
FIRST_PKT_SENT      0
LAST_PKT_RESEVED    0
PKT_CLASS           0
dtype: int64


<font color='red'> No null or empty values. </font>

### Checking for Duplicate Data

Now that the data appears sufficiently clean, we will check for duplicate data. First, on the row level.

In [29]:
# Checking for duplicates and counting
network_df.duplicated().sum()

0

**There are no duplicate rows in the dataset.**

Let us also be thorough and check that no two columns also contain the same information.

In [37]:
#We can check if all columns in a DataFrame are unique using the nunique() method.

# Check if all columns are unique
check_unique_columns(network_df)

No identical columns found.


In [36]:
def check_unique_columns(df):
    n = len(df.columns)
    identical_columns = []
    
    for i in range(n):
        for j in range(i + 1, n):
            col1 = df.iloc[:, i]
            col2 = df.iloc[:, j]
            if col1.equals(col2):
                identical_columns.append((df.columns[i], df.columns[j]))
    
    if identical_columns:
        print("Identical columns found:")
        for col_pair in identical_columns:
            print(f"Columns {col_pair[0]} and {col_pair[1]} are identical.")
    else:
        print("No identical columns found.")