### Summary Statistics

Summary statistics provide a concise overview of a dataset's distribution. Here's a breakdown of the commonly calculated metrics:

**Central Tendency**
- **Mean**: The average value of the dataset. It's calculated by summing all values and dividing by the number of values.
- **Median**: The middle value in a dataset when the values are arranged in ascending order. It's a robust measure that's less sensitive to outliers than the mean.
- **Mode**: The most frequent value in the dataset. There can be one mode (unimodal), multiple modes (multimodal), or no mode.

**Dispersion**
- **Standard Deviation**: Measures the spread of the data around the mean. A higher standard deviation indicates greater variability.
- **Variance**: The square of the standard deviation.
- **Range**: The difference between the maximum and minimum values.
- **Interquartile Range (IQR)**: The difference between the 75th percentile (Q3) and the 25th percentile (Q1). It measures the spread of the middle 50% of the data.
---

**Category Name: TCP Flag Features**

These are the features we will process in this notebook:
- FIN Flag Count: The number of packets with the FIN flag set.
- SYN Flag Count: The number of packets with the SYN flag set.
- RST Flag Count: The number of packets with the RST flag set.
- PSH Flag Count: The number of packets with the PSH flag set.
- ACK Flag Count: The number of packets with the ACK flag set.
- URG Flag Count: The number of packets with the URG flag set.
- CWR Flag Count: The number of packets with the CWR flag set.
- ECE Flag Count: The number of packets with the ECE flag set.

**Category Name: Flow Direction and Rate**
- Down/Up Ratio: The ratio of bytes transferred in the downward direction to bytes transferred in the upward direction.
- Average Packet Size: The average size of packets in the flow.
- Fwd Segment Size Avg: The average size of forward-direction segments.
- Bwd Segment Size Avg: The average size of backward-direction segments.
- Fwd Bytes/Bulk Avg, Bwd Bytes/Bulk Avg: The average number of bytes transferred per bulk of packets in the forward and backward directions.
- Fwd Packet/Bulk Avg, Bwd Packet/Bulk Avg: The average number of packets per bulk in the forward and backward directions.
- Fwd Bulk Rate Avg, Bwd Bulk Rate Avg: The average rate of bulk transfers in the forward and backward directions.



In [1]:
# Importing important libraries 
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os 
import json 

In [2]:
# Functions 
# Update this code block to add more functions as per need
def read_data(data_path: str, usecols: list) -> pd.DataFrame:
    """
    This function reads the data from the data path
    """
    data = pd.read_csv(data_path, usecols=usecols)
    return data

**Data**

In [4]:
# Read data from filepath 
# This file path is for temporary usage (For EDA)
# It will be removed once the task is completed 

file_path = "dataset.csv"
# Columns in this category
columns_to_read = ['FIN Flag Count', 'SYN Flag Count',
       'RST Flag Count', 'PSH Flag Count', 'ACK Flag Count', 'URG Flag Count',
       'CWR Flag Count', 'ECE Flag Count', 'Down/Up Ratio',
       'Average Packet Size', 'Fwd Segment Size Avg', 'Bwd Segment Size Avg',
       'Fwd Bytes/Bulk Avg', 'Fwd Packet/Bulk Avg', 'Fwd Bulk Rate Avg',
       'Bwd Bytes/Bulk Avg', 'Bwd Packet/Bulk Avg', 'Bwd Bulk Rate Avg']

# Read the data
try:
    data = read_data(file_path, usecols = columns_to_read)
    print(f"Successfully read {len(data.columns)} features")
except:
    print("File not found!")
    print("[INFO] Please place the dataset.csv in the directory for use!")

Successfully read 18 features


**Analysis**

In [5]:
# Pandas Describe 
data.describe()

Unnamed: 0,FIN Flag Count,SYN Flag Count,RST Flag Count,PSH Flag Count,ACK Flag Count,URG Flag Count,CWR Flag Count,ECE Flag Count,Down/Up Ratio,Average Packet Size,Fwd Segment Size Avg,Bwd Segment Size Avg,Fwd Bytes/Bulk Avg,Fwd Packet/Bulk Avg,Fwd Bulk Rate Avg,Bwd Bytes/Bulk Avg,Bwd Packet/Bulk Avg,Bwd Bulk Rate Avg
count,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0
mean,1.674717,1.998901,0.02425827,154.864,233.1495,0.0,0.0,0.0,0.7773432,216.4166,189.3546,227.1624,8967.764,9.654493,844519.5,6100.337,1.420971,3842389.0
std,0.736063,0.8149936,0.2063659,876.6922,1262.182,0.0,0.0,0.0,0.281642,626.0783,359.7574,1211.658,41376.66,31.29977,17379150.0,91589.71,19.09298,70774320.0
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2.0,2.0,0.0,2.0,9.0,0.0,0.0,0.0,0.6894977,79.0,58.33333,29.75,0.0,0.0,0.0,0.0,0.0,0.0
50%,2.0,2.0,0.0,2.0,10.0,0.0,0.0,0.0,0.8333333,82.09091,87.5,71.4,0.0,0.0,0.0,0.0,0.0,0.0
75%,2.0,2.0,0.0,3.0,12.0,0.0,0.0,0.0,1.0,126.0,147.2857,71.4,0.0,0.0,0.0,0.0,0.0,0.0
max,13.0,16.0,30.0,11612.0,32586.0,0.0,0.0,0.0,30.0,18144.9,17151.53,34930.59,1756019.0,735.0,2534442000.0,16063080.0,5986.0,3519127000.0


> Since, columns URG, CWR and ECE flags contains all values as 0, these columns need to be dropped

In [9]:
# unique 
data.nunique()

FIN Flag Count              14
SYN Flag Count              16
RST Flag Count              26
PSH Flag Count            6373
ACK Flag Count            9692
URG Flag Count               1
CWR Flag Count               1
ECE Flag Count               1
Down/Up Ratio            89549
Average Packet Size     467620
Fwd Segment Size Avg    409628
Bwd Segment Size Avg    139441
Fwd Bytes/Bulk Avg      107333
Fwd Packet/Bulk Avg        618
Fwd Bulk Rate Avg       277882
Bwd Bytes/Bulk Avg       44050
Bwd Packet/Bulk Avg        648
Bwd Bulk Rate Avg       116820
dtype: int64

Value counts

In [10]:
data['FIN Flag Count'].value_counts()

FIN Flag Count
2     2689445
0      514068
1       26233
3        1158
4         209
5          99
6          95
7          76
8          61
9          27
13          1
12          1
10          1
11          1
Name: count, dtype: int64

In [11]:
data['SYN Flag Count'].value_counts()

SYN Flag Count
2     2763708
0      211755
4      183140
1       46428
8        9833
3        9637
7        2837
6        2223
5        1485
12        244
10         80
9          63
11         32
16          6
13          2
14          2
Name: count, dtype: int64

In [12]:
data['RST Flag Count'].value_counts()

RST Flag Count
0     3167203
1       53332
2       10336
3         151
7          64
4          62
5          60
6          55
8          49
10         24
12         21
9          21
11         17
18         12
14         12
15         11
16         11
13         10
19          6
17          5
20          4
21          3
22          2
24          2
30          1
28          1
Name: count, dtype: int64

> Data shows that most data points have 2 packets with FIN and SYN Flag sets.

> For RST flag set, there are mostly 0 packets.

In [13]:
data.describe()

Unnamed: 0,FIN Flag Count,SYN Flag Count,RST Flag Count,PSH Flag Count,ACK Flag Count,URG Flag Count,CWR Flag Count,ECE Flag Count,Down/Up Ratio,Average Packet Size,Fwd Segment Size Avg,Bwd Segment Size Avg,Fwd Bytes/Bulk Avg,Fwd Packet/Bulk Avg,Fwd Bulk Rate Avg,Bwd Bytes/Bulk Avg,Bwd Packet/Bulk Avg,Bwd Bulk Rate Avg
count,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0
mean,1.674717,1.998901,0.02425827,154.864,233.1495,0.0,0.0,0.0,0.7773432,216.4166,189.3546,227.1624,8967.764,9.654493,844519.5,6100.337,1.420971,3842389.0
std,0.736063,0.8149936,0.2063659,876.6922,1262.182,0.0,0.0,0.0,0.281642,626.0783,359.7574,1211.658,41376.66,31.29977,17379150.0,91589.71,19.09298,70774320.0
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2.0,2.0,0.0,2.0,9.0,0.0,0.0,0.0,0.6894977,79.0,58.33333,29.75,0.0,0.0,0.0,0.0,0.0,0.0
50%,2.0,2.0,0.0,2.0,10.0,0.0,0.0,0.0,0.8333333,82.09091,87.5,71.4,0.0,0.0,0.0,0.0,0.0,0.0
75%,2.0,2.0,0.0,3.0,12.0,0.0,0.0,0.0,1.0,126.0,147.2857,71.4,0.0,0.0,0.0,0.0,0.0,0.0
max,13.0,16.0,30.0,11612.0,32586.0,0.0,0.0,0.0,30.0,18144.9,17151.53,34930.59,1756019.0,735.0,2534442000.0,16063080.0,5986.0,3519127000.0


> Columns Fwd and bwd bytes/bulk avg, fwd and bwd bulk rate avg has most values as 0 indicating the average number of bytes transferred per bulk is mostly 0. This might need to be treated during data preprocessing! 