### Summary Statistics

Summary statistics provide a concise overview of a dataset's distribution. Here's a breakdown of the commonly calculated metrics:

**Central Tendency**
- **Mean**: The average value of the dataset. It's calculated by summing all values and dividing by the number of values.
- **Median**: The middle value in a dataset when the values are arranged in ascending order. It's a robust measure that's less sensitive to outliers than the mean.
- **Mode**: The most frequent value in the dataset. There can be one mode (unimodal), multiple modes (multimodal), or no mode.

**Dispersion**
- **Standard Deviation**: Measures the spread of the data around the mean. A higher standard deviation indicates greater variability.
- **Variance**: The square of the standard deviation.
- **Range**: The difference between the maximum and minimum values.
- **Interquartile Range (IQR)**: The difference between the 75th percentile (Q3) and the 25th percentile (Q1). It measures the spread of the middle 50% of the data.
---

**Category Name: Header Lengths**

These are the features we will process in this notebook:
- Fwd Header Length, Bwd Header Length: The length of the IP header for forward and backward packets, 


**Category Name: Packet length features**

These are the features we will process in this notebook:
- Packet Length Min, Max, Mean, Std: The minimum, maximum, mean, and standard deviation of packet lengths for the entire flow.
- Packet Length Variance: The variance of packet lengths, indicating the spread of packet sizes.

In [2]:
# Importing important libraries 
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os 
import json 

In [3]:
# Functions 
# Update this code block to add more functions as per need
def read_data(data_path: str, usecols: list) -> pd.DataFrame:
    """
    This function reads the data from the data path
    """
    data = pd.read_csv(data_path, usecols=usecols)
    return data

**Data**

In [5]:
# Read data from filepath 
# This file path is for temporary usage (For EDA)
# It will be removed once the task is completed 

file_path = "dataset.csv"
# Columns in this category
columns_to_read = ['Fwd Header Length', 'Bwd Header Length',
       'Fwd Packets/s', 'Bwd Packets/s', 'Packet Length Min',
       'Packet Length Max', 'Packet Length Mean', 'Packet Length Std',
       'Packet Length Variance']

# Read the data
try:
    data = read_data(file_path, usecols = columns_to_read)
    print(f"Successfully read {len(data.columns)} features")
except:
    print("File not found!")
    print("[INFO] Please place the dataset.csv in the directory for use!")

Successfully read 9 features


**Analysis**

In [6]:
# pandas Describe 
data.describe()

Unnamed: 0,Fwd Header Length,Bwd Header Length,Fwd Packets/s,Bwd Packets/s,Packet Length Min,Packet Length Max,Packet Length Mean,Packet Length Std,Packet Length Variance
count,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0
mean,4124.924,3392.609,2274.623,1944.105,0.8355488,2165.426,216.4166,463.8876,1596198.0
std,21830.01,18768.78,5989.855,5127.95,6.574541,4968.92,626.0783,1175.162,22624540.0
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,168.0,168.0,47.12465,42.11342,0.0,510.0,79.0,171.2078,29312.1
50%,200.0,168.0,154.6671,134.2102,0.0,520.0,82.09091,184.5085,34043.4
75%,232.0,200.0,787.4016,644.0809,0.0,545.0,126.0,221.3949,49015.69
max,485772.0,574920.0,333333.3,250000.0,1348.0,64704.0,18144.9,27519.84,757341300.0


In [7]:
# Unique values
data.nunique()

Fwd Header Length           18193
Bwd Header Length           21811
Fwd Packets/s             1100889
Bwd Packets/s              904276
Packet Length Min              72
Packet Length Max           23070
Packet Length Mean         504807
Packet Length Std          616971
Packet Length Variance     622424
dtype: int64

> Packet length Min has high data imbalance indicating the minimum length for most packets is 0.

> Fwd and Bwd header length has some values (25%) in very high range, indicating some header lengths are typically higher. These columns also need to be normalized. 

> Number of fwd and bwd packets per seconds also has some values in high range indicating correlation with header lengths.

In [9]:
data['Packet Length Min'].value_counts()

Packet Length Min
0.0      3167801
38.0       24571
55.0       15006
29.0        7548
37.0        4838
          ...   
115.0          1
58.0           1
138.0          1
199.0          1
478.0          1
Name: count, Length: 72, dtype: int64