### Summary Statistics

Summary statistics provide a concise overview of a dataset's distribution. Here's a breakdown of the commonly calculated metrics:

**Central Tendency**
- **Mean**: The average value of the dataset. It's calculated by summing all values and dividing by the number of values.
- **Median**: The middle value in a dataset when the values are arranged in ascending order. It's a robust measure that's less sensitive to outliers than the mean.
- **Mode**: The most frequent value in the dataset. There can be one mode (unimodal), multiple modes (multimodal), or no mode.

**Dispersion**
- **Standard Deviation**: Measures the spread of the data around the mean. A higher standard deviation indicates greater variability.
- **Variance**: The square of the standard deviation.
- **Range**: The difference between the maximum and minimum values.
- **Interquartile Range (IQR)**: The difference between the 75th percentile (Q3) and the 25th percentile (Q1). It measures the spread of the middle 50% of the data.
---

**Category Name: Packet Size Information**

These are the features we will process in this notebook:
- Total Length of Fwd Packet: The total length of all forward packets.
- Total Length of Bwd Packet: The total length of all backward packets.
- Fwd Packet Length Max, Min, Mean, Std: Maximum, minimum, mean, and standard deviation of forward packet lengths.
- Bwd Packet Length Max, Min, Mean, Std: Maximum, minimum, mean, and standard deviation of backward packet lengths.

In [1]:
# Importing important libraries 
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os 
import json

In [2]:
# Functions 
# Update this code block to add more functions as per need
def read_data(data_path: str, usecols: list) -> pd.DataFrame:
    """
    This function reads the data from the data path
    """
    data = pd.read_csv(data_path, usecols=usecols)
    return data

**Data**

In [3]:
# Read data from filepath 
# This file path is for temporary usage (For EDA)
# It will be removed once the task is completed 

file_path = "dataset.csv"
# Columns in this category
columns_to_read = ['Total Length of Fwd Packet', 'Total Length of Bwd Packet',
       'Fwd Packet Length Max', 'Fwd Packet Length Min',
       'Fwd Packet Length Mean', 'Fwd Packet Length Std',
       'Bwd Packet Length Max', 'Bwd Packet Length Min',
       'Bwd Packet Length Mean', 'Bwd Packet Length Std']

# Read the data
try:
    data = read_data(file_path, usecols = columns_to_read)
    print(f"Successfully read {len(data.columns)} features")
except:
    print("File not found!")
    print("[INFO] Please place the dataset.csv in the directory for use!")

Successfully read 10 features


**Analysis**

In [7]:
# Pandas Describe Function
data.describe()

Unnamed: 0,Total Length of Fwd Packet,Total Length of Bwd Packet,Fwd Packet Length Max,Fwd Packet Length Min,Fwd Packet Length Mean,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Min,Bwd Packet Length Mean,Bwd Packet Length Std
count,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0
mean,64246.91,184934.8,1539.499,0.9127386,189.3546,333.0028,1060.918,2.88644,227.1624,367.0041
std,623101.8,1269267.0,3683.048,9.55478,359.7574,549.1157,3728.56,26.42945,1211.658,1357.321
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,443.0,357.0,268.0,0.0,58.33333,97.55825,357.0,0.0,29.75,79.0
50%,520.0,357.0,518.0,0.0,87.5,212.6974,357.0,0.0,71.4,159.6553
75%,1040.0,357.0,534.0,0.0,147.2857,239.2593,357.0,0.0,71.4,159.6553
max,18927940.0,48194210.0,64704.0,5083.0,17151.53,12475.33,64704.0,3561.0,34930.59,31189.38


In [8]:
data.columns

Index(['Total Length of Fwd Packet', 'Total Length of Bwd Packet',
       'Fwd Packet Length Max', 'Fwd Packet Length Min',
       'Fwd Packet Length Mean', 'Fwd Packet Length Std',
       'Bwd Packet Length Max', 'Bwd Packet Length Min',
       'Bwd Packet Length Mean', 'Bwd Packet Length Std'],
      dtype='object')

In [9]:
# Unique Values 
data.nunique() 

Total Length of Fwd Packet    209948
Total Length of Bwd Packet    122551
Fwd Packet Length Max          22884
Fwd Packet Length Min            118
Fwd Packet Length Mean        467819
Fwd Packet Length Std         579325
Bwd Packet Length Max           8799
Bwd Packet Length Min             77
Bwd Packet Length Mean        144043
Bwd Packet Length Std         151296
dtype: int64

> Fwd Packet length min has very less unique values, that means many data points have this minimum packet length. This is because the minimum packet length is usually the same for all packets in a flow. Same applies to Packet length max.

> Bwd Packet length min and max has also less number of unique values. 

In [11]:
data.describe()

Unnamed: 0,Total Length of Fwd Packet,Total Length of Bwd Packet,Fwd Packet Length Max,Fwd Packet Length Min,Fwd Packet Length Mean,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Min,Bwd Packet Length Mean,Bwd Packet Length Std
count,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0
mean,64246.91,184934.8,1539.499,0.9127386,189.3546,333.0028,1060.918,2.88644,227.1624,367.0041
std,623101.8,1269267.0,3683.048,9.55478,359.7574,549.1157,3728.56,26.42945,1211.658,1357.321
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,443.0,357.0,268.0,0.0,58.33333,97.55825,357.0,0.0,29.75,79.0
50%,520.0,357.0,518.0,0.0,87.5,212.6974,357.0,0.0,71.4,159.6553
75%,1040.0,357.0,534.0,0.0,147.2857,239.2593,357.0,0.0,71.4,159.6553
max,18927940.0,48194210.0,64704.0,5083.0,17151.53,12475.33,64704.0,3561.0,34930.59,31189.38


> Most values in Total length of forward packet is mostly below 1k, this column might indicate outliers. The standard deviation is very high, showing the highly dispersed data distribution 

> The mean for total length of fwd packet and total length of bwd packet is higher showing that most values lie on higher end in this feature.