### Summary Statistics

Summary statistics provide a concise overview of a dataset's distribution. Here's a breakdown of the commonly calculated metrics:

**Central Tendency**
- **Mean**: The average value of the dataset. It's calculated by summing all values and dividing by the number of values.
- **Median**: The middle value in a dataset when the values are arranged in ascending order. It's a robust measure that's less sensitive to outliers than the mean.
- **Mode**: The most frequent value in the dataset. There can be one mode (unimodal), multiple modes (multimodal), or no mode.

**Dispersion**
- **Standard Deviation**: Measures the spread of the data around the mean. A higher standard deviation indicates greater variability.
- **Variance**: The square of the standard deviation.
- **Range**: The difference between the maximum and minimum values.
- **Interquartile Range (IQR)**: The difference between the 75th percentile (Q3) and the 25th percentile (Q1). It measures the spread of the middle 50% of the data.
---

**Category Name: Flow rate and Inter-arrival time**

These are the features we will process in this notebook:
- Flow Bytes/s: The average number of bytes transferred per second during the flow.
- Flow Packets/s: The average number of packets transferred per second during the flow.
- Flow IAT Mean, Std, Max, Min: The mean, standard deviation, maximum, and minimum interarrival times for the entire flow.
- Fwd IAT Total, Mean, Std, Max, Min: The total, mean, standard deviation, maximum, and minimum interarrival times for forward packets.
- Bwd IAT Total, Mean, Std, Max, Min: The total, mean, standard deviation, maximum, and minimum interarrival times for backward packets.

In [2]:
# Importing important libraries 
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os 
import json 

In [3]:
# Functions 
# Update this code block to add more functions as per need
def read_data(data_path: str, usecols: list) -> pd.DataFrame:
    """
    This function reads the data from the data path
    """
    data = pd.read_csv(data_path, usecols=usecols)
    return data

**Data**

In [4]:
# Read data from filepath 
# This file path is for temporary usage (For EDA)
# It will be removed once the task is completed 

file_path = "dataset.csv"
# Columns in this category
columns_to_read = ['Flow Bytes/s',
       'Flow Packets/s', 'Flow IAT Mean', 'Flow IAT Std', 'Flow IAT Max',
       'Flow IAT Min', 'Fwd IAT Total', 'Fwd IAT Mean', 'Fwd IAT Std',
       'Fwd IAT Max', 'Fwd IAT Min', 'Bwd IAT Total', 'Bwd IAT Mean',
       'Bwd IAT Std', 'Bwd IAT Max', 'Bwd IAT Min']

# Read the data
try:
    data = read_data(file_path, usecols = columns_to_read)
    print(f"Successfully read {len(data.columns)} features")
except:
    print("File not found!")
    print("[INFO] Please place the dataset.csv in the directory for use!")

Successfully read 16 features


**Analysis**

In [6]:
# Pandas Describe func
data.describe()

Unnamed: 0,Flow Bytes/s,Flow Packets/s,Flow IAT Mean,Flow IAT Std,Flow IAT Max,Flow IAT Min,Fwd IAT Total,Fwd IAT Mean,Fwd IAT Std,Fwd IAT Max,Fwd IAT Min,Bwd IAT Total,Bwd IAT Mean,Bwd IAT Std,Bwd IAT Max,Bwd IAT Min
count,3190424.0,3231475.0,3189766.0,3189766.0,3189766.0,3189766.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0
mean,inf,inf,273525.1,336059.6,1276584.0,83288.63,6945646.0,328559.2,341007.8,1256459.0,117901.9,6068053.0,127762.4,153385.6,716299.2,38282.08
std,,,1508600.0,2094611.0,6032778.0,910390.0,24529220.0,1975864.0,2020715.0,5977838.0,1475624.0,23874460.0,1472715.0,1031695.0,3916720.0,1147860.0
min,0.0,0.0167695,4.0,0.0,4.0,-37.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0
25%,13118.55,99.50051,796.8889,1691.476,6120.0,4.0,7436.0,1286.4,1905.74,5221.0,14.0,2382.0,444.5,462.5197,1258.0,12.0
50%,29538.97,304.9976,3698.667,9634.148,30843.0,8.0,37481.0,7329.0,12708.76,29964.0,23.0,27068.0,6331.75,10303.39,22312.0,22.0
75%,134163.2,1664.145,11210.73,26938.86,107329.0,11.0,128872.0,22046.0,35656.35,104379.5,32.0,80731.0,17693.25,29455.71,64334.0,34.0
max,inf,inf,119264100.0,83992940.0,119962100.0,119264100.0,120000000.0,119962200.0,83657140.0,119962200.0,119962200.0,120000000.0,119962100.0,83665320.0,119962100.0,119962100.0


In [8]:
# Unique 
data.nunique()

Flow Bytes/s      2260437
Flow Packets/s    1222559
Flow IAT Mean     1510578
Flow IAT Std      3053101
Flow IAT Max       498232
Flow IAT Min        40359
Fwd IAT Total      614738
Fwd IAT Mean      1191586
Fwd IAT Std       2976829
Fwd IAT Max        498469
Fwd IAT Min         58963
Bwd IAT Total      455963
Bwd IAT Mean      1005295
Bwd IAT Std       2706225
Bwd IAT Max        450557
Bwd IAT Min         30776
dtype: int64

> Fwd and Bwd IAT min columns has less unique values showing the minimum inter arrival time for both Fwd and Bwd remain same throughout the flow.


In [14]:
data[['Flow Bytes/s', 'Flow Packets/s', 'Flow IAT Mean', 'Fwd IAT Total', 'Bwd IAT Total']].describe()

Unnamed: 0,Flow Bytes/s,Flow Packets/s,Flow IAT Mean,Fwd IAT Total,Bwd IAT Total
count,3190424.0,3231475.0,3189766.0,3231475.0,3231475.0
mean,inf,inf,273525.1,6945646.0,6068053.0
std,,,1508600.0,24529220.0,23874460.0
min,0.0,0.0167695,4.0,0.0,0.0
25%,13118.55,99.50051,796.8889,7436.0,2382.0
50%,29538.97,304.9976,3698.667,37481.0,27068.0
75%,134163.2,1664.145,11210.73,128872.0,80731.0
max,inf,inf,119264100.0,120000000.0,120000000.0


> There are data instances which has values 'infinity', we will replace this with NaN

In [34]:
dt = data.replace([np.inf, -np.inf], np.NaN)  # Replace 'inf' with NaN
dt = dt.dropna()  # Drop NaN values

# or
# Put the max value of that column in place of 'inf'
# df = data.replace([np.inf, -np.inf], data.max())

In [35]:
dt[['Flow Bytes/s', 'Flow Packets/s', 'Flow IAT Mean', 'Fwd IAT Total', 'Bwd IAT Total']].describe()

Unnamed: 0,Flow Bytes/s,Flow Packets/s,Flow IAT Mean,Fwd IAT Total,Bwd IAT Total
count,3189766.0,3189766.0,3189766.0,3189766.0,3189766.0
mean,1555212.0,4273.891,273525.1,7036466.0,6147398.0
std,19428990.0,11055.74,1508600.0,24676120.0,24019890.0
min,0.0,0.0167695,4.0,0.0,0.0
25%,13117.18,97.1121,796.8889,8271.0,2923.0
50%,29527.46,296.9201,3698.667,38533.0,27943.0
75%,133928.6,1485.8,11210.73,133531.0,81964.0
max,1489628000.0,500000.0,119264100.0,120000000.0,120000000.0


> Flow Bytes/s has a wide spread of data ranging from 0 to crores. Although, most instances fall below 1.33 lakhs (approx.).

> Flow Packets/s has mean closer to the max value, indicating that average number of packets transferred in a flow is generally higher.

> Flow IAT Mean shows a wide spread of data ranging from 0 to crores. Indicating, the inter arrival time mean can vary in a wide range.  
