### Summary Statistics

Summary statistics provide a concise overview of a dataset's distribution. Here's a breakdown of the commonly calculated metrics:

**Central Tendency**
- **Mean**: The average value of the dataset. It's calculated by summing all values and dividing by the number of values.
- **Median**: The middle value in a dataset when the values are arranged in ascending order. It's a robust measure that's less sensitive to outliers than the mean.
- **Mode**: The most frequent value in the dataset. There can be one mode (unimodal), multiple modes (multimodal), or no mode.

**Dispersion**
- **Standard Deviation**: Measures the spread of the data around the mean. A higher standard deviation indicates greater variability.
- **Variance**: The square of the standard deviation.
- **Range**: The difference between the maximum and minimum values.
- **Interquartile Range (IQR)**: The difference between the 75th percentile (Q3) and the 25th percentile (Q1). It measures the spread of the middle 50% of the data.
---

**Category Name: Subflow related features, Window size and TCP Flags, Flow Duration and idle time**

- Subflow Fwd Packets: The total number of forward packets within the subflow.
- Subflow Fwd Bytes: The total number of bytes transferred in the forward direction within the subflow.
- Subflow Bwd Packets: The total number of backward packets within the subflow.
- Subflow Bwd Bytes: The total number of bytes transferred in the backward direction within the subflow.
- FWD Init Win Bytes: The initial window size advertised by the forward flow.
- Bwd Init Win Bytes: The initial window size advertised by the backward flow.
- Fwd Act Data Pkts: The number of active data packets in the forward direction.
- Active Mean, Std, Max, Min: The mean, standard deviation, maximum, and minimum active times for the flow.
- Idle Mean, Std, Max, Min: The mean, standard deviation, maximum, and minimum idle times for the flow.
- Total TCP Flow Time: The total duration of the TCP flow.


In [1]:
# Importing important libraries 
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os 
import json 

In [2]:
# Functions 
# Update this code block to add more functions as per need
def read_data(data_path: str, usecols: list) -> pd.DataFrame:
    """
    This function reads the data from the data path
    """
    data = pd.read_csv(data_path, usecols=usecols)
    return data

**Data**

In [3]:
# Read data from filepath 
# This file path is for temporary usage (For EDA)
# It will be removed once the task is completed 

file_path = "dataset.csv"
# Columns in this category
columns_to_read = ['Subflow Fwd Packets', 'Subflow Fwd Bytes', 'Subflow Bwd Packets',
       'Subflow Bwd Bytes', 'FWD Init Win Bytes', 'Bwd Init Win Bytes',
       'Fwd Act Data Pkts', 'Fwd Seg Size Min', 'Active Mean', 'Active Std',
       'Active Max', 'Active Min', 'Idle Mean', 'Idle Std', 'Idle Max',
       'Idle Min', 'Total TCP Flow Time']

# Read the data
try:
    data = read_data(file_path, usecols = columns_to_read)
    print(f"Successfully read {len(data.columns)} features")
except:
    print("File not found!")
    print("[INFO] Please place the dataset.csv in the directory for use!")

Successfully read 17 features


**Analysis**

In [4]:
# Describe
data.describe()

Unnamed: 0,Subflow Fwd Packets,Subflow Fwd Bytes,Subflow Bwd Packets,Subflow Bwd Bytes,FWD Init Win Bytes,Bwd Init Win Bytes,Fwd Act Data Pkts,Fwd Seg Size Min,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Total TCP Flow Time
count,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0
mean,0.08419746,109.0405,4.518061e-05,106.485,60232.49,495.7609,75.50082,31.64003,243729.6,173407.3,479218.9,116818.7,671527.0,125960.3,824276.9,552952.4,71865530.0
std,0.2776838,214.7488,0.006721501,599.1156,15831.7,853.0959,418.2399,5.8041,2654499.0,1768455.0,4228964.0,2195276.0,5274227.0,1262020.0,5958127.0,4972078.0,1686803000.0
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,33.0,0.0,13.0,64240.0,502.0,1.0,32.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6991.0
50%,0.0,47.0,0.0,32.0,64240.0,502.0,1.0,32.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,36714.0
75%,0.0,79.0,0.0,35.0,64240.0,502.0,2.0,32.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,123284.5
max,1.0,10547.0,1.0,18137.0,65495.0,64704.0,7830.0,40.0,114827900.0,77181530.0,114827900.0,114827900.0,119962100.0,77115540.0,119962100.0,119962100.0,86389260000.0


In [5]:
# unique 
data.nunique()

Subflow Fwd Packets         2
Subflow Fwd Bytes        3036
Subflow Bwd Packets         2
Subflow Bwd Bytes        1796
FWD Init Win Bytes       1976
Bwd Init Win Bytes       4781
Fwd Act Data Pkts        5371
Fwd Seg Size Min            5
Active Mean            106109
Active Std              74852
Active Max             100615
Active Min              53851
Idle Mean              116004
Idle Std                78592
Idle Max               109211
Idle Min               102036
Total TCP Flow Time    629615
dtype: int64

In [6]:
data['Subflow Bwd Packets'].value_counts()

Subflow Bwd Packets
0    3231329
1        146
Name: count, dtype: int64

> Subflow bwd packets has most values as 0 indicating there are close to none bwd packets within the subflow

In [8]:
data['Subflow Fwd Packets'].value_counts()

Subflow Fwd Packets
0    2959393
1     272082
Name: count, dtype: int64

> This shows that there are some subflow packets in Fwd subflow.

In [10]:
data['Fwd Seg Size Min'].value_counts()

Fwd Seg Size Min
32    2827826
40     274014
8       63671
0       51783
20      14181
Name: count, dtype: int64

In [13]:
data['FWD Init Win Bytes'].value_counts()

FWD Init Win Bytes
64240    2488326
65280     531312
0         127569
502        19938
501        14967
          ...   
11695          1
7746           1
7781           1
6500           1
9233           1
Name: count, Length: 1976, dtype: int64

> 64240 is the most used window size advertised in the forward flow

In [15]:
data['Active Mean'].value_counts()

Active Mean
0.0           3114433
31.0              112
27.0              110
33.0              102
25.0               96
               ...   
14862866.0          1
13479777.6          1
16929474.0          1
12551961.0          1
22329.5             1
Name: count, Length: 106109, dtype: int64

In [16]:
data['Idle Mean'].value_counts()

Idle Mean
0.000000e+00    3113423
1.100796e+07          8
1.100796e+07          8
1.075196e+07          8
1.177596e+07          8
                 ...   
6.857778e+06          1
7.115686e+06          1
6.289454e+06          1
6.151582e+06          1
2.686503e+07          1
Name: count, Length: 116004, dtype: int64

> Active and idle mean are mostly 0.