### Summary Statistics

Summary statistics provide a concise overview of a dataset's distribution. Here's a breakdown of the commonly calculated metrics:

**Central Tendency**
- **Mean**: The average value of the dataset. Its calculated by summing all values and dividing by the number of values.
- **Median**: The middle value in a dataset when the values are arranged in ascending order. It's a robust measure that's less sensitive to outliers than the mean.
- **Mode**: The most frequent value in the dataset. There can be one mode (unimodal), multiple modes (multimodal), or no mode.

**Dispersion**
- **Standard Deviation**: Measures the spread of the data around the mean. A higher standard deviation indicates greater variability.
- **Variance**: The square of the standard deviation.
- **Range**: The difference between the maximum and minimum values.
- **Interquartile Range (IQR)**: The difference between the 75th percentile (Q3) and the 25th percentile (Q1). It measures the spread of the middle 50% of the data.
---

**Category Name: Flag Based features**

These are the features we will process in this notebook:
- Fwd PSH Flags, Bwd PSH Flags: Indicate whether the push flag is set in forward and backward packets, respectively.
- Fwd URG Flags, Bwd URG Flags: Indicate whether the urgent flag is set in forward and backward packets, respectively.
- Fwd RST Flags, Bwd RST Flags: Indicate whether the reset flag is set in forward and backward packets, respectively.


In [4]:
# Importing important libraries 
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os 
import json 

In [5]:
# Functions 
# Update this code block to add more functions as per need
def read_data(data_path: str, usecols: list) -> pd.DataFrame:
    """
    This function reads the data from the data path
    """
    data = pd.read_csv(data_path, usecols=usecols)
    return data

**Data**

In [6]:
# Read data from filepath 
# This file path is for temporary usage (For EDA)
# It will be removed once the task is completed 

file_path = "dataset.csv"
# Columns in this category
columns_to_read = ['Fwd PSH Flags',
       'Bwd PSH Flags', 'Fwd URG Flags', 'Bwd URG Flags', 'Fwd RST Flags',
       'Bwd RST Flags']

# Read the data
try:
    data = read_data(file_path, usecols = columns_to_read)
    print(f"Successfully read {len(data.columns)} features")
except:
    print("File not found!")
    print("[INFO] Please place the dataset.csv in the directory for use!")

Successfully read 6 features


**Analysis**

In [7]:
# Pandas Describe
data.describe()

Unnamed: 0,Fwd PSH Flags,Bwd PSH Flags,Fwd URG Flags,Bwd URG Flags,Fwd RST Flags,Bwd RST Flags
count,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0,3231475.0
mean,75.33344,79.53056,0.0,0.0,0.01665648,0.00751731
std,418.3171,533.101,0.0,0.0,0.158137,0.1220052
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,1.0,0.0,0.0,0.0,0.0
50%,1.0,1.0,0.0,0.0,0.0,0.0
75%,2.0,1.0,0.0,0.0,0.0,0.0
max,7831.0,10200.0,0.0,0.0,10.0,28.0


In [8]:
# Unique Values 
data.nunique()

Fwd PSH Flags    5378
Bwd PSH Flags    5852
Fwd URG Flags       1
Bwd URG Flags       1
Fwd RST Flags      11
Bwd RST Flags      26
dtype: int64

> Since, Fwd and Bwd URG Flags only contains 1 value which is 0. This column should be removed! 

**Note:** Remove the above two columns in data processing stage

In [10]:
data['Fwd RST Flags'].value_counts()

Fwd RST Flags
0     3188551
1       33085
2        9517
3         100
4          46
6          46
5          37
7          36
9          29
8          16
10         12
Name: count, dtype: int64

In [12]:
data['Bwd RST Flags'].value_counts()

Bwd RST Flags
0     3210013
1       20301
2         875
3          56
5          33
4          32
6          31
7          24
8          13
9          12
11         12
13         11
12         11
10         10
14         10
18          6
17          6
16          5
15          5
20          2
21          2
26          1
28          1
24          1
23          1
19          1
Name: count, dtype: int64

> Fwd and Bwd RST Flags has high imbalanced data. Use oversampling, under-sampling or class weighting technique to handle it!