### Summary Statistics

Summary statistics provide a concise overview of a dataset's distribution. Here's a breakdown of the commonly calculated metrics:

**Central Tendency**
- **Mean**: The average value of the dataset. It's calculated by summing all values and dividing by the number of values.
- **Median**: The middle value in a dataset when the values are arranged in ascending order. It's a robust measure that's less sensitive to outliers than the mean.
- **Mode**: The most frequent value in the dataset. There can be one mode (unimodal), multiple modes (multimodal), or no mode.

**Dispersion**
- **Standard Deviation**: Measures the spread of the data around the mean. A higher standard deviation indicates greater variability.
- **Variance**: The square of the standard deviation.
- **Range**: The difference between the maximum and minimum values.
- **Interquartile Range (IQR)**: The difference between the 75th percentile (Q3) and the 25th percentile (Q1). It measures the spread of the middle 50% of the data.
---

**Category Name: Flow Duration and packet counts**

These are the features we will process in this notebook:
- Flow Duration: The total duration of the flow in seconds.
- Total Fwd Packet: The total number of packets sent from the source to the destination.
- Total Bwd packets: The total number of packets sent from the destination to the source.

In [1]:
# Importing important libraries 
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os 
import json 

In [2]:
# Functions 
# Update this code block to add more functions as per need
def read_data(data_path: str, usecols: list) -> pd.DataFrame:
    """
    This function reads the data from the data path
    """
    data = pd.read_csv(data_path, usecols=usecols)
    return data

**Data**

In [3]:
# Read data from filepath 
# This file path is for temporary usage (For EDA)
# It will be removed once the task is completed 

file_path = "dataset.csv"
# Columns in this category
columns_to_read = ['Flow Duration', 'Total Fwd Packet', 'Total Bwd packets']

# Read the data
try:
    data = read_data(file_path, usecols = columns_to_read)
    print(f"Successfully read {len(data.columns)} features")
except:
    print("File not found!")
    print("[INFO] Please place the dataset.csv in the directory for use!")

Successfully read 3 features


**Analysis**

In [4]:
# Pandas Describe function
data.describe()

Unnamed: 0,Flow Duration,Total Fwd Packet,Total Bwd packets
count,3231475.0,3231475.0,3231475.0
mean,6954488.0,128.7863,105.7871
std,24543060.0,682.2713,586.3471
min,0.0,0.0,0.0
25%,7507.0,5.0,5.0
50%,37606.0,6.0,5.0
75%,129755.0,7.0,6.0
max,120000000.0,14621.0,17966.0


In [5]:
data.columns

Index(['Flow Duration', 'Total Fwd Packet', 'Total Bwd packets'], dtype='object')

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3231475 entries, 0 to 3231474
Data columns (total 3 columns):
 #   Column             Dtype
---  ------             -----
 0   Flow Duration      int64
 1   Total Fwd Packet   int64
 2   Total Bwd packets  int64
dtypes: int64(3)
memory usage: 74.0 MB


In [8]:
# Unique values 
data.nunique()

Flow Duration        616643
Total Fwd Packet       5819
Total Bwd packets      6137
dtype: int64

> Flow Duration has 6 lakh + unique values indicating the duration is diverse

> Total Fwd and Bwd Packets doesn't have many unique values suggesting it remains unchanged for many data points

In [9]:
data.describe()

Unnamed: 0,Flow Duration,Total Fwd Packet,Total Bwd packets
count,3231475.0,3231475.0,3231475.0
mean,6954488.0,128.7863,105.7871
std,24543060.0,682.2713,586.3471
min,0.0,0.0,0.0
25%,7507.0,5.0,5.0
50%,37606.0,6.0,5.0
75%,129755.0,7.0,6.0
max,120000000.0,14621.0,17966.0


> Flow Duration: Shows a wide range of values ranging bw 0 to 12 cr. 75% of the values are below 1.29 lakh (approx.), this column need to normalized for use 

> Total Fwd and Bwd packets shows the number of packets sent from forward and backward processes. For both these columns, the data points are similar and thus shows strong linear relationship.