### Summary Statistics

Summary statistics provide a concise overview of a dataset's distribution. Here's a breakdown of the commonly calculated metrics:

**Central Tendency**
- **Mean**: The average value of the dataset. It's calculated by summing all values and dividing by the number of values.
- **Median**: The middle value in a dataset when the values are arranged in ascending order. It's a robust measure that's less sensitive to outliers than the mean.
- **Mode**: The most frequent value in the dataset. There can be one mode (unimodal), multiple modes (multimodal), or no mode.

**Dispersion**
- **Standard Deviation**: Measures the spread of the data around the mean. A higher standard deviation indicates greater variability.
- **Variance**: The square of the standard deviation.
- **Range**: The difference between the maximum and minimum values.
- **Interquartile Range (IQR)**: The difference between the 75th percentile (Q3) and the 25th percentile (Q1). It measures the spread of the middle 50% of the data.
---

**Category Name: Basic Flow Information**

These are the features we will process in this notebook:
- Flow ID: A unique identifier for a network flow.
- Src IP: Source IP address of the flow.
- Src Port: Source port number of the flow.
- Dst IP: Destination IP address of the flow.
- Dst Port: Destination port number of the flow.
- Protocol: The network protocol used (e.g., TCP, UDP, ICMP).
- Timestamp: The timestamp when the flow started.

In [1]:
# Importing important libraries 
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os 
import json 

In [2]:
# Functions 
# Update this code block to add more functions as per need
def read_data(data_path: str, usecols: list) -> pd.DataFrame:
    """
    This function reads the data from the data path
    """
    data = pd.read_csv(data_path, usecols=usecols)
    return data

**Data**

In [3]:
# Read data from filepath 
# This file path is for temporary usage (For EDA)
# It will be removed once the task is completed 

file_path = "dataset.csv"
# Columns in this category
columns_to_read = ['Flow ID', 'Src IP', 'Src Port', 'Dst IP', 'Dst Port', 'Protocol', 'Timestamp']

# Read the data
try:
    data = read_data(file_path, usecols = columns_to_read)
    print(f"Successfully read {len(data.columns)} features")
except:
    print("File not found!")
    print("[INFO] Please place the dataset.csv in the directory for use!")

Successfully read 7 features


**Analysis**

In [4]:
# Pandas Describe function
data.describe()

Unnamed: 0,Src Port,Dst Port,Protocol
count,3231475.0,3231475.0,3231475.0
mean,37953.97,7856.463,6.202331
std,17312.81,4032.255,1.558721
min,0.0,0.0,0.0
25%,25921.0,8080.0,6.0
50%,40981.0,8080.0,6.0
75%,52014.0,8080.0,6.0
max,65535.0,65387.0,17.0


In [5]:
data.columns

Index(['Flow ID', 'Src IP', 'Src Port', 'Dst IP', 'Dst Port', 'Protocol',
       'Timestamp'],
      dtype='object')

In [13]:
# Lets have a look at the non-numerical columns 
# Flow ID, Src IP, Dst IP, Timestamp 

# Unique Values 
for i in data.columns:
    print(f"There are {len(data[i].unique())} unique values in the column {i}")

There are 345418 unique values in the column Flow ID
There are 4253 unique values in the column Src IP
There are 64514 unique values in the column Src Port
There are 4282 unique values in the column Dst IP
There are 4755 unique values in the column Dst Port
There are 3 unique values in the column Protocol
There are 3206589 unique values in the column Timestamp


> Flow ID: Unique identifier of the network flow, there are approx 3.5 lakhs unique data instances of this feature, indicating that there are multiple data points for the same flow id! This column is object type and needs to be encoded for further use! 

> Src IP: Unique IP addresses of the source. There are only 4.2k (approx) unique values. Ideally, the range of source and destination ips should be close! This column is object type and need to be treated!

> Dst IP: Unique IP addresses of the destination. There are only 4.2k (approx) unique values, similar to Src IP! This column is object type and need to be treated!

> Timestamp: This is the column which specifies the date and the exact time when the flow started. This column can be processed to separate the date and time for further use! 

In [19]:
# Describe
data.describe()

Unnamed: 0,Src Port,Dst Port,Protocol
count,3231475.0,3231475.0,3231475.0
mean,37953.97,7856.463,6.202331
std,17312.81,4032.255,1.558721
min,0.0,0.0,0.0
25%,25921.0,8080.0,6.0
50%,40981.0,8080.0,6.0
75%,52014.0,8080.0,6.0
max,65535.0,65387.0,17.0


> Src Port: Source Port number of the flow. Ranges from 0 to 65k (approx). May need to be scaled (to be decided later). 

> Dst Port: Destination Port number of the flow. Ranges from 0 to 65k (approx) as well. May need to be scaled.

> Protocol: Type of network protocol. 

In [22]:
"""Protocol Column """
# The type of network protocol used 

# 0 means HOPOPT 
# 6 means TCP 
# 17 means UDP

# Analysis
data['Protocol'].value_counts()

Protocol
6     3160045
17      63671
0        7759
Name: count, dtype: int64

> It is observed that the TCP is the most used network protocol! This column is **Highly imbalanced** and may need to be treated! Higher weights to minority classes can be used, etc.