# Data Inspection

An early look at the data to get a brief understanding

1. Done to understand the **data structure** 
2. Analysis on data types, shape and dimension 
3. Basic data inspection is done to get early insights

In [8]:
""""Importing Important Libraries For Data Inspection"""
# Pandas play the key role in this step

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os 
import json 

**Functions**

In [2]:
# Functions 
# Update this code block to add more functions as per need
def read_data(data_path: str) -> pd.DataFrame:
    """
    This function reads the data from the data path
    """
    data = pd.read_csv(data_path)
    return data

**Data**

In [9]:
# File path of the dataset 
# This file path is for temporary usage (For EDA)
# It will be removed once the task is completed 
file_path = "dataset.csv"

# Read the data
try:
    data = read_data(file_path)
    print("File Read Successfully!")
except:
    print("File not found!")
    print("[INFO] Please place the dataset.csv in the directory for use!")

### **Data Inspection**

In [12]:
# Shape of the data
print(f"Number of Features: {data.shape[1]}")
print(f"Number of Instances: {data.shape[0]}")

# Overall shape
print(f"Overall shape: {data.shape}")

Number of Features: 87
Number of Instances: 3231475
Overall shape: (3231475, 87)


> The dataset consists of 3.2 million instances and 87 features with 1 target column called "Label"

In [13]:
# Check for columns 
print(f"Columns: {data.columns}")

Columns: Index(['Flow ID', 'Src IP', 'Src Port', 'Dst IP', 'Dst Port', 'Protocol',
       'Timestamp', 'Flow Duration', 'Total Fwd Packet', 'Total Bwd packets',
       'Total Length of Fwd Packet', 'Total Length of Bwd Packet',
       'Fwd Packet Length Max', 'Fwd Packet Length Min',
       'Fwd Packet Length Mean', 'Fwd Packet Length Std',
       'Bwd Packet Length Max', 'Bwd Packet Length Min',
       'Bwd Packet Length Mean', 'Bwd Packet Length Std', 'Flow Bytes/s',
       'Flow Packets/s', 'Flow IAT Mean', 'Flow IAT Std', 'Flow IAT Max',
       'Flow IAT Min', 'Fwd IAT Total', 'Fwd IAT Mean', 'Fwd IAT Std',
       'Fwd IAT Max', 'Fwd IAT Min', 'Bwd IAT Total', 'Bwd IAT Mean',
       'Bwd IAT Std', 'Bwd IAT Max', 'Bwd IAT Min', 'Fwd PSH Flags',
       'Bwd PSH Flags', 'Fwd URG Flags', 'Bwd URG Flags', 'Fwd RST Flags',
       'Bwd RST Flags', 'Fwd Header Length', 'Bwd Header Length',
       'Fwd Packets/s', 'Bwd Packets/s', 'Packet Length Min',
       'Packet Length Max', 'Packet Len

> The features include mostly information about network packets, network protocols, time durations, lengths and other similar information

> The target variable is named 'Label' showing 11 different labels, conveying this is a task of multi-class classification 

> The information on each label with their class number is given below:

- Benign	- 0
- CVE‑2020‑13379 - 1
- Node-RED Reconnaissance	- 2
- Node-RED RCE	- 3
- Node-RED Container Escape	- 4
- CVE‑2021‑43798	- 5
- CVE‑2019‑20933 - 	6
- CVE‑2021‑30465 -	7
- CVE‑2021‑25741 -	8
- CVE‑2022‑23648 -	9
- CVE‑2019‑5736	- 10
- DSB Nuclei Scan	- 11

> *Note:* *The detailed information on each label is found at label-info.txt file in the base directory of this project*

**Data Structure**

In [18]:
# Pandas info 
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3231475 entries, 0 to 3231474
Data columns (total 87 columns):
 #   Column                      Dtype  
---  ------                      -----  
 0   Flow ID                     object 
 1   Src IP                      object 
 2   Src Port                    int64  
 3   Dst IP                      object 
 4   Dst Port                    int64  
 5   Protocol                    int64  
 6   Timestamp                   object 
 7   Flow Duration               int64  
 8   Total Fwd Packet            int64  
 9   Total Bwd packets           int64  
 10  Total Length of Fwd Packet  float64
 11  Total Length of Bwd Packet  float64
 12  Fwd Packet Length Max       float64
 13  Fwd Packet Length Min       float64
 14  Fwd Packet Length Mean      float64
 15  Fwd Packet Length Std       float64
 16  Bwd Packet Length Max       float64
 17  Bwd Packet Length Min       float64
 18  Bwd Packet Length Mean      float64
 19  Bwd Packet Length Std

> Four columns are **categorical** type (object64)

> 45 columns are **float** type (float64) and 38 are **integer** type (int64)

> **INSIGHT**: Convert int to float (do as per requirement)

> The total **memory usage is 2.1+ GBs**