## **Feature Engineering**

**Feature Interaction**
- Interaction features are new features created by combining two or more existing features in a dataset. The goal of interaction features is to capture the relationships and interactions between features that might not be evident when looking at them individually.

**Frequency Encoding**
- Frequency encoding replaces each category in a categorical feature with the frequency of that category in the dataset. Essentially, it counts how many times each category appears and uses this count as the new value.

**Target Encoding** 
- Target encoding replaces each category in a categorical feature with the average (mean) of the target variable for that category. It calculates the mean value of the target variable for each category and uses that mean as the new value for the category.

**Handling Time data**
- When dealing with timestamp data, feature extraction involves converting raw timestamps into useful features that can improve the performance of machine learning models.

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import math 
from scipy import stats

**Data**

In [2]:
# Load the data
try:
    data = pd.read_csv('data_versions/02_outliers_removed.csv')
    print(f"Successfully loaded {len(data.columns)} features")
except Exception as e: 
    print(f"Error loading data: {e}")

Successfully loaded 82 features


**Analysis**

In [1]:
# analysis
# data.info()

There are 4 categorical columns

In [4]:
# get list of categorical columns
categorical_columns = data.select_dtypes(include='object').columns
print(f"Categorical columns: {categorical_columns}")

Categorical columns: Index(['Flow ID', 'Src IP', 'Dst IP', 'Timestamp'], dtype='object')


In [7]:
data[categorical_columns[0]].value_counts()

Flow ID
100.64.0.2-100.64.0.1-0-0-0              2603
10.16.0.6-114.114.114.114-0-0-0          2417
10.16.0.6-144.122.71.18-0-0-0            2346
10.16.0.6-144.122.71.18-53070-6443-6      809
144.122.71.18-10.16.0.4-6443-41270-6      762
                                         ... 
10.16.0.49-10.98.54.171-57976-11211-6       1
10.16.0.41-10.110.54.142-33270-9090-6       1
10.16.0.41-10.110.54.142-40764-9090-6       1
10.16.0.49-10.98.54.171-48284-11211-6       1
100.64.0.2-10.16.0.4-55310-8181-6           1
Name: count, Length: 330376, dtype: int64

**Functions**

In [19]:
def frequency_encoding(df, column_name):
    """
    Performs frequency encoding on a specified column in a pandas DataFrame.
    Parameters:
        df (pd.DataFrame): The input DataFrame.
        column_name (str): The name of the column to encode.
    Returns:
        pd.DataFrame: The DataFrame with the frequency encoded column.
    """
    # Calculate the frequency of each category
    frequency = df[column_name].value_counts(normalize=True)
    
    # Map the frequencies to the original column
    df[column_name] = df[column_name].map(frequency)    
    return df


import category_encoders as ce
def target_encoding(df, column_name, target_name, smoothing=1.0):
    """
    Performs target encoding on a specified column in a pandas DataFrame.
    Parameters:
        df (pd.DataFrame): The input DataFrame.
        column_name (str): The name of the column to encode.
        target_name (str): The name of the target column.
        smoothing (float): The smoothing parameter for the target encoder.
    Returns:
        pd.DataFrame: The DataFrame with the target encoded column.
        ce.TargetEncoder: The fitted target encoder (useful for transforming test data).
    """
    # Initialize the target encoder
    encoder = ce.TargetEncoder(cols=[column_name], smoothing=smoothing)
    # Fit and transform the column based on the target
    df[column_name] = encoder.fit_transform(df[column_name], df[target_name])

    return df

In [20]:
# Perform frequency encoding
data = frequency_encoding(data, 'Flow ID')
data.drop(columns = ['Flow ID_frequency_encoded'], inplace=True)
data.head(3)

Unnamed: 0,Flow ID,Src IP,Src Port,Dst IP,Dst Port,Protocol,Timestamp,Flow Duration,Total Fwd Packet,Total Bwd packets,...,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Total TCP Flow Time,Label
0,0.000816,100.64.0.2,0,100.64.0.1,0,0,2023-03-19 15:02:20.266440,117265975,7,6,...,150.478261,21.023514,188.0,104.0,5098364.0,9600.548411,5102694.0,5058427.0,0,0
1,4e-05,10.16.0.6,34788,144.122.71.18,6443,6,2023-03-19 15:02:22.387673,116365340,7,6,...,252225.789474,37764.416542,330282.0,182906.0,5858473.0,27027.010586,5900021.0,5780088.0,116365340,0
2,0.000735,10.16.0.6,0,144.122.71.18,0,0,2023-03-19 15:02:22.901650,116311908,7,6,...,200688.0,115.236954,201019.0,200516.0,5910436.0,35670.903124,5985028.0,5829637.0,0,0


In [22]:
# Perform target encoding
data = target_encoding(data, 'Src IP', 'Label')
data.head(3)

Unnamed: 0,Flow ID,Src IP,Src Port,Dst IP,Dst Port,Protocol,Timestamp,Flow Duration,Total Fwd Packet,Total Bwd packets,...,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Total TCP Flow Time,Label
0,0.000816,0.167776,0,100.64.0.1,0,0,2023-03-19 15:02:20.266440,117265975,7,6,...,150.478261,21.023514,188.0,104.0,5098364.0,9600.548411,5102694.0,5058427.0,0,0
1,4e-05,0.166931,34788,144.122.71.18,6443,6,2023-03-19 15:02:22.387673,116365340,7,6,...,252225.789474,37764.416542,330282.0,182906.0,5858473.0,27027.010586,5900021.0,5780088.0,116365340,0
2,0.000735,0.166931,0,144.122.71.18,0,0,2023-03-19 15:02:22.901650,116311908,7,6,...,200688.0,115.236954,201019.0,200516.0,5910436.0,35670.903124,5985028.0,5829637.0,0,0


In [23]:
# Target encoding for Dst IP 
data = target_encoding(data, 'Dst IP', 'Label') 
data.head(3)

Unnamed: 0,Flow ID,Src IP,Src Port,Dst IP,Dst Port,Protocol,Timestamp,Flow Duration,Total Fwd Packet,Total Bwd packets,...,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Total TCP Flow Time,Label
0,0.000816,0.167776,0,0.398528,0,0,2023-03-19 15:02:20.266440,117265975,7,6,...,150.478261,21.023514,188.0,104.0,5098364.0,9600.548411,5102694.0,5058427.0,0,0
1,4e-05,0.166931,34788,0.193546,6443,6,2023-03-19 15:02:22.387673,116365340,7,6,...,252225.789474,37764.416542,330282.0,182906.0,5858473.0,27027.010586,5900021.0,5780088.0,116365340,0
2,0.000735,0.166931,0,0.193546,0,0,2023-03-19 15:02:22.901650,116311908,7,6,...,200688.0,115.236954,201019.0,200516.0,5910436.0,35670.903124,5985028.0,5829637.0,0,0


In [24]:
dataset = data

In [26]:
# HANDLE TIMESTAMP COLUMN

# Convert 'Timestamp' column to datetime format
data['Timestamp'] = pd.to_datetime(data['Timestamp'], format='ISO8601')

# Extract date components
data['Year'] = data['Timestamp'].dt.year
data['Month'] = data['Timestamp'].dt.month
data['Day'] = data['Timestamp'].dt.day
data['Hour'] = data['Timestamp'].dt.hour
data['Minute'] = data['Timestamp'].dt.minute
data['Second'] = data['Timestamp'].dt.second

# print(data)

          Flow ID    Src IP  Src Port    Dst IP  Dst Port  Protocol  \
0        0.000816  0.167776         0  0.398528         0         0   
1        0.000040  0.166931     34788  0.193546      6443         6   
2        0.000735  0.166931         0  0.193546         0         0   
3        0.000758  0.166931         0  0.414806         0         0   
4        0.000040  0.518839     56026  0.193546      6443         6   
...           ...       ...       ...       ...       ...       ...   
3189761  0.009762  0.167776     56896  2.000000      1880         6   
3189762  0.009762  0.167776     45918  2.000000      1880         6   
3189763  0.009177  0.167776     40106  0.413106      8080         6   
3189764  0.019665  0.167776     47972  0.413106      8181         6   
3189765  0.006014  0.167776     54818  2.000000      1880         6   

                         Timestamp  Flow Duration  Total Fwd Packet  \
0       2023-03-19 15:02:20.266440      117265975                 7   
1    

In [30]:
data[['Timestamp', 'Year', 'Month', 'Day', 'Hour', 'Minute', 'Second']].sample(5)

Unnamed: 0,Timestamp,Year,Month,Day,Hour,Minute,Second
2326259,2023-09-21 03:57:36.707585,2023,9,21,3,57,36
381893,2023-12-06 17:26:37.306711,2023,12,6,17,26,37
2456552,2023-09-21 11:34:58.013778,2023,9,21,11,34,58
1333110,2023-12-06 21:31:13.373315,2023,12,6,21,31,13
2987627,2024-04-26 08:43:54.066732,2024,4,26,8,43,54


In [31]:
data = data.drop(columns='Timestamp') # drop the original column

In [33]:
# Re arrange columns 
# Current list of columns
columns = data.columns.tolist()
# Remove 'Label' from its current position
columns.remove('Label')
# Append 'Label' to the end
columns.append('Label')
# Reindex the DataFrame with the new order of columns
data = data[columns]

In [39]:
# print unique dtypes of the columns and their counts
print(data.dtypes.value_counts())

float64    48
int64      33
int32       6
Name: count, dtype: int64


In [40]:
def convert_int_to_float(df):
    """
    Converts all int32 and int64 columns in the DataFrame to float64.
    Parameters:
        df (pd.DataFrame): The input DataFrame.
    Returns:
        pd.DataFrame: The DataFrame with int32 and int64 columns converted to float64.
    """
    # Identify columns with int32 and int64 data types
    int_cols = df.select_dtypes(include=['int32', 'int64']).columns
    
    # Convert identified columns to float64
    df[int_cols] = df[int_cols].astype('float64')
    return df


# Convert 
data = convert_int_to_float(data)
print(data.dtypes.value_counts())

float64    87
Name: count, dtype: int64


In [47]:
# we ll merge features in later steps if required

**Save data to csv**

In [44]:
# to csv 
data.to_csv('data_versions/03_feature_engineered_data.csv')