# Data Preprocessing for IDS Models\n,
    
    This notebook demonstrates data preprocessing for Intrusion Detection System (IDS) models using the UNSW-NB15 dataset. The pipeline includes handling missing values, encoding categorical features, applying feature scaling (with optional log transformation for skewed features), and ensuring consistent label encoding. Saving the preprocessing pipeline guarantees that the same transformations are applied during both training and inference.
   

In [None]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, FunctionTransformer, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import joblib

# Load the UNSW-NB15 training and test datasets
# Replace these filenames with the correct paths to your dataset files
df_train = pd.read_csv('D:\\Optimization-Research\\Hybrid_Whale_RIME_IDS_Project\\UNSW_NB15\\data\\row\\UNSW_NB15_training-set.csv')
df_test = pd.read_csv('D:\\Optimization-Research\\Hybrid_Whale_RIME_IDS_Project\\UNSW_NB15\\data\\row\\UNSW_NB15_testing-set.csv')

# Display the first few rows of the training dataset
df_train.head()


## 1. Data Exploration and Handling Missing Values

Examine the datasets for missing values. In IDS datasets like UNSW-NB15, missing values might be rare but should be handled appropriately to ensure robust preprocessing.


In [None]:
# Display missing values per column in the training dataset
print("Missing values in training dataset:")
print(df_train.isnull().sum())

# Optionally, check missing values in the test dataset
print("\nMissing values in test dataset:")
print(df_test.isnull().sum())


## 2. Identify Numeric and Categorical Features

For effective preprocessing, we separate numeric and categorical features. We also drop high-cardinality identifiers (e.g., source/destination IP addresses) from both datasets to ensure consistency.


In [None]:
# Function to drop high-cardinality columns
def drop_high_cardinality(df):
    columns_to_drop = ['srcip', 'dstip'] if set(['srcip', 'dstip']).issubset(df.columns) else []
    return df.drop(columns=columns_to_drop, errors='ignore')

# Apply the function to both training and test datasets
df_train = drop_high_cardinality(df_train)
df_test = drop_high_cardinality(df_test)

# Identify numeric columns from the training dataset
numeric_cols = df_train.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Define the target column (adjust if needed, e.g., 'attack' or 'label')
target_column = 'label'

# Remove the target column from numeric features if it is present
if target_column in numeric_cols:
    numeric_cols.remove(target_column)

# Identify categorical columns from the training dataset
categorical_cols = df_train.select_dtypes(include=['object']).columns.tolist()

print("Numeric columns:", numeric_cols)
print("Categorical columns:", categorical_cols)


## 3. Creating Preprocessing Pipelines

### 3.1 Numeric Pipeline

The numeric pipeline includes:
- **Imputation:** Filling missing numeric values using the mean strategy.
- **Log Transformation:** An optional step to transform skewed features (e.g., `dbytes`) using a log transform to reduce the influence of outliers.
- **Scaling:** Normalizing features using Min-Max scaling to bring them into the [0, 1] range.

The log transformation is applied selectively. Adjust the transformation based on your domain knowledge and data distribution.


In [None]:
# Define a function for log transformation using log1p (to handle zero values)
def log_transform(values):
    return np.log1p(values)

# Function to selectively apply log transformation on the column corresponding to 'dbytes'
# Here, X is expected to be a NumPy array and feature_names is the list of column names in order.
def select_and_log_transform(X, feature_names):
    X_transformed = X.copy()
    if 'dbytes' in feature_names:
        idx = feature_names.index('dbytes')
        X_transformed[:, idx] = log_transform(X_transformed[:, idx])
    return X_transformed

# Define a top-level function to apply the log transformation using the numeric_cols list
def apply_log_transform(X):
    # 'numeric_cols' should be defined in the global scope (from Code Block 3)
    return select_and_log_transform(X, numeric_cols)

# Create the numeric pipeline using the named function 'apply_log_transform'
numeric_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('log_transform', FunctionTransformer(apply_log_transform, validate=False)),
    ('scaler', MinMaxScaler())
])


### 3.2 Categorical Pipeline

For categorical features, the pipeline includes:
- **Imputation:** Filling missing values with a constant value (e.g., 'unknown').
- **Encoding:** Converting categorical variables to numeric using one-hot encoding. This ensures that the model does not assume any ordinal relationship among categories.

For some models (such as tree-based algorithms), label encoding may also be appropriate. In this example, we use one-hot encoding.


In [None]:
categorical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


## 4. Combining Pipelines with ColumnTransformer

We use a `ColumnTransformer` to combine the numeric and categorical pipelines so that each set of columns receives the appropriate transformations.


In [None]:
preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_pipeline, numeric_cols),
    ('cat', categorical_pipeline, categorical_cols)
])


## 5. Preparing the Target Label

The target label in UNSW-NB15 can be binary (e.g., 0 for normal and 1 for attack) or multi-class. This cell checks whether the target is already numeric. If not, label encoding is applied to the training dataset, and the same transformation is used on the test dataset (if available). This ensures consistency in how the labels are represented.


In [None]:
# If the target label in the training dataset is not numeric, apply label encoding
if df_train[target_column].dtype == 'object':
    le = LabelEncoder()
    df_train[target_column] = le.fit_transform(df_train[target_column])
    
    # If the test dataset contains the target column and it is non-numeric, apply the same encoding
    if target_column in df_test.columns and df_test[target_column].dtype == 'object':
        df_test[target_column] = le.transform(df_test[target_column])
    
    # Save the label encoder for use during inference
    joblib.dump(le, 'label_encoder.joblib')
    print("Label encoding applied on training dataset.")
else:
    print("Target column is already numeric in training dataset.")


## 6. Applying the Preprocessing Pipeline

With the preprocessing pipeline defined, we now apply it to the training and test datasets. The pipeline is fitted on the training data, and the same transformations are applied to the test data to ensure consistency.


In [None]:
# Separate features and target label for training and test datasets
X_train = df_train.drop(columns=[target_column])
y_train = df_train[target_column]

# For the test dataset, if the target column exists, separate it; otherwise, work only with features
X_test = df_test.drop(columns=[target_column]) if target_column in df_test.columns else df_test.copy()
y_test = df_test[target_column] if target_column in df_test.columns else None

# Fit the preprocessor on the training data and transform the training features
X_train_transformed = preprocessor.fit_transform(X_train)

# Apply the same transformation to the test features
X_test_transformed = preprocessor.transform(X_test)

print("Preprocessing complete for both training and test datasets.")


## 7. Saving the Preprocessing Pipeline

To ensure consistency during future inference, save the fitted preprocessing pipeline. This step preserves the exact transformations applied to the training data so that they can be reused on new incoming data.


In [None]:
joblib.dump(preprocessor, 'preprocessing_pipeline.joblib')
print("Preprocessing pipeline saved.")


## Summary

This notebook provided a step-by-step guide for preprocessing the UNSW-NB15 dataset for IDS models using two separate datasets (training and test). The process included:

- Handling missing values via imputation for both numeric and categorical data.
- Encoding categorical features with one-hot encoding (with an option for label encoding).
- Applying feature scaling (with an optional log transformation for skewed numeric features).
- Ensuring the target label is in the correct numeric format.
- Saving the preprocessing pipeline to avoid training-serving skew during future inference.

This systematic approach is crucial for building reliable and consistent IDS models.
