# Data Preprocessing for Intrusion Detection System

This notebook covers the steps for loading, preprocessing, and preparing the KDD Cup 1999 dataset for training machine learning models to detect network intrusions.

In [1]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder


## Load the Dataset

We'll load the training and test datasets from CSV files.

In [2]:
# Load the datasets
train_df = pd.read_csv('KDDTrain+.csv')
test_df = pd.read_csv('KDDTest+.csv')

# Display the first few rows of the training dataset
train_df.head()

## Preprocessing the Data

We'll preprocess the data by encoding categorical variables, scaling numerical features, and splitting the data into features and labels.

In [3]:
# Function to preprocess the data
def preprocess_data(df):
    # Encode categorical features
    categorical_cols = ['protocol_type', 'service', 'flag']
    df = pd.get_dummies(df, columns=categorical_cols)
    
    # Encode labels
    label_encoder = LabelEncoder()
    df['label'] = label_encoder.fit_transform(df['label'])
    
    # Split into features and labels
    X = df.drop(columns=['label'])
    y = df['label']
    
    return X, y

# Preprocess the training and test datasets
X_train, y_train = preprocess_data(train_df)
X_test, y_test = preprocess_data(test_df)

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Split the training data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

X_train.shape, X_val.shape, X_test.shape

## Save Preprocessed Data

We'll save the preprocessed data to new CSV files for use in model training.

In [4]:
# Save the preprocessed data
pd.DataFrame(X_train).to_csv('X_train.csv', index=False)
pd.DataFrame(X_val).to_csv('X_val.csv', index=False)
pd.DataFrame(X_test).to_csv('X_test.csv', index=False)
pd.DataFrame(y_train).to_csv('y_train.csv', index=False)
pd.DataFrame(y_val).to_csv('y_val.csv', index=False)
pd.DataFrame(y_test).to_csv('y_test.csv', index=False)

## Summary

In this notebook, we:
1. Loaded the KDD Cup 1999 training and test datasets.
2. Preprocessed the data by encoding categorical features and labels, and standardizing numerical features.
3. Split the data into training, validation, and test sets.
4. Saved the preprocessed data for use in model training.