# Notebook 1: Data Preprocessing

## Objective

This notebook handles the preprocessing of the UNSW-NB15 dataset for our intrusion detection system. We will:

1. Load the raw dataset from CSV files
2. Perform exploratory data inspection
3. Clean and handle missing values
4. Normalize numerical features
5. Encode categorical features
6. Create sequences of network flows for temporal modeling
7. Split data into train, validation, and test sets
8. Save preprocessed data for subsequent notebooks

---

## Background: UNSW-NB15 Dataset

The UNSW-NB15 dataset is a modern network intrusion detection dataset that contains:
- **49 features** describing network flow characteristics
- **Normal traffic** and **9 attack categories**
- Real-world network traffic patterns

**Attack Categories:**
- Fuzzers: Attempts to discover vulnerabilities by sending random data
- DoS: Denial of Service attacks
- Exploits: Exploitation of known vulnerabilities
- Reconnaissance: Scanning and probing
- Shellcode: Code injection attacks
- Analysis, Backdoor, Generic, Worms

---

## Step 1: Import Required Libraries

We'll use:
- **pandas**: For data manipulation
- **numpy**: For numerical operations
- **sklearn**: For preprocessing and data splitting
- **pickle**: For saving processed data

In [1]:
import pandas as pd
import numpy as np
import pickle
import os
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

print("Libraries imported successfully!")

Libraries imported successfully!


## Step 2: Load the Dataset

The UNSW-NB15 dataset typically comes in two files:
- Training set
- Testing set

We'll load both and combine them for our preprocessing pipeline.

In [2]:
# Define data paths
TRAIN_PATH = 'data/UNSW_NB15_training-set.csv'
TEST_PATH = 'data/UNSW_NB15_testing-set.csv'

# Load the datasets
print("Loading UNSW-NB15 dataset...")
train_df = pd.read_csv(TRAIN_PATH)
test_df = pd.read_csv(TEST_PATH)

# Combine both datasets for unified preprocessing
df = pd.concat([train_df, test_df], axis=0, ignore_index=True)

print(f"Dataset loaded successfully!")
print(f"   - Training set: {train_df.shape[0]} samples")
print(f"   - Testing set: {test_df.shape[0]} samples")
print(f"   - Combined dataset: {df.shape[0]} samples, {df.shape[1]} features")

Loading UNSW-NB15 dataset...
Dataset loaded successfully!
   - Training set: 82332 samples
   - Testing set: 175341 samples
   - Combined dataset: 257673 samples, 45 features


## Step 3: Initial Data Inspection

Let's examine the structure of our dataset to understand:
- Feature names and types
- Missing values
- Basic statistics

In [3]:
# Display first few rows
print("First 5 rows of the dataset:")
print(df.head())

First 5 rows of the dataset:
   id       dur proto service state  spkts  dpkts  sbytes  dbytes  \
0   1  0.000011   udp       -   INT      2      0     496       0   
1   2  0.000008   udp       -   INT      2      0    1762       0   
2   3  0.000005   udp       -   INT      2      0    1068       0   
3   4  0.000006   udp       -   INT      2      0     900       0   
4   5  0.000010   udp       -   INT      2      0    2126       0   

          rate  ...  ct_dst_sport_ltm  ct_dst_src_ltm  is_ftp_login  \
0   90909.0902  ...                 1               2             0   
1  125000.0003  ...                 1               2             0   
2  200000.0051  ...                 1               3             0   
3  166666.6608  ...                 1               3             0   
4  100000.0025  ...                 1               3             0   

   ct_ftp_cmd  ct_flw_http_mthd  ct_src_ltm  ct_srv_dst  is_sm_ips_ports  \
0           0                 0           1          

In [4]:
# Dataset information
print("\nðŸ“‹ Dataset Information:")
print(df.info())


ðŸ“‹ Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 257673 entries, 0 to 257672
Data columns (total 45 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   id                 257673 non-null  int64  
 1   dur                257673 non-null  float64
 2   proto              257673 non-null  object 
 3   service            257673 non-null  object 
 4   state              257673 non-null  object 
 5   spkts              257673 non-null  int64  
 6   dpkts              257673 non-null  int64  
 7   sbytes             257673 non-null  int64  
 8   dbytes             257673 non-null  int64  
 9   rate               257673 non-null  float64
 10  sttl               257673 non-null  int64  
 11  dttl               257673 non-null  int64  
 12  sload              257673 non-null  float64
 13  dload              257673 non-null  float64
 14  sloss              257673 non-null  int64  
 15  dloss              25767

In [5]:
# Check for missing values
print("Missing values per column:")
missing = df.isnull().sum()
missing = missing[missing > 0].sort_values(ascending=False)
if len(missing) > 0:
    print(missing)
else:
    print("No missing values found!")

Missing values per column:
No missing values found!


In [6]:
# Check class distribution
print("Class Distribution:")
if 'label' in df.columns:
    print(df['label'].value_counts())
    print(f"\nNormal traffic: {(df['label'] == 0).sum()} samples")
    print(f"Attack traffic: {(df['label'] == 1).sum()} samples")
    print(f"Attack ratio: {(df['label'] == 1).sum() / len(df) * 100:.2f}%")

if 'attack_cat' in df.columns:
    print("Attack Categories:")
    print(df['attack_cat'].value_counts())

Class Distribution:
label
1    164673
0     93000
Name: count, dtype: int64

Normal traffic: 93000 samples
Attack traffic: 164673 samples
Attack ratio: 63.91%
Attack Categories:
attack_cat
Normal            93000
Generic           58871
Exploits          44525
Fuzzers           24246
DoS               16353
Reconnaissance    13987
Analysis           2677
Backdoor           2329
Shellcode          1511
Worms               174
Name: count, dtype: int64


## Step 4: Feature Selection and Cleaning

We'll:
1. Identify and separate feature types (numerical vs categorical)
2. Remove irrelevant features (IDs, timestamps that don't add value)
3. Handle missing values
4. Store labels for later use

In [7]:
# Store labels separately
labels = df['label'].values if 'label' in df.columns else None
attack_categories = df['attack_cat'].values if 'attack_cat' in df.columns else None

# Features to drop (non-predictive or identifier columns)
# Adjust these based on your specific dataset columns
columns_to_drop = ['id', 'label', 'attack_cat']
columns_to_drop = [col for col in columns_to_drop if col in df.columns]

# Create feature dataframe
features_df = df.drop(columns=columns_to_drop, errors='ignore')

print(f"Selected {features_df.shape[1]} features for modeling")
print(f"\nFeature columns: {list(features_df.columns)}")

Selected 42 features for modeling

Feature columns: ['dur', 'proto', 'service', 'state', 'spkts', 'dpkts', 'sbytes', 'dbytes', 'rate', 'sttl', 'dttl', 'sload', 'dload', 'sloss', 'dloss', 'sinpkt', 'dinpkt', 'sjit', 'djit', 'swin', 'stcpb', 'dtcpb', 'dwin', 'tcprtt', 'synack', 'ackdat', 'smean', 'dmean', 'trans_depth', 'response_body_len', 'ct_srv_src', 'ct_state_ttl', 'ct_dst_ltm', 'ct_src_dport_ltm', 'ct_dst_sport_ltm', 'ct_dst_src_ltm', 'is_ftp_login', 'ct_ftp_cmd', 'ct_flw_http_mthd', 'ct_src_ltm', 'ct_srv_dst', 'is_sm_ips_ports']


In [8]:
# Identify categorical and numerical columns
categorical_cols = features_df.select_dtypes(include=['object']).columns.tolist()
numerical_cols = features_df.select_dtypes(include=['int64', 'float64']).columns.tolist()

print(f"Numerical features: {len(numerical_cols)}")
print(f"Categorical features: {len(categorical_cols)}")
print(f"\nCategorical columns: {categorical_cols}")

Numerical features: 39
Categorical features: 3

Categorical columns: ['proto', 'service', 'state']


## Step 5: Handle Missing Values

For missing values:
- **Numerical features**: Fill with median
- **Categorical features**: Fill with mode or 'unknown'

In [9]:
# Handle missing values in numerical columns
for col in numerical_cols:
    if features_df[col].isnull().sum() > 0:
        median_value = features_df[col].median()
        features_df[col].fillna(median_value, inplace=True)
        print(f"Filled {col} with median: {median_value}")

# Handle missing values in categorical columns
for col in categorical_cols:
    if features_df[col].isnull().sum() > 0:
        features_df[col].fillna('unknown', inplace=True)
        print(f"Filled {col} with 'unknown'")

print("\nMissing values handled!")


Missing values handled!


## Step 6: Encode Categorical Features

We need to convert categorical variables (like protocol type, service, state) into numerical values using Label Encoding.

In [10]:
# Initialize label encoders dictionary to save for later use
label_encoders = {}

# Encode categorical columns
for col in categorical_cols:
    le = LabelEncoder()
    features_df[col] = le.fit_transform(features_df[col].astype(str))
    label_encoders[col] = le
    print(f"Encoded {col}: {len(le.classes_)} unique values")

print("\nCategorical features encoded!")

Encoded proto: 133 unique values
Encoded service: 13 unique values
Encoded state: 11 unique values

Categorical features encoded!


## Step 7: Normalize Numerical Features

Neural networks perform better with normalized data. We'll use MinMaxScaler to scale all features to the range [0, 1].

In [11]:
# Create and fit the scaler
scaler = MinMaxScaler()
features_scaled = scaler.fit_transform(features_df)

# Convert back to DataFrame for easier handling
features_scaled_df = pd.DataFrame(features_scaled, columns=features_df.columns)

print(f"Features normalized to range [0, 1]")
print(f"\nScaled features shape: {features_scaled_df.shape}")
print(f"\nSample statistics after scaling:")
print(features_scaled_df.describe())

Features normalized to range [0, 1]

Scaled features shape: (257673, 42)

Sample statistics after scaling:
                dur          proto        service          state  \
count  2.576730e+05  257673.000000  257673.000000  257673.000000   
mean   2.077859e-02       0.834364       0.129659       0.434147   
std    9.957177e-02       0.161742       0.187162       0.088742   
min    0.000000e+00       0.000000       0.000000       0.000000   
25%    1.333334e-07       0.856061       0.000000       0.400000   
50%    7.141668e-05       0.856061       0.000000       0.400000   
75%    1.142962e-02       0.901515       0.166667       0.500000   
max    1.000000e+00       1.000000       1.000000       1.000000   

               spkts          dpkts         sbytes         dbytes  \
count  257673.000000  257673.000000  257673.000000  257673.000000   
mean        0.001764       0.001680       0.000596       0.000982   
std         0.012771       0.010164       0.012105       0.009974   
min 

## Step 8: Create Sequences for Temporal Modeling

For our CNN+LSTM autoencoder, we need to create sequences of network flows. 

**Why sequences?**
- Network attacks often span multiple packets/flows
- Temporal patterns help identify anomalies
- LSTM can learn time dependencies

We'll use a sliding window approach to create fixed-length sequences.

In [12]:
def create_sequences(data, labels, sequence_length=10, stride=5):
    """
    Create sequences from the dataset using a sliding window.
    
    Parameters:
    -----------
    data : array-like
        Feature data (samples x features)
    labels : array-like
        Binary labels (0=normal, 1=attack)
    sequence_length : int
        Length of each sequence (number of flows)
    stride : int
        Step size for sliding window
    
    Returns:
    --------
    sequences : ndarray
        Sequences of shape (num_sequences, sequence_length, num_features)
    sequence_labels : ndarray
        Labels for each sequence (1 if any flow in sequence is attack)
    """
    sequences = []
    sequence_labels = []
    
    # Sliding window to create sequences
    for i in range(0, len(data) - sequence_length + 1, stride):
        # Extract sequence
        seq = data[i:i + sequence_length]
        
        # Label is 1 if any flow in the sequence is an attack
        label = labels[i:i + sequence_length]
        seq_label = 1 if np.any(label == 1) else 0
        
        sequences.append(seq)
        sequence_labels.append(seq_label)
    
    return np.array(sequences), np.array(sequence_labels)

# Set sequence parameters
SEQUENCE_LENGTH = 10  # Each sequence contains 10 network flows
STRIDE = 5            # Move 5 flows forward for next sequence

print(f"Creating sequences with length={SEQUENCE_LENGTH}, stride={STRIDE}...")

# Create sequences
X_sequences, y_sequences = create_sequences(
    features_scaled_df.values, 
    labels, 
    sequence_length=SEQUENCE_LENGTH,
    stride=STRIDE
)

print(f"\nSequences created!")
print(f"   - Number of sequences: {X_sequences.shape[0]}")
print(f"   - Sequence shape: {X_sequences.shape[1:]}")
print(f"   - Normal sequences: {np.sum(y_sequences == 0)}")
print(f"   - Attack sequences: {np.sum(y_sequences == 1)}")

Creating sequences with length=10, stride=5...

Sequences created!
   - Number of sequences: 51533
   - Sequence shape: (10, 42)
   - Normal sequences: 16981
   - Attack sequences: 34552


## Step 9: Split Data into Train, Validation, and Test Sets

We'll split the data:
- **70%** Training set (for model training)
- **15%** Validation set (for hyperparameter tuning)
- **15%** Test set (for final evaluation)

**Important**: For autoencoder training, we'll primarily use normal traffic in the training set!

In [13]:
# First split: 70% train, 30% temp (for validation and test)
X_train, X_temp, y_train, y_temp = train_test_split(
    X_sequences, y_sequences, 
    test_size=0.3, 
    random_state=42,
    stratify=y_sequences  # Maintain class distribution
)

# Second split: Split temp into 50% validation, 50% test (15% each of total)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, 
    test_size=0.5, 
    random_state=42,
    stratify=y_temp
)

print(" Data split completed!")
print(f"\nTraining set:")
print(f"   - Shape: {X_train.shape}")
print(f"   - Normal: {np.sum(y_train == 0)}, Attack: {np.sum(y_train == 1)}")

print(f"\nValidation set:")
print(f"   - Shape: {X_val.shape}")
print(f"   - Normal: {np.sum(y_val == 0)}, Attack: {np.sum(y_val == 1)}")

print(f"\nTest set:")
print(f"   - Shape: {X_test.shape}")
print(f"   - Normal: {np.sum(y_test == 0)}, Attack: {np.sum(y_test == 1)}")

 Data split completed!

Training set:
   - Shape: (36073, 10, 42)
   - Normal: 11887, Attack: 24186

Validation set:
   - Shape: (7730, 10, 42)
   - Normal: 2547, Attack: 5183

Test set:
   - Shape: (7730, 10, 42)
   - Normal: 2547, Attack: 5183


## Step 10: Extract Normal Traffic for Autoencoder Training

Autoencoders learn to reconstruct normal patterns. We train only on normal traffic, then use reconstruction error to detect anomalies (attacks).

In [14]:
# Extract only normal traffic for training the autoencoder
X_train_normal = X_train[y_train == 0]

print(f"Extracted normal traffic for autoencoder training")
print(f"   - Normal training sequences: {X_train_normal.shape[0]}")
print(f"   - This represents {X_train_normal.shape[0] / X_train.shape[0] * 100:.1f}% of training data")

Extracted normal traffic for autoencoder training
   - Normal training sequences: 11887
   - This represents 33.0% of training data


## Step 11: Save Preprocessed Data

We'll save all preprocessed data for use in subsequent notebooks.

In [15]:
# Create output directory
os.makedirs('preprocessed_data', exist_ok=True)

# Save data as numpy arrays
np.save('preprocessed_data/X_train.npy', X_train)
np.save('preprocessed_data/X_train_normal.npy', X_train_normal)
np.save('preprocessed_data/X_val.npy', X_val)
np.save('preprocessed_data/X_test.npy', X_test)
np.save('preprocessed_data/y_train.npy', y_train)
np.save('preprocessed_data/y_val.npy', y_val)
np.save('preprocessed_data/y_test.npy', y_test)

# Save preprocessing objects
with open('preprocessed_data/scaler.pkl', 'wb') as f:
    pickle.dump(scaler, f)

with open('preprocessed_data/label_encoders.pkl', 'wb') as f:
    pickle.dump(label_encoders, f)

# Save feature names
with open('preprocessed_data/feature_names.pkl', 'wb') as f:
    pickle.dump(list(features_df.columns), f)

print(" All preprocessed data saved successfully!")
print("\n Saved files:")
print("   - X_train.npy, X_train_normal.npy, X_val.npy, X_test.npy")
print("   - y_train.npy, y_val.npy, y_test.npy")
print("   - scaler.pkl, label_encoders.pkl, feature_names.pkl")

 All preprocessed data saved successfully!

 Saved files:
   - X_train.npy, X_train_normal.npy, X_val.npy, X_test.npy
   - y_train.npy, y_val.npy, y_test.npy
   - scaler.pkl, label_encoders.pkl, feature_names.pkl


## Summary

In this notebook, we successfully:

1.  Loaded the UNSW-NB15 dataset
2.  Inspected and cleaned the data
3.  Handled missing values
4.  Encoded categorical features
5.  Normalized all features to [0, 1] range
6.  Created sequences for temporal modeling
7.  Split data into train/validation/test sets
8.  Extracted normal traffic for autoencoder training
9.  Saved all preprocessed data

---

##  Next Steps

Proceed to **Notebook 2: Visualization** to:
- Explore feature distributions
- Visualize attack patterns
- Understand correlations
- Gain insights into the dataset

---

**Note**: Make sure the `preprocessed_data/` directory contains all saved files before proceeding to the next notebook!