# Notebook 1: Data Preprocessing

## üéØ Objective

This notebook handles the preprocessing of the UNSW-NB15 dataset for our intrusion detection system. We will:

1. Load the raw dataset from CSV files
2. Perform exploratory data inspection
3. Clean and handle missing values
4. Normalize numerical features
5. Encode categorical features
6. Create sequences of network flows for temporal modeling
7. Split data into train, validation, and test sets
8. Save preprocessed data for subsequent notebooks

---

## üìö Background: UNSW-NB15 Dataset

The UNSW-NB15 dataset is a modern network intrusion detection dataset that contains:
- **49 features** describing network flow characteristics
- **Normal traffic** and **9 attack categories**
- Real-world network traffic patterns

**Attack Categories:**
- Fuzzers: Attempts to discover vulnerabilities by sending random data
- DoS: Denial of Service attacks
- Exploits: Exploitation of known vulnerabilities
- Reconnaissance: Scanning and probing
- Shellcode: Code injection attacks
- Analysis, Backdoor, Generic, Worms

---

## Step 1: Import Required Libraries

We'll use:
- **pandas**: For data manipulation
- **numpy**: For numerical operations
- **sklearn**: For preprocessing and data splitting
- **pickle**: For saving processed data

In [None]:
import pandas as pd
import numpy as np
import pickle
import os
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

print("‚úÖ Libraries imported successfully!")

## Step 2: Load the Dataset

The UNSW-NB15 dataset typically comes in two files:
- Training set
- Testing set

We'll load both and combine them for our preprocessing pipeline.

In [None]:
# Define data paths
TRAIN_PATH = 'data/UNSW_NB15_training-set.csv'
TEST_PATH = 'data/UNSW_NB15_testing-set.csv'

# Load the datasets
print("üìÇ Loading UNSW-NB15 dataset...")
train_df = pd.read_csv(TRAIN_PATH)
test_df = pd.read_csv(TEST_PATH)

# Combine both datasets for unified preprocessing
df = pd.concat([train_df, test_df], axis=0, ignore_index=True)

print(f"\n‚úÖ Dataset loaded successfully!")
print(f"   - Training set: {train_df.shape[0]} samples")
print(f"   - Testing set: {test_df.shape[0]} samples")
print(f"   - Combined dataset: {df.shape[0]} samples, {df.shape[1]} features")

## Step 3: Initial Data Inspection

Let's examine the structure of our dataset to understand:
- Feature names and types
- Missing values
- Basic statistics

In [None]:
# Display first few rows
print("üìä First 5 rows of the dataset:")
print(df.head())

In [None]:
# Dataset information
print("\nüìã Dataset Information:")
print(df.info())

In [None]:
# Check for missing values
print("\nüîç Missing values per column:")
missing = df.isnull().sum()
missing = missing[missing > 0].sort_values(ascending=False)
if len(missing) > 0:
    print(missing)
else:
    print("No missing values found!")

In [None]:
# Check class distribution
print("\nüìä Class Distribution:")
if 'label' in df.columns:
    print(df['label'].value_counts())
    print(f"\nNormal traffic: {(df['label'] == 0).sum()} samples")
    print(f"Attack traffic: {(df['label'] == 1).sum()} samples")
    print(f"Attack ratio: {(df['label'] == 1).sum() / len(df) * 100:.2f}%")

if 'attack_cat' in df.columns:
    print("\nüéØ Attack Categories:")
    print(df['attack_cat'].value_counts())

## Step 4: Feature Selection and Cleaning

We'll:
1. Identify and separate feature types (numerical vs categorical)
2. Remove irrelevant features (IDs, timestamps that don't add value)
3. Handle missing values
4. Store labels for later use

In [None]:
# Store labels separately
labels = df['label'].values if 'label' in df.columns else None
attack_categories = df['attack_cat'].values if 'attack_cat' in df.columns else None

# Features to drop (non-predictive or identifier columns)
# Adjust these based on your specific dataset columns
columns_to_drop = ['id', 'label', 'attack_cat']
columns_to_drop = [col for col in columns_to_drop if col in df.columns]

# Create feature dataframe
features_df = df.drop(columns=columns_to_drop, errors='ignore')

print(f"‚úÖ Selected {features_df.shape[1]} features for modeling")
print(f"\nFeature columns: {list(features_df.columns)}")

In [None]:
# Identify categorical and numerical columns
categorical_cols = features_df.select_dtypes(include=['object']).columns.tolist()
numerical_cols = features_df.select_dtypes(include=['int64', 'float64']).columns.tolist()

print(f"üî¢ Numerical features: {len(numerical_cols)}")
print(f"üìù Categorical features: {len(categorical_cols)}")
print(f"\nCategorical columns: {categorical_cols}")

## Step 5: Handle Missing Values

For missing values:
- **Numerical features**: Fill with median
- **Categorical features**: Fill with mode or 'unknown'

In [None]:
# Handle missing values in numerical columns
for col in numerical_cols:
    if features_df[col].isnull().sum() > 0:
        median_value = features_df[col].median()
        features_df[col].fillna(median_value, inplace=True)
        print(f"Filled {col} with median: {median_value}")

# Handle missing values in categorical columns
for col in categorical_cols:
    if features_df[col].isnull().sum() > 0:
        features_df[col].fillna('unknown', inplace=True)
        print(f"Filled {col} with 'unknown'")

print("\n‚úÖ Missing values handled!")

## Step 6: Encode Categorical Features

We need to convert categorical variables (like protocol type, service, state) into numerical values using Label Encoding.

In [None]:
# Initialize label encoders dictionary to save for later use
label_encoders = {}

# Encode categorical columns
for col in categorical_cols:
    le = LabelEncoder()
    features_df[col] = le.fit_transform(features_df[col].astype(str))
    label_encoders[col] = le
    print(f"Encoded {col}: {len(le.classes_)} unique values")

print("\n‚úÖ Categorical features encoded!")

## Step 7: Normalize Numerical Features

Neural networks perform better with normalized data. We'll use MinMaxScaler to scale all features to the range [0, 1].

In [None]:
# Create and fit the scaler
scaler = MinMaxScaler()
features_scaled = scaler.fit_transform(features_df)

# Convert back to DataFrame for easier handling
features_scaled_df = pd.DataFrame(features_scaled, columns=features_df.columns)

print(f"‚úÖ Features normalized to range [0, 1]")
print(f"\nScaled features shape: {features_scaled_df.shape}")
print(f"\nSample statistics after scaling:")
print(features_scaled_df.describe())

## Step 8: Create Sequences for Temporal Modeling

For our CNN+LSTM autoencoder, we need to create sequences of network flows. 

**Why sequences?**
- Network attacks often span multiple packets/flows
- Temporal patterns help identify anomalies
- LSTM can learn time dependencies

We'll use a sliding window approach to create fixed-length sequences.

In [None]:
def create_sequences(data, labels, sequence_length=10, stride=5):
    """
    Create sequences from the dataset using a sliding window.
    
    Parameters:
    -----------
    data : array-like
        Feature data (samples x features)
    labels : array-like
        Binary labels (0=normal, 1=attack)
    sequence_length : int
        Length of each sequence (number of flows)
    stride : int
        Step size for sliding window
    
    Returns:
    --------
    sequences : ndarray
        Sequences of shape (num_sequences, sequence_length, num_features)
    sequence_labels : ndarray
        Labels for each sequence (1 if any flow in sequence is attack)
    """
    sequences = []
    sequence_labels = []
    
    # Sliding window to create sequences
    for i in range(0, len(data) - sequence_length + 1, stride):
        # Extract sequence
        seq = data[i:i + sequence_length]
        
        # Label is 1 if any flow in the sequence is an attack
        label = labels[i:i + sequence_length]
        seq_label = 1 if np.any(label == 1) else 0
        
        sequences.append(seq)
        sequence_labels.append(seq_label)
    
    return np.array(sequences), np.array(sequence_labels)

# Set sequence parameters
SEQUENCE_LENGTH = 10  # Each sequence contains 10 network flows
STRIDE = 5            # Move 5 flows forward for next sequence

print(f"Creating sequences with length={SEQUENCE_LENGTH}, stride={STRIDE}...")

# Create sequences
X_sequences, y_sequences = create_sequences(
    features_scaled_df.values, 
    labels, 
    sequence_length=SEQUENCE_LENGTH,
    stride=STRIDE
)

print(f"\n‚úÖ Sequences created!")
print(f"   - Number of sequences: {X_sequences.shape[0]}")
print(f"   - Sequence shape: {X_sequences.shape[1:]}")
print(f"   - Normal sequences: {np.sum(y_sequences == 0)}")
print(f"   - Attack sequences: {np.sum(y_sequences == 1)}")

## Step 9: Split Data into Train, Validation, and Test Sets

We'll split the data:
- **70%** Training set (for model training)
- **15%** Validation set (for hyperparameter tuning)
- **15%** Test set (for final evaluation)

**Important**: For autoencoder training, we'll primarily use normal traffic in the training set!

In [None]:
# First split: 70% train, 30% temp (for validation and test)
X_train, X_temp, y_train, y_temp = train_test_split(
    X_sequences, y_sequences, 
    test_size=0.3, 
    random_state=42,
    stratify=y_sequences  # Maintain class distribution
)

# Second split: Split temp into 50% validation, 50% test (15% each of total)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, 
    test_size=0.5, 
    random_state=42,
    stratify=y_temp
)

print("‚úÖ Data split completed!")
print(f"\nTraining set:")
print(f"   - Shape: {X_train.shape}")
print(f"   - Normal: {np.sum(y_train == 0)}, Attack: {np.sum(y_train == 1)}")

print(f"\nValidation set:")
print(f"   - Shape: {X_val.shape}")
print(f"   - Normal: {np.sum(y_val == 0)}, Attack: {np.sum(y_val == 1)}")

print(f"\nTest set:")
print(f"   - Shape: {X_test.shape}")
print(f"   - Normal: {np.sum(y_test == 0)}, Attack: {np.sum(y_test == 1)}")

## Step 10: Extract Normal Traffic for Autoencoder Training

Autoencoders learn to reconstruct normal patterns. We train only on normal traffic, then use reconstruction error to detect anomalies (attacks).

In [None]:
# Extract only normal traffic for training the autoencoder
X_train_normal = X_train[y_train == 0]

print(f"‚úÖ Extracted normal traffic for autoencoder training")
print(f"   - Normal training sequences: {X_train_normal.shape[0]}")
print(f"   - This represents {X_train_normal.shape[0] / X_train.shape[0] * 100:.1f}% of training data")

## Step 11: Save Preprocessed Data

We'll save all preprocessed data for use in subsequent notebooks.

In [None]:
# Create output directory
os.makedirs('preprocessed_data', exist_ok=True)

# Save data as numpy arrays
np.save('preprocessed_data/X_train.npy', X_train)
np.save('preprocessed_data/X_train_normal.npy', X_train_normal)
np.save('preprocessed_data/X_val.npy', X_val)
np.save('preprocessed_data/X_test.npy', X_test)
np.save('preprocessed_data/y_train.npy', y_train)
np.save('preprocessed_data/y_val.npy', y_val)
np.save('preprocessed_data/y_test.npy', y_test)

# Save preprocessing objects
with open('preprocessed_data/scaler.pkl', 'wb') as f:
    pickle.dump(scaler, f)

with open('preprocessed_data/label_encoders.pkl', 'wb') as f:
    pickle.dump(label_encoders, f)

# Save feature names
with open('preprocessed_data/feature_names.pkl', 'wb') as f:
    pickle.dump(list(features_df.columns), f)

print("‚úÖ All preprocessed data saved successfully!")
print("\nüìÅ Saved files:")
print("   - X_train.npy, X_train_normal.npy, X_val.npy, X_test.npy")
print("   - y_train.npy, y_val.npy, y_test.npy")
print("   - scaler.pkl, label_encoders.pkl, feature_names.pkl")

## üìä Summary

In this notebook, we successfully:

1. ‚úÖ Loaded the UNSW-NB15 dataset
2. ‚úÖ Inspected and cleaned the data
3. ‚úÖ Handled missing values
4. ‚úÖ Encoded categorical features
5. ‚úÖ Normalized all features to [0, 1] range
6. ‚úÖ Created sequences for temporal modeling
7. ‚úÖ Split data into train/validation/test sets
8. ‚úÖ Extracted normal traffic for autoencoder training
9. ‚úÖ Saved all preprocessed data

---

## üéØ Next Steps

Proceed to **Notebook 2: Visualization** to:
- Explore feature distributions
- Visualize attack patterns
- Understand correlations
- Gain insights into the dataset

---

**Note**: Make sure the `preprocessed_data/` directory contains all saved files before proceeding to the next notebook!