# Notebook 1: Data Preprocessing

This notebook handles preprocessing of the UNSW-NB15 dataset. We will:

1. Load raw dataset from CSV files
2. Perform data inspection
3. Clean and handle missing values
4. Normalize numerical features
5. Encode categorical features
6. Create sequences for temporal modeling
7. Split into train/validation/test sets
8. Save preprocessed data


## Step 1: Import Required LibrariesUsing:- **pandas**: For data manipulation- **numpy**: For numerical operations- **sklearn**: For preprocessing and data splitting- **pickle**: For saving processed data


In [1]:
import pandas as pdimport numpy as npimport pickleimport osfrom sklearn.preprocessing import MinMaxScaler, LabelEncoderfrom sklearn.model_selection import train_test_splitimport warningswarnings.filterwarnings('ignore')# Set random seed for reproducibilitynp.random.seed(42)print(" Libraries imported complete")

Libraries imported complete

## Step 2: Load the DatasetThe UNSW-NB15 dataset typically comes in two files:- Training set- Testing setWe will load both and combine them for our preprocessing pipeline.


In [2]:
# Define data pathsTRAIN_PATH = r'/home/mesbah7/Github/Repos/Intrusion-Detection-AI/dataset_kaggle/UNSW_NB15_training-set.csv'TEST_PATH = r'/home/mesbah7/Github/Repos/Intrusion-Detection-AI/dataset_kaggle/UNSW_NB15_training-set.csv'# Load the datasetsprint(" Loading UNSW-NB15 dataset...")train_df = pd.read_csv(TRAIN_PATH)test_df = pd.read_csv(TEST_PATH)# Combine both datasets for unified preprocessingdf = pd.concat([train_df, test_df], axis=0, ignore_index=True)print(f"\n Dataset loaded complete")print(f" - Training set: {train_df.shape[0]} samples")print(f" - Testing set: {test_df.shape[0]} samples")print(f" - Combined dataset: {df.shape[0]} samples, {df.shape[1]} features")

üìÇ Loading UNSW-NB15 dataset...Dataset loaded complete- Training set: 82332 samples- Testing set: 82332 samples- Combined dataset: 164664 samples, 45 features

## Step 3: Initial Data InspectionLet us examine the structure of our dataset to understand:- Feature names and types- Missing values- Basic statistics


In [3]:
# Display first few rowsprint(" First 5 rows of the dataset:")df.head()

First 5 rows of the dataset:

Unnamed: 0,id,dur,proto,service,state,spkts,dpkts,sbytes,dbytes,rate,...,ct_dst_sport_ltm,ct_dst_src_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports,attack_cat,label
0,1,1.1e-05,udp,-,INT,2,0,496,0,90909.0902,...,1,2,0,0,0,1,2,0,Normal,0
1,2,8e-06,udp,-,INT,2,0,1762,0,125000.0003,...,1,2,0,0,0,1,2,0,Normal,0
2,3,5e-06,udp,-,INT,2,0,1068,0,200000.0051,...,1,3,0,0,0,1,3,0,Normal,0
3,4,6e-06,udp,-,INT,2,0,900,0,166666.6608,...,1,3,0,0,0,2,3,0,Normal,0
4,5,1e-05,udp,-,INT,2,0,2126,0,100000.0025,...,1,3,0,0,0,2,3,0,Normal,0


In [4]:
df.columns

Index(['id', 'dur', 'proto', 'service', 'state', 'spkts', 'dpkts', 'sbytes',
       'dbytes', 'rate', 'sttl', 'dttl', 'sload', 'dload', 'sloss', 'dloss',
       'sinpkt', 'dinpkt', 'sjit', 'djit', 'swin', 'stcpb', 'dtcpb', 'dwin',
       'tcprtt', 'synack', 'ackdat', 'smean', 'dmean', 'trans_depth',
       'response_body_len', 'ct_srv_src', 'ct_state_ttl', 'ct_dst_ltm',
       'ct_src_dport_ltm', 'ct_dst_sport_ltm', 'ct_dst_src_ltm',
       'is_ftp_login', 'ct_ftp_cmd', 'ct_flw_http_mthd', 'ct_src_ltm',
       'ct_srv_dst', 'is_sm_ips_ports', 'attack_cat', 'label'],
      dtype='object')

In [5]:
df.head(10)

Unnamed: 0,id,dur,proto,service,state,spkts,dpkts,sbytes,dbytes,rate,...,ct_dst_sport_ltm,ct_dst_src_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports,attack_cat,label
0,1,1.1e-05,udp,-,INT,2,0,496,0,90909.0902,...,1,2,0,0,0,1,2,0,Normal,0
1,2,8e-06,udp,-,INT,2,0,1762,0,125000.0003,...,1,2,0,0,0,1,2,0,Normal,0
2,3,5e-06,udp,-,INT,2,0,1068,0,200000.0051,...,1,3,0,0,0,1,3,0,Normal,0
3,4,6e-06,udp,-,INT,2,0,900,0,166666.6608,...,1,3,0,0,0,2,3,0,Normal,0
4,5,1e-05,udp,-,INT,2,0,2126,0,100000.0025,...,1,3,0,0,0,2,3,0,Normal,0
5,6,3e-06,udp,-,INT,2,0,784,0,333333.3215,...,1,2,0,0,0,2,2,0,Normal,0
6,7,6e-06,udp,-,INT,2,0,1960,0,166666.6608,...,1,2,0,0,0,2,2,0,Normal,0
7,8,2.8e-05,udp,-,INT,2,0,1384,0,35714.28522,...,1,3,0,0,0,1,3,0,Normal,0
8,9,0.0,arp,-,INT,1,0,46,0,0.0,...,2,2,0,0,0,2,2,1,Normal,0
9,10,0.0,arp,-,INT,1,0,46,0,0.0,...,2,2,0,0,0,2,2,1,Normal,0


In [6]:
df['service'].value_counts()

service
-           94306
dns         42734
http        16574
smtp         3702
ftp          3104
ftp-data     2792
pop3          846
ssh           408
ssl            60
snmp           58
dhcp           52
radius         18
irc            10
Name: count, dtype: int64

In [7]:
most_frequent = df[df['service'] != '-']['service'].mode()[0]df['service'] = df['service'].replace('-', most_frequent)# In your case, this would replace all '-' with 'dns'df['service'].value_counts()

service
dns         137040
http         16574
smtp          3702
ftp           3104
ftp-data      2792
pop3           846
ssh            408
ssl             60
snmp            58
dhcp            52
radius          18
irc             10
Name: count, dtype: int64

In [8]:
# Check for missing valuesprint("\n Missing values per column:")missing = df.isnull().sum()missing = missing[missing > 0].sort_values(ascending=False)if len(missing) > 0:print(missing)else:print("No missing values found!")

üîç Missing values per column:No missing values found!

In [9]:
# Check class distributionprint("\n Class Distribution:")if 'label' in df.columns:print(df['label'].value_counts())print(f"\nNormal traffic: {(df['label'] == 0).sum()} samples")print(f"Attack traffic: {(df['label'] == 1).sum()} samples")print(f"Attack ratio: {(df['label'] == 1).sum() / len(df) * 100:.2f}%")if 'attack_cat' in df.columns:print("\n Attack Categories:")print(df['attack_cat'].value_counts())

Class Distribution:label1 906640 74000Name: count, dtype: int64Normal traffic: 74000 samplesAttack traffic: 90664 samplesAttack ratio: 55.06%Attack Categories:attack_catNormal 74000Generic 37742Exploits 22264Fuzzers 12124DoS 8178Reconnaissance 6992Analysis 1354Backdoor 1166Shellcode 756Worms 88Name: count, dtype: int64

## Step 4: Feature Selection and CleaningWe will:1. Identify and separate feature types (numerical vs categorical)2. Remove irrelevant features (IDs, timestamps that don't add value)3. Handle missing values4. Store labels for later use


In [10]:
# Store labels separatelylabels = df['label'].values if 'label' in df.columns else Noneattack_categories = df['attack_cat'].values if 'attack_cat' in df.columns else None# Features to drop (non-predictive or identifier columns)# Adjust these based on your specific dataset columnscolumns_to_drop = ['id', 'label', 'attack_cat']columns_to_drop = [col for col in columns_to_drop if col in df.columns]# Create feature dataframefeatures_df = df.drop(columns=columns_to_drop, errors='ignore')print(f" Selected {features_df.shape[1]} features for modeling")print(f"\nFeature columns: {list(features_df.columns)}")

Selected 42 features for modelingFeature columns: ['dur', 'proto', 'service', 'state', 'spkts', 'dpkts', 'sbytes', 'dbytes', 'rate', 'sttl', 'dttl', 'sload', 'dload', 'sloss', 'dloss', 'sinpkt', 'dinpkt', 'sjit', 'djit', 'swin', 'stcpb', 'dtcpb', 'dwin', 'tcprtt', 'synack', 'ackdat', 'smean', 'dmean', 'trans_depth', 'response_body_len', 'ct_srv_src', 'ct_state_ttl', 'ct_dst_ltm', 'ct_src_dport_ltm', 'ct_dst_sport_ltm', 'ct_dst_src_ltm', 'is_ftp_login', 'ct_ftp_cmd', 'ct_flw_http_mthd', 'ct_src_ltm', 'ct_srv_dst', 'is_sm_ips_ports']

In [None]:
# Identify categorical and numerical columnscategorical_cols = features_df.select_dtypes(include=['object']).columns.tolist()numerical_cols = features_df.select_dtypes(include=['int64', 'float64']).columns.tolist()print(f" Numerical features: {len(numerical_cols)}")print(f" Categorical features: {len(categorical_cols)}")print(f"\nCategorical columns: {categorical_cols}")

üî¢ Numerical features: 39Categorical features: 3Categorical columns: ['proto', 'service', 'state']

## Step 5: Handle Missing ValuesFor missing values:- **Numerical features**: Fill with median- **Categorical features**: Fill with mode or 'unknown'


In [None]:
# Handle missing values in numerical columnsfor col in numerical_cols:if features_df[col].isnull().sum() > 0:median_value = features_df[col].median()features_df[col].fillna(median_value, inplace=True)print(f"Filled {col} with median: {median_value}")# Handle missing values in categorical columnsfor col in categorical_cols:if features_df[col].isnull().sum() > 0:features_df[col].fillna('unknown', inplace=True)print(f"Filled {col} with 'unknown'")print("\n Missing values handled!")

Missing values handled!

## Step 6: Encode Categorical FeaturesWe need to convert categorical variables (like protocol type, service, state) into numerical values using Label Encoding.


In [None]:
# Initialize label encoders dictionary to save for later uselabel_encoders = {}# Encode categorical columnsfor col in categorical_cols:le = LabelEncoder()features_df[col] = le.fit_transform(features_df[col].astype(str))label_encoders[col] = leprint(f"Encoded {col}: {len(le.classes_)} unique values")print("\n Categorical features encoded!")

Encoded proto: 131 unique valuesEncoded service: 12 unique valuesEncoded state: 7 unique valuesCategorical features encoded!

## Step 7: Normalize Numerical FeaturesNeural networks perform better with normalized data. We will use MinMaxScaler to scale all features to the range [0, 1].


In [None]:
# Create and fit the scalerscaler = MinMaxScaler()features_scaled = scaler.fit_transform(features_df)# Convert back to DataFrame for easier handlingfeatures_scaled_df = pd.DataFrame(features_scaled, columns=features_df.columns)print(f" Features normalized to range [0, 1]")print(f"\nScaled features shape: {features_scaled_df.shape}")print(f"\nSample statistics after scaling:")print(features_scaled_df.describe())

Features normalized to range [0, 1]Scaled features shape: (164664, 42)Sample statistics after scaling:dur proto service state \count 1.646640e+05 164664.000000 164664.000000 164664.000000mean 1.677927e-02 0.841141 0.142466 0.562459std 7.850718e-02 0.143363 0.133791 0.111728min 0.000000e+00 0.000000 0.000000 0.00000025% 1.333334e-07 0.853846 0.090909 0.50000050% 2.356334e-04 0.853846 0.090909 0.50000075% 1.198934e-02 0.900000 0.090909 0.666667max 1.000000e+00 1.000000 1.000000 1.000000spkts dpkts sbytes dbytes \count 164664.000000 164664.000000 164664.000000 164664.000000mean 0.001660 0.001592 0.000555 0.000903std 0.012580 0.010490 0.011956 0.010334min 0.000000 0.000000 0.000000 0.00000025% 0.000094 0.000000 0.000006 0.00000050% 0.000470 0.000182 0.000036 0.00001275% 0.001033 0.000908 0.000087 0.000065max 1.000000 1.000000 1.000000 1.000000rate sttl ... ct_dst_ltm ct_src_dport_ltm \count 164664.000000 164664.000000 ... 164664.000000 164664.000000mean 0.082411 0.709677 ... 0.081809 0.067

## Step 8: Create Sequences for Temporal ModelingFor our CNN+LSTM autoencoder, we need to create sequences of network flows.**Why sequences?**- Network attacks often span multiple packets/flows- Temporal patterns help identify anomalies- LSTM can learn time dependenciesWe will use a sliding window approach to create fixed-length sequences.


In [None]:
def create_sequences(data, labels, sequence_length=10, stride=5):"""Create sequences from the dataset using a sliding window.Parameters:-----------data : array-likeFeature data (samples x features)labels : array-likeBinary labels (0=normal, 1=attack)sequence_length : intLength of each sequence (number of flows)stride : intStep size for sliding windowReturns:--------sequences : ndarraySequences of shape (num_sequences, sequence_length, num_features)sequence_labels : ndarrayLabels for each sequence (1 if any flow in sequence is attack)"""sequences = []sequence_labels = []# Sliding window to create sequencesfor i in range(0, len(data) - sequence_length + 1, stride):# Extract sequenceseq = data[i:i + sequence_length]# Label is 1 if any flow in the sequence is an attacklabel = labels[i:i + sequence_length]seq_label = 1 if np.any(label == 1) else 0sequences.append(seq)sequence_labels.append(seq_label)return np.array(sequences), np.array(sequence_labels)# Set sequence parametersSEQUENCE_LENGTH = 10 # Each sequence contains 10 network flowsSTRIDE = 5 # Move 5 flows forward for next sequenceprint(f"Creating sequences with length={SEQUENCE_LENGTH}, stride={STRIDE}...")# Create sequencesX_sequences, y_sequences = create_sequences(features_scaled_df.values,labels,sequence_length=SEQUENCE_LENGTH,stride=STRIDE)print(f"\n Sequences created!")print(f" - Number of sequences: {X_sequences.shape[0]}")print(f" - Sequence shape: {X_sequences.shape[1:]}")print(f" - Normal sequences: {np.sum(y_sequences == 0)}")print(f" - Attack sequences: {np.sum(y_sequences == 1)}")

Creating sequences with length=10, stride=5...Sequences created!- Number of sequences: 32931- Sequence shape: (10, 42)- Normal sequences: 14791- Attack sequences: 18140

## Step 9: Split Data into Train, Validation, and Test SetsWe will split the data:- **70%** Training set (for model training)- **15%** Validation set (for hyperparameter tuning)- **15%** Test set (for final evaluation)**Important**: For autoencoder training, we'll primarily use normal traffic in the training set!


In [None]:
# First split: 70% train, 30% temp (for validation and test)X_train, X_temp, y_train, y_temp = train_test_split(X_sequences, y_sequences,test_size=0.3,random_state=42,stratify=y_sequences # Maintain class distribution)# Second split: Split temp into 50% validation, 50% test (15% each of total)X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp,test_size=0.5,random_state=42,stratify=y_temp)print(" Data split completed!")print(f"\nTraining set:")print(f" - Shape: {X_train.shape}")print(f" - Normal: {np.sum(y_train == 0)}, Attack: {np.sum(y_train == 1)}")print(f"\nValidation set:")print(f" - Shape: {X_val.shape}")print(f" - Normal: {np.sum(y_val == 0)}, Attack: {np.sum(y_val == 1)}")print(f"\nTest set:")print(f" - Shape: {X_test.shape}")print(f" - Normal: {np.sum(y_test == 0)}, Attack: {np.sum(y_test == 1)}")

Data split completed!Training set:- Shape: (23051, 10, 42)- Normal: 10353, Attack: 12698Validation set:- Shape: (4940, 10, 42)- Normal: 2219, Attack: 2721Test set:- Shape: (4940, 10, 42)- Normal: 2219, Attack: 2721

## Step 10: Extract Normal Traffic for Autoencoder TrainingAutoencoders learn to reconstruct normal patterns. We train only on normal traffic, then use reconstruction error to detect anomalies (attacks).


In [None]:
# Extract only normal traffic for training the autoencoderX_train_normal = X_train[y_train == 0]print(f" Extracted normal traffic for autoencoder training")print(f" - Normal training sequences: {X_train_normal.shape[0]}")print(f" - This represents {X_train_normal.shape[0] / X_train.shape[0] * 100:.1f}% of training data")

Extracted normal traffic for autoencoder training- Normal training sequences: 10353- This represents 44.9% of training data

## Step 11: Save Preprocessed DataWe will save all preprocessed data for use in subsequent notebooks.


In [None]:
# Create output directoryos.makedirs('preprocessed_data', exist_ok=True)# Save data as numpy arraysnp.save('preprocessed_data/X_train.npy', X_train)np.save('preprocessed_data/X_train_normal.npy', X_train_normal)np.save('preprocessed_data/X_val.npy', X_val)np.save('preprocessed_data/X_test.npy', X_test)np.save('preprocessed_data/y_train.npy', y_train)np.save('preprocessed_data/y_val.npy', y_val)np.save('preprocessed_data/y_test.npy', y_test)# Save preprocessing objectswith open('preprocessed_data/scaler.pkl', 'wb') as f:pickle.dump(scaler, f)with open('preprocessed_data/label_encoders.pkl', 'wb') as f:pickle.dump(label_encoders, f)# Save feature nameswith open('preprocessed_data/feature_names.pkl', 'wb') as f:pickle.dump(list(features_df.columns), f)print(" All preprocessed data saved complete")print("\n Saved files:")print(" - X_train.npy, X_train_normal.npy, X_val.npy, X_test.npy")print(" - y_train.npy, y_val.npy, y_test.npy")print(" - scaler.pkl, label_encoders.pkl, feature_names.pkl")

All preprocessed data saved completeüìÅ Saved files:- X_train.npy, X_train_normal.npy, X_val.npy, X_test.npy- y_train.npy, y_val.npy, y_test.npy- scaler.pkl, label_encoders.pkl, feature_names.pkl

## SummaryIn this notebook, we successfully:1. Loaded the UNSW-NB15 dataset2. Inspected and cleaned the data3. Handled missing values4. Encoded categorical features5. Normalized all features to [0, 1] range6. Created sequences for temporal modeling7. Split data into train/validation/test sets8. Extracted normal traffic for autoencoder training9. Saved all preprocessed data---## Next StepsProceed to **Notebook 2: Visualization** to:- Explore feature distributions- Visualize attack patterns- Understand correlations- Gain insights into the dataset---**Note**: Make sure the `preprocessed_data/` directory contains all saved files before proceeding to the next notebook!
