> Group Members:
- Muhammad Awais `BCS221093`
- Syed Umer Sarfaraz `BCS221083`


# Network Intrusion Detection System

## 1. System Description
Network security is an important issue for businesses and organizations in this global village as cyber threats become more advanced and harder to detect. A Network Intrusion Detection System (NIDS) is a tool designed to **monitor** and **analyze** network traffic, looking for **unauthorized access**, **suspicious activities**, or **harmful behavior**. It identified threats in real-time by generating an alert which helps security team take measutes within time. NIDS helps *secure sensitive data, prevent system breaches, and ensure the network stays reliable and protected*.

Nowadays almost every business and organizations has its own network. Hence securing that network is a crucial need for them. NIDS system uses modern techniques to distinguish between normal and potentially harmful traffic. They examine network activity, by continously monitoring network traffic both ingoing and outgoing, and by creating logs in real-time. *NIDS plays a key role in protecting against various types of cyber attacks, such as Denial-of-Service (DoS), Land Attacks, and unauthorized logins*. 

# 2. Problem Description
Detecting network intrusions in real-time is a challenging task due to the heavy network traffic. Rule-based systems often struggle to handle the dynamic nature of cyber threats, *leading to **increased number of false positives** or **missed detections***. The main problem is to create such mechanism that can effectively classify network traffic into **normal** or **suspicious** activities, thereby minimizing threats without affecting legitimate network activity.

# 3. How ML Can Help Address the Problem?
Machine Learning offers a powerful way to improve intrusion detection compared to *traditional rule-based systems*. Unlike rule-based approaches, which rely on pre-defined rules and signatures, ML can automatically learn patterns from network data. This means *ML systems can detect new and unknown threats that may not follow established patterns*. *They adapt over time, improving their ability to spot suspicious activity as new types of attacks emerge*.

Another advantage of ML is its ability to reduce false alarms. Traditional systems often flag normal activities as threats, leading to unnecessary alerts. If ML models are trained accurately, they can reduce false alarms, helping business. By using ML, intrusion detection systems become more **intelligent**, **scalable**, and *capable of keeping up with modern cybersecurity challenges*.

# 4. Dataset Description
Dataset consist of 33 descriptive features that provides various information related to network activity. The feaures includes different information like content information, tcp connection information, traffic and host related information computed using a time window.

> Table below shows brief description of main features:

### Dataset Features

| **Feature Name**         | **Description**                                                               | **Type**       |
|---------------------------|-------------------------------------------------------------------------------|----------------|
| 1. `duration`               | Length (in seconds) of the connection.                                       | numeric     |
| 2. `protocol_type`          | Type of protocol used, e.g., `tcp`, `udp`, `icmp`.                           | categorical       |
| 3. `service`                | Network service on the destination, e.g., `http`, `ftp`, etc.                | categorical       |
| 4. `src_bytes`              | Number of data bytes sent from the source to the destination.                | numeric     |
| 5. `dst_bytes`              | Number of data bytes sent from the destination to the source.                | numeric     |
| 6. `flag`                   | Connection status (normal or error), e.g., `SF`, `REJ`, etc.                 | categorical       |
| 7. `land`                   | Indicates if the connection is from/to the same host and port (`0` or `1`).  | binary       |
| 8. `wrong_fragment`         | Number of malformed fragments.                                               | numeric     |
| 9. `urgent`                 | Number of urgent packets.                                                    | numeric     |

### Content Features

| **Feature Name**         | **Description**                                                               | **Type**       |
|---------------------------|-------------------------------------------------------------------------------|----------------|
| 10. `hot`                    | Number of "hot" indicators in the connection.                                | numeric     |
| 11. `num_failed_logins`      | Number of failed login attempts.                                             | numeric     |
| 12. `logged_in`              | Indicates successful login (`1` for yes, `0` for no).                        | binary       |
| 13. `num_compromised`        | Number of compromised conditions.                                            | numeric     |
| 14. `root_shell`             | Indicates if a root shell was obtained (`1` for yes, `0` for no).            | binary       |
| 15. `su_attempted`           | Indicates if an "su root" command was attempted (`1` for yes, `0` for no).   | binary       |
| 16. `num_file_creations`     | Number of file creation operations.                                          | numeric     |
| 17. `num_shells`             | Number of shell prompts initiated.                                           | numeric     |
| 18. `num_access_files`       | Number of operations on access control files.                                | numeric     |
| 19. `num_outbound_cmds`      | Number of outbound commands in an FTP session.                               | numeric     |
| 20. `is_host_login`           | Indicates if the loggedin user is a "host" (`1` for yes, `0` for no).    | binary       |
| 21. `is_guest_login`         | Indicates if the loggedin user is a "guest" login (`1` for yes, `0` for no).         | binary       |


### Traffic and Host Features

| **Feature Name**               | **Description**                                                                                   | **Type**       |
|---------------------------------|---------------------------------------------------------------------------------------------------|----------------|
| 22. `count`                        | Number of connections to the same host.         | Numeric        |
| 23. `srv_count`                    | Number of connections to the same service.      | Numeric        |
| 24. `serror_rate`                  | Percentage of connections with "SYN" errors.                                                     | Numeric        |
| 25. `rerror_rate`                  | Percentage of connections with "REJ" errors.                                                     | Numeric        |
| 26. `same_srv_rate`                | Percentage of connections to the same service.                                                   | Numeric        |
| 27. `diff_srv_rate`                | Percentage of connections to different services.                                                 | Numeric        |
| 28. `srv_diff_host_rate`           | Percentage of same-service connections to different hosts.                                       | Numeric        |
| 29. `dst_host_count`               | Number of connections to the same destination host.                                              | Numeric        |
| 30. `dst_host_srv_count`           | Number of connections to the same service on the destination host.                               | Numeric        |
| 31. `dst_host_diff_srv_rate`       | Percentage of connections to different services on the destination host.                         | Numeric        |
| 32. `dst_host_same_src_port_rate`  | Percentage of connections to the same source port on the destination host.                       | Numeric        |
| 33. `dst_host_srv_diff_host_rate`  | Percentage of connections to different hosts on the same service at the destination host.         | Numeric        |

### 5. Dataset Collection

Is is a publicly available dataset that was used in KDDCup99. it is widely used bu beginners to learn ML techniques to various classifiers on it. KDDCup is Knowledge Discovery and Dataminig competition that is widely known all over the world in the field of DataScience and Machine Learning.

The KDDCup dataset is a large dataset with 42 features. It also has refined version for beginners to learn machine learning known as **NSL-KDD** dataset with 33 features. Some features are removed after applying correlation technqiues.
The **NSL-KDD** dataset is a refined version of the original KDD Cup 1999 dataset, addressing some of its shortcomings.

The dataset is sourced from Kaggle: [NSL-KDD Dataset](https://www.kaggle.com/datasets/kaggleprollc/nsl-kdd99-dataset?resource=download)


# 6. Data Preprocessing

> Let's perform follwing pre-processing steps:
- Convert data to Pandas Dataframe
- Check columns
- Check data length (shape)
- Check for null values.
- Check for noisy data.
- Check for duplicate records and remove them
- Check for outliers
- Remove or replace outliers, if necessary

#### Convert dataset

> This code was used to convert dataset to csv format

```py
import pandas s pd
from scipy.io import arff
# Convert train dataset to csv
data = arff.loadarff('./data/KDDTrain+.arff') 
df = pd.DataFrame(data[0])

df.to_csv('kdd.csv', index=False)
print("Data saved to kdd.csv")
```

#### Libs import

In [54]:
import pandas as pd

#### Load dataset and handle exceptions(file not found error)

In [None]:
try:
    # LOAD DATASET
    df = pd.read_csv('./kdd.csv')
    print(df.head(5))  
except FileNotFoundError:
    print("Error: The file 'kdd.csv' was not found.")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,rerror_rate,same_srv_rate,diff_srv_rate,srv_diff_host_rate,dst_host_count,dst_host_srv_count,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,class
0,0.0,b'tcp',b'ftp_data',b'SF',491.0,0.0,b'0',0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,150.0,25.0,0.03,0.17,0.0,b'normal'
1,0.0,b'udp',b'other',b'SF',146.0,0.0,b'0',0.0,0.0,0.0,...,0.0,0.08,0.15,0.0,255.0,1.0,0.6,0.88,0.0,b'normal'
2,0.0,b'tcp',b'private',b'S0',0.0,0.0,b'0',0.0,0.0,0.0,...,0.0,0.05,0.07,0.0,255.0,26.0,0.05,0.0,0.0,b'anomaly'
3,0.0,b'tcp',b'http',b'SF',232.0,8153.0,b'0',0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,30.0,255.0,0.0,0.03,0.04,b'normal'
4,0.0,b'tcp',b'http',b'SF',199.0,420.0,b'0',0.0,0.0,0.0,...,0.0,1.0,0.0,0.09,255.0,255.0,0.0,0.0,0.0,b'normal'


The values of some features above are encoded in byte strings instead of plain strings. Let's convert them to plain strings.

In [33]:
# byte strings are like this b'value'. To convert them, we can remove "b'" from start and "'" from end.
def clean_byte_string(value):
    if isinstance(value, str) and value.startswith("b'"):
        return value[2:-1]  # Remove the "b'" at the start and the "'" at the end
    return value

for col in df.columns:
    df[col] = df[col].apply(clean_byte_string)
    
    
df.head(5)    

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,rerror_rate,same_srv_rate,diff_srv_rate,srv_diff_host_rate,dst_host_count,dst_host_srv_count,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,class
0,0.0,tcp,ftp_data,SF,491.0,0.0,0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,150.0,25.0,0.03,0.17,0.0,normal
1,0.0,udp,other,SF,146.0,0.0,0,0.0,0.0,0.0,...,0.0,0.08,0.15,0.0,255.0,1.0,0.6,0.88,0.0,normal
2,0.0,tcp,private,S0,0.0,0.0,0,0.0,0.0,0.0,...,0.0,0.05,0.07,0.0,255.0,26.0,0.05,0.0,0.0,anomaly
3,0.0,tcp,http,SF,232.0,8153.0,0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,30.0,255.0,0.0,0.03,0.04,normal
4,0.0,tcp,http,SF,199.0,420.0,0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.09,255.0,255.0,0.0,0.0,0.0,normal


#### Check Columns


In [38]:
cols = df.columns.to_list()
print(df.columns)

Index(['duration', 'protocol_type', 'service', 'flag', 'src_bytes',
       'dst_bytes', 'land', 'wrong_fragment', 'urgent', 'hot',
       'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell',
       'su_attempted', 'num_file_creations', 'num_shells', 'num_access_files',
       'num_outbound_cmds', 'is_host_login', 'is_guest_login', 'count',
       'srv_count', 'serror_rate', 'rerror_rate', 'same_srv_rate',
       'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count',
       'dst_host_srv_count', 'dst_host_diff_srv_rate',
       'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate', 'class'],
      dtype='object')


#### Check Shape

In [46]:
df.shape

(124527, 34)

#### Count null values

In [35]:
df.isnull().sum()

duration                       0
protocol_type                  0
service                        0
flag                           0
src_bytes                      0
dst_bytes                      0
land                           0
wrong_fragment                 0
urgent                         0
hot                            0
num_failed_logins              0
logged_in                      0
num_compromised                0
root_shell                     0
su_attempted                   0
num_file_creations             0
num_shells                     0
num_access_files               0
num_outbound_cmds              0
is_host_login                  0
is_guest_login                 0
count                          0
srv_count                      0
serror_rate                    0
rerror_rate                    0
same_srv_rate                  0
diff_srv_rate                  0
srv_diff_host_rate             0
dst_host_count                 0
dst_host_srv_count             0
dst_host_d

#### Check for duplicate records

In [None]:
duplicates = df.duplicated()
print(f"Number of duplicate rows: {duplicates.sum()}")

df = df.drop_duplicates()

Number of duplicate rows: 1446


#### Separate columns for different types, to apply feature scaling and feature encoding later, if any required

In [41]:
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns.to_list()
print("Numerical Columns:", numerical_cols)

# Select both object and category types for categorical data
categorical_cols = df.select_dtypes(include=['object', 'category']).columns.to_list()
print("Categorical Columns:", categorical_cols)

Numerical Columns: ['duration', 'src_bytes', 'dst_bytes', 'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'num_compromised', 'root_shell', 'su_attempted', 'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds', 'count', 'srv_count', 'serror_rate', 'rerror_rate', 'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count', 'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate']
Categorical Columns: ['protocol_type', 'service', 'flag', 'land', 'logged_in', 'is_host_login', 'is_guest_login', 'class']


### Check for class imbalance

In [42]:
# Check class distribution
print(df['class'].value_counts(normalize=True))

class
normal     0.539128
anomaly    0.460872
Name: proportion, dtype: float64


#### Checking for number of outliers in numerical features

In [43]:
for col in numerical_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Flag outliers
    outliers = ((df[col] < lower_bound) | (df[col] > upper_bound)).sum()
    print(f"Outliers in {col}: {outliers}")

Outliers in duration: 9998
Outliers in src_bytes: 13488
Outliers in dst_bytes: 23209
Outliers in wrong_fragment: 1072
Outliers in urgent: 9
Outliers in hot: 2475
Outliers in num_failed_logins: 122
Outliers in num_compromised: 1094
Outliers in root_shell: 169
Outliers in su_attempted: 80
Outliers in num_file_creations: 287
Outliers in num_shells: 47
Outliers in num_access_files: 371
Outliers in num_outbound_cmds: 0
Outliers in count: 2534
Outliers in srv_count: 11898
Outliers in serror_rate: 0
Outliers in rerror_rate: 15154
Outliers in same_srv_rate: 0
Outliers in diff_srv_rate: 7272
Outliers in srv_diff_host_rate: 28295
Outliers in dst_host_count: 0
Outliers in dst_host_srv_count: 0
Outliers in dst_host_diff_srv_rate: 9691
Outliers in dst_host_same_src_port_rate: 24666
Outliers in dst_host_srv_diff_host_rate: 11334


### Outliers removal

Remove records with outlier values for some features, where removing them does not affect data.
While other outliers will be replaced with median values.

In [None]:
# REMOVING OUTLIERS
columns_to_remove_outliers = [
    'duration', 
    'wrong_fragment', 
    'urgent', 
    'hot', 
    'num_failed_logins', 
    'num_compromised', 
    'root_shell', 
    'su_attempted', 
    'num_file_creations', 
    'num_shells', 
    'num_access_files'
]

for col in columns_to_remove_outliers:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]
    
df.shape    

(111856, 34)

#### Replacing outliers with median values, where removing records is not good approach. For example, some features are counts, like number of connections, so removing them will change our dataset originality.

In [49]:
# REPLACING OUTLIERS WITH MEDIAN VALUES
columns_to_replace_outliers = [
    'src_bytes',
    'dst_bytes',
    'count',
    'srv_count',
    'rerror_rate',
    'diff_srv_rate',
    'srv_diff_host_rate',
    'dst_host_diff_srv_rate',
    'dst_host_same_src_port_rate',
    'dst_host_srv_diff_host_rate'
]

for col in columns_to_replace_outliers:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    median_value = df[col].median()
    df[col] = df[col].apply(lambda x: median_value if x < lower_bound or x > upper_bound else x)

print("Outliers replaced with median values.")

Outliers replaced with median values.


#### Shape after removal of some rows

In [52]:
print(f"SHAPE  = {df.shape}")
df.head(5)

SHAPE  = (111856, 34)


Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,rerror_rate,same_srv_rate,diff_srv_rate,srv_diff_host_rate,dst_host_count,dst_host_srv_count,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,class
0,0.0,tcp,ftp_data,SF,491.0,0.0,0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,150.0,25.0,0.03,0.0,0.0,normal
1,0.0,udp,other,SF,146.0,0.0,0,0.0,0.0,0.0,...,0.0,0.08,0.15,0.0,255.0,1.0,0.02,0.0,0.0,normal
2,0.0,tcp,private,S0,0.0,0.0,0,0.0,0.0,0.0,...,0.0,0.05,0.07,0.0,255.0,26.0,0.05,0.0,0.0,anomaly
3,0.0,tcp,http,SF,232.0,0.0,0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,30.0,255.0,0.0,0.03,0.04,normal
4,0.0,tcp,http,SF,199.0,420.0,0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,255.0,255.0,0.0,0.0,0.0,normal


### Almost all data preprocessing steps are complete, let's save cleaned data to file.

In [53]:
df.to_csv("./kdd-cleaned.csv",index=False)