<p align="center">
    <img src="JHU.png" width="200" alt="Johns Hopkins University logo">
</p>

# Hands-on Lab: Cyber Intrusion Detection Systems

Estimated time needed: **60** minutes

### Overview:

In this lab, we will develop a machine learning model for detecting cybersecurity intrusions using seven datasets from the CICIDS 2017 collection. Each dataset represents a different type of cyber-attack captured over five days. We will focus on Dataset 1, ensuring it has multiple class values, and apply a Random Forest classifier due to its robustness and ability to handle feature importance. We will process the class feature into binary values, explore key network features, visualize the data, evaluate different classifiers, and adapt our approach for the remaining datasets. Finally, we will reflect on the challenges of developing a machine learning model in the field of cybersecurity.

### Learning Objectives:

In this lab, we aim to achieve the following objectives:

- Learn to preprocess and manipulate datasets effectively.
- Explore and engineer relevant features for network traffic analysis.
- Select and justify appropriate machine learning methodologies.
- Evaluate machine learning models using cross-validation and performance metrics.
- Utilize data visualization techniques to identify patterns in datasets.
- Adapt code for application across multiple datasets.

### Datasets Used:

The datasets consist of network traffic data captured during various timeframes. Each dataset includes the following columns:

- **Flow ID:** Unique identifier for each flow.
- **Source IP/Port:** IP address and port of the source.
- **Destination IP/Port:** IP address and port of the destination.
- **Protocol:** The transport protocol used (e.g., TCP, UDP).
- **Timestamp:** The time at which the flow was recorded.
- **Flow Duration:** Total time duration of the flow.
- **Total Fwd/Backward Packets:** Total number of packets sent in the forward and backward directions.
- **Total Length of Fwd/Bwd Packets:** Total bytes sent in the forward and backward directions.
- **Fwd/Bwd Packet Length Stats:** Various statistics about packet lengths (max, min, mean, standard deviation).
- **Flow Bytes/s and Packets/s:** Throughput metrics for the flow.
- **Flow IAT (Inter-Arrival Time):** Time intervals between packets, with various statistics.
- **Flags and Counts:** Information about TCP flags and their counts.
- **Packet Length Stats:** Additional statistics regarding packet lengths, including variance.
- **Window Sizes and Bulk Metrics:** Metrics related to the forwarding and backward window sizes and data transfer rates.
- **Active/Idle Times:** Statistics for active and idle periods during the flow.
- **Label:** Class label indicating whether the flow is benign or malicious.

All the seven datasets contain similar structures and types of information, allowing for consistent analysis across different timeframes and conditions. Out of the seven datasets, we will be working on four in this lab, and you will need to practice with the remaining three.

### Implementation:

#### Importing necessary libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.compose import ColumnTransformer

#### Problem 1: Pick one of the data files, call it Dataset 1, and examine its features. Make sure it has more than one class value for its label.

To begin, we have chosen "Tuesday-WorkingHours.pcap_ISCX.csv" as Dataset 1.

In [2]:
# Load the dataset
dataset_1 = pd.read_csv('Tuesday-WorkingHours.pcap_ISCX.csv')

In [None]:
# Remove leading and trailing whitespace from all column names in dataset_1
# write your code here



<details><summary>Click here for the solution</summary>
 
```python
dataset_1.columns = dataset_1.columns.str.strip()
```
 
</details>

In [None]:
# Analyze the features of the dataset and determine if there is a suitable label column with multiple classes.

# print the column names for dataset_1
# write your code here


# Check for unique values in each column to identify potential label columns
# write your code here



<details><summary>Click here for the solution</summary>
 
```python
# print the column names for dataset_1
print(dataset_1.columns)

# Check for unique values in each column to identify potential label columns
for col in dataset_1.columns:
    print(f"{col}: {dataset_1[col].nunique()} unique values")    
```
 
</details>

In [None]:
# Define the label column name and print its unique values from dataset_1
# write your code here



<details><summary>Click here for the solution</summary>
 
```python
label_column = 'Label'  
print(f"Unique values in {label_column}: {dataset_1[label_column].unique()}")
```
 
</details>

**For Dataset 1, Random Forest is an excellent choice for classification due to its ability to handle high-dimensional data without scaling and its robustness against overfitting through the use of multiple decision trees. It effectively manages class imbalances with techniques like class weighting and provides feature importance rankings, offering insights into key attributes. Additionally, its non-parametric nature allows it to accommodate the non-linear patterns often seen in network traffic.**    

#### Problem 2: Process the class feature/category as binary classes for supervised learning, assign BENIGN to value 0 and the rest to value 1. Check its balance for the Dataset 1.

To process the class feature of Dataset 1 as binary classes (assigning BENIGN to 0 and all other values to 1), follow these steps:

- Identify the target column: The class column is labeled as "Label."
- Assign binary values: Convert BENIGN to 0 and all other class values to 1.
- Check balance: After processing, count the number of instances in each class to assess for class imbalance.

In [None]:
# Assign binary values in the 'Label' column: 'BENIGN' as 0 and all other values as 1
dataset_1['Label_binary'] = dataset_1['Label'].apply(lambda x: 0 if x == 'BENIGN' else 1)

# Check the balance of the binary classes
# write your code here


# Display the class balance
# write your code here



<details><summary>Click here for the solution</summary>
 
```python
# Check the balance of the binary classes
class_counts = dataset_1['Label_binary'].value_counts()

# Display the class balance
print("Class balance:")
print(class_counts)
```
 
</details>

#### Problem 3: Explore Dataset 1 features with respect to the class. 
**(Hint: features Source Port and Destination Port are very useful; research and find out important networking port numbers and one-hot-encode them. Unimportant port numbers or source port numbers can be assigned to a feature called 'other ports'.)**

In [None]:
# List of important ports (common port numbers for HTTP, HTTPS, FTP, SSH, etc.)
important_ports = [80, 443, 21, 22, 53, 25, 110, 143]  # Modify this list based on your research

# Function to label ports: important ports retain their number, others are labeled as 'Other Port'
def label_ports(port, important_ports):
    if port in important_ports:
        return port
    else:
        return 'Other Port'

# Apply the labeling function to both 'Source Port' and 'Destination Port'
# write your code here




# One-Hot Encoding the 'Source Port Label' and 'Destination Port Label' columns
# write your code here



# Drop the original label columns for source and destination ports
# write your code here



# Concatenate the one-hot encoded columns back to the dataset
# write your code here



# Display the first few rows of the updated dataset to check the result
# write your code here



<details><summary>Click here for the solution</summary>
 
```python
# Apply the labeling function to both 'Source Port' and 'Destination Port'
dataset_1['Source Port Label'] = dataset_1['Source Port'].apply(lambda x: label_ports(x, important_ports))
dataset_1['Destination Port Label'] = dataset_1['Destination Port'].apply(lambda x: label_ports(x, important_ports))

# One-Hot Encoding the 'Source Port Label' and 'Destination Port Label' columns
source_port_encoded = pd.get_dummies(dataset_1['Source Port Label'], prefix='Source_Port')
destination_port_encoded = pd.get_dummies(dataset_1['Destination Port Label'], prefix='Destination_Port')

# Drop the original label columns for source and destination ports
dataset_1.drop(['Source Port Label', 'Destination Port Label'], axis=1, inplace=True)

# Concatenate the one-hot encoded columns back to the dataset
dataset_1 = pd.concat([dataset_1, source_port_encoded, destination_port_encoded], axis=1)

# Display the first few rows of the updated dataset to check the result
print(dataset_1.head())
```
 
</details>

#### Problem 4: Display some histograms and anything you deem fit to pick independent Dataset 1 features.
**(Hint: source/destination bytes, packets, ports and the duration features.)**

**Problem 4.1: Plot a boxplot for Total Backward Packets, excluding outliers**

In [None]:
# Initialize a new figure with specified width and height
plt.figure(figsize=(8, 4))  

# Create a boxplot for 'Total Backward Packets', hiding outliers
# write your code here


# Set the title and label for the x-axis
# write your code here


# Display the plot
# write your code here



<details><summary>Click here for the solution</summary>
 
```python
# Create a boxplot for 'Total Backward Packets', hiding outliers
sns.boxplot(x=dataset_1['Total Backward Packets'], showfliers=False)

# Set the title and label for the x-axis
plt.title('Box Plot of Total Backward Packets (Without Outliers)')
plt.xlabel('Total Backward Packets')

# Display the plot
plt.show()
```
 
</details>

#### Problem 4.2: Plot a histogram for Flow Duration with a kernel density estimate (KDE)

In [None]:
# Initialize a new figure with specified width and height for the histogram
plt.figure(figsize=(7, 5))  

# Create a histogram of Flow Duration, including a KDE to visualize the distribution shape
# write your code here


# Set the title of the plot
# write your code here


# Label the x-axis and y-axis
# write your code here
  

# Display the plot
# write your code here



<details><summary>Click here for the solution</summary>
 
```python
# Create a histogram of Flow Duration, including a KDE to visualize the distribution shape
sns.histplot(dataset_1['Flow Duration'], bins=20, kde=True)

# Set the title of the plot
plt.title('Flow Duration Distribution') 

# Label the x-axis and y-axis
plt.xlabel('Flow Duration')  
plt.ylabel('Frequency')  

# Display the plot
plt.show()  
```
 
</details>

#### Problem 4.3: Plot Correlation Heatmap

In [None]:
# Create a figure for the heatmap
plt.figure(figsize=(7, 5))

# Calculate the correlation matrix for selected features
# write your code here



# Plot the heatmap with annotations, using a color map for visualization
# write your code here


# Set the title of the heatmap
# write your code here


# Display the heatmap
# write your code here



<details><summary>Click here for the solution</summary>
 
```python
# Calculate the correlation matrix for selected features
correlation_matrix = dataset_1[['Total Length of Fwd Packets', 
                                  'Total Length of Bwd Packets', 
                                  'Total Fwd Packets', 
                                  'Total Backward Packets', 
                                  'Flow Duration']].corr()

# Plot the heatmap with annotations, using a color map for visualization
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', vmin=-1, vmax=1)

# Set the title of the heatmap
plt.title('Correlation Heatmap')

# Display the heatmap
plt.show()
```
 
</details>

#### Problem 5: Attempt a few classifier models and report their 10-fold CV performances.

**Problem 5.1: Prepare the Data for Classifier Model Training**

In [None]:
# Select relevant features for model training
features = ['Total Length of Fwd Packets', 'Total Length of Bwd Packets', 
            'Total Fwd Packets', 'Total Backward Packets', 'Flow Duration']

# Define the feature set and target variable
# write your code here



<details><summary>Click here for the solution</summary>
 
```python
X = dataset_1[features]  
y = dataset_1['Label']
```
 
</details>

In [None]:
# Encode target labels if they are categorical
label_encoder = LabelEncoder()

# Encode labels
# write your code here



<details><summary>Click here for the solution</summary>
 
```python
y_encoded = label_encoder.fit_transform(y)  
```
 
</details>

In [None]:
# Handle missing values and standardize features using a pipeline
pipeline = make_pipeline(
    SimpleImputer(strategy='mean'),  
    StandardScaler()  
)

# Apply the transformations to the feature set X.
# write your code here



<details><summary>Click here for the solution</summary>
 
```python
X_processed = pipeline.fit_transform(X) 
```
 
</details>

>  **Note: Please be patient as the execution may take some time.**

**Problem 5.2: Evaluate Classifier Models and Report 10-Fold Cross-Validation Results**

In [None]:
# Create a dictionary to hold results
results = {}

**1. Logistic Regression**

In [None]:
# Create a Logistic Regression model and perform 10-fold cross-validation.
# write your code here



<details><summary>Click here for the solution</summary>
 
```python
log_reg = LogisticRegression(max_iter=1000)
cv_scores_log_reg = cross_val_score(log_reg, X_processed, y_encoded, cv=10, scoring='accuracy', n_jobs=-1)
results['Logistic Regression'] = cv_scores_log_reg.mean()
```
 
</details>

**2. Decision Tree**

In [None]:
# Create a Decision Tree model and perform 10-fold cross-validation.
# write your code here



<details><summary>Click here for the solution</summary>
 
```python
decision_tree = DecisionTreeClassifier()
cv_scores_decision_tree = cross_val_score(decision_tree, X_processed, y_encoded, cv=10, scoring='accuracy', n_jobs=-1)
results['Decision Tree'] = cv_scores_decision_tree.mean()
```
 
</details>

**3. Random Forest**

In [None]:
# Create a Random Forest model and perform 10-fold cross-validation.
# write your code here



<details><summary>Click here for the solution</summary>
 
```python
random_forest = RandomForestClassifier()
cv_scores_random_forest = cross_val_score(random_forest, X_processed, y_encoded, cv=10, scoring='accuracy', n_jobs=-1)
results['Random Forest'] = cv_scores_random_forest.mean()
```
 
</details>

#### Problem 5.3: Compile Results into a DataFrame

In [None]:
# Convert the results dictionary to a DataFrame for easier visualization
results_df = pd.DataFrame(list(results.items()), columns=['Classifier', '10-Fold CV Accuracy'])

# Display the results DataFrame
# write your code here



<details><summary>Click here for the solution</summary>
 
```python
print(results_df)
```
 
</details>

#### Problem 6: Adapt your code to analyze any three datasets from the remaining 6 datasets.

For this task, you need to utilize the following three datasets:

- **dataset_2:** Wednesday-workingHours.pcap_ISCX.csv 
- **dataset_3:** Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv  
- **dataset_4:** Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv 

**Data Preparation**

To ensure consistency in the datasets, you need to clean the column names and load the data properly. 

In [None]:
# Function to clean column names by removing leading and trailing whitespace
def clean_column_names(df):
    df.columns = df.columns.str.strip()  # Strip whitespace from column names
    return df

# Function to load the dataset, handling potential encoding issues
# Attempt to load with UTF-8 encoding
# Fallback to ISO-8859-1 encoding
# write your code here




<details><summary>Click here for the solution</summary>
 
```python
def load_dataset(file):
    try:
        return pd.read_csv(file, encoding='utf-8')  
    except UnicodeDecodeError:
        return pd.read_csv(file, encoding='ISO-8859-1')  
```
 
</details>

#### Problem 7: Pick a classifier algorithm and report its evaluation for the 3 datasets.

In [None]:
def evaluate_supervised_model(dataset):
    # Clean the dataset column names
    dataset = clean_column_names(dataset)
    
    # Define the required features for modeling
    required_features = ['Total Length of Fwd Packets', 'Total Length of Bwd Packets',
                         'Total Fwd Packets', 'Total Backward Packets', 'Flow Duration', 'Label']
    
    # Check for any missing features in the dataset
    missing_features = [feature for feature in required_features if feature not in dataset.columns]
    if missing_features:
        return f"Cannot evaluate: missing features {missing_features}"

    # Separate features and target variable
    X = dataset[required_features[:-1]]  # Features excluding the label
    y_encoded = LabelEncoder().fit_transform(dataset['Label'])  # Encode the target labels

    # Identify numeric and categorical features
    numeric_features = X.select_dtypes(include=['float64', 'int64']).columns.tolist()
    categorical_features = X.select_dtypes(include=['object']).columns.tolist()

    # Create a preprocessing pipeline for numeric and categorical features
    preprocessor = ColumnTransformer(transformers=[
        ('num', Pipeline([('imputer', SimpleImputer(strategy='mean')),  # Handle missing values with mean
                          ('scaler', StandardScaler())]), numeric_features),  # Standardize numeric features
        ('cat', Pipeline([('imputer', SimpleImputer(strategy='most_frequent')),  # Handle missing categorical values
                          ('onehot', OneHotEncoder(handle_unknown='ignore'))]), categorical_features)  # One-hot encode categorical features
    ])

    
    # Create a modeling pipeline that includes preprocessing and classifier
    # Use Random Forest as the classifier
    # write your code here
    
    
    
                             
    # Evaluate the model using 10-fold cross-validation and return the mean accuracy
    return cross_val_score(model, X, y_encoded, cv=10, scoring='accuracy', n_jobs=-1).mean()

<details><summary>Click here for the solution</summary>
 
```python
model = Pipeline(steps=[('preprocessor', preprocessor), 
                             ('classifier', RandomForestClassifier())]) 

```
 
</details>

**Evaluate Supervised Datasets and Compile Results**

In [None]:
# Supervised dataset paths for Datasets 2-4
supervised_files = [
    'Wednesday-workingHours.pcap_ISCX.csv',
    'Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv',
    'Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv',  
]

# Create new dataset names for supervised datasets
dataset_names = [f'dataset_{i + 2}' for i in range(len(supervised_files))]  # Generates ['dataset_2', 'dataset_3', 'dataset_4']

# Evaluate supervised datasets
supervised_results = {
    name: evaluate_supervised_model(load_dataset(file))
    for name, file in zip(dataset_names, supervised_files)
}

# Create a DataFrame for supervised results
# write your code here




<details><summary>Click here for the solution</summary>
 
```python
supervised_results_df = pd.DataFrame(list(supervised_results.items()), columns=['Dataset', '10-Fold CV Accuracy'])
print("Supervised Learning Results:")
print(supervised_results_df)
```
 
</details>

### Practice datasets:

#### Next Steps: 
To enhance your skills in cybersecurity intrusion detection, we have provided three additional datasets for you to practice. You can apply the supervised learning algorithms and techniques you've learned, such as data preprocessing, feature engineering, and model evaluation, to these new datasets.

**Datasets:**

- **Friday-WorkingHours-Morning.pcap_ISCX.csv**
- **Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv**
- **Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv**

Experiment with these datasets by implementing similar approaches and observing how different attack types and network characteristics affect model performance. This will help reinforce your understanding and ability to adapt your methods in real-world scenarios.

> **Note**: To practice, create a new notebook by navigating to File > New Notebook and open the link in a new tab.
The datasets are already loaded into the environment, so you can easily access them using the same names as mentioned above.


### Summary:

In this lab, we developed a machine learning model for cybersecurity intrusion detection using CICIDS 2017 datasets. We focused on preprocessing the data, exploring key features, and implementing multiple classifiers, including a Random Forest. We evaluated their performance through cross-validation. This exercise underscored the challenges of developing effective models in cybersecurity and highlighted the importance of systematic data handling and model evaluation.