![cyber_photo](cyber_photo.jpg)

Cyber threats are a growing concern for organizations worldwide. These threats take many forms, including malware, phishing, and denial-of-service (DOS) attacks, compromising sensitive information and disrupting operations. The increasing sophistication and frequency of these attacks make it imperative for organizations to adopt advanced security measures. Traditional threat detection methods often fall short due to their inability to adapt to new and evolving threats. This is where deep learning models come into play.

Deep learning models can analyze vast amounts of data and identify patterns that may not be immediately obvious to human analysts. By leveraging these models, organizations can proactively detect and mitigate cyber threats, safeguarding their sensitive information and ensuring operational continuity.

As a cybersecurity analyst, you identify and mitigate these threats. In this project, you will design and implement a deep learning model to detect cyber threats. The BETH dataset simulates real-world logs, providing a rich source of information for training and testing your model. The data has already undergone preprocessing, and we have a target label, `sus_label`, indicating whether an event is malicious (1) or benign (0).

By successfully developing this model, you will contribute to enhancing cybersecurity measures and protecting organizations from potentially devastating cyber attacks.


### The Data

| Column     | Description              |
|------------|--------------------------|
|`processId`|The unique identifier for the process that generated the event - int64 |
|`threadId`|ID for the thread spawning the log - int64|
|`parentProcessId`|Label for the process spawning this log - int64|
|`userId`|ID of user spawning the log|Numerical - int64|
|`mountNamespace`|Mounting restrictions the process log works within - int64|
|`argsNum`|Number of arguments passed to the event - int64|
|`returnValue`|Value returned from the event log (usually 0) - int64|
|`sus_label`|Binary label as suspicous event (1 is suspicious, 0 is not) - int64|

More information on the dataset: [BETH dataset](accreditation.md)

In [40]:
# Make sure to run this cell to use torchmetrics. If you cannot use pip install to install the torchmetrics, you can use sklearn.
!pip install torchmetrics

Defaulting to user installation because normal site-packages is not writeable


In [41]:
# Import required libraries
import pandas as pd
from sklearn.preprocessing import StandardScaler
import torch
import torch.nn as nn
import torch.nn.functional as functional
from torch.utils.data import DataLoader, TensorDataset
import torch.optim as optim
import torchmetrics
from torchmetrics import Accuracy
import torch.optim as optim
from sklearn.metrics import accuracy_score  # uncomment to use sklearn

In [42]:
# Load preprocessed data
train_df = pd.read_csv('labelled_train.csv')
test_df = pd.read_csv('labelled_test.csv')
val_df = pd.read_csv('labelled_validation.csv')

# View the first 5 rows of training set
train_df.head()

Unnamed: 0,processId,threadId,parentProcessId,userId,mountNamespace,argsNum,returnValue,sus_label
0,381,7337,1,100,4026532231,5,0,1
1,381,7337,1,100,4026532231,1,0,1
2,381,7337,1,100,4026532231,0,0,1
3,7347,7347,7341,0,4026531840,2,-2,1
4,7347,7347,7341,0,4026531840,4,0,1


In [43]:
# Start coding here
# Use as many cells as you need
test_df.head()

Unnamed: 0,processId,threadId,parentProcessId,userId,mountNamespace,argsNum,returnValue,sus_label
0,382,382,1,101,4026532232,3,15,0
1,379,379,1,100,4026532231,3,15,0
2,1,1,0,0,4026531840,4,0,0
3,1,1,0,0,4026531840,4,17,0
4,1,1,0,0,4026531840,2,0,0


In [44]:
val_df.head()

Unnamed: 0,processId,threadId,parentProcessId,userId,mountNamespace,argsNum,returnValue,sus_label
0,381,381,1,101,4026532232,3,15,0
1,378,378,1,100,4026532231,3,15,0
2,1,1,0,0,4026531840,4,0,0
3,1,1,0,0,4026531840,4,12,0
4,1,1,0,0,4026531840,2,0,0


1. Loading and Scaling Data

In [45]:
#  separate features and labels, scale the features, and convert data to PyTorch tensors.
# Drop the sus_label column from each DataFrame to separate features and assign the sus_label column to the labels.
X_train = train_df.drop('sus_label', axis=1)
y_train = train_df['sus_label']
X_test = test_df.drop('sus_label', axis=1)
y_test = test_df['sus_label']
X_val = val_df.drop('sus_label', axis=1)
y_val = val_df['sus_label']

In [46]:
# Initialize the StandardScaler
scaler = StandardScaler()

# Fit the scaler on the training data and transform the training data
X_train_scaled = scaler.fit_transform(X_train)

# Use the fitted scaler to transform the test and validation data
X_test_scaled = scaler.transform(X_test)
X_val_scaled = scaler.transform(X_val)

In [47]:
# Convert scaled features and labels to PyTorch tensors
X_train_tensor = torch.tensor(X_train_scaled, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.long)

X_test_tensor = torch.tensor(X_test_scaled, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.long)

X_val_tensor = torch.tensor(X_val_scaled, dtype=torch.float32)
y_val_tensor = torch.tensor(y_val.values, dtype=torch.long)

# Print results
print("Train Features Tensor:")
print(X_train_tensor)
print("\nTrain Labels Tensor:")
print(y_train_tensor)

print("\nTest Features Tensor:")
print(X_test_tensor)
print("\nTest Labels Tensor:")
print(y_test_tensor)

print("\nValidation Features Tensor:")
print(X_val_tensor)
print("\nValidation Labels Tensor:")
print(y_val_tensor)

Train Features Tensor:
tensor([[-3.3013,  0.2668, -0.8491,  ...,  1.7840,  1.7361, -0.0550],
        [-3.3013,  0.2668, -0.8491,  ...,  1.7840, -1.2470, -0.0550],
        [-3.3013,  0.2668, -0.8491,  ...,  1.7840, -1.9927, -0.0550],
        ...,
        [ 0.2356,  0.2342,  2.3587,  ..., -0.5871, -1.9927, -0.0550],
        [ 0.2362,  0.2348, -0.8491,  ..., -0.5871, -1.9927, -0.0550],
        [ 0.1505,  0.1485, -0.8491,  ..., -0.5871, -1.2470, -0.0550]])

Train Labels Tensor:
tensor([1, 1, 1,  ..., 0, 0, 0])

Test Features Tensor:
tensor([[-3.3008, -3.3237, -0.8491,  ...,  1.7901,  0.2446, -0.0079],
        [-3.3023, -3.3253, -0.8491,  ...,  1.7840,  0.2446, -0.0079],
        [-3.4963, -3.5204, -0.8495,  ..., -0.5871,  0.9903, -0.0550],
        ...,
        [ 0.3798,  0.3793,  2.5573,  ..., -0.5871,  0.2446, -0.4128],
        [ 0.3798,  0.3793,  2.5573,  ..., -0.5871,  0.2446, -0.4128],
        [ 0.3798,  0.3793,  2.5573,  ..., -0.5871,  0.2446, -0.4128]])

Test Labels Tensor:
tensor([0,

In [48]:
# Create DataLoaders
batch_size = 32
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)
val_dataset = TensorDataset(X_val_tensor, y_val_tensor)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size)
val_loader = DataLoader(val_dataset, batch_size=batch_size)

2. Define the Neural Network Model

In [49]:
import torch
import torch.nn as nn

# Define the number of features in the dataset
num_features = X_train_tensor.shape[1]

# Define the neural network using nn.Sequential
model = nn.Sequential(
    nn.Linear(num_features, 128),  # Input layer
    nn.ReLU(),
    nn.Linear(128, 64),           # Hidden layer
    nn.ReLU(),
    nn.Linear(64, 2)              # Output layer: 2 units for binary classification
)


Train The Model

In [50]:
# Define the loss function
criterion = nn.CrossEntropyLoss()

# Initialize the optimizer
optimizer = optim.SGD(
    model.parameters(),
    lr=1e-3,
    weight_decay=1e-4
)

# Number of epochs
num_epochs = 20

# Training loop
for epoch in range(num_epochs):
    model.train()  # Set the model to training mode

    running_loss = 0.0
    for batch_X, batch_y in train_loader:
        optimizer.zero_grad()  # Clear the gradients

        outputs = model(batch_X)  # Forward pass

        loss = criterion(outputs, batch_y)  # Compute the loss

        loss.backward()  # Backward pass to compute gradients

        optimizer.step()  # Update the model parameters

        running_loss += loss.item() * batch_X.size(0)

    # Average loss for the epoch
    epoch_loss = running_loss / len(train_loader.dataset)

    # Print epoch loss
    print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {epoch_loss:.4f}")

print("Training complete.")


Epoch [1/20], Loss: 0.0273
Epoch [2/20], Loss: 0.0038
Epoch [3/20], Loss: 0.0034
Epoch [4/20], Loss: 0.0032
Epoch [5/20], Loss: 0.0031
Epoch [6/20], Loss: 0.0030
Epoch [7/20], Loss: 0.0029
Epoch [8/20], Loss: 0.0029
Epoch [9/20], Loss: 0.0028
Epoch [10/20], Loss: 0.0028
Epoch [11/20], Loss: 0.0027
Epoch [12/20], Loss: 0.0027
Epoch [13/20], Loss: 0.0026
Epoch [14/20], Loss: 0.0026
Epoch [15/20], Loss: 0.0026
Epoch [16/20], Loss: 0.0026
Epoch [17/20], Loss: 0.0025
Epoch [18/20], Loss: 0.0025
Epoch [19/20], Loss: 0.0025
Epoch [20/20], Loss: 0.0025
Training complete.


4. Evaluating the Model

In [51]:
# Initialize the accuracy metric
accuracy_metric = torchmetrics.Accuracy(num_classes=2, task='binary')

# Function to evaluate the model
def evaluate_model(loader):
    model.eval()  # Set the model to evaluation mode

    correct = 0
    total = 0

    with torch.no_grad():
        for batch_X, batch_y in loader:
            outputs = model(batch_X)
            _, predicted = torch.max(outputs, 1)  # Get the predicted class

            correct += (predicted == batch_y).sum().item()
            total += batch_y.size(0)

    accuracy = correct / total
    return accuracy

# Evaluate on training, testing, and validation datasets
train_accuracy = evaluate_model(train_loader)
test_accuracy = evaluate_model(test_loader)
val_accuracy = evaluate_model(val_loader)

print(f"Training Accuracy: {train_accuracy:.4f}")
print(f"Testing Accuracy: {test_accuracy:.4f}")
print(f"Validation Accuracy: {val_accuracy:.4f}")

Training Accuracy: 0.9996
Testing Accuracy: 0.9276
Validation Accuracy: 1.0000
