**INTRODUCTION**
Cyber threats are a growing concern for organizations worldwide. These threats take many forms, including malware, phishing, and denial-of-service (DOS) attacks, compromising sensitive information and disrupting operations. The increasing sophistication and frequency of these attacks make it imperative for organizations to adopt advanced security measures. Traditional threat detection methods often fall short due to their inability to adapt to new and evolving threats. This is where deep learning models come into play.

Deep learning models can analyze vast amounts of data and identify patterns that may not be immediately obvious to human analysts. By leveraging these models, organizations can proactively detect and mitigate cyber threats, safeguarding their sensitive information and ensuring operational continuity.

As a cybersecurity analyst, you identify and mitigate these threats. In this project, you will design and implement a deep learning model to detect cyber threats. The BETH dataset simulates real-world logs, providing a rich source of information for training and testing your model. The data has already undergone preprocessing, and we have a target label, `sus_label`, indicating whether an event is malicious (1) or benign (0).

By successfully developing this model, you will contribute to enhancing cybersecurity measures and protecting organizations from potentially devastating cyber attacks.


### The Data

| Column     | Description              |
|------------|--------------------------|
|`processId`|The unique identifier for the process that generated the event - int64 |
|`threadId`|ID for the thread spawning the log - int64|
|`parentProcessId`|Label for the process spawning this log - int64|
|`userId`|ID of user spawning the log|Numerical - int64|
|`mountNamespace`|Mounting restrictions the process log works within - int64|
|`argsNum`|Number of arguments passed to the event - int64|
|`returnValue`|Value returned from the event log (usually 0) - int64|
|`sus_label`|Binary label as suspicous event (1 is suspicious, 0 is not) - int64|

More information on the dataset: [BETH dataset](accreditation.md)

This is a binary classification challenge. I will solve it by using neural networks with BCELoss loss function and evaluate by Accuracy metric by Torchmetric.

**Import libraries and datasets**

In [35]:
# Import required libraries
import pandas as pd
from sklearn.preprocessing import StandardScaler
import torch
import torch.nn as nn
import torch.nn.functional as functional
from torch.utils.data import DataLoader, TensorDataset
import torch.optim as optim
from torchmetrics import Accuracy
# from sklearn.metrics import accuracy_score  # uncomment to use sklearn

In [36]:
# Load preprocessed data
train_df = pd.read_csv('data/labelled_train.csv')
test_df = pd.read_csv('data/labelled_test.csv')
val_df = pd.read_csv('data/labelled_validation.csv')

# View the first 5 rows of training set
train_df.head()

Unnamed: 0,processId,threadId,parentProcessId,userId,mountNamespace,argsNum,returnValue,sus_label
0,381,7337,1,100,4026532231,5,0,1
1,381,7337,1,100,4026532231,1,0,1
2,381,7337,1,100,4026532231,0,0,1
3,7347,7347,7341,0,4026531840,2,-2,1
4,7347,7347,7341,0,4026531840,4,0,1


In [37]:
print(train_df.shape)
print(test_df.shape)
print(val_df.shape)

(763144, 8)
(188967, 8)
(188967, 8)


In [38]:
#Check hơw many classes are in the dataset
print(train_df['sus_label'].value_counts())

sus_label
0    761875
1      1269
Name: count, dtype: int64


In [39]:
train_df.describe()

Unnamed: 0,processId,threadId,parentProcessId,userId,mountNamespace,argsNum,returnValue,sus_label
count,763144.0,763144.0,763144.0,763144.0,763144.0,763144.0,763144.0,763144.0
mean,6814.763308,6820.265241,1882.216609,2.279034,4026532000.0,2.672082,17.520924,0.001663
std,1948.871187,1937.068333,2215.563094,37.416576,164.903,1.340906,318.596662,0.040744
min,1.0,1.0,0.0,0.0,4026532000.0,0.0,-115.0,0.0
25%,7313.0,7313.0,187.0,0.0,4026532000.0,1.0,0.0,0.0
50%,7365.0,7365.0,1385.0,0.0,4026532000.0,3.0,0.0,0.0
75%,7415.0,7415.0,1648.0,0.0,4026532000.0,4.0,4.0,0.0
max,8619.0,8619.0,7672.0,1000.0,4026532000.0,5.0,8289.0,1.0


# Load and Scale data

1/ Separating Features and Labels: 
- Drop the sus_label column from each DataFrame to separate features and assign the sus_label column to the labels

2/ Scaling features: 
- Initialize a StandardScaler() from sklearn.preprocessing to scale the features.
- Fit the scaler on the training data and transform the training data.
- Use the fitted scaler to transform the test and validation data.

3/ Converting to PyTorch Tensors
- Convert the numpy arrays for features and labels to PyTorch tensors using torch.tensor().
- Ensure the features are converted to float32 type and the labels are reshaped to have a dimension of (-1, 1).

In [40]:
#1. Separate features and labels

# Define input features
features = train_df.drop(columns=['sus_label'])

target = train_df['sus_label']

print(target.shape)

(763144,)


In [41]:
#2. Scaling features:
scaler = StandardScaler()

# Fit the scaler to the data and transform it
scaled_features = scaler.fit_transform(features)

Why we need to reshape the value to (-1,1)?
- With the scaled features, we have 7 features with n sample => the size is [7]
- With the target, we need to transform to match after scaling features.

In [42]:
#3. Converting to Pytorch Tensors - TensorDataset

#Instantiate dataset class
dataset = TensorDataset(torch.tensor(scaled_features, dtype=torch.float32), torch.tensor(target.values, dtype=torch.float32).reshape(-1,1))

#Access an individual sample
input_sample, label_sample = dataset[0]
print('input sample:', input_sample)   
print('label sample:', label_sample)

input sample: tensor([-3.3013,  0.2668, -0.8491,  2.6117,  1.7840,  1.7361, -0.0550])
label sample: tensor([1.])


In [43]:
#4. DataLoader
batch_size = 32
suffle = True

# Create a DataLoader
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=suffle)

In [44]:
#Test iterate over the dataloader
#for batch_inputs, batch_labels in dataloader:
#    print('batch_inputs:', batch_inputs.shape)
#    print('batch_labels:', batch_labels.shape)

In [45]:
#5. Transform data and convert tensors with test data
test_features = test_df.drop(columns=['sus_label'])

test_target = test_df['sus_label']

# Fit the scaler to the data and transform it
scaled_test_features = scaler.fit_transform(test_features)

#Instantiate dataset class
test_dataset = TensorDataset(torch.tensor(scaled_test_features, dtype=torch.float32), torch.tensor(test_target.values, dtype=torch.float32).reshape(-1,1))

#DataLoader
batch_size = 32
suffle = True

# Create a DataLoader
val_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=suffle)


#2.  Define the Neural Network Model
1. Create a model
2. Choose a loss function
3. Define a dataset
4. Set an optimizer
5. Run a training loop:
- Calculate loss (forward pass)

Define a neural network model using PyTorch's nn.Sequential() to create a feed-forward neural network with specified layers and activations.

1/ Creating the Model
- Use nn.Sequential() to define the model structure.
- Include three fully connected layers nn.Linear() with appropriate input and output dimensions.
- Add ReLU activation functions nn.ReLU() between the layers.
- Use a Sigmoid activation function nn.Sigmoid() at the output layer

2/ Initializing Loss Function and Optimizer
- Use nn.CrossEntropyLoss().
- Initialize the optimizer using optim.SGD with a learning rate of 1e-3 and weight_decay 1e-4.

In [46]:
#1. Create binary classification model
model = nn.Sequential(
    nn.Linear(7, 32),
    nn.ReLU(),
    nn.Linear(32, 16),
    nn.ReLU(),
    nn.Linear(16, 1),
    nn.Sigmoid()
)

In [47]:
print(model)

Sequential(
  (0): Linear(in_features=7, out_features=32, bias=True)
  (1): ReLU()
  (2): Linear(in_features=32, out_features=16, bias=True)
  (3): ReLU()
  (4): Linear(in_features=16, out_features=1, bias=True)
  (5): Sigmoid()
)


In [48]:
#2.4 Choose loss function and optimizer
criterion = nn.BCELoss()
optimizer = optim.SGD(model.parameters(), lr=1e-3, weight_decay=1e-4)

In [49]:
#Test with dummy data
dummy_X = torch.randn(4, 7)

dummy_y = torch.tensor([[0.], [1.], [0.], [1.]])

#Forward
dummy_output = model(dummy_X)
dummy_loss = criterion(dummy_output, dummy_y)
print(dummy_loss.item())

0.6729817390441895


# Train and evaluate the model

Implement a training loop to train the neural network model using the training data. Update the model parameters based on the computed loss.

1/ Setting Up the Training Loop
- Iterate through the specified number of epochs.
- Set the model to training mode using model.train().
- Clear the gradients using optimizer.zero_grad().
- Perform a forward pass to compute the model outputs.
- Compute the loss using the loss function.
- Perform a backward pass to compute the gradients.
- Update the model parameters using optimizer.step().

2/ Evaluating the Model
- Evaluate the trained model on training, testing, and validation datasets. Calculate and print the accuracy for each dataset.
- Use torchmetrics.Accuracy() to calculate the accuracy for training, testing, and validation datasets.

In [50]:
#1. Train the model
num_epochs = 10  # Number of epochs

for epoch in range(num_epochs):
    model.train()  # Training mode
    training_loss = 0.0

    for feature, target in dataloader:
        # Zero gradients
        optimizer.zero_grad()

        # Forward pass
        pred = model(feature)

        # Compute loss
        loss = criterion(pred, target)

        # Backward pass
        loss.backward()

        # Update weights
        optimizer.step()

        # Accumulate loss
        training_loss += loss.item()

    # Average loss for the epoch
    epoch_loss = training_loss / len(dataloader)
    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {epoch_loss:.4f}")

        
        

Epoch 1/10, Loss: 0.0638
Epoch 2/10, Loss: 0.0068
Epoch 3/10, Loss: 0.0057
Epoch 4/10, Loss: 0.0054
Epoch 5/10, Loss: 0.0052
Epoch 6/10, Loss: 0.0050
Epoch 7/10, Loss: 0.0049
Epoch 8/10, Loss: 0.0048
Epoch 9/10, Loss: 0.0047
Epoch 10/10, Loss: 0.0047


In [51]:
#2. Calculating validation loss

validation_loss = 0.0
model.eval() #Put model to evaluation mode

with torch.no_grad():
    for inputs, labels in val_loader:
        #Run the forward pass
        outputs = model(inputs)
        #Calculate the loss
        loss = criterion(outputs, labels)
        validation_loss += loss.item()

epoch_loss = validation_loss / len(val_loader)
model.train()


Sequential(
  (0): Linear(in_features=7, out_features=32, bias=True)
  (1): ReLU()
  (2): Linear(in_features=32, out_features=16, bias=True)
  (3): ReLU()
  (4): Linear(in_features=16, out_features=1, bias=True)
  (5): Sigmoid()
)

In [52]:
print(f"Validation Loss: {epoch_loss:.4f}")

Validation Loss: 6.3214


In [53]:
#3. Evaluation Accuracy with TorchMetric
from torchmetrics import Accuracy

acc = Accuracy(task='binary')

model.eval()

with torch.no_grad():
    for inputs, labels in val_loader:
        outputs = model(inputs)
        predictions = (outputs > 0.5).float()  # Convert probabilities to binary predictions
        acc.update(predictions, labels.int())

accuracy = acc.compute()
print(f"Validation Accuracy: {accuracy:.4f}")

Validation Accuracy: 0.0927
