![cyber_photo](cyber_photo.jpg)

Cyber threats are a growing concern for organizations worldwide. These threats take many forms, including malware, phishing, and denial-of-service (DOS) attacks, compromising sensitive information and disrupting operations. The increasing sophistication and frequency of these attacks make it imperative for organizations to adopt advanced security measures. Traditional threat detection methods often fall short due to their inability to adapt to new and evolving threats. This is where deep learning models come into play.

Deep learning models can analyze vast amounts of data and identify patterns that may not be immediately obvious to human analysts. By leveraging these models, organizations can proactively detect and mitigate cyber threats, safeguarding their sensitive information and ensuring operational continuity.

As a cybersecurity analyst, you identify and mitigate these threats. In this project, you will design and implement a deep learning model to detect cyber threats. The BETH dataset simulates real-world logs, providing a rich source of information for training and testing your model. The data has already undergone preprocessing, and we have a target label, `sus_label`, indicating whether an event is malicious (1) or benign (0).

By successfully developing this model, you will contribute to enhancing cybersecurity measures and protecting organizations from potentially devastating cyber attacks.


### The Data

| Column     | Description              |
|------------|--------------------------|
|`processId`|The unique identifier for the process that generated the event - int64 |
|`threadId`|ID for the thread spawning the log - int64|
|`parentProcessId`|Label for the process spawning this log - int64|
|`userId`|ID of user spawning the log|Numerical - int64|
|`mountNamespace`|Mounting restrictions the process log works within - int64|
|`argsNum`|Number of arguments passed to the event - int64|
|`returnValue`|Value returned from the event log (usually 0) - int64|
|`sus_label`|Binary label as suspicous event (1 is suspicious, 0 is not) - int64|

More information on the dataset: [BETH dataset](accreditation.md)

In [None]:
# Import required libraries
import pandas as pd
import sklearn
from sklearn.preprocessing import StandardScaler
import torch
import torch.nn as nn
import torch.nn.functional as functional
from torch.utils.data import DataLoader, TensorDataset
import torch.optim as optim
from torchmetrics import Accuracy
import torch.optim as optim
from torch.optim.lr_scheduler import ReduceLROnPlateau
# from sklearn.metrics import accuracy_score  # uncomment to use sklearn

In [3]:
# Load preprocessed data
train_df = pd.read_csv('labelled_train.csv')
test_df = pd.read_csv('labelled_test.csv')
val_df = pd.read_csv('labelled_validation.csv')

# View the first 5 rows of training set
train_df.head()

Unnamed: 0,processId,threadId,parentProcessId,userId,mountNamespace,argsNum,returnValue,sus_label
0,381,7337,1,100,4026532231,5,0,1
1,381,7337,1,100,4026532231,1,0,1
2,381,7337,1,100,4026532231,0,0,1
3,7347,7347,7341,0,4026531840,2,-2,1
4,7347,7347,7341,0,4026531840,4,0,1


In [4]:
# Checking the shape of the training data
print( f' Number of rows and features for the training dataset: {train_df.shape}' )
print( f' Number of rows and features for the test dataset: {test_df.shape}' )
print( f' Number of rows and features for the validation dataset: {val_df.shape}')

 Number of rows and features for the training dataset: (763144, 8)
 Number of rows and features for the test dataset: (188967, 8)
 Number of rows and features for the validation dataset: (188967, 8)


In [5]:
print(train_df.isnull().sum())

processId          0
threadId           0
parentProcessId    0
userId             0
mountNamespace     0
argsNum            0
returnValue        0
sus_label          0
dtype: int64


In [6]:
print(test_df.isnull().sum())

processId          0
threadId           0
parentProcessId    0
userId             0
mountNamespace     0
argsNum            0
returnValue        0
sus_label          0
dtype: int64


In [7]:
print(val_df.isnull().sum())

processId          0
threadId           0
parentProcessId    0
userId             0
mountNamespace     0
argsNum            0
returnValue        0
sus_label          0
dtype: int64


In [8]:
y_train = train_df['sus_label']


In [9]:
X_train = train_df.drop(['sus_label'], axis= 1)

In [10]:
X_train.head(5)

Unnamed: 0,processId,threadId,parentProcessId,userId,mountNamespace,argsNum,returnValue
0,381,7337,1,100,4026532231,5,0
1,381,7337,1,100,4026532231,1,0
2,381,7337,1,100,4026532231,0,0
3,7347,7347,7341,0,4026531840,2,-2
4,7347,7347,7341,0,4026531840,4,0


In [11]:
X_train = X_train.to_numpy()

In [12]:
scaler = StandardScaler()

In [13]:
X_train_scaled = scaler.fit_transform(X_train)

In [14]:
y_train = y_train.to_numpy()

In [15]:
y_train.shape

(763144,)

In [16]:
y_train = torch.tensor(y_train)

In [17]:
y_val = val_df['sus_label']

In [18]:
X_val = val_df.drop(['sus_label'], axis= 1)

In [19]:
X_val = X_val.to_numpy()

In [20]:
X_val_scaled = scaler.transform(X_val)

In [21]:
y_val = y_val.to_numpy()

In [22]:
y_val = torch.tensor(y_val)

In [23]:
y_test = test_df['sus_label']

In [24]:
y_test = y_test.to_numpy()

In [25]:
y_test = torch.tensor(y_test)

In [26]:
y_test.shape

torch.Size([188967])

In [27]:
y_val.shape

torch.Size([188967])

In [28]:
X_test = test_df.drop(['sus_label'], axis= 1)

In [29]:
X_test = X_test.to_numpy()

In [30]:
X_test_scaled = scaler.transform(X_test)

In [31]:
X_test_scaled.shape

(188967, 7)

In [32]:
X_test_scaled[0]

array([-3.30076588, -3.3237183 , -0.84909243,  2.63843038,  1.79005541,
        0.24455003, -0.00791259])

In [33]:
y_test[0]

tensor(0)

In [31]:
train_dataset = TensorDataset(torch.tensor(X_train_scaled).float(),y_train)

In [32]:
y_train.shape

torch.Size([763144])

In [33]:
input_sample, label_sample = train_dataset[0]
print('input sample', input_sample.shape)
print('label_sample', label_sample.shape)

input sample torch.Size([7])
label_sample torch.Size([])


In [34]:
val_dataset = TensorDataset(torch.tensor(X_val_scaled).float(), y_val)

In [35]:
test_dataset = TensorDataset(torch.tensor(X_test_scaled).float(),y_test)

In [36]:
batch_size = 64
shuffle = True

trainloader = DataLoader(train_dataset, batch_size=batch_size, shuffle= shuffle)

In [37]:
validationloader = DataLoader(val_dataset, batch_size=batch_size, shuffle= shuffle)

In [38]:
testloader = DataLoader(test_dataset, batch_size=batch_size, shuffle= shuffle)

In [39]:
X_train.shape[1]

7

In [40]:
class ThreatDetectionModel(nn.Module):
    def __init__(self):
        super(ThreatDetectionModel, self).__init__()
        self.model = nn.Sequential(
            nn.Linear(7, 128),  # First fully connected layer
            nn.ReLU(),  # ReLU activation
            nn.Linear(128, 64),  # Second fully connected layer
            nn.ReLU(),  # ReLU activation
            nn.Linear(64, 1),  # Third fully connected layer
            nn.Sigmoid()  # Sigmoid activation for binary classification
    )

    def forward(self, x):
        return self.model(x)

In [41]:
model = ThreatDetectionModel()

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

model.to(device)  # Move model to GPU (if available)

Using device: cpu


ThreatDetectionModel(
  (model): Sequential(
    (0): Linear(in_features=7, out_features=128, bias=True)
    (1): ReLU()
    (2): Linear(in_features=128, out_features=64, bias=True)
    (3): ReLU()
    (4): Linear(in_features=64, out_features=1, bias=True)
    (5): Sigmoid()
  )
)

In [42]:
total = 0
for parameter in model.parameters():
    total += parameter.numel()
print(total)

9345


In [43]:
criterion = nn.BCELoss()
optimizer = optim.SGD(model.parameters(), lr=1e-3, weight_decay=1e-4)


In [44]:
import torchmetrics

metric = torchmetrics.Accuracy(task= "binary", num_classes=2)

In [None]:
# Initialize the optimizer
optimizer = optim.SGD(model.parameters(), lr=1e-3, weight_decay=1e-4, momentum=0.9)



# Train the model
train_loss = []
for epoch in range(10):
    training_loss = 0.0  # Reset loss at the start of each epoch
    model.train()  # Set the model to training mode

    for data in trainloader:
        # Get feature and target from the data loader
        feature, target = data
        target = target.unsqueeze(1).float()
        
        # Reset gradients
        optimizer.zero_grad()
        
        # Run a forward pass
        pred = model(feature)
        
        # Compute loss
        loss = criterion(pred, target)
        
        # Backpropagation
        loss.backward()
        
        # Update the parameters
        optimizer.step()
        
        # Accumulate training loss
        training_loss += loss.item()
        
        # Update metric with predictions and targets
        metric.update(pred, target)

    # Compute average loss for the epoch
    epoch_loss = training_loss / len(trainloader)
    train_loss.append(epoch_loss)
    
    # Update the scheduler based on the validation loss (or training loss)
    # scheduler.step(epoch_loss)  # You can change this to validation_loss if needed
    
    # Calculate accuracy
    accuracy = metric.compute()
    print(f"Epoch {epoch+1}, Training Loss: {epoch_loss:.4f}, Accuracy: {accuracy:.4f}")
    
    # Reset metric for the next epoch
    metric.reset()

Epoch 1, Training Loss: 0.0120, Accuracy: 0.9995
Epoch 2, Training Loss: 0.0033, Accuracy: 0.9996
Epoch 3, Training Loss: 0.0031, Accuracy: 0.9996
Epoch 4, Training Loss: 0.0030, Accuracy: 0.9996
Epoch 5, Training Loss: 0.0029, Accuracy: 0.9996
Epoch 6, Training Loss: 0.0029, Accuracy: 0.9996
Epoch 7, Training Loss: 0.0028, Accuracy: 0.9996
Epoch 8, Training Loss: 0.0028, Accuracy: 0.9996
Epoch 9, Training Loss: 0.0027, Accuracy: 0.9996
Epoch 10, Training Loss: 0.0027, Accuracy: 0.9996


In [46]:
val_loss = []
model.eval()
for epoch in range(10):
    validation_loss = 0.0  # Reset loss for each epoch
    
    with torch.no_grad():  # Disable gradient computation
        for data in validationloader:
            # Get feature and target from the data loader
            feature, target = data
            target = target.unsqueeze(1).float()

            # Run a forward pass
            pred = model(feature)

            # Compute loss
            loss = criterion(pred, target)
            validation_loss += loss.item()

            # Update metric with predictions and targets
            metric.update(pred, target)

    # Compute average loss for the epoch
    epoch_loss = validation_loss / len(validationloader)
    val_loss.append(epoch_loss)
    accuracy = metric.compute()

    print(f"Epoch {epoch+1}, Validation Loss: {epoch_loss:.4f}, Accuracy: {accuracy:.4f}")

    # Reset metric after each epoch
    metric.reset()

Epoch 1, Validation Loss: 0.0005, Accuracy: 1.0000
Epoch 2, Validation Loss: 0.0005, Accuracy: 1.0000
Epoch 3, Validation Loss: 0.0005, Accuracy: 1.0000
Epoch 4, Validation Loss: 0.0005, Accuracy: 1.0000
Epoch 5, Validation Loss: 0.0005, Accuracy: 1.0000
Epoch 6, Validation Loss: 0.0005, Accuracy: 1.0000
Epoch 7, Validation Loss: 0.0005, Accuracy: 1.0000
Epoch 8, Validation Loss: 0.0005, Accuracy: 1.0000
Epoch 9, Validation Loss: 0.0005, Accuracy: 1.0000
Epoch 10, Validation Loss: 0.0005, Accuracy: 1.0000


In [48]:
test_loss = 0.0  # Reset loss for each epoch

with torch.no_grad():  # Disable gradient computation
    for test_data in testloader:
        # Get feature and target from the data loader
        feature, target = test_data
        target = target.unsqueeze(1).float()
        # Run a forward pass
        pred = model(feature)

        # Compute loss
        loss = criterion(pred, target)
        test_loss += loss.item()

        # Update metric with predictions and targets
        metric.update(pred, target)


accuracy = metric.compute()

print(f'Accuracy: {accuracy:.4f}')  

metric.reset()

Accuracy: 0.9448


In [None]:
# save model 
torch.save(model.state_dict(), 'models/model.pth')