**Revised on 3/5/2024: Changed source files**

This is the skeleton code for Task 1 of the midterm project. The files that are downloaded in step 4 are based on the [Ember 2018 dataset](https://arxiv.org/abs/1804.04637), and contain the features (and corresponding labels) extracted from 1 million PE files, split into 80\% training and 20\% test datasets. The code used for for feature extraction is available [here](https://colab.research.google.com/drive/16q9bOlCmnTquPtVXVzxUj4ZY1ORp10R2?usp=sharing). However, the preprocessing and featurization process may take up to 3 hours on Google Colab. Hence, I recommend using the processed datasets (Step 4) to speed up your development.

Also, note that there is a new optional step 8.5 - To speed up your experiments, you may want to sample the original dataset of 800k training samples and 200k test samples to smaller datasets.

**Step 1:** Mount your Google Drive by clicking on "Mount Drive" in the Files section (panel to the left of this text.)

**Step 2:** Go to Runtime -> Change runtime type and select T4 GPU.

**Step 3:** Create a folder in your Google Drive, and rename it to "vMalConv"

**Step 4:** Download the pre-processed training and test datasets.

In [4]:
# ~8GB
!wget https://dsci6015s24-midterm.s3.amazonaws.com/v2/X_train.dat
!wget https://dsci6015s24-midterm.s3.amazonaws.com/v2/X_test.dat
!wget https://dsci6015s24-midterm.s3.amazonaws.com/v2/y_train.dat
!wget https://dsci6015s24-midterm.s3.amazonaws.com/v2/y_test.dat
!wget https://dsci6015s24-midterm.s3.amazonaws.com/v2/metadata.csv

--2024-03-13 17:55:04--  https://dsci6015s24-midterm.s3.amazonaws.com/v2/X_train.dat
Resolving dsci6015s24-midterm.s3.amazonaws.com (dsci6015s24-midterm.s3.amazonaws.com)... 52.217.174.1, 54.231.131.185, 3.5.29.37, ...
Connecting to dsci6015s24-midterm.s3.amazonaws.com (dsci6015s24-midterm.s3.amazonaws.com)|52.217.174.1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7619200000 (7.1G) [binary/octet-stream]
Saving to: ‘X_train.dat.1’


2024-03-13 17:59:04 (30.2 MB/s) - ‘X_train.dat.1’ saved [7619200000/7619200000]

--2024-03-13 17:59:05--  https://dsci6015s24-midterm.s3.amazonaws.com/v2/X_test.dat
Resolving dsci6015s24-midterm.s3.amazonaws.com (dsci6015s24-midterm.s3.amazonaws.com)... 3.5.21.107, 16.182.67.209, 54.231.227.137, ...
Connecting to dsci6015s24-midterm.s3.amazonaws.com (dsci6015s24-midterm.s3.amazonaws.com)|3.5.21.107|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1904800000 (1.8G) [binary/octet-stream]
Saving to: ‘X_test.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


**Step 5:** Copy the downloaded files to vMalConv

In [2]:
!cp /content/X_train.dat /content/drive/MyDrive/vMalConv/X_train.dat
!cp /content/X_test.dat /content/drive/MyDrive/vMalConv/X_test.dat
!cp /content/y_train.dat /content/drive/MyDrive/vMalConv/y_train.dat
!cp /content/y_test.dat /content/drive/MyDrive/vMalConv/y_test.dat
!cp /content/metadata.csv /content/drive/MyDrive/vMalConv/metadata.csv

cp: cannot stat '/content/X_train.dat': No such file or directory
cp: cannot stat '/content/X_test.dat': No such file or directory
cp: cannot stat '/content/y_train.dat': No such file or directory
cp: cannot stat '/content/y_test.dat': No such file or directory
cp: cannot stat '/content/metadata.csv': No such file or directory


**Step 6:** Download and install Ember:

In [3]:
!pip install git+https://github.com/PFGimenez/ember.git

Collecting git+https://github.com/PFGimenez/ember.git
  Cloning https://github.com/PFGimenez/ember.git to /tmp/pip-req-build-g5u99ku4
  Running command git clone --filter=blob:none --quiet https://github.com/PFGimenez/ember.git /tmp/pip-req-build-g5u99ku4
  Resolved https://github.com/PFGimenez/ember.git to commit 3b82fe63069884882e743af725d29cc2a67859f1
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: ember
  Building wheel for ember (setup.py) ... [?25l[?25hdone
  Created wheel for ember: filename=ember-0.1.0-py3-none-any.whl size=13050 sha256=a936bac1ae38c0df27a67af905709e1c57b1f1936989e3908e02a9b5743f961c
  Stored in directory: /tmp/pip-ephem-wheel-cache-xhy0s2vn/wheels/8f/69/f9/1917c8df03b25fe53e8e2f6cb2c9f61a43dec179b19b10ab9f
Successfully built ember
Installing collected packages: ember
Successfully installed ember-0.1.0


In [4]:
!pip install lief

Collecting lief
  Downloading lief-0.14.1-cp310-cp310-manylinux_2_28_x86_64.manylinux_2_27_x86_64.whl (2.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.7/2.7 MB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: lief
Successfully installed lief-0.14.1


**Step 7:** Read vectorized features from the data files.

In [5]:
import ember
X_train, y_train, X_test, y_test = ember.read_vectorized_features("drive/MyDrive/vMalConv/")
metadata_dataframe = ember.read_metadata("drive/MyDrive/vMalConv/")



**Step 8:** Get rid of rows with no labels.

In [6]:
labelrows = (y_train != -1)
X_train = X_train[labelrows]
y_train = y_train[labelrows]

In [7]:
import h5py
h5f = h5py.File('X_train.h5', 'w')
h5f.create_dataset('X_train', data=X_train)
h5f.close()
h5f = h5py.File('y_train.h5', 'w')
h5f.create_dataset('y_train', data=y_train)
h5f.close()

In [8]:
!cp /content/X_train.h5 /content/drive/MyDrive/vMalConv/X_train.h5
!cp /content/y_train.h5 /content/drive/MyDrive/vMalConv/y_train.h5

**Optional Step 8.5:** To speed up your experiments, you may want to sample the original dataset of 800k training samples and 200k test samples to smaller datasets. You can use the [Pandas Dataframe sample() method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html), or come up with your own sampling methodology. Be mindful of the fact that the database is heavily imbalanced.

> **Task 1:** Complete the following code to build the architecture of MalConv in PyTorch:

In [9]:
import torch
import torch.nn as nn

class MalConv(nn.Module):
    def __init__(self, input_length=2000000, embedding_dim=8, window_size=8, output_dim=1):
        super(MalConv, self).__init__()
        self.embedding_dim = embedding_dim
        self.window_size = window_size
        self.input_length = input_length
        self.flatten = nn.Flatten()

        self.embed = nn.Embedding(256, embedding_dim)  # 256 unique bytes, embedding dimension
        self.conv1 = nn.Conv1d(in_channels=embedding_dim, out_channels=128, kernel_size=window_size, stride=window_size, bias=True)
        self.conv2 = nn.Conv1d(in_channels=128, out_channels=128, kernel_size=window_size, stride=window_size, bias=True)

        # Calculate the output size after the convolutional layers
        conv_output_length = self.calculate_conv_output_length()

        self.fc1 = nn.Linear(conv_output_length, 128)
        self.fc2 = nn.Linear(128, output_dim)
        self.sigmoid = nn.Sigmoid()

    def calculate_conv_output_length(self):
        # Calculate the output size after the convolutional layers
        # Formula: out_length = (in_length - kernel_size) / stride + 1
        conv1_output_length = (self.input_length - self.window_size) // self.window_size + 1
        conv2_output_length = (conv1_output_length - self.window_size) // self.window_size + 1
        conv_output_length = 128 * conv2_output_length
        return conv_output_length

    def forward(self, x):
        x = self.embed(x)
        x = x.transpose(1, 2)  # Conv1d expects (batch_size, channels, length)
        x = self.conv1(x)
        x = torch.relu(x)
        x = self.conv2(x)
        x = torch.relu(x)
        x = self.flatten(x)
        x = self.fc1(x)
        x = torch.relu(x)
        x = self.fc2(x)
        x = self.sigmoid(x)
        return x

input_length = 2000000   # The fixed length for each input file
model = MalConv(input_length=input_length)
print(model)

# Example input (a batch of byte sequences, padded or truncated to the fixed length)
example_input = torch.randint(0, 256, (4, input_length), dtype=torch.long)  # 4 examples, random data
output = model(example_input)
print(output)

MalConv(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (embed): Embedding(256, 8)
  (conv1): Conv1d(8, 128, kernel_size=(8,), stride=(8,))
  (conv2): Conv1d(128, 128, kernel_size=(8,), stride=(8,))
  (fc1): Linear(in_features=4000000, out_features=128, bias=True)
  (fc2): Linear(in_features=128, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)
tensor([[0.5086],
        [0.5026],
        [0.5000],
        [0.5262]], grad_fn=<SigmoidBackward0>)


**Step 8:** Partial fit the standardScaler to avoid overloading the memory:

In [10]:
from sklearn.preprocessing import StandardScaler
mms = StandardScaler()
for x in range(0,100000,1000):
  mms.partial_fit(X_train[x:x+1000])

In [12]:
import numpy as np
X_train = mms.transform(X_train)
X_train = np.array((X_train+1)*127.5, dtype=np.uint8)

  X_train = np.array((X_train+1)*127.5, dtype=np.uint8)


In [13]:
## Reshape to create 3 channels ##
import numpy as np
X_train = np.reshape(X_train,(-1,2381))
y_train = np.reshape(y_train,(-1,1))

**Load, Tensorize, and Split** The following code takes care of converting the training data into Torch Tensors, and then splits it into 80% training and 20% validation datasets.

In [14]:
import numpy as np
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import train_test_split

# Assuming MalConv class definition is already provided as above

# Convert your numpy arrays to PyTorch tensors
X_train = torch.tensor(X_train, dtype=torch.long)
y_train = torch.tensor(y_train, dtype=torch.float32)

# Split the data into training and validation sets (80% training, 20% validation)
X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.2, random_state=42
)

# Create TensorDatasets and DataLoaders for training and validation sets
train_dataset = TensorDataset(X_train, y_train)
val_dataset = TensorDataset(X_val, y_val)

batch_size = 2048  # Adjust based on your GPU memory
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

> **Task 2:** Complete the following code to train the model on the GPU for 15 epochs, with a batch size of 64. If you are on Google Colab, don't forget to change the kernel in Runtime -> Change runtime type -> T4 GPU.

In [15]:
# Initialize the MalConv model
model = MalConv(input_length=2381)

# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

MalConv(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (embed): Embedding(256, 8)
  (conv1): Conv1d(8, 128, kernel_size=(8,), stride=(8,))
  (conv2): Conv1d(128, 128, kernel_size=(8,), stride=(8,))
  (fc1): Linear(in_features=4736, out_features=128, bias=True)
  (fc2): Linear(in_features=128, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)

In [17]:
import os

# Loss function and optimizer
criterion = nn.BCELoss()  # Binary Cross-Entropy Loss for binary classification
optimizer = optim.Adam(model.parameters(), lr=0.001)  # Adjust learning rate as needed

# Directory to save model checkpoints
save_dir = "drive/MyDrive/vMalConv/"

# Training Loop with Validation
num_epochs = 10  # Adjust the number of epochs as needed

for epoch in range(num_epochs):
    model.train()  # Set model to training mode
    running_loss = 0.0

    for inputs, labels in train_loader:
        inputs, labels = inputs.to(device), labels.to(device)

        optimizer.zero_grad()  # Zero the gradients

        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

    print(f'Epoch {epoch+1}, Training Loss: {running_loss/len(train_loader)}')

    # Validation step
    model.eval()  # Set model to evaluation mode
    val_loss = 0.0
    with torch.no_grad():
        for inputs, labels in val_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            val_loss += loss.item()
    print(f'Validation Loss: {val_loss/len(val_loader)}')

    # Save checkpoint every 5 epochs
    if (epoch + 1) % 5 == 0:
        checkpoint_path = os.path.join(save_dir, f'model_epoch_{epoch+1}.pt')
        torch.save(model.state_dict(), checkpoint_path)
        print(f'Model checkpoint saved to {checkpoint_path}')


Epoch 1, Training Loss: 0.3857630789279938
Validation Loss: 0.25979248345908473
Epoch 2, Training Loss: 0.2401955082061443
Validation Loss: 0.21811937351348037
Epoch 3, Training Loss: 0.20602467586385442
Validation Loss: 0.20991830224707975
Epoch 4, Training Loss: 0.1923080824157025
Validation Loss: 0.22223901521351377
Epoch 5, Training Loss: 0.18205424141376578
Validation Loss: 0.20164139493037078
Model checkpoint saved to drive/MyDrive/vMalConv/model_epoch_5.pt
Epoch 6, Training Loss: 0.1723962587879059
Validation Loss: 0.18021204739303912
Epoch 7, Training Loss: 0.1617850717077864
Validation Loss: 0.17362682950698724
Epoch 8, Training Loss: 0.15728249372319972
Validation Loss: 0.17223025125972294
Epoch 9, Training Loss: 0.15235507798955797
Validation Loss: 0.1694031186023001
Epoch 10, Training Loss: 0.1471582829318148
Validation Loss: 0.188477234315064
Model checkpoint saved to drive/MyDrive/vMalConv/model_epoch_10.pt


**Task 3:** Complete the following code to evaluate your trained model on the test data.

In [24]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

# Convert test data to PyTorch tensors
X_test = mms.transform(X_test[:50000])
y_test = y_test[:50000]
X_test = np.array((X_test+1)*127.5, dtype=np.uint8)

X_test = torch.tensor(X_test, dtype=torch.long)
y_test = torch.tensor(y_test, dtype=torch.float32)

# Create a TensorDataset and DataLoader for test data
test_dataset = TensorDataset(X_test, y_test)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

model.eval()


  y_test = torch.tensor(y_test, dtype=torch.float32)


MalConv(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (embed): Embedding(256, 8)
  (conv1): Conv1d(8, 128, kernel_size=(8,), stride=(8,))
  (conv2): Conv1d(128, 128, kernel_size=(8,), stride=(8,))
  (fc1): Linear(in_features=4736, out_features=128, bias=True)
  (fc2): Linear(in_features=128, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)

In [29]:
predictions = []
labels = []

with torch.no_grad():
    for inputs, labels_batch in test_loader:

        inputs, labels_batch = inputs.to(device), labels_batch.to(device)
        outputs = model(inputs)
        predicted = (outputs > 0.5).float()
        # Store predictions and labels
        predictions.extend(predicted.cpu().numpy())
        labels.extend(labels_batch.cpu().numpy())

# Compute metrics
accuracy = accuracy_score(labels, predictions)
precision = precision_score(labels, predictions)
recall = recall_score(labels, predictions)

print(f'Test Accuracy: {accuracy:.4f}')
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')

Test Accuracy: 0.9133
Precision: 0.8810
Recall: 0.9562


The model's performance slightly surpassed random guessing, achieving a test accuracy of 91.33%. However, its recall of 95.62% and precision of 88.10% indicate a significant number of false positives and false negatives, suggesting insufficient class differentiation. To improve the model's performance, consider the following strategies:

**Expand the Dataset:** Increase the size of the dataset to provide the model with more examples to learn from. A larger dataset can help improve the model's ability to generalize to new data.

**Adjust Hyperparameters:** Experiment with different hyperparameter settings to find the combination that yields the best performance. This could involve tuning parameters such as learning rate, batch size, and regularization strength.

**Enhance Model Design:** Consider using a more sophisticated model architecture or incorporating additional features to improve the model's ability to capture complex patterns in the data. This could involve using a deep learning model or adding new input features.

**Improve Class Differentiation:** Address the imbalance in the dataset to improve class differentiation. This could involve using techniques such as oversampling, undersampling, or using class weights during training.

By implementing these strategies, you can enhance the model's memory, precision, and overall forecast accuracy, making it more dependable in real-world situations.