# **CDS Project: Part 1**

*Institute of Software Security (E22)*  
*Hamburg University of Technology*  
*SoSe 2023*

## Learning objectives
---

- Use a basic Machine Learning (ML) pipeline with pre-trained models.
- Build your own data loader.
- Load and run a pre-trained ML model.
- Evaluate the performance of an ML model.
- Calculate and interpret performance metrics.

## Materials
---

- Lecture Slides 1, 2, and 3.
- PyTorch Documentation: [Datasets and Data Loaders](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) 


## Project Description
---

In this project, you are given an ML model that is pre-trained on a vulnerability dataset. The dataset consists of code samples labeled with True or False flags, depending on the presence and absense of a vulnerability. Your goal is to use the pre-trained model to predict if the code samples in the validation set contain vulnerabilities or not and analyse the results. Please proceed to the below tasks. 

###*Task 1*

Build a data loader for the validation dataset present in the following path: "*data_students/student_dataset.hdf5*". You will be using this dataset to validate the performance of the ML model. The dataset is in HDF5 binary data format. This format is used to store large amount of data. Make sure that you import and familiarise yourself with the right Python libraries to handle HDF5 files. 


*Solution*

Initially, we import the torch and h5py libraries and view the structure of the dataset.

In [87]:
#Import Libraries
import numpy as np #Numerical Manipulation
import torch #Neural Network Training
import h5py #Load, Manipulation of .hdf5 files
from torch.utils.data import Dataset #Conversion into tensor format
from torch.utils.data import DataLoader #Loading the Data into the pre-trained Model
import pandas as pd #Exploratory Data analysis
from torch import nn #Task4
from sklearn.metrics import confusion_matrix #For TP, TN, FP, FN computation

#Check datasets at root
def print_structure(name, obj):
    print(name)    
    
f = h5py.File('./data_students/student_dataset.hdf5', 'r')
f.visititems(print_structure)

labels
source
vectors


Next, we check the shape of the data for further information.

In [45]:
with h5py.File('./data_students/student_dataset.hdf5', 'r') as f:
    for name in f:
        print(f"{name} shape: {f[name].shape}")

labels shape: (1000,)
source shape: (1000,)
vectors shape: (1000, 1, 768)


We proceed by loading the data into variables.

In [37]:
with h5py.File('./data_students/student_dataset.hdf5', 'r') as f:
    vectors = np.squeeze(f["vectors"][:])  
    labels = f["labels"][:]               
    source = f["source"][:]

Towards using this data with Pytorch, we create a custom class to load the data in tensor form.

In [46]:
#Custom Class to load data into Tensor format for Pytorch
class StudentDataset(Dataset):
    def __init__(self, vectors, labels, source):
        self.vectors = torch.tensor(vectors, dtype = torch.float32)
        self.labels = torch.tensor(labels, dtype = torch.float32)
        self.source = source
    
    def __len__(self):
        return len(self.labels)
    
    def __getitem__(self, idx):
        return{
            'vector': self.vectors[idx],
            'label': self.labels[idx],
            'source': self.source[idx]
        }
        
#Create custom tensor dataset using StudentDataset class
dataset = StudentDataset(vectors, labels, source)

###*Task 2*

Generate a table with 10 random samples from the dataset and show their corresponding labels.

*Solution*

We use pandas to create the table for exploratory data analysis on the dataset object.
Note 
- The vector embedding is truncated to 5 dimensions for the purpose of easy viewing.
- The boolean value in label will be presented categorically (1 or 0).

In [65]:
#Creating a Data Dictionary for the dataset
data_dict = {
    "Text": [dataset[i]["source"] for i in range(len(dataset))],
    "Label": [int(dataset[i]["label"].item()) for i in range(len(dataset))],
    "Vector (truncated)": [dataset[i]["vector"][:5].tolist() for i in range(len(dataset))] 
}

#Constructing a Pandas dataframe
df = pd.DataFrame(data_dict)

We then print the top 10 rows.

In [66]:
df.head(10)

Unnamed: 0,Text,Label,Vector (truncated)
0,b'get_charcode(VMG_ uint argc)\r\n{\r\n con...,0,"[1.0354700088500977, -0.21910110116004944, -0...."
1,"b""find_open_file_info(char * id) {\n unsign...",0,"[0.7009735703468323, -0.33198848366737366, -2...."
2,"b'_openipmi_read (ipmi_openipmi_ctx_t ctx,\n ...",1,"[0.16170060634613037, 1.011047601699829, -0.54..."
3,b'camel_store_get_inbox_folder_sync (CamelStor...,1,"[1.2847437858581543, -0.02586905099451542, -0...."
4,"b""locate_var_of_level_walker(Node *node,\n\t\t...",0,"[1.6463299989700317, 0.8318526148796082, -0.17..."
5,b'apply(ast_sent* s) {\n if (s->get_nod...,0,"[0.14187678694725037, -0.05534656345844269, -0..."
6,"b'addr_ston(const struct sockaddr *sa, struct ...",1,"[1.5827864408493042, 0.11386679857969284, -1.0..."
7,"b'printStats(const RunSummary& sol, const Solv...",0,"[-0.7642837762832642, 1.819999098777771, -0.90..."
8,b'extendtimeline() {\n if (timeline.recording...,0,"[-0.7471140623092651, -1.333551287651062, 0.72..."
9,"b'Document(Conf& conf, Encodings& encodings, i...",0,"[1.4751206636428833, -1.2725929021835327, -0.0..."


###*Task 3*

Inspect the dataset and answer the following questions:
1.  How many samples are in the dataset?
2. How many positive examples (vulnerability-labeled instances) are in the dataset?
3. What is the vulnerable/non-vulnerable ratio?

In [70]:
#Answer 1 - Sample Count
sample_count = df.count()

#Answer 2 - Positive Example Count (Label - 1)
positive_examples = (df['Label'] == 1).sum()

#Answer 3 - Calculate Vulnerable/Non-Vulnerable Ratio
vulnerability_ratio = (df['Label'] == 1).sum()/(df['Label'] == 0).sum()

#Consolidated Answer
print(f"Sample Count: {sample_count.iloc[0]}")
print(f"Positive Example Count (Label - 1): {positive_examples}")
print(f"Vulnerable/Non-Vulnerable Ratio: {vulnerability_ratio}")

Sample Count: 1000
Positive Example Count (Label - 1): 283
Vulnerable/Non-Vulnerable Ratio: 0.3947001394700139


###*Task 4*

Load and run the following pre-trained neural network model called VulnPredictionModel. 

``` python 
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using {device} device")

```

``` python
from torch import nn

class VulnPredictModel(nn.Module):
    # intialize the model architecture
    def __init__(self):
      super().__init__()
      self.flatten = nn.Flatten()
      self.linear_stack = nn.Sequential(
         nn.Linear(768, 64),
         nn.ReLU(),
         nn.Linear(64, 64),
         nn.ReLU(),
         nn.Linear(64, 1),
         nn.Sigmoid()
      )

      # forward propagation
      def forward(self, x):
        pred = self.linear_stack(x)
        return pred
      

# TODO: intialize and load the model

```

*Solution*

We begin by running the provided code.

In [83]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using {device} device")

class VulnPredictModel(nn.Module):
    # intialize the model architecture
    def __init__(self):
      super().__init__()
      self.flatten = nn.Flatten()
      self.linear_stack = nn.Sequential(
         nn.Linear(768, 64),
         nn.ReLU(),
         nn.Linear(64, 64),
         nn.ReLU(),
         nn.Linear(64, 1),
         nn.Sigmoid()
      )

    # forward propagation
    def forward(self, x):
      pred = self.linear_stack(x)
      return pred     


Using cpu device


We create an instance of the model, load the weights, and then load the dataset.

In [90]:
#Instantiate the Model
model = VulnPredictModel()

#Load the Weights
try:
    model.load_state_dict(torch.load('model_2023-03-28_20-03.pth', map_location=device))
    print("Loaded pre-trained model weights.")
except FileNotFoundError:
    print("Model weights file not found. Proceeding with untrained model.")
    
#Set the model to evaluation mode
model.eval()

#Running Inference on the provided data.

#Step 1 - Create DataLoader with appropriate batch_size
batch_size = 32 
data_loader = DataLoader(dataset, batch_size=batch_size, shuffle=False)

#Step 2 - Perform Inference using Model on provided data
predictions = []
labels = []
with torch.no_grad():
    for batch in data_loader:
        vectors = batch['vector'].to(device)
        batch_labels = batch['label'].to(device)
        
        #Perform forward pass
        outputs = model.forward(vectors)
        
        #Apply threshold for Binary Classification (from Sigmoidal Output in final layer)
        prediction = (outputs > 0.5).float().cpu().numpy()
        predictions.extend(prediction)
        
        #Simulataneously store actual labels in the label list
        labels.extend(batch_labels)
        
print("Model Inference Complete.")

Loaded pre-trained model weights.
Model Inference Complete.


###*Task 5*

Make a prediction on the provided dataset and compute the following values:
- True Positives
- True Negatives
- False Positives
- False Negatives

In [97]:
# Interpretation 
# True Positive - Both Model prediction and Dataset have same value, for positive vulnerability.- TP
# True Negative - Both Model prediction and Dataset have same value, for no vulnerability. - TN
# False Positive - Both Model prediction and Dataset have differing value, for positive vulnerability. - FP
# False Negative - Both Model prediction and Dataset have differing value, for no vulnerability. - FN

# Compute confusion matrix
cm = confusion_matrix(labels, predictions)

# Unravel the confusion matrix
TN, FP, FN, TP = cm.ravel()

# Present Results
print(f"True Positive Count : {TP}")
print(f"True Negative Count : {FP}")
print(f"False Positive Count : {FN}")
print(f"True Negative Count : {TP}")

True Positive Count : 20
True Negative Count : 1
False Positive Count : 263
True Negative Count : 20


### *Task 6*

Compute the corresponding performance metrics **manually** (do not use PyTorch's predefined metrics):
- Accuracy
- Precision
- Recall
- F1

In [98]:
# Calculate Metrics
accuracy = (TP + TN) / (TP + TN + FP + FN)
precision = TP / (TP + FP)
recall = TP / (TP + FN)
specificity = TN / (TN + FP)
f1_score = 2 * (precision * recall) / (precision + recall)

# Display Metrics
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"Specificity: {specificity:.4f}")
print(f"F1 Score: {f1_score:.4f}")

Accuracy: 0.7360
Precision: 0.9524
Recall: 0.0707
Specificity: 0.9986
F1 Score: 0.1316


### *Task 7*

Based on your performance metrics, answer the following questions:

- Explain the impact of accuracy vs. F1 score.
- In this particular problem, which metric one should focus more on?
- Is there a better metric suitable for the use case of vulnerability prediction? Why?


### Miscellaneous Code-Bits


In [None]:
# TODO: import the necessary libraries to load the data from the specified path.

# SOLUTION: 

#Libraries Import
import numpy as np
import h5py

#Load Dataset using h5py library
f = h5py.File('./data_students/student_dataset.hdf5')
labels = f['labels'][:]
sources = f['source'][:]
vectors = f['vectors'][:]
f.close()

#Access and Print First index to confirm proper loading
label0, source0, vector0 = labels[0], sources[0], vectors[0]
print(f"Label: {label0}, Vector: {vector0}")
