Project 2: Model Engineering
===

___

Submitted by:

* <u>*Arthur Humblot*</u>
* <u>*Bekhzod Anvarov*</u>
* <u>*Ghita El Belghiti*</u>


University: **Politechnico di Torino**

Academic Year: **2025 - 2026**

## 1. Task 1: Frequency-based baseline

In Machine Learning problems, it is always good practice to compare against baseline solutions. Typically, one baseline involves a simple approach that helps determine whether simple choices and assumptions can already address the problem - before progressing to potentially more complex architectures like RNNs or GNNs.

In this context, a suitable baseline is a **frequency-based** approach.

Specifically:

In [29]:
#imports here
import pandas as pd

In [30]:
# read dataset
df_train = pd.read_json("../data/train.json")
df_test = pd.read_json("../data/test.json")

# instruction check
print(df_train.head())
print(df_test.head())

                                   api_call_sequence  is_malware
0  [LdrGetDllHandle, LdrGetProcedureAddress, LdrL...           1
1  [NtAllocateVirtualMemory, LdrLoadDll, LdrGetPr...           1
2  [FindResourceExW, LoadResource, FindResourceEx...           1
3  [FindResourceExW, LoadResource, FindResourceEx...           1
4  [LdrGetProcedureAddress, SetErrorMode, LdrLoad...           1
                                   api_call_sequence  is_malware
0  [NtQueryValueKey, NtClose, NtOpenKey, NtQueryV...           1
1  [LdrGetProcedureAddress, NtClose, NtOpenKey, N...           1
2  [NtOpenKey, NtQueryValueKey, NtClose, NtOpenKe...           1
3  [NtAllocateVirtualMemory, LdrLoadDll, LdrGetPr...           1
4  [NtOpenKey, NtQueryValueKey, NtClose, LdrGetPr...           1


* Extract the vocabulary from your input dataset - that is, the **set of all the API calls** appearing in it

In [31]:
# extract sequences(api) and labels
train_seqs = df_train['api_call_sequence'].tolist()
test_seqs = df_test['api_call_sequence'].tolist()

train_labels = df_train['is_malware'].tolist()
test_labels = df_test['is_malware'].tolist()

# instruction check
print(train_seqs[0][:5])
print(f"Type of sequence: {type(train_seqs[0]).__name__}")

['LdrGetDllHandle', 'LdrGetProcedureAddress', 'LdrLoadDll', 'LdrGetProcedureAddress', 'LdrGetDllHandle']
Type of sequence: list


* **Q:** How many unique API calls does the training set contain?

In [32]:
# create train vocabulary unique api
train_vocab = set()

for train_seq in train_seqs:
    for api_call in train_seq:
        train_vocab.add(api_call)

print(f"Number of unique API calls the training set contain: {len(train_vocab)}")

Number of unique API calls the training set contain: 258


And how many the test set?

In [33]:
# create test vocal unique api
test_vocab = set()

for test_seq in test_seqs:
    for api_call in test_seq:
        test_vocab.add(api_call)

print(f"Number of unique API calls the test set contain: {len(test_vocab)}")

Number of unique API calls the test set contain: 232


* **Q:** Are there any API calls that appear only in the test set (but not in the training set)? If yes, how many? And which one are they?

In [34]:
# features, which appear only in test set, and not in train set
only_in_test = test_vocab - train_vocab
print(f"Number of unique API calls only the test set contain(but not in the training set): {len(only_in_test)}")
print(f"Unique API calls only the test set contain:\n{only_in_test}")

Number of unique API calls only the test set contain(but not in the training set): 3
Unique API calls only the test set contain:
{'ControlService', 'WSASocketA', 'NtDeleteKey'}


In [35]:
# sorted vocabulary
train_vocab_sorted = sorted([i for i in train_vocab])
test_vocab_sorted = sorted([i for i in test_vocab])

# instruction check
print(train_vocab_sorted[:5])
print(test_vocab_sorted[:5])

['CertOpenStore', 'CertOpenSystemStoreW', 'CoCreateInstance', 'CoCreateInstanceEx', 'CoGetClassObject']
['CoCreateInstance', 'CoCreateInstanceEx', 'CoGetClassObject', 'CoInitializeEx', 'CoInitializeSecurity']


* **Q:** Can you use the test vocabulary to build the new test dataframe? If not, how do you handle API calls in the test set that do not exist in the training vocabulary?

In [36]:
feature_names = train_vocab_sorted + ['<UNK>']

We add **< UNK >** - for features unknown for train set and appears on test set only

* Use this vocabulary as the **feature set**: for each row in the input dataset, count the **number of times** (frequency) each vocabulary term occurs

In [37]:
# map api with their positions
api_to_idx = dict()

for i in range(len(feature_names)):
    api_to_idx[feature_names[i]] = i

# creating features for train
X_train = list()

for seq in train_seqs:
    freq = [0 for _ in range(len(feature_names))]   # frequency vector for train features
    for api_call in seq:
        if api_call in api_to_idx:
            freq[api_to_idx[api_call]] += 1
    X_train.append(freq)

# creating features for train
X_test = list()

for seq in test_seqs:
    freq = [0 for _ in range(len(feature_names))]   # frequency vector for test features
    for api_call in seq:
        if api_call in api_to_idx:
            freq[api_to_idx[api_call]] += 1     # UNK features, which are only on test
        else:
            freq[-1] += 1
    X_test.append(freq)

* **Q:** One issue of this frequency-based approach is that it creates sparse vectors (i.e., vectors with many zeros per row):
    * how many non-zero elements per row do you have on average in the training set?
    * How many in the test set ?
    * What is the ratio with respect to the number of elements per row?

In [38]:
# sparsity for the train set (non-zero per row)
nnz_train_per_row = list()  # num of non zeros

for freq in X_train:
    nnz_train_per_row.append(sum([1 if i > 0 else 0 for i in freq]))

avg_non_zero_train = sum(nnz_train_per_row) / len(X_train)
print(f"Average non-zero elements per row in training set: {avg_non_zero_train}")
ratio_train = avg_non_zero_train / len(feature_names)
print(f"Ratio with respect to the number of elements per row in training set: {ratio_train}")


# sparsity for the test set (non-zero per row)
nnz_test_per_row = list()  # num of non zeros

for freq in X_test:
    nnz_test_per_row.append(sum([1 if i > 0 else 0 for i in freq]))

avg_non_zero_test = sum(nnz_test_per_row) / len(X_test)
print(f"Average non-zero elements per row in test set: {avg_non_zero_test}")
ratio_test = avg_non_zero_test / len(feature_names)
print(f"Ratio with respect to the number of elements per row in test set: {ratio_test}")

Average non-zero elements per row in training set: 21.94707503828484
Ratio with respect to the number of elements per row in training set: 0.08473774146055923
Average non-zero elements per row in test set: 24.27870868562644
Ratio with respect to the number of elements per row in test set: 0.09374018797539166


In [None]:
""" This is temporary helper which is not part of lab activity !!! """

# ---------- Reused across Task 1–4 (keep these) ----------

"""
    df_train: pandas DataFrame for training data loaded from train.json
    df_test: pandas DataFrame for test data loaded from test.json

    train_seqs: list of lists, each inner list is a sequence of API call strings from the training set
    test_seqs: list of lists, each inner list is a sequence of API call strings from the test set

    y_train: list/array of labels (0 = goodware, 1 = malware) for training samples
    y_test: list/array of labels (0 = goodware, 1 = malware) for test samples

    train_vocab: set of unique API call strings observed in the training set
    test_vocab: set of unique API call strings observed in the test set (used for analysis / OOV check)

    train_vocab_sorted: sorted list of API call strings from train_vocab
    feature_names: list of feature names for Task 1 (train_vocab_sorted + ['<UNK>'])
                  later tasks may use a similar list for building ID/embedding vocabularies

    api_to_idx: dictionary mapping API call string -> integer index
          (for Task 1: column index in the frequency vector;
          for Tasks 2–3 you will build a similar mapping for IDs/embeddings)

    only_test: set of API call strings that appear only in test (test_vocab - train_vocab)
        used to motivate the need for an <UNK> token / index
"""

## 2. Task 2: Feed Forward Neural Network (FFNN)

## 3. Task 3: Recurrent Neural Network (RNN)

## 4. Task 1: Graph Neural Network (GNN)