<a href="https://colab.research.google.com/github/Andrian0s/ML4NLP1-2024-Tutorial-Notebooks/blob/main/exercises/ex1/ex1_nn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ML4NLP1
## Starting Point for Exercise 1, part II

This notebook is supposed to serve as a starting point and/or inspiration when starting exercise 1, part II.

One of the goals of this exercise is o make you acquainted with **skorch**. You will probably need to consult the [documentation](https://skorch.readthedocs.io/en/stable/).

# Installing skorch and loading libraries

In [21]:
import subprocess

# Installation on Google Colab
try:
    import google.colab
    subprocess.run(['python', '-m', 'pip', 'install', 'skorch'])
except ImportError:
    pass

In [22]:
import torch
from torch import nn
import torch.nn.functional as F
from skorch import NeuralNetClassifier

import pandas as pd
import numpy as np
import csv
import re
import string
from collections import defaultdict

# Set seed for reproducibility
seed = 42
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)

## Training a classifier and making predictions

In [23]:
# Download dataset
!gdown 1QP6YuwdKFNUPpvhOaAcvv2Pcp4JMbIRs # x_train
!gdown 1QVo7PZAdiZKzifK8kwhEr_umosiDCUx6 # x_test
!gdown 1QbBeKcmG2ZyAEFB3AKGTgSWQ1YEMn2jl # y_train
!gdown 1QaZj6bI7_78ymnN8IpSk4gVvg-C9fA6X # y_test

Downloading...
From: https://drive.google.com/uc?id=1QP6YuwdKFNUPpvhOaAcvv2Pcp4JMbIRs
To: /content/x_train.txt
100% 64.1M/64.1M [00:02<00:00, 25.8MB/s]
Downloading...
From: https://drive.google.com/uc?id=1QVo7PZAdiZKzifK8kwhEr_umosiDCUx6
To: /content/x_test.txt
100% 65.2M/65.2M [00:01<00:00, 60.0MB/s]
Downloading...
From: https://drive.google.com/uc?id=1QbBeKcmG2ZyAEFB3AKGTgSWQ1YEMn2jl
To: /content/y_train.txt
100% 480k/480k [00:00<00:00, 5.44MB/s]
Downloading...
From: https://drive.google.com/uc?id=1QaZj6bI7_78ymnN8IpSk4gVvg-C9fA6X
To: /content/y_test.txt
100% 480k/480k [00:00<00:00, 6.03MB/s]


In [24]:
with open(f'x_train.txt') as f:
    x_train = f.read().splitlines()
with open(f'y_train.txt') as f:
    y_train = f.read().splitlines()
with open(f'x_test.txt') as f:
    x_test = f.read().splitlines()
with open(f'y_test.txt') as f:
    y_test = f.read().splitlines()

In [25]:
# Combine x_train and y_train into one dataframe
train_df = pd.DataFrame({'text': x_train, 'label': y_train})
# Write train_df to csv with tab as separator
train_df.to_csv('train_df.csv', index=False, sep='\t')
# Comibne x_test and y_test into one dataframe
test_df = pd.DataFrame({'text': x_test, 'label': y_test})
# Inspect the first 5 items in the train split
train_df.head()

def print_label_info(df, dataset_name):
    print(f"\nUnique labels in {dataset_name}:")
    print(df['label'].unique())

    print(f"\nLabel counts in {dataset_name}:")
    print(df['label'].value_counts())

print_label_info(train_df, "train_df")
print_label_info(test_df, "test_df")


Unique labels in train_df:
['est' 'swe' 'mai' 'oci' 'tha' 'orm' 'lim' 'guj' 'pnb' 'zea' 'krc' 'hat'
 'pcd' 'tam' 'vie' 'pan' 'szl' 'ckb' 'fur' 'wuu' 'arz' 'ton' 'eus'
 'map-bms' 'glk' 'nld' 'bod' 'jpn' 'arg' 'srd' 'ext' 'sin' 'kur' 'che'
 'tuk' 'pag' 'tur' 'als' 'koi' 'lat' 'urd' 'tat' 'bxr' 'ind' 'kir'
 'zh-yue' 'dan' 'por' 'fra' 'ori' 'nob' 'jbo' 'kok' 'amh' 'khm' 'hbs'
 'slv' 'bos' 'tet' 'zho' 'kor' 'sah' 'rup' 'ast' 'wol' 'bul' 'gla' 'msa'
 'crh' 'lug' 'sun' 'bre' 'mon' 'nep' 'ibo' 'cdo' 'asm' 'grn' 'hin' 'mar'
 'lin' 'ile' 'lmo' 'mya' 'ilo' 'csb' 'tyv' 'gle' 'nan' 'jam' 'scn'
 'be-tarask' 'diq' 'cor' 'fao' 'mlg' 'yid' 'sme' 'spa' 'kbd' 'udm' 'isl'
 'ksh' 'san' 'aze' 'nap' 'dsb' 'pam' 'cym' 'srp' 'stq' 'tel' 'swa' 'vls'
 'mzn' 'bel' 'lad' 'ina' 'ava' 'lao' 'min' 'ita' 'nds-nl' 'oss' 'kab'
 'pus' 'fin' 'snd' 'kaa' 'fas' 'cbk' 'cat' 'nci' 'mhr' 'roa-tara' 'frp'
 'ron' 'new' 'bar' 'ltg' 'vro' 'lav' 'ces' 'yor' 'nso' 'bak' 'rus' 'ace'
 'mdf' 'vep' 'sgs' 'uig' 'lit' 'sqi' 'som' 'slk' '

### Data preparation

Prepare your dataset for this experiment using the same method as you did in part 1.

Get a subset of the train/test data that includes 20 languages. Include English, German, Dutch, Danish, Swedish, Norwegian, and Japanese, plus 13 additional languages of your choice based on the items in the list of labels.

Don't forget to encode your labels using the adjusted code snippet from part 1!


In [26]:
# TODO: Create your train/test subsets of languages
# Note, make sure these are the same as what you used in Part 1!

total_rows = len(train_df) + len(test_df)

train_rows = int(total_rows * .75)
test_rows = total_rows - train_rows

print(f"Train rows: {train_rows}")
print(f"Test rows: {test_rows}")
print("\n")

# Concatenating train and test dataframes
combined_df = pd.concat([train_df, test_df], ignore_index=True)

# Shuffleing the combined_df (If the original train_df and test_df have any inherent order or pattern (e.g., all samples of a particular language gruped together), not shuffling could result in training and testing sets that are not representative of the overall data distributio)
combined_df = combined_df.sample(frac=1, random_state=42)


train_df = combined_df[:train_rows]
test_df = combined_df[train_rows:]

# Split the combined dataframe into train and test sets
print("Number of rows in train set:", train_df.shape[0])
print("Number of rows in test set:", test_df.shape[0])
print("\n")

subsets = ['eng', 'deu', 'nld', 'dan', 'swe', 'nno', 'jpn', 'ita', 'tel', 'hin', 'tam', 'kan', 'bul', 'ara', 'kor', 'rus', 'fra', 'pol', 'fin', 'tha']

# Filter the train and test dataframes to include only the selected languages
train_subset = train_df[train_df['label'].isin(subsets)]
test_subset = test_df[test_df['label'].isin(subsets)]

# Print the number of rows for each subset
print("Number of rows in train SUBset:", train_subset.shape[0])
print("Number of rows in test SUBset:", test_subset.shape[0])

# Make it numpy for better handling
x_train = train_subset.text.to_numpy()
y_train = train_subset.label.to_numpy()
x_test = test_subset.text.to_numpy()
y_test = test_subset.label.to_numpy()

Train rows: 176250
Test rows: 58750


Number of rows in train set: 176250
Number of rows in test set: 58750


Number of rows in train SUBset: 14967
Number of rows in test SUBset: 5033


In [27]:
# TODO: Use your adjusted code from part 1 to encode the labels again
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()

#y_train = train_subset['label']
#y_test = test_subset['label']

print(f"Labels type before encoding: {type(y_train)}")
print(f"Labels example before encoding: {y_train[:4]}")
print("\n")

label_encoder.fit(y_train)
y_train = label_encoder.transform(y_train)
y_test = label_encoder.transform(y_test)
print(f"Classes: {label_encoder.classes_}")
print("\n")
print(f"Labels type after encoding: {type(y_train)}")
print(f"Labels example after encoding: {y_train[:4]}")

Labels type before encoding: <class 'numpy.ndarray'>
Labels example before encoding: ['eng' 'tel' 'fra' 'rus']


Classes: ['ara' 'bul' 'dan' 'deu' 'eng' 'fin' 'fra' 'hin' 'ita' 'jpn' 'kan' 'kor'
 'nld' 'nno' 'pol' 'rus' 'swe' 'tam' 'tel' 'tha']


Labels type after encoding: <class 'numpy.ndarray'>
Labels example after encoding: [ 4 18  6 15]


### Feature Extraction

In [28]:
# First, we extract some simple features as input for the neural network
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer='char', ngram_range=(2, 2), max_features=100, binary=True)
X = vectorizer.fit_transform(x_train)

In [29]:
# We need to change the datatype to make it play nice with pytorch
X = X.astype(np.float32)
y = y_train.astype(np.int64)

In [30]:
# Double check X and y shapes
print(f'X is a {type(X)} with shape: {X.shape}')
print(f'y is a {type(y)} with shape: {y.shape}')

X is a <class 'scipy.sparse._csr.csr_matrix'> with shape: (14967, 100)
y is a <class 'numpy.ndarray'> with shape: (14967,)


In the following, we define a vanilla neural network with two hidden layers. The output layer should have as many outputs as there are classes. In addition, it should have a nonlinearity function.

In [31]:
# TODO: In the following, you can find a small (almost) working example of a neural network.
# Unfortunately, again, the cat messed up some of the code. Please fix the code such that it is executable. (Hint: the input and output sizes look a bit weird...)

class ClassifierModule(nn.Module):
    def __init__(
        self,
        input_size,
        num_classes,
        num_units=200,
        nonlin=F.relu,
    ):
        super(ClassifierModule, self).__init__()
        self.num_units = num_units
        self.nonlin = nonlin

        self.dense0 = nn.Linear(input_size, num_units)
        self.nonlin = nonlin
        self.dense1 = nn.Linear(num_units, 50)
        self.output = nn.Linear(50, num_classes)

    def forward(self, X, **kwargs):
        X = self.nonlin(self.dense0(X))
        X = F.relu(self.dense1(X))  # I don't know why it's using F.relu instead of self.nonlin and I don't know if i am allowed to change it so i'm gonna keep it like this :)
        X = self.output(X)
        return X.squeeze(dim=1)


In [32]:
# Initalise the neural net classifier.
net = NeuralNetClassifier(
    ClassifierModule(
        input_size=X.shape[1],
        num_units=200,
        num_classes=len(label_encoder.classes_),
        nonlin=F.relu,
    ),
    max_epochs=20,
    criterion=nn.CrossEntropyLoss(),
    lr=0.1,
    device='cuda',  # comment this to train with CPU
)

In [33]:
X = X.toarray() # to make it faster
# Train the classifier
net.fit(X, y)

  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        [36m2.7086[0m       [32m0.1814[0m        [35m2.5086[0m  0.3839
      2        [36m2.2076[0m       [32m0.2966[0m        [35m1.9486[0m  0.3504
      3        [36m1.7057[0m       [32m0.3507[0m        [35m1.6924[0m  0.3036
      4        [36m1.4901[0m       [32m0.4469[0m        [35m1.5446[0m  0.3380
      5        [36m1.3609[0m       [32m0.4883[0m        [35m1.4126[0m  0.3635
      6        [36m1.2594[0m       [32m0.5314[0m        [35m1.3049[0m  0.3340
      7        [36m1.1883[0m       [32m0.5564[0m        [35m1.2315[0m  0.3686
      8        [36m1.1425[0m       [32m0.5661[0m        [35m1.1852[0m  0.3650
      9        [36m1.1122[0m       [32m0.5818[0m        [35m1.1570[0m  0.3955
     10        [36m1.0900[0m       [32m0.5888[0m        [35m1.1390[0m  0.3602
     11        [36m1.0721[0m       [32m0.59

<class 'skorch.classifier.NeuralNetClassifier'>[initialized](
  module_=ClassifierModule(
    (dense0): Linear(in_features=100, out_features=200, bias=True)
    (dense1): Linear(in_features=200, out_features=50, bias=True)
    (output): Linear(in_features=50, out_features=20, bias=True)
  ),
)

Note, you can also use `GridSearchCV` with `skorch`, but be aware that training a neural network takes much more time.

Play around with 5 different sets of hyperparameters. For example, consider some of the following:

- layer sizes
- activation functions
- regularizers
- early stopping
- vectorizer parameters

Report your best hyperparameter combination. \\
📝❓ What is the effect of your modifcations on validation performance? Discuss potential reasons.

☝ Note, during model development, if you run into the infamous CUDA out-of-memory (OOM) error, try clearing the GPU memory either with `torch.cuda.empty_cache()` or restarting the runtime.

In [34]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer

In [35]:
to_dense = FunctionTransformer(lambda x: x.toarray().astype(np.float32), accept_sparse=True)

class ClassifierModule(nn.Module):
    def __init__(
        self,
        num_units=200,
        num_classes=20,  # You can set this based on your task
        nonlin=F.relu,
        dropout_rate=0.5,
    ):
        super(ClassifierModule, self).__init__()
        self.num_units = num_units
        self.nonlin = nonlin
        self.dropout = nn.Dropout(dropout_rate)

        # The dense layers will be initialized in the forward pass
        self.dense0 = None
        self.dense1 = nn.Linear(num_units, 50)
        self.output = nn.Linear(50, num_classes)

    def forward(self, X):
        # Dynamically initialize self.dense0 based on the input size
        if self.dense0 is None:
            input_size = X.shape[1]
            self.dense0 = nn.Linear(input_size, self.num_units)

        X = self.nonlin(self.dense0(X))
        X = self.dropout(X)
        X = self.nonlin(self.dense1(X))
        X = self.output(X)
        return X




net = NeuralNetClassifier(
    module=ClassifierModule(
        num_units=200,
        nonlin=F.relu,
    ),
    max_epochs=10,
    criterion=nn.CrossEntropyLoss,
    optimizer=torch.optim.Adam,
    lr=0.01,
    batch_size=64,
    device='cpu',
    verbose=0,
)

pipe = Pipeline([
    ('vect', CountVectorizer(analyzer='char', ngram_range=(2, 2), binary=True)),
    ('to_dense', to_dense),
    ('net', net),
])


In [36]:

params = {
    'vect__max_features': [100, 800],
    'net__module__nonlin': [F.relu, F.tanh],
    'net__module__dropout_rate': [0.0, 0.5],
    'net__optimizer__lr': [0.01, 0.001],
    'net__optimizer__weight_decay': [0.01, 0.001],
    'net__module__num_units': [100, 300],
}


In [38]:
gs = GridSearchCV(pipe, params, refit=True, cv=5, scoring='accuracy', verbose=2)

# Fit the model
gs.fit(x_train, y)

Fitting 5 folds for each of 64 candidates, totalling 320 fits
[CV] END net__module__dropout_rate=0.0, net__module__nonlin=<function relu at 0x7844082332e0>, net__module__num_units=100, net__optimizer__lr=0.01, net__optimizer__weight_decay=0.01, vect__max_features=100; total time=   6.8s
[CV] END net__module__dropout_rate=0.0, net__module__nonlin=<function relu at 0x7844082332e0>, net__module__num_units=100, net__optimizer__lr=0.01, net__optimizer__weight_decay=0.01, vect__max_features=100; total time=   6.8s
[CV] END net__module__dropout_rate=0.0, net__module__nonlin=<function relu at 0x7844082332e0>, net__module__num_units=100, net__optimizer__lr=0.01, net__optimizer__weight_decay=0.01, vect__max_features=100; total time=   7.0s
[CV] END net__module__dropout_rate=0.0, net__module__nonlin=<function relu at 0x7844082332e0>, net__module__num_units=100, net__optimizer__lr=0.01, net__optimizer__weight_decay=0.01, vect__max_features=100; total time=   6.7s
[CV] END net__module__dropout_rate

In [39]:
# What was the best hyperparameter combination?
print(gs.best_params_)
# What was the best score average score across all cross validation runs?
print(gs.best_score_)

{'net__module__dropout_rate': 0.0, 'net__module__nonlin': <function tanh at 0x784408233d00>, 'net__module__num_units': 300, 'net__optimizer__lr': 0.001, 'net__optimizer__weight_decay': 0.001, 'vect__max_features': 800}
0.9721391106078958



---

📝❓ Write your lab report here addressing all questions in the notebook

We implemented a pipeline with a vectorizer, a transformation from sparse to dense matrices, and a two-layer neural network. The transformation was crucial in improving training speed, as we observed that PyTorch was significantly slower with sparse matrices.

During hyperparameter tuning, we experimented with different values for the vectorizer’s max_features parameter. Although we initially thought that limiting the matrix size by focusing on the most frequent bigrams would improve performance, the results showed otherwise. A larger matrix with more features yielded higher accuracy. In a grid search comparing low (100) and high (800) feature counts, the higher count performed substantially better.

We also optimized the number of units in the hidden layers, with higher values consistently leading to better results. In terms of regularization, we tested both L2 regularization (weight decay) and dropout. While L2 regularization improved performance, dropout surprisingly decreased accuracy. This may be because the model’s complexity wasn’t high enough for dropout to be effective.

Regarding activation functions, we tested ReLU and Tanh, with Tanh slightly outperforming ReLU, likely due to its smoother gradient. Finally, we explored different learning rates, and the lower values provided better stability and accuracy.