# Natural Language Processing Project
## NLP Course @ Politecnico di Milano 2022/2023 - Prof. Mark Carman
### Topic 4: Autextification

### Group: Residual Sum of Students
- Raul Singh
- Davide Rigamonti
- Francesco Tosini
- Enrico Zuccolotto

## Introduction

The dataset consists of *short text passages* that have either been written by a *human* or have been generated automatically by a *language model*, more information can be found on the [official site](https://sites.google.com/view/autextification).

The original task will take place as part of [IberLEF 2023](http://sepln2023.sepln.org/en/iberlef-en/), the 5th Workshop on Iberian Languages Evaluation Forum at the SEPLN 2023 Conference, which will be held in Jaén, Spain on the 26th of September, 2023.

Given the scope of the original challenge we can observe that the dataset contains two separate set of samples, one in **English** and the other in **Spanish**; our main focus will be on the **English** dataset.

We will treat the two tasks **separately** as the two respective goals are different; although some similarities can be traced between the two, most of the considered approaches will be symmetrical and net homogeneous results.

Each task will be presented with a brief **data exploration** section, then we will proceed to utilize models and approaches that we have seen in the course (with the introduction of some novelties) starting from the most basic techniques based on the **Bag of Words representation** to then transition towards approaches that utilize **Word Embeddings** to then reach the *state-of-the-art* **Transformer** models.

## Preliminary initialization

This section contains all the library imports, helper function initialization calls and global variable definitions.

### Imports

#### Utilized libraries
The following dependencies are needed to run the notebook:
```
pip install scikit-learn~=1.2.2
pip install torch~=2.0.0
pip install matplotlib~=3.7.1
pip install plotly~=5.14.1
pip install nltk~=3.8.1
pip install spacy~=3.5.1
pip install textstat~=0.7.3
pip install numpy~=1.24.2
pip install pandas~=2.0.0
pip install python-terrier~=0.9.2
pip install scipy~=1.9.3
pip install gensim~=4.3.1
pip install lexicalrichness~=0.5.0
pip install sentence-transformers~=2.2.2
pip install transformers~=4.28.1
pip install datasets~=2.12.0
pip install evaluate~=0.4.0
```

#### Python standard library

In [None]:
import re
import os
import sys
import abc
import random
import string

import copy as cp
import array as arr

from collections import Counter

#### Scikit-learn

In [None]:
from sklearn import metrics

from sklearn.preprocessing import LabelBinarizer, LabelEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.metrics import ConfusionMatrixDisplay, accuracy_score, classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans

#### Pytorch

In [None]:
import torch

import torch.nn as nn
import torch.optim as opt
import torch.utils.data as dt

#### Plotting

In [None]:
import plotly.express as px
import matplotlib.pyplot as plt
import plotly.figure_factory as ff
import plotly.graph_objects as go

from plotly.subplots import make_subplots

#### Various

In [None]:
import nltk
import spacy
import textstat

import evaluate as hf_ev
import numpy as np
import pandas as pd
import pyterrier as pt
import scipy.sparse as sps

import gensim.downloader as api

from spacy import displacy
from nltk.corpus import stopwords
from pandas.core.common import flatten
from datasets import Dataset, DatasetDict
from gensim.models.word2vec import Word2Vec
from lexicalrichness import LexicalRichness
from sentence_transformers import SentenceTransformer, util
from transformers import TrainingArguments, Trainer, AutoTokenizer, AutoModelForSequenceClassification

from nlp_project.notebook_utils import compact_split, evaluate

from nlp_project.notebook_utils import evaluate, split, save_scikit_model, load_scikit_model
from nlp_project.nn_utils import init_gpu
from nlp_project.nn_classifier import Data, Classifier
from nlp_project.nn_extra import EarlyStopping, AdaptLR

### Helper functions

#### Pytorch models

In [None]:
history_metrics = {
    "epoch": {},
    "loss": {"order": -1},
    "val_loss": {"order": -1},
    "acc": {"order": +1},
    "val_acc": {"order": +1},
}

In [None]:
class StopNNLoop(BaseException):
    pass

def build_history_string(history_point):
    epoch = history_point["epoch"]
    metrics_string = " ".join(
        [f"{k}: {history_point[k]:.7f}" for k in history_point if not k == "epoch"]
    )
    return f"Epoch {epoch} -- " + metrics_string


def compare_equal_models(model_1, model_2):
    models_differ = 0
    for key_item_1, key_item_2 in zip(
        model_1.state_dict().items(), model_2.state_dict().items()
    ):
        if torch.equal(key_item_1[1], key_item_2[1]):
            pass
        else:
            models_differ += 1
            if key_item_1[0] == key_item_2[0]:
                print("Mismtach found at", key_item_1[0])
            else:
                raise Exception
    if models_differ == 0:
        return True
    return False


# Returns true if a is "better" than b following the metric
def compare_metric(metric, a, b, delta=0):
    if a == b:
        return False
    if history_metrics[metric]["order"] == +1:
        return a > b + delta
    return a < b - delta


# Initializes lowest possible value given a metric
def init_lowest(metric):
    return -np.inf if history_metrics[metric]["order"] == +1 else np.inf


def init_gpu(gpu="cuda:0"):
    return torch.device(gpu if torch.cuda.is_available() else "cpu")

In [None]:
class Data(dt.Dataset):
    def __init__(self, x, y, x_type=np.int32, y_type=torch.float):
        x_coo = x.tocoo()
        self.x = torch.sparse.FloatTensor(
            torch.LongTensor([x_coo.row, x_coo.col]),
            torch.FloatTensor(x_coo.data.astype(x_type)),
            x_coo.shape,
        )
        self.y = torch.tensor(y, dtype=y_type)
        self.shape = self.x.shape

    def __getitem__(self, index):
        return self.x[index].to_dense(), self.y[index]

    def __len__(self):
        return self.shape[0]

In [None]:
class Classifier(nn.Module):
    def __init__(self, binary_classifier=False, device=torch.device("cpu"), verbose=True):
        super().__init__()
        self.device = device
        self.is_binary = binary_classifier
        self.verbose = verbose
        self.is_compiled = False
        self.history = []

    def forward(self, x):
        return x

    def compile(self, loss, optimizer, binary_threshold=0.5):
        self.loss = loss
        self.optimizer = optimizer
        self.binary_threshold = binary_threshold
        self.to(self.device)
        self.is_compiled = True

    def parse_logits(self, outputs):
        if self.is_binary:
            predicted = (outputs > self.binary_threshold).float()
        else:
            _, predicted = torch.max(outputs.data, 1)
        return predicted

    def train_loop(self, data, epochs, data_val=None, callbacks=[]):
        try:
            tot = len(data.dataset)
            # Iterate over all epochs
            for epoch in range(epochs):
                running_loss = 0.0
                correct = 0
                history_point = {}
                # Iterate over each dataset batch
                for i, datum in enumerate(data):
                    # Decompose batch in x and y
                    inputs, labels = datum
                    # Set gradients to zero
                    self.optimizer.zero_grad()
                    # Forward pass
                    outputs = self(inputs)
                    predictions = self.parse_logits(outputs)
                    current_loss = self.loss(outputs, labels)
                    # Backpropagation
                    current_loss.backward()
                    # Optimization
                    self.optimizer.step()
                    # Update metrics
                    running_loss += current_loss.item()
                    correct += (predictions == labels).float().sum()

                # Compute training metrics
                history_point["epoch"] = epoch + 1
                history_point["loss"] = running_loss / tot
                history_point["acc"] = correct / tot

                # Compute and save eventual validation metrics
                if data_val:
                    _, val_metrics = self.test_loop(data_val)
                    history_point["val_loss"] = val_metrics["loss"]
                    history_point["val_acc"] = val_metrics["acc"]

                # Save epoch in history
                self.history.append(history_point)

                # Perform callbacks
                for callback in callbacks:
                    callback.call(self, history_point)

                # Print epoch summary
                if self.verbose:
                    print(build_history_string(history_point))

        except StopNNLoop as s:  # noqa
            pass

    def test_loop(self, data):
        all_predictions = np.array([])
        tot = len(data.dataset)
        loss = 0.0
        correct = 0
        metrics = {}
        # Prevent model update
        with torch.no_grad():
            # Iterate over each dataset batch
            for datum in data:
                # Decompose batch in x and y
                inputs, labels = datum
                # Forward pass
                outputs = self(inputs)
                predictions = self.parse_logits(outputs)
                current_loss = self.loss(outputs, labels)
                # Update metrics
                loss += current_loss.item()
                correct += (predictions == labels).float().sum()
                # Append predictions
                all_predictions = np.append(all_predictions, predictions)

        # Compute metrics
        metrics["acc"] = correct / tot
        metrics["loss"] = loss / tot

        return all_predictions.flatten(), metrics

In [None]:
class Callback(metaclass=abc.ABCMeta):
    def __init__(self, inputs):
        if not isinstance(inputs, list):
            raise TypeError("Parameter 'inputs' must be a list")
        if not all(x in history_metrics for x in inputs):
            raise ValueError(
                "Unknown input value, not present in Callback.callback_inputs"
            )
        self.inputs = inputs

    def inputs_check(self, inputs):
        if not all(x in inputs for x in self.inputs):
            raise ValueError(
                f"Requested inputs not provided: {[i for i in inputs if i not in self.inputs]}"
            )

    @abc.abstractmethod
    def call(self, model, inputs):
        self.inputs_check(inputs)
        pass

class EarlyStopping(Callback):
    def __init__(
        self,
        metric="loss",
        patience=10,
        baseline=None,
        delta=0,
        restore_best=True,
        verbose=True,
    ):
        super().__init__([metric])
        self.metric = metric
        self.patience = patience
        self.baseline = baseline
        self.delta = delta
        self.restore_best = restore_best
        self.verbose = verbose
        self.best_epoch = 0
        self.counter = 0
        self.saved_params = {}
        self.last_best = init_lowest(self.metric)

    def call(self, model, inputs):
        super().call(model, inputs)
        if self.early_stop(model, inputs):
            raise StopNNLoop()

    def early_stop(self, model, inputs):
        metric = inputs[self.metric]
        # Check if new metric is better than the current best
        if compare_metric(self.metric, metric, self.last_best):
            # Reset counter and update best value
            self.last_best = metric
            self.counter = 0
            self.best_epoch = inputs["epoch"]
            # Update model checkpoint
            if compare_metric(self.metric, metric, self.baseline):
                self.saved_params = cp.deepcopy(model.state_dict())
        # Check if new metric is worse than the current best
        elif compare_metric(self.metric, self.last_best, metric, self.delta):
            # Increment counter
            self.counter += 1
            # Check if counter exceeds patience, if so interrupt training
            if self.counter >= self.patience:
                # Restore best model checkpoint if possible and wanted
                if not self.restore_best:
                    return True
                if self.saved_params:
                    model.load_state_dict(self.saved_params)
                if self.verbose:
                    if self.saved_params:
                        print(f"Model restored successfully @ epoch {self.best_epoch}")
                    else:
                        print(f"Couldn't restore model @ epoch {self.best_epoch}")
                return True
        return False


class AdaptLR(Callback):
    def __init__(self, metric="loss", patience=5, factor=0.1, delta=0, verbose=True):
        super().__init__([metric])
        self.metric = metric
        self.patience = patience
        self.factor = factor
        self.delta = delta
        self.verbose = verbose
        self.counter = 0
        self.last_best = init_lowest(self.metric)

    def call(self, model, inputs):
        super().call(model, inputs)
        if self.adaptlr(inputs):
            # Adapt learning rate
            out = []
            for g in model.optimizer.param_groups:
                g["lr"] *= self.factor
                out = g["lr"]
            if self.verbose:
                print(f"Reducing lr to {out:.4f}")

    def adaptlr(self, inputs):
        metric = inputs[self.metric]
        # Check if new metric is better than the current best
        if compare_metric(self.metric, metric, self.last_best):
            # Reset counter and update best value
            self.last_best = metric
            self.counter = 0
        # Check if new metric is worse than the current best
        elif compare_metric(self.metric, self.last_best, metric, self.delta):
            # Increment counter
            self.counter += 1
            # Check if counter exceeds patience, if so interrupt training
            if self.counter >= self.patience:
                self.counter = 0
                return True
        return False

#### Generic utility

In [None]:
def split(x, y, test_size=0.2, val_size=0.0, seed=0):
    if val_size + test_size >= 1:
        return None
    x_train, x_test, y_train, y_test = train_test_split(
        x, y, test_size=test_size + val_size, stratify=y, random_state=seed
    )
    x_val, y_val = None, None
    if val_size > 0:
        x_test, x_val, y_test, y_val = train_test_split(
            x_test,
            y_test,
            test_size=val_size / (test_size + val_size),
            stratify=y_test,
            random_state=seed,
        )
    return x_train, x_val, x_test, y_train, y_val, y_test

def compact_split(dataset, test_size=0.2, val_size=0.0, seed=0):
    if val_size + test_size >= 1:
        return None
    train, test = train_test_split(
        dataset, test_size=test_size + val_size, random_state=seed
    )
    val = None
    if val_size > 0:
        val, test = train_test_split(
            test,
            test_size=test_size / (test_size + val_size),
            random_state=seed,
        )
    return train, val, test

def evaluate(y_true, y_pred, labels=None):
    print(classification_report(y_true, y_pred))
    cm = confusion_matrix(y_true, y_pred)
    cm_display = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
    cm_display.plot()
    plt.show()

In [None]:
def train_cv_models(models, x_train, y_train):
    for model in models:
        x_train_, y_train_ = x_train, y_train
        if "subsample" in model.keys():
            x_train_, _, y_train_, _ = train_test_split(
                x_train, 
                y_train, 
                test_size=model["subsample"], 
                stratify=y_train
            )
                
        print(f"Training {model['name']}")        
        model["model"].fit(x_train_, y_train_)
        
        print("Found best model")
        model["best"] = model["model"].best_estimator_
        model["best"].fit(x_train, y_train)
        print("Trained best model")

def test_cv_models(models, x_test, y_test):
    for model in models:
    print(f"{model['name']}")
    if hasattr(model["model"], "cv_results_"):
        print(f"Best parameters: {model['model'].best_params_}")
        print(f"Best CV score: {model['model'].best_score_}")
    y_pred = model['best'].predict(x_test)
    evaluate(y_test, y_pred, labels=labels)

### Variable definitions

In [None]:
seed = 42
np.random.seed(seed)
random.seed(seed)

#### Library initialization calls

In [None]:
%%capture
# Load spacy pipeline model
!{sys.executable} -m spacy download en_core_web_sm

## Task 1

The challange is subdivided in two main tasks, the first is a **Binary Classification** task that aims at identifying if a text passage was *written by a human* or if it was *generated from a langauge model*.

In [None]:
# Classification labels for Task 1
labels = ["generated", "human"]

### Data Exploration

We load the english dataset and we run the *en_core_web_sm* SpaCy pipeline on it to generate a vectorized representation enriched with POS tags and NER tags.

The [SpaCy pipeline](https://github.com/explosion/spacy-models/releases/tag/en_core_web_sm-3.5.0) contains the following components: *tok2vec*, *tagger*, *parser*, *senter*, *attribute_ruler*, *lemmatizer*, *ner*.

In addition, we define subsets of grouped POS and NER tags that may be of interest to our application.

In [None]:
# Loading the dataset
df = pd.read_csv("../AUTEXTIFICATION/subtask_1/train.tsv", sep="\t")
df = df.drop("id", axis=1)
df["tagged_text"] = df["text"].apply(lambda x: nlp_model(x))

df

In [None]:
# Interesting POS tags
sel_pos = {
    ",": [","], ".": [","], ":": [":"], "ADD": ["ADD"], "AFX": ["AFX"], "CC": ["CC"],
    "CD": ["CD"], "DT": ["DT"], "EX": ["EX"], "HYPH": ["HYPH"], "IN": ["IN"],
    "JJ": ["JJ", "JJR", "JJS"], "NFP": ["NFP"], "NN": ["NN", "NNP", "NNPS", "NNS"],
    "PRP": ["PRP", "PRP$"], "RB": ["RB", "RBR", "RBS", "RP"], "SYM": ["SYM"], "TO": ["TP"],
    "UH": ["UH"], "VB": ["VB", "VBD", "VBG", "VBN", "VBP", "VBZ"], "WDT": ["WDT"], "XX": ["XX"]
}

# Interesting NER tags
sel_ner = [
    "CARDINAL", "DATE", "EVENT", "FAC", "GPE", "LANGUAGE", "LAW", "LOC", "MONEY", "NORP",
    "ORDINAL", "ORG", "PERCENT", "PERSON", "PRODUCT", "QUANTITY", "TIME", "WORK_OF_ART"
]

# Lower limit for the number of samples to consider
limit_n_samples = 40

# Print tag explanation
print("POS tags")
for tag in sel_pos:
    print(f"{tag} {sel_pos[tag]}: {spacy.explain(tag)}")
print("NER tags")
for tag in sel_ner:
    print(f"{tag}: {spacy.explain(tag)}")

We enrich the dataset calculating some custom metrics for each sentence, such as:
- text length
- number of uppercase letters
- number of different and cumulative stopwords
- POS and NER tag counts
  
Most of these metrics have been calculated considering both an absolute and a relative (to the text length) approach.

Then, we delete metrics that don't meet a given support.

In [None]:
# Text length
df["length"] = df["text"].str.len()

# Number of uppercase letters
df["n_upcase"] = df["text"].str.count(r"[A-Z]")
df["n_upcase_rel"] = df["n_upcase"] / df["length"]

# Number of stopwords
df["n_stopword"] = df["text"].str.split().apply(
    lambda x: len(set(x) & set(stopwords.words("english")))
)
df["ncum_stopword"] = df["text"].str.split().apply(
    lambda x: len([w for w in x if w in stopwords.words("english")])
)
df["n_stopword_rel"] = df["n_stopword"] / df["length"]
df["ncum_stopword_rel"] = df["ncum_stopword"] / df["length"]

# Number of POS and NER tags
for tag in sel_pos:
    column_tag = "n_pos_" + tag
    df[column_tag] = df["tagged_text"].apply(
        lambda x: len([tok for tok in x if tok.tag_ in sel_pos[tag]])
    )
    df[column_tag + "_rel"] = df[column_tag] / df["length"]
for tag in sel_ner:
    column_tag = "n_ner_" + tag
    df[column_tag] = df["tagged_text"].apply(
        lambda x: len([tok for tok in x if tok.ent_type_ == tag])
    )
    df[column_tag + "_rel"] = df[column_tag] / df["length"]

In [None]:
print("Deleting empty columns")
print(df.columns[(df == 0).all(axis=0)].tolist())
df = df.loc[:, (df != 0).any(axis=0)]

print(f"Deleting columns with less than {limit_n_samples} samples")
print(df.columns[df.astype(bool).sum(axis=0) <= limit_n_samples].tolist())
df = df.loc[:, df.astype(bool).sum(axis=0) > limit_n_samples]

# Defragment dataset
all_columns = df.columns.tolist()
df = df.copy()

We define a function to neatly plot comparisons between human and generated samples utilizing the metrics visualized as box plots and histograms.

We opt to ignore metrics that result too similar between the two classes comparing the 1/4, 1/2 and 3/4 quantiles.

In [None]:
def plot_df_stats(
    df,
    sel_columns,
    labels,
    colors,
    ignore_similar=True,
    height=1000,
    width=800
):
    ignored = []
    vis_columns = []
    for column in sel_columns:
        if ignore_similar:
            col_name = column[0]["name"]
            x = df[col_name]
            x1 = df.loc[df['label']==labels[0]][col_name]
            x2 = df.loc[df['label']==labels[1]][col_name]
            delta = (x.max() - x.min()) / 100
            if (np.abs(x1.quantile(0.25) - x2.quantile(0.25)) <= delta and
                np.abs(x1.quantile(0.5) - x2.quantile(0.5)) <= delta and 
                np.abs(x1.quantile(0.75) - x2.quantile(0.75)) <= delta):
                ignored.append(col_name)
                continue
        vis_columns.append(column)
        
    titles = [c["name"] for c_arr in vis_columns for c in c_arr]
    specs = [
        [{"secondary_y": True} if h["type"] == "hist" else {} for h in s] 
        if len(s) > 1 else [{"colspan":2}, None]
        for s in vis_columns
    ]
    
    fig = make_subplots(
        horizontal_spacing = 0.005,
        vertical_spacing = 0.01,
        rows=len(vis_columns),
        cols=2,
        subplot_titles=titles,
        specs=specs
    )
    fig.update_layout(
        height=height,
        width=width,
        showlegend=False,
        template="plotly_white"
    )

    for i, column in enumerate(vis_columns):
        if len(column) > 1:
            for j, subcolumn in enumerate(column):
                x1 = df.loc[df['label']==labels[0]][subcolumn["name"]]
                x2 = df.loc[df['label']==labels[1]][subcolumn["name"]]
                if subcolumn["type"] == "box":
                    add_boxplot(fig, [x1, x2], i+1, j+1, labels, colors)
                elif subcolumn["type"] == "hist":
                    add_hist(fig, [x1, x2], i+1, j+1, labels, colors)
                else:
                    raise Exception()
        else:
            column = column[0]
            x1 = df.loc[df['label']==labels[0]][column["name"]]
            x2 = df.loc[df['label']==labels[1]][column["name"]]
            add_boxplot(fig, [x1, x2], i+1, 1, labels, colors)

    fig.show()
    print(f"{[i for i in ignored]} too similar, ignored")
    
def add_boxplot(fig, x, row, col, labels, colors):
    for i, el in enumerate(x):
        fig.add_trace(go.Box(
            y=el,
            name=labels[i],
            marker_color=colors[i]
        ),row=row, col=col)
        
def add_hist(fig, x, row, col, labels, colors):
    offset = len(x)
    temp = ff.create_distplot(x, labels, curve_type = 'kde')
    normal_x = []
    normal_y = []
    for n in range(offset):
        normal_x.append(temp.data[offset + n]['x'])
        normal_y.append(temp.data[offset + n]['y'])
    for i, el in enumerate(x):
        fig.add_trace(go.Histogram(
            x=el,
            orientation="v",
            xbins=go.histogram.XBins(size=(max(el) - min(el)) / 15),
            name=labels[i],
            opacity=0.4,
            marker_color=colors[i]
        ), row=row, col=col)
        fig.add_trace(go.Scatter(
            x=normal_x[i],
            y=normal_y[i],
            mode = 'lines',
            name=labels[i],
            marker_color=colors[i]
        ), row=row, col=col, secondary_y=True)

We select the metrics that we want to visualize and how we want to visualize them.

For the sake of brevity we only choose to visualize some of the POS/NER tags here.

In [None]:
sel_columns = [
    [{"name": "length", "type": "box"}], 
    [{"name": "n_upcase", "type": "box"}, {"name": "n_upcase_rel", "type": "hist"}], 
    [{"name": "n_stopword", "type": "box"}, {"name": "n_stopword_rel", "type": "hist"}],
    [{"name": "ncum_stopword", "type": "box"}, {"name": "ncum_stopword_rel", "type": "hist"}],
]

# sel_columns.extend([
#     [{"name": "n_pos_" + x, "type": "box"}, {"name": "n_pos_" + x + "_rel", "type": "hist"}]
#     for x in sel_pos if "n_pos_" + x in all_columns
# ])
sel_columns.extend([
    [{"name": "n_pos_,", "type": "box"}, {"name": "n_pos_,_rel", "type": "hist"}],
    [{"name": "n_pos_.", "type": "box"}, {"name": "n_pos_._rel", "type": "hist"}],
    [{"name": "n_pos_CC", "type": "box"}, {"name": "n_pos_CC_rel", "type": "hist"}],
    [{"name": "n_pos_CD", "type": "box"}, {"name": "n_pos_CD_rel", "type": "hist"}],
    [{"name": "n_pos_DT", "type": "box"}, {"name": "n_pos_DT_rel", "type": "hist"}],
    [{"name": "n_pos_JJ", "type": "box"}, {"name": "n_pos_JJ_rel", "type": "hist"}],
    [{"name": "n_pos_NN", "type": "box"}, {"name": "n_pos_NN_rel", "type": "hist"}],
    [{"name": "n_pos_VB", "type": "box"}, {"name": "n_pos_VB_rel", "type": "hist"}],
    [{"name": "n_pos_WDT", "type": "box"}, {"name": "n_pos_WDT_rel", "type": "hist"}],
])

# sel_columns.extend(
#     [{"name": "n_ner_" + x, "type": "box"}, {"name": "n_ner_" + x + "_rel", "type": "hist"}]
#     for x in sel_ner if "n_ner_" + x in all_columns
# )
sel_columns.extend([
    [{"name": "n_ner_DATE", "type": "box"}, {"name": "n_ner_DATE_rel", "type": "hist"}],
    [{"name": "n_ner_GPE", "type": "box"}, {"name": "n_ner_GPE_rel", "type": "hist"}],
    [{"name": "n_ner_LAW", "type": "box"}, {"name": "n_ner_LAW_rel", "type": "hist"}],
])

plot_df_stats(
    df,
    sel_columns,
    labels=labels,
    colors=["darkorchid", "forestgreen"],
    height=8000, width=1000
)

We repeat the process, this time adding interesting [Textstat](https://pypi.org/project/textstat/) and [LexicalRichness](https://lexicalrichness.readthedocs.io/) metrics.

It's possible to observe how most of the visualized metrics, are actually quite similar

In [None]:
# Textstat metrics https://pypi.org/project/textstat/
textstat_metrics = [
    'flesch_reading_ease', 'flesch_kincaid_grade', 'smog_index', 'coleman_liau_index',
    'automated_readability_index', 'dale_chall_readability_score', 'difficult_words',
    'linsear_write_formula', 'gunning_fog', 'fernandez_huerta', 'szigriszt_pazos', 
    'gutierrez_polini', 'crawford', 'gulpease_index', 'osman'
]
df['flesch_reading_ease'] = df["text"].apply(lambda x: textstat.flesch_reading_ease(x))
df['flesch_kincaid_grade'] = df["text"].apply(lambda x: textstat.flesch_kincaid_grade(x))
df['smog_index'] = df["text"].apply(lambda x: textstat.smog_index(x))
df['coleman_liau_index'] = df["text"].apply(lambda x: textstat.coleman_liau_index(x))
df['automated_readability_index'] = df["text"].apply(lambda x: textstat.automated_readability_index(x))
df['dale_chall_readability_score'] = df["text"].apply(lambda x: textstat.dale_chall_readability_score(x))
df['difficult_words'] = df["text"].apply(lambda x: textstat.difficult_words(x))
df['linsear_write_formula'] = df["text"].apply(lambda x: textstat.linsear_write_formula(x))
df['gunning_fog'] = df["text"].apply(lambda x: textstat.gunning_fog(x))
df['fernandez_huerta'] = df["text"].apply(lambda x: textstat.fernandez_huerta(x))
df['szigriszt_pazos'] = df["text"].apply(lambda x: textstat.szigriszt_pazos(x))
df['gutierrez_polini'] = df["text"].apply(lambda x: textstat.gutierrez_polini(x))
df['crawford'] = df["text"].apply(lambda x: textstat.crawford(x))
df['gulpease_index'] = df["text"].apply(lambda x: textstat.gulpease_index(x))
df['osman'] = df["text"].apply(lambda x: textstat.osman(x))

# Lexicalrichness metrics https://lexicalrichness.readthedocs.io/
lexrich_metrics = [
    'ttr', 'rttr', 'cttr'
]
df['ttr'] = df["text"].apply(lambda x: LexicalRichness(x).ttr)
df['rttr'] = df["text"].apply(lambda x: LexicalRichness(x).rttr)
df['cttr'] = df["text"].apply(lambda x: LexicalRichness(x).cttr)

print("Deleting empty columns")
print(df.columns[(df == 0).all(axis=0)].tolist())
df = df.loc[:, (df != 0).any(axis=0)]

print(f"Deleting columns with less than {limit_n_samples} samples")
print(df.columns[df.astype(bool).sum(axis=0) <= limit_n_samples].tolist())
df = df.loc[:, df.astype(bool).sum(axis=0) > limit_n_samples]

# Defragment dataset
all_columns = df.columns.tolist()
df = df.copy()

In [None]:
sel_columns = []

sel_columns.extend([
    [{"name": x, "type": "box"}] for x in textstat_metrics if x in all_columns
])
sel_columns.extend([
    [{"name": x, "type": "box"}] for x in lexrich_metrics if x in all_columns
])

plot_df_stats(
    df,
    sel_columns,
    labels=labels,
    colors=["darkorchid", "forestgreen"],
    height=6000, width=1000
)

### Bag of Words-based models and other approaches

#### Only Bag of Words text vectorization

We load the english dataset for the first task.

In [None]:
# Import main dataset
df = pd.read_csv("../AUTEXTIFICATION/subtask_1/train.tsv", sep="\t")
df = df.drop("id", axis=1)

df

We define a preprocessing function that:
- converts all of the data to lowercase
- applies a regex to remove punctuation
- vectorizes all of the words using a given vectorizer

and we preprocess our data, splitting it into train and test sets utilizing the relevant helper function defined in [Preliminary initialization](#Preliminary-initialization).

For most of the simple BoW approaches, we have found that utlizing a **TfidfVectorizer** leads to no particular improvement w.r.t. a **CountVectorizer**; in addition, setting a minimum term frequency of 4 is a good compromise between number of parameters and performance.

Interestingly enough, keeping stopwords instead of removing them leads to slightly better results; this could be due to the fact that the difference in their usage is statistically relevant for the two classes.

In [None]:
def preprocess(data, lower=True, vectorizer=None, fit=True):
    # Convert all text to lowercase
    if lower:
        data = [x.lower() for x in data]

    # Remove punctuation and reset multiple spaces to one
    punct_regex = re.compile("[" + string.punctuation + "\’'" + "]")
    whitespace_regex = re.compile(" ( )+")
    data = [whitespace_regex.sub(" ", punct_regex.sub(" ", x)) for x in data]
    
    # Vectorize
    if vectorizer:
        if fit:
            data = vectorizer.fit_transform(data)
        else:
            data = vectorizer.transform(data)
    
    return data

In [None]:
vectorizer = TfidfVectorizer(min_df=4, max_df=0.6, ngram_range=(2,2))

x, y = df["text"], df["label"]
x_train, x_val, x_test, y_train, y_val, y_test = split(
    x, y, test_size=0.2, val_size=0.0, seed=seed
)

x_train = preprocess(x_train, vectorizer=vectorizer)
x_test = preprocess(x_test, vectorizer=vectorizer, fit=False)

We define some simple models to try out the simple Bag of Words approach, without any additional data; the models that we are going to use are:
- Multinomial Naive Bayes
- Logistic Regression
- C-Support SVM
- Decision Tree
- Random Forest
- Extra Tree classifier

All of the previous models are run using 5-fold cross-validation on a small gridsearch around some of their default parameters.

To perform training and evaluation we have used the relevant helper functions defined in [Preliminary initialization](#Preliminary-initialization).

In [None]:
models = []
usecached = False

# Naive Bayes
nb = MultinomialNB()
nb_param = {"alpha":[0.001, 0.01, 0.1, 1, 10, 100]}
nb_clf = GridSearchCV(nb, nb_param, cv=5, scoring="f1_micro", verbose=1)
models.append({
    "name": "Naive Bayes",
    "model": nb_clf,
})

# Logistic Regression
lr = LogisticRegression(max_iter=1000)
lr_param = [{
    "solver": ["liblinear"], 
    "penalty": ["l1", "l2"],
    "C":[0.01, 0.1, 1, 10]
},{
    "solver": ("lbfgs", "sag", "saga"), 
    "penalty": ["l2"],
    "C":[0.01, 0.1, 1]
}]
lr_clf = GridSearchCV(lr, lr_param, cv=5, scoring="f1_micro", verbose=1)
models.append({
    "name": "Linear Regression",
    "model": lr_clf,
    "subsample": 0.7,
})

# SVC
svc = SVC()
svc_param = {"kernel": ["rbf"], "C": [0.1, 1, 10]}
svc_clf = GridSearchCV(svc, svc_param, cv=5, scoring="f1_micro", verbose=1)
models.append({
    "name": "SVC",
    "model": svc_clf,
    "subsample": 0.7,
})

# Decision Tree
dtree = DecisionTreeClassifier()
dtree_param = {
    "criterion": ["gini", "entropy"], 
    # "min_samples_split": [2, 4, 8],
    # "min_samples_leaf": [1, 2, 4],
    "max_features": [None, "sqrt", "log2"],
}
dtree_clf =  GridSearchCV(dtree, dtree_param, cv=5, scoring="f1_micro", verbose=1)
models.append({
    "name": "Decision Tree",
    "model": dtree_clf,
})

# Random Forest
rf = RandomForestClassifier()
rf_param = {
    "criterion": ["gini", "entropy"],
}
rf_clf =  GridSearchCV(rf, rf_param, cv=5, scoring="f1_micro", verbose=1)
models.append({
    "name": "Random Forest",
    "model": rf_clf,
    "subsample": 0.6,
})

# Extra Trees
et = ExtraTreesClassifier()
et_param = {
    "criterion": ["gini", "entropy"],
}
et_clf =  GridSearchCV(et, et_param, cv=5, scoring="f1_micro", verbose=1)
models.append({
    "name": "Extra Trees",
    "model": et_clf,
    "subsample": 0.6,
})

In [None]:
train_cv_models(models, x_test, y_test)
test_cv_models(models, x_test, y_test)

#### Including additional information

We run the *en_core_web_sm* SpaCy pipeline that was used in the [Data Exploration section](#Data-Exploration) on the dataset.

We also define a custom function to extract useful feature from the tokenized text and we apply it on the dataset.

In [None]:
def extract_features(tree):
    features = []
    for token in tree:
        lemma = token.lemma_
        pos_tag = token.pos_
        dep_lab = token.dep_
        head = token.head
        if token.i < head.i:
            direction = "l"
        else:
            direction = "r"
        dfr = len(list(token.ancestors))
        # if not token.is_stop:
        features.append({
            "lem": lemma ,
            "pos": pos_tag, 
            "dep": dep_lab, 
            "head": head, 
            "dir": direction, 
            "dfr": dfr
        })
    return features

In [None]:
# Run SpaCy NLP pipeline on dataset
parsed_df = df.copy()
parsed_df["text"] = df["text"].apply(lambda x: nlp_model(x))

# Extract useful features
parsed_df["features"] = parsed_df["text"].apply(lambda x: extract_features(x))

parsed_df

Here is an example of a dependency parse tree extracted from a sentence.

In [None]:
displacy.render(parsed_df["text"][0], jupyter=True, style='dep')

First of all, we try to exploit the **lemmatization** contained inside the SpaCy pipeline to check if it gets any better result than our rudimentary Bag of Words approach without any form of stemming.

The same considerations mentioned previously about the vectorization of text apply.

In [None]:
vectorizer = TfidfVectorizer(min_df=4, max_df=0.6, ngram_range=(2,2))
parsed_df["text_lem"] = parsed_df["features"].apply(lambda x: " ".join([t["lem"] for t in x]))

x, y = parsed_df["text_lem"], parsed_df["label"]
x_train, x_val, x_test, y_train, y_val, y_test = split(
    x, y, test_size=0.2, val_size=0.0, seed=seed
)

x_train = vectorizer.fit_transform(x_train)
x_test = vectorizer.transform(x_test)

This time we only try Naive Bayes, Logistic Regression and a SVM; following the previous cross-validation approach.

In [None]:
models = []

# Naive Bayes
nb = MultinomialNB()
nb_param = {"alpha":[0.001, 0.01, 0.1, 1, 10, 100]}
nb_clf = GridSearchCV(nb, nb_param, cv=5, scoring="f1_micro", verbose=1)
models.append({
    "name": "Naive Bayes",
    "model": nb_clf,
})

# Logistic Regression
lr = LogisticRegression(max_iter=100000)
lr_param = [{
    "solver": ["liblinear"], 
    "penalty": ["l1", "l2"],
    "C":[0.01, 0.1, 1, 10]
},{
    "solver": ("lbfgs", "sag", "saga"), 
    "penalty": ["l2"],
    "C":[0.01, 0.1, 1]
}]
lr_clf = GridSearchCV(lr, lr_param, cv=5, scoring="f1_micro", verbose=1)
models.append({
    "name": "Linear Regression",
    "model": lr_clf,
    "subsample": 0.8,
})

# SVC
svc = SVC()
svc_param = {"kernel": ["rbf"], "C": [0.1, 1, 10]}
svc_clf = GridSearchCV(svc, svc_param, cv=5, scoring="f1_micro", verbose=1)
models.append({
    "name": "SVC", 
    "model": svc_clf, 
    "subsample": 0.7,
})

In [None]:
train_cv_models(models, x_test, y_test)
test_cv_models(models, x_test, y_test)

By comparing the accuracies, there seems to be a slight improvement in the *SVM* case, but the results are quite symmetrical overall.

Now we can try to integrate the POS information and the average distance from root of each sentence for each word (ADFR) into our models, to do so we vectorize the tags using a **TfidfVectorizer** (noticing that a **Countvectorizer** would produce slightly better results for tree-based methods) for the POS tags and we normalize the ADFR using a **MinMaxScaler**.

The various metrics are then concatenated into a single vector  and we apply the same models using our established cross-validation approach.

In [None]:
# Vectorize lemmas
vectorizer = TfidfVectorizer(min_df=4, ngram_range=(2,2))
x_train_lem = vectorizer.fit_transform(x_train_lem)
x_test_lem = vectorizer.transform(x_test_lem)

# Vectorize POS tags
vectorizer = TfidfVectorizer(min_df=4)
x_train_pos = vectorizer.fit_transform(x_train_pos)
x_test_pos = vectorizer.transform(x_test_pos)

# Normalize average distance from root
mms = MinMaxScaler()

x_train_avgdfr = mms.fit_transform(x_train_avgdfr.values.reshape(-1, 1))
x_test_avgdfr = mms.transform(x_test_avgdfr.values.reshape(-1, 1))

x_train_avgdfr = sps.csr_matrix(x_train_avgdfr)
x_test_avgdfr = sps.csr_matrix(x_test_avgdfr)

# Concatenate vectors
x_train = sps.hstack([x_train_lem, x_train_pos, x_train_avgdfr])
x_test = sps.hstack([x_test_lem, x_test_pos, x_test_avgdfr])

In [None]:
models = []

# Naive Bayes
nb = MultinomialNB()
nb_param = {"alpha":[0.001, 0.01, 0.1, 1, 10, 100]}
nb_clf = GridSearchCV(nb, nb_param, cv=5, scoring="f1_micro", verbose=1)
models.append({
    "name": "Naive Bayes", 
    "model": nb_clf,
})

# Logistic Regression
lr = LogisticRegression(max_iter=100000)
lr_param = [{
    "solver": ["liblinear"], 
    "penalty": ["l1", "l2"],
    "C":[0.01, 0.1, 1, 10]
},{
    "solver": ("lbfgs", "sag", "saga"), 
    "penalty": ["l2"],
    "C":[0.01, 0.1, 1]
}]
lr_clf = GridSearchCV(lr, lr_param, cv=5, scoring="f1_micro", verbose=1)
models.append({
    "name": "Linear Regression",
    "model": lr_clf,
    "subsample": 0.8,
})

# SVC
svc = SVC()
svc_param = {"kernel": ["rbf"], "C": [0.1, 1, 10]}
svc_clf = GridSearchCV(svc, svc_param, cv=5, scoring="f1_micro", verbose=1)
models.append({
    "name": "SVC", 
    "model": svc_clf,
    "subsample": 0.7,
})

# Decision Tree
dtree = DecisionTreeClassifier()
dtree_param = {
    "criterion": ["gini", "entropy"], 
    "max_features": [None, "sqrt", "log2"],
}
dtree_clf =  GridSearchCV(dtree, dtree_param, cv=5, scoring="f1_micro", verbose=1)
models.append({
    "name": "Decision Tree",
    "model": dtree_clf,
})

# Random Forest
rf = RandomForestClassifier()
rf_param = {
    "criterion": ["gini", "entropy"], 
}
rf_clf =  GridSearchCV(rf, rf_param, cv=5, scoring="f1_micro", verbose=1)
models.append({
    "name": "Random Forest",
    "model": rf_clf,
    "subsample": 0.6,
    "usecached": usecached,
})

# Extra Trees
et = ExtraTreesClassifier()
et_param = {
    "criterion": ["gini", "entropy"], 
}
et_clf =  GridSearchCV(et, et_param, cv=5, scoring="f1_micro", verbose=1)
models.append({
    "name": "Extra Trees",
    "model": et_clf,
    "subsample": 0.6,
    "usecached": usecached,
})

In [None]:
train_cv_models(models, x_test, y_test)
test_cv_models(models, x_test, y_test)

As we can see, the addition of these new features significantly improved the performance of more complex models, while models on the simpler side retained their accuracy found in previous experiments or returned worse results.

#### Other approaches

##### Neural network classifier

Another approach that was experimented upon consisted in building a neural network classifier from scratch utilizing *pytorch* to perform classification on the Bag of Words representation (both augmented and non-augmented).

The results were quite underwhelming with regards to the training time and resources spent, since in every scenario it performed similarly to the SVM.

This approach will not be presented here but the source code used for the implementation of the clasifier can be found under the [corresponding section](#Preliminary-initialization)

##### SVM on extracted features

Training an SVM **only** on the features obtained in the [data exploration section](#Data-Exploration), it is possible to obtain an accuracy around 80%.

This result is particularly interesting since no direct text information is given to the classifier, only derived statistics and metrics.

### Text embedding approaches

In [None]:
# TODO

### Transformer-based models

In [None]:
# TODO

## Task 2

The challange is subdivided in two main tasks, the second task is a **Multinomial Classification** task that aims at identifying the specific *language model* that generated a given text passage, choosing from 6 different models labeled as A, B, C, D, E and F.

In [None]:
# Classification labels for Task 2
labels = ["A", "B", "C", "D", "E", "F"]

### Data Exploration

In [None]:
# TODO

### Bag of Words-based models and other approaches

#### Only Bag of Words text vectorization

We load the english dataset for the first task.

In [None]:
# Import main dataset
df = pd.read_csv("../AUTEXTIFICATION/subtask_2/train.tsv", sep="\t")
df = df.drop("id", axis=1)

df

We use the same preprocessing function that we used for Task 1, we use it to preprocess our data, splitting it into train and test sets.

All of the previous considerations still apply.

In [None]:
def preprocess(data, lower=True, vectorizer=None, fit=True):
    # Convert all text to lowercase
    if lower:
        data = [x.lower() for x in data]

    # Remove punctuation and reset multiple spaces to one
    punct_regex = re.compile("[" + string.punctuation + "\’'" + "]")
    whitespace_regex = re.compile(" ( )+")
    data = [whitespace_regex.sub(" ", punct_regex.sub(" ", x)) for x in data]
    
    # Vectorize
    if vectorizer:
        if fit:
            data = vectorizer.fit_transform(data)
        else:
            data = vectorizer.transform(data)
    
    return data

In [None]:
vectorizer = TfidfVectorizer(min_df=4, max_df=0.6, ngram_range=(2,2))

x, y = df["text"], df["label"]
x_train, x_val, x_test, y_train, y_val, y_test = split(
    x, y, test_size=0.2, val_size=0.0, seed=seed
)

x_train = preprocess(x_train, vectorizer=vectorizer)
x_test = preprocess(x_test, vectorizer=vectorizer, fit=False)

We define some simple models to try out the simple Bag of Words approach, without any additional data; the models that we are going to use are:
- Multinomial Naive Bayes
- Logistic Regression
- C-Support SVM
- Extra Tree classifier

All of the previous models are run using 5-fold cross-validation on a small gridsearch around some of their default parameters.

In [None]:
models = []

# Naive Bayes
nb = MultinomialNB()
nb_param = {"alpha":[0.001, 0.01, 0.1, 1, 10, 100]}
nb_clf = GridSearchCV(nb, nb_param, cv=5, scoring="f1_micro", verbose=1)
models.append({
    "name": "Naive Bayes",
    "model": nb_clf,
})

# Logistic Regression
lr = LogisticRegression(max_iter=1000)
lr_param = [{
    "solver": ["liblinear"], 
    "penalty": ["l1", "l2"],
    "C":[0.01, 0.1, 1, 10]
},{
    "solver": ("lbfgs", "sag", "saga"), 
    "penalty": ["l2"],
    "C":[0.01, 0.1, 1]
}]
lr_clf = GridSearchCV(lr, lr_param, cv=5, scoring="f1_micro", verbose=1)
models.append({
    "name": "Linear Regression",
    "model": lr_clf,
    "subsample": 0.9,
})

# SVC
svc = SVC()
svc_param = {"kernel": ["rbf"], "C": [0.1, 1, 10]}
svc_clf = GridSearchCV(svc, svc_param, cv=5, scoring="f1_micro", verbose=1)
models.append({
    "name": "SVC",
    "model": svc_clf,
    "subsample": 0.6,
})

# Extra Trees
et = ExtraTreesClassifier()
et_param = {
    "criterion": ["gini", "entropy"],
}
et_clf =  GridSearchCV(et, et_param, cv=5, scoring="f1_micro", verbose=1)
models.append({
    "name": "Extra Trees",
    "model": et_clf,
    "subsample": 0.6,
})

In [None]:
train_cv_models(models, x_test, y_test)
test_cv_models(models, x_test, y_test)

From the results we see a much worse performance with respect to Task 1, this is expected, as Task 2 is inherently more difficult due to the presence of more classes and the "more similar" nature of those classes.

Our simple Bag of Words classifiers still have an edge against a random classifier since all of them can abundantly surpass the 1/6 ≈ 17% baseline.

From the confusion matrix results we confirm the similarities between A, B, C and D, E, F respectively, in addition we can observe that F is much more easier to detect than the other classes.

#### Including additional information

We run the *en_core_web_sm* SpaCy pipeline and define the custom feature extraction function that were used in the previous points.

This time we skip directly to the application of augmentation features, without dwelling on the benefits of lemmatization.

The techniques applied through this section are the same as those applied for Task 1.

In [None]:
def extract_features(tree):
    features = []
    for token in tree:
        lemma = token.lemma_
        pos_tag = token.pos_
        dep_lab = token.dep_
        head = token.head
        if token.i < head.i:
            direction = "l"
        else:
            direction = "r"
        dfr = len(list(token.ancestors))
        # if not token.is_stop:
        features.append({
            "lem": lemma ,
            "pos": pos_tag, 
            "dep": dep_lab, 
            "head": head, 
            "dir": direction, 
            "dfr": dfr
        })
    return features

In [None]:
# Run SpaCy NLP pipeline on dataset
parsed_df = df.copy()
parsed_df["text"] = df["text"].apply(lambda x: nlp_model(x))

# Extract useful features
parsed_df["features"] = parsed_df["text"].apply(lambda x: extract_features(x))

parsed_df

In [None]:
x_lem = parsed_df["features"].apply(lambda x: " ".join([t["lem"] for t in x]))
x_pos = parsed_df["features"].apply(lambda x: " ".join([t["pos"] for t in x]))
x_avgdfr = parsed_df["features"].apply(lambda x: sum(t["dfr"] for t in x) / len(x))
y = parsed_df["label"]

x_train_lem, x_val_lem, x_test_lem, y_train, y_val, y_test = split(
    x_lem, y, test_size=0.2, val_size=0.0, seed=seed
)
x_train_pos, x_val_pos, x_test_pos, y_train, y_val, y_test = split(
    x_pos, y, test_size=0.2, val_size=0.0, seed=seed
)
x_train_avgdfr, x_val_avgdfr, x_test_avgdfr, y_train, y_val, y_test = split(
    x_avgdfr, y, test_size=0.2, val_size=0.0, seed=seed
)

In [None]:
# Vectorize lemmas
vectorizer = TfidfVectorizer(min_df=4, ngram_range=(2,2))
x_train_lem = vectorizer.fit_transform(x_train_lem)
x_test_lem = vectorizer.transform(x_test_lem)

# Vectorize POS tags
vectorizer = TfidfVectorizer(min_df=4)
x_train_pos = vectorizer.fit_transform(x_train_pos)
x_test_pos = vectorizer.transform(x_test_pos)

# Normalize average distance from root
mms = MinMaxScaler()

x_train_avgdfr = mms.fit_transform(x_train_avgdfr.values.reshape(-1, 1))
x_test_avgdfr = mms.transform(x_test_avgdfr.values.reshape(-1, 1))

x_train_avgdfr = sps.csr_matrix(x_train_avgdfr)
x_test_avgdfr = sps.csr_matrix(x_test_avgdfr)

# Concatenate vectors
x_train = sps.hstack([x_train_lem, x_train_pos, x_train_avgdfr])
x_test = sps.hstack([x_test_lem, x_test_pos, x_test_avgdfr])

In [None]:
models = []

# Naive Bayes
nb = MultinomialNB()
nb_param = {"alpha":[0.001, 0.01, 0.1, 1, 10, 100]}
nb_clf = GridSearchCV(nb, nb_param, cv=5, scoring="f1_micro", verbose=1)
models.append({
    "name": "Naive Bayes", 
    "model": nb_clf,
})

# Logistic Regression
lr = LogisticRegression(max_iter=100000)
lr_param = [{
    "solver": ["liblinear"], 
    "penalty": ["l1", "l2"],
    "C":[0.01, 0.1, 1, 10]
},{
    "solver": ("lbfgs", "sag", "saga"), 
    "penalty": ["l2"],
    "C":[0.01, 0.1, 1]
}]
lr_clf = GridSearchCV(lr, lr_param, cv=5, scoring="f1_micro", verbose=1)
models.append({
    "name": "Linear Regression",
    "model": lr_clf,
    "subsample": 0.8,
})

# SVC
svc = SVC()
svc_param = {"kernel": ["rbf"], "C": [0.1, 1, 10]}
svc_clf = GridSearchCV(svc, svc_param, cv=5, scoring="f1_micro", verbose=1)
models.append({
    "name": "SVC", 
    "model": svc_clf,
    "subsample": 0.7,
})

# Decision Tree
dtree = DecisionTreeClassifier()
dtree_param = {
    "criterion": ["gini", "entropy"], 
    "max_features": [None, "sqrt", "log2"],
}
dtree_clf =  GridSearchCV(dtree, dtree_param, cv=5, scoring="f1_micro", verbose=1)
models.append({
    "name": "Decision Tree",
    "model": dtree_clf,
})

# Random Forest
rf = RandomForestClassifier()
rf_param = {
    "criterion": ["gini", "entropy"], 
}
rf_clf =  GridSearchCV(rf, rf_param, cv=5, scoring="f1_micro", verbose=1)
models.append({
    "name": "Random Forest",
    "model": rf_clf,
    "subsample": 0.6,
})

# Extra Trees
et = ExtraTreesClassifier()
et_param = {
    "criterion": ["gini", "entropy"], 
}
et_clf =  GridSearchCV(et, et_param, cv=5, scoring="f1_micro", verbose=1)
models.append({
    "name": "Extra Trees",
    "model": et_clf,
    "subsample": 0.6,
})

In [None]:
train_cv_models(models, x_test, y_test)
test_cv_models(models, x_test, y_test)

We can observe that for Task 2, the improvements of adding additional information are much less evident (although still present), even for complex models.

### Text embedding approaches

In [None]:
# TODO

### Transformer-based models

In [None]:
# TODO

## Conclusions

In [None]:
# TODO