# Natural Language Processing Project
## NLP Course @ Politecnico di Milano 2022/2023 - Prof. Mark Carman
### Topic 4: Autextification

### Group: Residual Sum of Students
- Raul Singh
- Davide Rigamonti
- Francesco Tosini
- Enrico Zuccolotto

## Introduction

The dataset consists of *short text passages* that have either been written by a *human* or have been generated automatically by a *language model*, more information can be found on the [official site](https://sites.google.com/view/autextification).

The original task will take place as part of [IberLEF 2023](http://sepln2023.sepln.org/en/iberlef-en/), the 5th Workshop on Iberian Languages Evaluation Forum at the SEPLN 2023 Conference, which will be held in Jaén, Spain on the 26th of September, 2023.

Given the scope of the original challenge we can observe that the dataset contains two separate set of samples, one in **English** and the other in **Spanish**; our main focus will be on the **English** dataset.

We will treat the two tasks **separately** as the two respective goals are different; although some similarities can be traced between the two, most of the considered approaches will be symmetrical and net homogeneous results.

Each task will be presented with a brief **data exploration** section, then we will proceed to utilize models and approaches that we have seen in the course (with the introduction of some novelties) starting from the most basic techniques based on the **Bag of Words representation** to then transition towards approaches that utilize **Word Embeddings** to then reach the *state-of-the-art* **Transformer** models.

## Preliminary initialization

This section contains all the library imports, helper function initialization calls and global variable definitions.

### Imports

#### Utilized libraries
The following dependencies are needed to run the notebook:
```
pip install scikit-learn~=1.2.2
pip install torch~=2.0.0
pip install matplotlib~=3.7.1
pip install plotly~=5.14.1
pip install nltk~=3.8.1
pip install spacy~=3.5.1
pip install textstat~=0.7.3
pip install numpy~=1.24.2
pip install pandas~=2.0.0
pip install python-terrier~=0.9.2
pip install scipy~=1.9.3
pip install gensim~=4.3.1
pip install lexicalrichness~=0.5.0
pip install sentence-transformers~=2.2.2
pip install transformers~=4.28.1
pip install datasets~=2.12.0
pip install evaluate~=0.4.0
```

#### Python standard library

In [None]:
import re
import os
import sys
import abc
import random
import string

import copy as cp
import array as arr

from collections import Counter

#### Scikit-learn

In [None]:
from sklearn import metrics

from sklearn.preprocessing import LabelBinarizer, LabelEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.metrics import ConfusionMatrixDisplay, accuracy_score, classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans

#### Pytorch

In [None]:
import torch

import torch.nn as nn
import torch.optim as opt
import torch.utils.data as dt

#### Plotting

In [None]:
import plotly.express as px
import matplotlib.pyplot as plt
import plotly.figure_factory as ff
import plotly.graph_objects as go

from plotly.subplots import make_subplots

#### Various

In [None]:
import nltk
import spacy
import textstat

import evaluate as hf_ev
import numpy as np
import pandas as pd
import pyterrier as pt
import scipy.sparse as sps

import gensim.downloader as api

from spacy import displacy
from nltk.corpus import stopwords
from pandas.core.common import flatten
from datasets import Dataset, DatasetDict
from gensim.models.word2vec import Word2Vec
from lexicalrichness import LexicalRichness
from sentence_transformers import SentenceTransformer, util
from transformers import TrainingArguments, Trainer, AutoTokenizer, AutoModelForSequenceClassification

from nlp_project.notebook_utils import compact_split, evaluate

from nlp_project.notebook_utils import evaluate, split, save_scikit_model, load_scikit_model
from nlp_project.nn_utils import init_gpu
from nlp_project.nn_classifier import Data, Classifier
from nlp_project.nn_extra import EarlyStopping, AdaptLR

### Helper functions

#### Pytorch models

In [None]:
history_metrics = {
    "epoch": {},
    "loss": {"order": -1},
    "val_loss": {"order": -1},
    "acc": {"order": +1},
    "val_acc": {"order": +1},
}

In [None]:
class StopNNLoop(BaseException):
    pass

def build_history_string(history_point):
    epoch = history_point["epoch"]
    metrics_string = " ".join(
        [f"{k}: {history_point[k]:.7f}" for k in history_point if not k == "epoch"]
    )
    return f"Epoch {epoch} -- " + metrics_string


def compare_equal_models(model_1, model_2):
    models_differ = 0
    for key_item_1, key_item_2 in zip(
        model_1.state_dict().items(), model_2.state_dict().items()
    ):
        if torch.equal(key_item_1[1], key_item_2[1]):
            pass
        else:
            models_differ += 1
            if key_item_1[0] == key_item_2[0]:
                print("Mismtach found at", key_item_1[0])
            else:
                raise Exception
    if models_differ == 0:
        return True
    return False


# Returns true if a is "better" than b following the metric
def compare_metric(metric, a, b, delta=0):
    if a == b:
        return False
    if history_metrics[metric]["order"] == +1:
        return a > b + delta
    return a < b - delta


# Initializes lowest possible value given a metric
def init_lowest(metric):
    return -np.inf if history_metrics[metric]["order"] == +1 else np.inf


def init_gpu(gpu="cuda:0"):
    return torch.device(gpu if torch.cuda.is_available() else "cpu")

In [None]:
class Data(dt.Dataset):
    def __init__(self, x, y, x_type=np.int32, y_type=torch.float):
        x_coo = x.tocoo()
        self.x = torch.sparse.FloatTensor(
            torch.LongTensor([x_coo.row, x_coo.col]),
            torch.FloatTensor(x_coo.data.astype(x_type)),
            x_coo.shape,
        )
        self.y = torch.tensor(y, dtype=y_type)
        self.shape = self.x.shape

    def __getitem__(self, index):
        return self.x[index].to_dense(), self.y[index]

    def __len__(self):
        return self.shape[0]

In [None]:
class Classifier(nn.Module):
    def __init__(self, binary_classifier=False, device=torch.device("cpu"), verbose=True):
        super().__init__()
        self.device = device
        self.is_binary = binary_classifier
        self.verbose = verbose
        self.is_compiled = False
        self.history = []

    def forward(self, x):
        return x

    def compile(self, loss, optimizer, binary_threshold=0.5):
        self.loss = loss
        self.optimizer = optimizer
        self.binary_threshold = binary_threshold
        self.to(self.device)
        self.is_compiled = True

    def parse_logits(self, outputs):
        if self.is_binary:
            predicted = (outputs > self.binary_threshold).float()
        else:
            _, predicted = torch.max(outputs.data, 1)
        return predicted

    def train_loop(self, data, epochs, data_val=None, callbacks=[]):
        try:
            tot = len(data.dataset)
            # Iterate over all epochs
            for epoch in range(epochs):
                running_loss = 0.0
                correct = 0
                history_point = {}
                # Iterate over each dataset batch
                for i, datum in enumerate(data):
                    # Decompose batch in x and y
                    inputs, labels = datum
                    # Set gradients to zero
                    self.optimizer.zero_grad()
                    # Forward pass
                    outputs = self(inputs)
                    predictions = self.parse_logits(outputs)
                    current_loss = self.loss(outputs, labels)
                    # Backpropagation
                    current_loss.backward()
                    # Optimization
                    self.optimizer.step()
                    # Update metrics
                    running_loss += current_loss.item()
                    correct += (predictions == labels).float().sum()

                # Compute training metrics
                history_point["epoch"] = epoch + 1
                history_point["loss"] = running_loss / tot
                history_point["acc"] = correct / tot

                # Compute and save eventual validation metrics
                if data_val:
                    _, val_metrics = self.test_loop(data_val)
                    history_point["val_loss"] = val_metrics["loss"]
                    history_point["val_acc"] = val_metrics["acc"]

                # Save epoch in history
                self.history.append(history_point)

                # Perform callbacks
                for callback in callbacks:
                    callback.call(self, history_point)

                # Print epoch summary
                if self.verbose:
                    print(build_history_string(history_point))

        except StopNNLoop as s:  # noqa
            pass

    def test_loop(self, data):
        all_predictions = np.array([])
        tot = len(data.dataset)
        loss = 0.0
        correct = 0
        metrics = {}
        # Prevent model update
        with torch.no_grad():
            # Iterate over each dataset batch
            for datum in data:
                # Decompose batch in x and y
                inputs, labels = datum
                # Forward pass
                outputs = self(inputs)
                predictions = self.parse_logits(outputs)
                current_loss = self.loss(outputs, labels)
                # Update metrics
                loss += current_loss.item()
                correct += (predictions == labels).float().sum()
                # Append predictions
                all_predictions = np.append(all_predictions, predictions)

        # Compute metrics
        metrics["acc"] = correct / tot
        metrics["loss"] = loss / tot

        return all_predictions.flatten(), metrics

In [None]:
class Callback(metaclass=abc.ABCMeta):
    def __init__(self, inputs):
        if not isinstance(inputs, list):
            raise TypeError("Parameter 'inputs' must be a list")
        if not all(x in history_metrics for x in inputs):
            raise ValueError(
                "Unknown input value, not present in Callback.callback_inputs"
            )
        self.inputs = inputs

    def inputs_check(self, inputs):
        if not all(x in inputs for x in self.inputs):
            raise ValueError(
                f"Requested inputs not provided: {[i for i in inputs if i not in self.inputs]}"
            )

    @abc.abstractmethod
    def call(self, model, inputs):
        self.inputs_check(inputs)
        pass

class EarlyStopping(Callback):
    def __init__(
        self,
        metric="loss",
        patience=10,
        baseline=None,
        delta=0,
        restore_best=True,
        verbose=True,
    ):
        super().__init__([metric])
        self.metric = metric
        self.patience = patience
        self.baseline = baseline
        self.delta = delta
        self.restore_best = restore_best
        self.verbose = verbose
        self.best_epoch = 0
        self.counter = 0
        self.saved_params = {}
        self.last_best = init_lowest(self.metric)

    def call(self, model, inputs):
        super().call(model, inputs)
        if self.early_stop(model, inputs):
            raise StopNNLoop()

    def early_stop(self, model, inputs):
        metric = inputs[self.metric]
        # Check if new metric is better than the current best
        if compare_metric(self.metric, metric, self.last_best):
            # Reset counter and update best value
            self.last_best = metric
            self.counter = 0
            self.best_epoch = inputs["epoch"]
            # Update model checkpoint
            if compare_metric(self.metric, metric, self.baseline):
                self.saved_params = cp.deepcopy(model.state_dict())
        # Check if new metric is worse than the current best
        elif compare_metric(self.metric, self.last_best, metric, self.delta):
            # Increment counter
            self.counter += 1
            # Check if counter exceeds patience, if so interrupt training
            if self.counter >= self.patience:
                # Restore best model checkpoint if possible and wanted
                if not self.restore_best:
                    return True
                if self.saved_params:
                    model.load_state_dict(self.saved_params)
                if self.verbose:
                    if self.saved_params:
                        print(f"Model restored successfully @ epoch {self.best_epoch}")
                    else:
                        print(f"Couldn't restore model @ epoch {self.best_epoch}")
                return True
        return False


class AdaptLR(Callback):
    def __init__(self, metric="loss", patience=5, factor=0.1, delta=0, verbose=True):
        super().__init__([metric])
        self.metric = metric
        self.patience = patience
        self.factor = factor
        self.delta = delta
        self.verbose = verbose
        self.counter = 0
        self.last_best = init_lowest(self.metric)

    def call(self, model, inputs):
        super().call(model, inputs)
        if self.adaptlr(inputs):
            # Adapt learning rate
            out = []
            for g in model.optimizer.param_groups:
                g["lr"] *= self.factor
                out = g["lr"]
            if self.verbose:
                print(f"Reducing lr to {out:.4f}")

    def adaptlr(self, inputs):
        metric = inputs[self.metric]
        # Check if new metric is better than the current best
        if compare_metric(self.metric, metric, self.last_best):
            # Reset counter and update best value
            self.last_best = metric
            self.counter = 0
        # Check if new metric is worse than the current best
        elif compare_metric(self.metric, self.last_best, metric, self.delta):
            # Increment counter
            self.counter += 1
            # Check if counter exceeds patience, if so interrupt training
            if self.counter >= self.patience:
                self.counter = 0
                return True
        return False

#### Generic utility

In [None]:
def split(x, y, test_size=0.2, val_size=0.0, seed=0):
    if val_size + test_size >= 1:
        return None
    x_train, x_test, y_train, y_test = train_test_split(
        x, y, test_size=test_size + val_size, stratify=y, random_state=seed
    )
    x_val, y_val = None, None
    if val_size > 0:
        x_test, x_val, y_test, y_val = train_test_split(
            x_test,
            y_test,
            test_size=val_size / (test_size + val_size),
            stratify=y_test,
            random_state=seed,
        )
    return x_train, x_val, x_test, y_train, y_val, y_test

def compact_split(dataset, test_size=0.2, val_size=0.0, seed=0):
    if val_size + test_size >= 1:
        return None
    train, test = train_test_split(
        dataset, test_size=test_size + val_size, random_state=seed
    )
    val = None
    if val_size > 0:
        val, test = train_test_split(
            test,
            test_size=test_size / (test_size + val_size),
            random_state=seed,
        )
    return train, val, test

def evaluate(y_true, y_pred, labels=None):
    print(classification_report(y_true, y_pred))
    cm = confusion_matrix(y_true, y_pred)
    cm_display = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
    cm_display.plot()
    plt.show()

def save_scikit_model(path, model, name):
    Path(path).mkdir(parents=True, exist_ok=True)
    dump(model, path + "/" + name)

def load_scikit_model(path, name):
    model_path = path + "/" + name
    if exists(model_path) and isfile(model_path):
        try:
            return load(model_path)
        except:
            pass
    return None

### Variable definitions

In [None]:
seed = 42
np.random.seed(seed)
random.seed(seed)

#### Library initialization calls

In [None]:
%%capture
# Load spacy pipeline model
!{sys.executable} -m spacy download en_core_web_sm

## Task 1

The challange is subdivided in two main tasks, the first is a **Binary Classification** task that aims at identifying if a text passage was *written by a human* or if it was *generated from a langauge model*.

### Data Exploration

In [None]:
# TODO

### Bag of Words-based models and other approaches

In [None]:
# TODO

### Text embedding approaches

In [None]:
# TODO

### Transformer-based models

In [None]:
# TODO

## Task 2

The challange is subdivided in two main tasks, the second task is a **Multinomial Classification** task that aims at identifying the specific *language model* that generated a given text passage, choosing from 6 different models labeled as A, B, C, D, E and F.

### Data Exploration

In [None]:
# TODO

### Bag of Words-based models and other approaches

In [None]:
# TODO

### Text embedding approaches

In [None]:
# TODO

### Transformer-based models

In [None]:
# TODO

## Conclusions

In [None]:
# TODO