# Model

I will use a distilled version of GPT2, configured as a regression model, to predict log scores directly. The output of this model could be combined in multiple ways with the other features identified in the data_analysis notebook to potentially achieve even better results.

Multiple model arrangements could be attempted before deciding on a specifc setup. The above is only a rationale for my starting point.

I will use Huggingface's distilled version of the GPT2 LLM (https://huggingface.co/distilgpt2). This is not a SOTA model, but it can be handled by my humble personal laptop in reasonable time.

# tl;dr

- Distilled GPT2 achieves much better performance (as measured by R2) than handcrafted regression

In [1]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from datasets import Dataset
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
import torch

BASE_MODEL = 'distilgpt2'
LEARNING_RATE = 2e-5
MAX_LENGTH = 256 # could be larger with more resources, would ensure we consider long posts
BATCH_SIZE = 16 # could be larger with more resources, would ensure we never over adjust on small samples
EPOCHS = 5 # needs to be small to finish in reasonable time

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL) 
tokenizer.pad_token = tokenizer.eos_token_id # distilgpt2 tokenizer does not have a default padding token

model = AutoModelForSequenceClassification.from_pretrained(BASE_MODEL, num_labels=1)
model.config.pad_token_id = model.config.eos_token_id # distilgpt2 model needs to conform to the tokenizer
model.resize_token_embeddings(len(tokenizer)) # distilgpt2 model needs to conform to the tokenizer

def compute_metrics_for_regression(eval_pred):
    logits, labels = eval_pred
    labels = labels.reshape(-1, 1)
    
    mse = mean_squared_error(labels, logits)
    mae = mean_absolute_error(labels, logits)
    r2 = r2_score(labels, logits)
    
    return {"mse": mse, "mae": mae, "r2": r2}


training_args = TrainingArguments(
    output_dir="./models/distilgpt2-fine-tuned-regression",
    learning_rate=LEARNING_RATE,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    num_train_epochs=EPOCHS,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    metric_for_best_model="r2",
    load_best_model_at_end=True,
    weight_decay=0.01,
    optim="adamw_torch"
)


class RegressionTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.pop("labels")
        outputs = model(**inputs)
        logits = outputs[0][:, 0]
        loss = torch.nn.functional.mse_loss(logits, labels)
        return (loss, outputs) if return_outputs else loss

    
def preprocess_function(df):
    label = df["score"] 
    examples = tokenizer(df["full_text"], truncation=True, padding='max_length', max_length=MAX_LENGTH)

    # Change this to real number
    examples["label"] = float(label)
    return examples

Some weights of the model checkpoint at distilgpt2 were not used when initializing GPT2ForSequenceClassification: ['lm_head.weight']
- This IS expected if you are initializing GPT2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing GPT2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at distilgpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [2]:
train_ds = Dataset.from_json("./data/reddit_scores_train.jsonlines")
val_ds = Dataset.from_json("./data/reddit_scores_val.jsonlines")
test_ds = Dataset.from_json("./data/reddit_scores_test.jsonlines")

train_ds, val_ds, test_ds

Downloading and preparing dataset json/default to /Users/christianrodriguezmercado/.cache/huggingface/datasets/json/default-5c43c10a08bef696/0.0.0...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /Users/christianrodriguezmercado/.cache/huggingface/datasets/json/default-5c43c10a08bef696/0.0.0. Subsequent calls will reuse this data.
Downloading and preparing dataset json/default to /Users/christianrodriguezmercado/.cache/huggingface/datasets/json/default-14b2a6691c4b4144/0.0.0...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /Users/christianrodriguezmercado/.cache/huggingface/datasets/json/default-14b2a6691c4b4144/0.0.0. Subsequent calls will reuse this data.
Downloading and preparing dataset json/default to /Users/christianrodriguezmercado/.cache/huggingface/datasets/json/default-fe43cf30e363a20f/0.0.0...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /Users/christianrodriguezmercado/.cache/huggingface/datasets/json/default-fe43cf30e363a20f/0.0.0. Subsequent calls will reuse this data.


(Dataset({
     features: ['title', 'full_text', 'score'],
     num_rows: 4804
 }),
 Dataset({
     features: ['title', 'full_text', 'score'],
     num_rows: 719
 }),
 Dataset({
     features: ['title', 'full_text', 'score'],
     num_rows: 1736
 }))

In [3]:
ds = {"train": train_ds, "validation": val_ds, "test": test_ds}

for split in ds:
    ds[split] = ds[split].map(preprocess_function, remove_columns=["title", "full_text", "score"])

Map:   0%|          | 0/4804 [00:00<?, ? examples/s]

Map:   0%|          | 0/719 [00:00<?, ? examples/s]

Map:   0%|          | 0/1736 [00:00<?, ? examples/s]

In [4]:
trainer = RegressionTrainer(
    model=model,
    args=training_args,
    train_dataset=ds["train"],
    eval_dataset=ds["validation"],
    compute_metrics=compute_metrics_for_regression,
)

trainer.train()

Epoch,Training Loss,Validation Loss,Mse,Mae,R2
1,No log,8.238085,8.238084,2.216275,0.272116
2,11.675300,8.399469,8.399471,2.068044,0.257857
3,11.675300,7.277102,7.277102,2.174073,0.357025
4,8.925200,6.091055,6.091055,1.866863,0.461819
5,7.477400,5.671689,5.671689,1.780632,0.498872


TrainOutput(global_step=1505, training_loss=9.346669202785556, metrics={'train_runtime': 5245.649, 'train_samples_per_second': 4.579, 'train_steps_per_second': 0.287, 'total_flos': 1569115322449920.0, 'train_loss': 9.346669202785556, 'epoch': 5.0})

# Summary

- Not surprisingly, even a non SOTA LLM achieves over 3x better performance than handcrafted features alone
- The R2 of this initial model is almost 0.5