# Notes: Hugging Face Transformers (won't use fastai library for this lesson)

* following the lesson, we will fine-tune a pre-trained NLP model using a library called HuggingFace Transformer
* reason: 
    * really useful to get experience using other libraries (good for reinforcing knowledge)
    * hugging face is really good for NLP, well worth knowing
    * probably will have finished integration of transformer library in fastai
    
* **Hugging Transformer** doesn't have the same architecture as fastai
    * lower level, will need to do a bit more work on our end

* **Pre-trained model** - a bunch of parameters already fit, some of them - already confident what they should be, some of them - no idea what they should be at all. Hence the need for fine-tuning.

* **ULMFiT** - an architecture and transfer learning method that can be applied to NLP tasks
    * Started out on Wikipedia data to predict the next word (got up to ~30% accuracy)
    * Then applied to IMDB data, took the pre-trained model on wikipedia and ran a few more epochs, then took those weights and fine tuned them to classify a review as positive or negative

* Used RNNs around the same time.
* **Transformers** - took good really advantage of modern accelerators like Google TPUs
    * Threw away the idea of predicting the next word of a sentence
    * Took chunks of wikipedia, deleted at random a few words, asked the model which words were deleted

* For this lesson, we'll focus on the **Transformers masked language model**


## Reference: Kaggle Competition - Getting Started with NLP for absolute beginners (U.S. patent phrase to phrase matching)

* Data:
    * id, anchor, target, context, score (how similar target & anchor are)
    
* [Link to competition](https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/data?select=train.csv)

* In this exercise, we're tasked with comparing two words or phrases, and scoring them based on whether they're similar or not, based on which patent class they were used in.

* Goal of the competition: come up with a model that auto determines which "anchor" and "target" pairs are talking about the same thing

* Score of 1: means the two inputs have identical meaning, 0 means they have totally different meaning. Score can be between 0 and 1.

* This can be represented as a classification problem. Ex) For the following text...: "TEXT1: abatement; TEXT2: eliminating process" ...chose a category of meaning similarity: "Different; Similar; Identical".

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import torch
import warnings
warnings.filterwarnings("ignore")

In [2]:
df = pd.read_csv('nlp_intro/train.csv')

In [3]:
eval_df = pd.read_csv('nlp_intro/test.csv')

In [4]:
df.head()

Unnamed: 0,id,anchor,target,context,score
0,37d61fd2272659b1,abatement,abatement of pollution,A47,0.5
1,7b9652b17b68b7a4,abatement,act of abating,A47,0.75
2,36d72442aefd8232,abatement,active catalyst,A47,0.25
3,5296b0c19e1ce60e,abatement,eliminating process,A47,0.5
4,54c1e3b9184cb5b6,abatement,forest region,A47,0.0


In [5]:
df.describe(include='object') #Not that much language data, lots of repeated data

Unnamed: 0,id,anchor,target,context
count,36473,36473,36473,36473
unique,36473,733,29340,106
top,37d61fd2272659b1,component composite coating,composition,H01
freq,1,152,24,2186


In [6]:
#Can represent the input to the model as for example, "TEXT1: abatement; TEXT2: eliminating process" 
df['input'] = 'TEXT1: ' + df.context + '; TEXT2: ' + df.target + '; ANC1: ' + df.anchor

In [7]:
df.input.head()

0    TEXT1: A47; TEXT2: abatement of pollution; ANC...
1    TEXT1: A47; TEXT2: act of abating; ANC1: abate...
2    TEXT1: A47; TEXT2: active catalyst; ANC1: abat...
3    TEXT1: A47; TEXT2: eliminating process; ANC1: ...
4    TEXT1: A47; TEXT2: forest region; ANC1: abatement
Name: input, dtype: object

### Tokenization
* Neural networks work with numbers

* Step 1: Split these into tokens (words)
    * Unique words will get a number
    * Generally, don't want a vocabulary to be too big
    * Nowadays, people use subwords


Transformers uses a **Dataset object** for storing a dataset.

In [8]:
from datasets import Dataset, DatasetDict

ds = Dataset.from_pandas(df)

In [9]:
ds

Dataset({
    features: ['id', 'anchor', 'target', 'context', 'score', 'input'],
    num_rows: 36473
})

### Numericalization
* Step 2: Convert each word (or token) into a number. "Unique ID" based on the vocabulary
* Details about how Step 1 & 2 are done -> depend on the particular model we use. (Hugging face has [250K+ models](https://huggingface.co/models) as of July 2023)
    * A reasonable starting point is to use "deberta-v3-small"
    * start with small, then explore large for slower but more accurate results!

In [10]:
#specify the model here
model_nm = 'microsoft/deberta-v3-small'

In [11]:
#AutoTokenizer creates a tokenizer appropriate for this model
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokz = AutoTokenizer.from_pretrained(model_nm)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [12]:
#Try passing a string to this tokenizer
tokz.tokenize("Yo what up, this is Link from Hyrule")
#Underscore indicates the START of the word

['▁Yo', '▁what', '▁up', ',', '▁this', '▁is', '▁Link', '▁from', '▁Hyrule']

In [13]:
#less common phrase
tokz.tokenize("A platypus is an ornithorhynchus anatinus.")


['▁A',
 '▁platypus',
 '▁is',
 '▁an',
 '▁or',
 'ni',
 'tho',
 'rhynch',
 'us',
 '▁an',
 'at',
 'inus',
 '.']

#### Create a simple function to tokenize our inputs



In [14]:
def tok_func(x): return tokz(x["input"])

In [15]:
#Run this quickly in parallel using 'map'
tok_ds = ds.map(tok_func, batched=True)

Map:   0%|          | 0/36473 [00:00<?, ? examples/s]

In [16]:
#Take a look at the first row of the tokenizer
row = tok_ds[0]
row['input'], row['input_ids'] #Successfully turned our tokens into numbers

('TEXT1: A47; TEXT2: abatement of pollution; ANC1: abatement',
 [1,
  54453,
  435,
  294,
  336,
  5753,
  346,
  54453,
  445,
  294,
  47284,
  265,
  6435,
  346,
  23702,
  435,
  294,
  47284,
  2])

In [17]:
#Try looking up words in a dictionary to get the numbers
tokz.vocab['▁of'] 
#To handle the whitespace as a basic token explicitly, 
#SentencePiece first escapes the whitespace with a meta symbol “▁” (U+2581) as follows.

265

* ULMFiT: probably best for reasonably quick and easy implementation for long documents
* Transformers: large documents are challenging, specifically since transformers have to do all documents at once (larger GPU cost)
* Example: documents of over 2000 words? consider ULMFiT

#### HuggingFace transformers expects that your target is called 'labels'

In [18]:
#change the score column to labels
tok_ds = tok_ds.rename_columns({'score':'labels'})

* Create a validation set: tells us whether our models are underfit/overfit, etc
* If you use the fastai library, it auto creates a validation set for you if you don't have one
* Transformers use a DatasetDict for holding your training and validation sets.

In [19]:
#25% validation, 75% training
dds = tok_ds.train_test_split(0.25, seed=42)
dds #**notice, the validation set here is called 'test' dataset.

DatasetDict({
    train: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 27354
    })
    test: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 9119
    })
})

In [20]:
#Test dataset: accuracy of your model on test set is only checked after completing the entire training process.
#Use "eval" as the name for the test set, to avoid confusion with the 'test' dataset that was created above
eval_df['input'] = 'TEXT1: ' + eval_df.context + '; TEXT2: ' + eval_df.target + '; ANC1: ' + eval_df.anchor
eval_ds = Dataset.from_pandas(eval_df).map(tok_func, batched=True)

Map:   0%|          | 0/36 [00:00<?, ? examples/s]

#### Transformers expects metrics to be returned as a dict. This way, the trainer knows what labels to use

In [21]:
#Create a function to do that
def corr_d(eval_pred): return {'pearson': corr(*eval_pred)}

In [22]:
#Will use this funciton later: returns a single # we need given a pair of variables
def corr(x,y): return np.corrcoef(x,y)[0][1]

## Training the model

In [23]:
from transformers import TrainingArguments,Trainer

In [24]:
#Pick a batch size that fits our GPU and a small # of epochs to run the experiments quickly
bs = 128
epochs = 4

In [25]:
#Most important hyperparameter: learning rate. 
#Fastai provides a learning rate finder to help you figure it out, but Transformers doesn't.
lr = 8e-5

In [26]:
#Transformers use "TrainingArguments" to set up arguments.
#These standard values generally work fine in most cases
args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
    evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
    num_train_epochs=epochs, weight_decay=0.01, report_to='none')

#### Create a model and "Trainer", which is a class that combines the data and model together (just like Learner in fastai)

In [27]:
#Transformers spits out lots of warnings, but you can ignore them.
model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=1)
trainer = Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'],
                  tokenizer=tokz, compute_metrics=corr_d)

Some weights of the model checkpoint at microsoft/deberta-v3-small were not used when initializing DebertaV2ForSequenceClassification: ['mask_predictions.dense.weight', 'mask_predictions.dense.bias', 'lm_predictions.lm_head.LayerNorm.weight', 'mask_predictions.LayerNorm.bias', 'lm_predictions.lm_head.dense.weight', 'mask_predictions.LayerNorm.weight', 'mask_predictions.classifier.bias', 'lm_predictions.lm_head.bias', 'mask_predictions.classifier.weight', 'lm_predictions.lm_head.dense.bias', 'lm_predictions.lm_head.LayerNorm.bias']
- This IS expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from 

In [28]:
#Train our model - again lots of warnings - can ignore them
trainer.train();

You're using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Pearson
1,No log,0.025546,0.799359
2,No log,0.024128,0.821532
3,0.032400,0.022523,0.833313
4,0.032400,0.022396,0.835018


#### Key thing to notice is the "Pearson" value in the table above.
* It's increasing and is already above 0.8 (great news).
* On Kaggle, submissions are evaluated on the Pearson correlation coefficient between the predicted and actual similarity scores.