# Notes: Hugging Face Transformers (won't use fastai library for this lesson)

* following the lesson, we will fine-tune a pre-trained NLP model using a library called HuggingFace Transformer
* reason: 
    * really useful to get experience using other libraries (good for reinforcing knowledge)
    * hugging face is really good for NLP, well worth knowing
    * probably will have finished integration of transformer library in fastai
    
* **Hugging Transformer** doesn't have the same architecture as fastai
    * lower level, will need to do a bit more work on our end

* **Pre-trained model** - a bunch of parameters already fit, some of them - already confident what they should be, some of them - no idea what they should be at all. Hence the need for fine-tuning.

* **ULMFiT** - an architecture and transfer learning method that can be applied to NLP tasks
    * Started out on Wikipedia data to predict the next word (got up to ~30% accuracy)
    * Then applied to IMDB data, took the pre-trained model on wikipedia and ran a few more epochs, then took those weights and fine tuned them to classify a review as positive or negative

* Used RNNs around the same time.
* **Transformers** - took good really advantage of modern accelerators like Google TPUs
    * Threw away the idea of predicting the next word of a sentence
    * Took chunks of wikipedia, deleted at random a few words, asked the model which words were deleted

* For this lesson, we'll focus on the **Transformers masked language model**


## Reference: Kaggle Competition - Getting Started with NLP for absolute beginners (U.S. patent phrase to phrase matching)

* Data:
    * id, anchor, target, context, score (how similar target & anchor are)
    
* [Link to competition](https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/data?select=train.csv)

In [12]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import torch
import warnings
warnings.filterwarnings("ignore")

In [3]:
df = pd.read_csv('nlp_intro/train.csv')

In [4]:
df.head()

Unnamed: 0,id,anchor,target,context,score
0,37d61fd2272659b1,abatement,abatement of pollution,A47,0.5
1,7b9652b17b68b7a4,abatement,act of abating,A47,0.75
2,36d72442aefd8232,abatement,active catalyst,A47,0.25
3,5296b0c19e1ce60e,abatement,eliminating process,A47,0.5
4,54c1e3b9184cb5b6,abatement,forest region,A47,0.0


In [5]:
df.describe(include='object') #Not that much language data, lots of repeated data

Unnamed: 0,id,anchor,target,context
count,36473,36473,36473,36473
unique,36473,733,29340,106
top,37d61fd2272659b1,component composite coating,composition,H01
freq,1,152,24,2186


In [6]:
#Can represent the input to the model as for example, "TEXT1: abatement; TEXT2: eliminating process" 
df['input'] = 'TEXT1: ' + df.context + '; TEXT2: ' + df.target + '; ANC1: ' + df.anchor

In [7]:
df.input.head()

0    TEXT1: A47; TEXT2: abatement of pollution; ANC...
1    TEXT1: A47; TEXT2: act of abating; ANC1: abate...
2    TEXT1: A47; TEXT2: active catalyst; ANC1: abat...
3    TEXT1: A47; TEXT2: eliminating process; ANC1: ...
4    TEXT1: A47; TEXT2: forest region; ANC1: abatement
Name: input, dtype: object

### Tokenization
* Neural networks work with numbers

* Step 1: Split these into tokens (words)
    * Unique words will get a number
    * Generally, don't want a vocabulary to be too big
    * Nowadays, people use subwords


Transformers uses a **Dataset object** for storing a dataset.

In [8]:
from datasets import Dataset, DatasetDict

ds = Dataset.from_pandas(df)

In [9]:
ds

Dataset({
    features: ['id', 'anchor', 'target', 'context', 'score', 'input'],
    num_rows: 36473
})

### Numericalization
* Step 2: Convert each word (or token) into a number. "Unique ID" based on the vocabulary
* Details about how Step 1 & 2 are done -> depend on the particular model we use. (Hugging face has [250K+ models](https://huggingface.co/models) as of July 2023)
    * A reasonable starting point is to use "deberta-v3-small"
    * start with small, then explore large for slower but more accurate results!

In [10]:
#specify the model here
model_nm = 'microsoft/deberta-v3-small'

In [13]:
#AutoTokenizer creates a tokenizer appropriate for this model
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokz = AutoTokenizer.from_pretrained(model_nm)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [14]:
#Try passing a string to this tokenizer
tokz.tokenize("Yo what up, this is Link from Hyrule")
#Underscore indicates the START of the word

['▁Yo', '▁what', '▁up', ',', '▁this', '▁is', '▁Link', '▁from', '▁Hyrule']

In [15]:
#less common phrase
tokz.tokenize("A platypus is an ornithorhynchus anatinus.")


['▁A',
 '▁platypus',
 '▁is',
 '▁an',
 '▁or',
 'ni',
 'tho',
 'rhynch',
 'us',
 '▁an',
 'at',
 'inus',
 '.']

#### Create a simple function to tokenize our inputs



In [16]:
def tok_func(x): return tokz(x["input"])

In [17]:
#Run this quickly in parallel using 'map'
tok_ds = ds.map(tok_func, batched=True)

Map:   0%|          | 0/36473 [00:00<?, ? examples/s]

In [18]:
#Take a look at the first row of the tokenizer
row = tok_ds[0]
row['input'], row['input_ids'] #Successfully turned our tokens into numbers

('TEXT1: A47; TEXT2: abatement of pollution; ANC1: abatement',
 [1,
  54453,
  435,
  294,
  336,
  5753,
  346,
  54453,
  445,
  294,
  47284,
  265,
  6435,
  346,
  23702,
  435,
  294,
  47284,
  2])

In [24]:
#Try looking up words in a dictionary to get the numbers
tokz.vocab['▁of'] 
#To handle the whitespace as a basic token explicitly, 
#SentencePiece first escapes the whitespace with a meta symbol “▁” (U+2581) as follows.

265

* ULMFiT: probably best for reasonably quick and easy implementation for long documents
* Transformers: large documents are challenging, specifically since transformers have to do all documents at once (larger GPU cost)
* Example: documents of over 2000 words? consider ULMFiT

#### HuggingFace transformers expects that your target is called 'labels'

In [28]:
#change the score column to labels
tok_ds = tok_ds.rename_columns({'score':'labels'})

#### Test and Validation Sets