<a href="https://colab.research.google.com/github/bachaudhry/FastAI-22-23/blob/main/FastAI_2022_Getting_Started_With_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Getting Started with NLP - Using FastAI and Hugging Face**


In [1]:
import os
import numpy as np
import pandas as pd

iskaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')

In [2]:
# Setting up Kaggle so that we can download datasets directly
!pip install kaggle



In [3]:
!pip install transformers[torch]

Collecting accelerate>=0.21.0 (from transformers[torch])
  Downloading accelerate-0.29.2-py3-none-any.whl (297 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.4/297.4 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch->transformers[torch])
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch->transformers[torch])
  Using cached nvidia_cublas_cu

In [4]:
# Using Kaggle API key
creds = '{"username":"bachaudhry","key":"f8e0ee26f34cb0d3e4eccd774de62f63"}'

In [5]:
# Check if file path exists or needs to be created
from pathlib import Path

cred_path = Path('~/.kaggle/kaggle.json').expanduser()
if not cred_path.exists():
  cred_path.parent.mkdir(exist_ok=True)
  cred_path.write_text(creds)
  cred_path.chmod(0o600)

Now that we have Kaggle setup for this NB, let's download the **US Patent Phrase Matching** dataset.

In [6]:
path = Path('us-patent-phrase-to-phrase-matching')

if not iskaggle and not path.exists():
  import zipfile, kaggle
  kaggle.api.competition_download_cli(str(path))
  zipfile.ZipFile(f'{path}.zip').extractall(path)

Downloading us-patent-phrase-to-phrase-matching.zip to /content


100%|██████████| 682k/682k [00:00<00:00, 54.1MB/s]







## **Import Data and EDA**

Here's a [description](https://https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/data) for the dataset that we'll be working on.

In short, we will be working on pairs of phrases - which consist of an `anchor` and a `target` phrase.

Additionally, similarity has been scored within a patent's `context`, which is a CPC classification and indicates the subject to which the patent relates.

In [7]:
# Checking local / GDrive path to verify files in the downloaded dataset.
!ls {path}

sample_submission.csv  test.csv  train.csv


In [8]:
# Loading training dataset in a DataFrame
df = pd.read_csv(path/'train.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36473 entries, 0 to 36472
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   id       36473 non-null  object 
 1   anchor   36473 non-null  object 
 2   target   36473 non-null  object 
 3   context  36473 non-null  object 
 4   score    36473 non-null  float64
dtypes: float64(1), object(4)
memory usage: 1.4+ MB


In [9]:
df.head(10)

Unnamed: 0,id,anchor,target,context,score
0,37d61fd2272659b1,abatement,abatement of pollution,A47,0.5
1,7b9652b17b68b7a4,abatement,act of abating,A47,0.75
2,36d72442aefd8232,abatement,active catalyst,A47,0.25
3,5296b0c19e1ce60e,abatement,eliminating process,A47,0.5
4,54c1e3b9184cb5b6,abatement,forest region,A47,0.0
5,067203128142739c,abatement,greenhouse gases,A47,0.25
6,061d17f04be2d1cf,abatement,increased rate,A47,0.25
7,e1f44e48399a2027,abatement,measurement level,A47,0.25
8,0a425937a3e86d10,abatement,minimising sounds,A47,0.5
9,ef2d4c2e6bbb208d,abatement,mixing core materials,A47,0.25


In [10]:
df.tail(10)

Unnamed: 0,id,anchor,target,context,score
36463,16a5c8551e534d1c,wood article,wood apple fruit,B44,0.0
36464,8ceaa2b5c2d56250,wood article,wood article,B44,1.0
36465,c4ac9d407fb427ab,wood article,wood logs,B44,0.5
36466,8a57100f6ee40ffc,wood article,wood material,B44,0.75
36467,f55e072f78d1fedb,wood article,wood substrate,B44,0.5
36468,8e1386cbefd7f245,wood article,wooden article,B44,1.0
36469,42d9e032d1cd3242,wood article,wooden box,B44,0.5
36470,208654ccb9e14fa3,wood article,wooden handle,B44,0.5
36471,756ec035e694722b,wood article,wooden material,B44,0.75
36472,8d135da0b55b8c88,wood article,wooden substrate,B44,0.5


In [11]:
df.describe(include='object')

Unnamed: 0,id,anchor,target,context
count,36473,36473,36473,36473
unique,36473,733,29340,106
top,8d135da0b55b8c88,component composite coating,composition,H01
freq,1,152,24,2186




We have 36473 rows, 733 unique anchors, 106 contexts and 29340 targets.





In [12]:
# Creating a column to concatenate the above so that we have a more convenient representation.
df['input'] = 'TEXT1: ' + df.context + '; TEXT2: ' + df.target + '; ANC1: ' + df.anchor
df.input.head()

0    TEXT1: A47; TEXT2: abatement of pollution; ANC...
1    TEXT1: A47; TEXT2: act of abating; ANC1: abate...
2    TEXT1: A47; TEXT2: active catalyst; ANC1: abat...
3    TEXT1: A47; TEXT2: eliminating process; ANC1: ...
4    TEXT1: A47; TEXT2: forest region; ANC1: abatement
Name: input, dtype: object

## Tokenization Using HF Tokenizer

We will be working with HuggingFace's Tokenizer, which uses a `Dataset` object for storage.

In [13]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: xxhash, dill, multiprocess, datasets
Successfully installed dataset

In [14]:
from datasets import Dataset, DatasetDict

In [15]:
# This is how a dataset object works when we load in the training DF.
ds = Dataset.from_pandas(df)
ds

Dataset({
    features: ['id', 'anchor', 'target', 'context', 'score', 'input'],
    num_rows: 36473
})

It should be noted that Tokenization is dependent on the particular model that we will be using in this notebook. This detail has to be declared in the Tokenizer explicitly.

In [16]:
# Choosing a small NLP model for exploration
model_nm = 'microsoft/deberta-v3-small'

# Importing HuggingFace Tokenizer
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokz = AutoTokenizer.from_pretrained(model_nm)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/578 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]



In [17]:
tokz.tokenize("In the midst of winter, I found there was, within me, an invincible summer.")

['▁In',
 '▁the',
 '▁midst',
 '▁of',
 '▁winter',
 ',',
 '▁I',
 '▁found',
 '▁there',
 '▁was',
 ',',
 '▁within',
 '▁me',
 ',',
 '▁an',
 '▁invincible',
 '▁summer',
 '.']

In [18]:
tokz.tokenize("amazon.com, is a super-handy website. It's the bee's knees when it comes to e-commerce.")

['▁amazon',
 '.',
 'com',
 ',',
 '▁is',
 '▁a',
 '▁super',
 '-',
 'hand',
 'y',
 '▁website',
 '.',
 '▁It',
 "'",
 's',
 '▁the',
 '▁bee',
 "'",
 's',
 '▁knees',
 '▁when',
 '▁it',
 '▁comes',
 '▁to',
 '▁e',
 '-',
 'commerce',
 '.']

Let's tokenize our inputs column which we created in the previous section.

In [19]:
# Function to tokenize our inputs.
def tok_func(x):
  return tokz(x["input"])

In [20]:
# Testing the difference b/w batched=True and the default.
%time tok_ds = ds.map(tok_func)

Map:   0%|          | 0/36473 [00:00<?, ? examples/s]

CPU times: user 9.17 s, sys: 59.2 ms, total: 9.23 s
Wall time: 9.44 s


In [21]:
%time tok_ds = ds.map(tok_func, batched=True)

Map:   0%|          | 0/36473 [00:00<?, ? examples/s]

CPU times: user 2.9 s, sys: 33.6 ms, total: 2.93 s
Wall time: 2.31 s


The performance gains with the usage of `batched` processing are quite pronounced...

In [22]:
round((2.3 - 9.4) / ((2.3 + 9.4)/2) * 100, 2)

-121.37

... in this case, we have an improvement which is close to 121%.

Looking at the dataset again, we can see that there is a new column called `input_ids`.

In [23]:
tok_ds

Dataset({
    features: ['id', 'anchor', 'target', 'context', 'score', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 36473
})

These IDS come from a list called `vocab` in the tokenizer which contains unique integer for all token strings.

A random example of this can be:

In [24]:
row = tok_ds[0]
row['input'], row['input_ids']

('TEXT1: A47; TEXT2: abatement of pollution; ANC1: abatement',
 [1,
  54453,
  435,
  294,
  336,
  5753,
  346,
  54453,
  445,
  294,
  47284,
  265,
  6435,
  346,
  23702,
  435,
  294,
  47284,
  2])

In [25]:
# Checking vocab integers for random words.
tokz.vocab['▁needle'], tokz.vocab['▁the'], tokz.vocab['▁rain']

(9445, 262, 2894)

In [26]:
# NOTE: Transformers expects the labels column to be named labels by default.
# In our dataset this is refers to the `score` column.
tok_ds = tok_ds.rename_columns({'score': 'labels'})

## Test and Validation Sets

In [27]:
# Loading test set
eval_df = pd.read_csv(path/'test.csv')
eval_df.describe()

Unnamed: 0,id,anchor,target,context
count,36,36,36,36
unique,36,34,36,29
top,4112d61851461f60,hybrid bearing,inorganic photoconductor drum,G02
freq,1,2,1,3


The Transformers library uses a `Datasetdict` to hold our training and validation sets. It should be noted that here `validation` is labelled as `test`, so care must be taken.

In [28]:
dds = tok_ds.train_test_split(0.25, seed=42)
dds

DatasetDict({
    train: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 27354
    })
    test: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 9119
    })
})

In [29]:
# We will use eval as our name for the test set.
eval_df['input'] = 'TEXT1: ' + eval_df.context + '; TEXT2: ' + eval_df.target + '; ANC1: ' + eval_df.anchor
# Now, creating a dataset object based on the changes above.
eval_ds = Dataset.from_pandas(eval_df).map(tok_func, batched=True)

Map:   0%|          | 0/36 [00:00<?, ? examples/s]

## Training an Initial Model

In [30]:
from transformers import TrainingArguments, Trainer

In [31]:
# Selecting a batch size, number of epochs and learning rate
bs = 128
epochs = 5
lr = 8e-5

In [32]:
# Setting up our training arguments using the class provided by HF
args = TrainingArguments('outputs',
                         learning_rate=lr,
                         warmup_ratio=0.1,
                         lr_scheduler_type='cosine',
                         fp16=True,
                         evaluation_strategy="epoch",
                         per_device_train_batch_size=bs,
                         per_device_eval_batch_size=bs*2,
                         num_train_epochs=epochs,
                         weight_decay=0.01,
                         report_to='none')

In [33]:
# Function for our evaluation metric i.e. Pearson's Correlation
def corr(x, y):
  return np.corrcoef(x, y)[0][1]

def corr_d(eval_pred):
  return {'pearson': corr(*eval_pred)}

In [34]:
# Defining the model object and the trainer.
model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=1)
trainer = Trainer(model, args,
                  train_dataset=dds['train'],
                  eval_dataset=dds['test'],
                  tokenizer=tokz,
                  compute_metrics=corr_d)

pytorch_model.bin:   0%|          | 0.00/286M [00:00<?, ?B/s]

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [35]:
# Training the model
trainer.train()

Epoch,Training Loss,Validation Loss,Pearson
1,No log,0.032456,0.796482
2,No log,0.022213,0.822913
3,0.049000,0.022346,0.834993
4,0.049000,0.021834,0.838555
5,0.012400,0.021979,0.838472


TrainOutput(global_step=1070, training_loss=0.029367925073498877, metrics={'train_runtime': 264.8796, 'train_samples_per_second': 516.348, 'train_steps_per_second': 4.04, 'total_flos': 894508229346960.0, 'train_loss': 0.029367925073498877, 'epoch': 5.0})

Training for an increased number of epochs doesn't help the model at all.

In fact, the model appears to be overfitting to the training data since training loss decreases are occuring in conjunction with validation loss.

Based on the evaluation metric, Pearson's r doesn't improve beyond the 7th epoch.

## Improving the Model

There aren't many improvements that can be made using the baseline modeling and data augmentation approach.

We can experiment with:

1. Improving the current model / selecting a larger model.
2. Different data augmentation approaches.
3. Experimenting with trainer parameters.

In [39]:
inps = "anchor", "target", "context"

# Creating a function to quickly apply tokenization
def get_dds(df):
  ds = Dataset.from_pandas(df).rename_columns({'score': 'labels'})
  tok_ds = ds.map(tok_func,
                  batched=True,
                  remove_columns=inps+('inputs', 'id', 'section'))
  return DatasetDict({"train": tok_ds.select(trn_idxs),
                      "test" : tok_ds.select(val_idxs)})

In [40]:
# Now we will create a function to create a Trainer.
def get_model():
  return AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=1)

def get_trainer(dds, model):
  if model is None:
    model = get_model()
    args = TrainingArguments('ouputs', learning_rate=lr,
                             warmup_ratio=0.1,
                             lr_scheduler_type='cosine',
                             fp16=True,
                             evaluation_strategy="epoch",
                             per_device_train_batch_size=bs,
                             per_device_eval_batch_size=bs*2,
                             num_train_epochs=epochs,
                             weight_decay=wd,
                             report_to='none')
  return Trainer(model, args, train_dataset=dds['train'],
                 eval_dataset=dds['test'], tokenizer=tokz,
                 compute_metrics=corr)



Jeremy thought that the presence of special characters wasn't a good idea and having more control over these auto generated elements might yield better results.

In [41]:
sep = " [s] "
df['inputs'] = df.context + sep + df.anchor + sep + df.target
dds = get_dds(df)

ValueError: Column to remove ['section'] not in the dataset. Current columns in the dataset: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'inputs']