<a href="https://colab.research.google.com/github/bachaudhry/FastAI-22-23/blob/main/FastAI_2022_Getting_Started_With_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Getting Started with NLP - Using FastAI and Hugging Face**


In [5]:
import os
import numpy as np
import pandas as pd

iskaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')

In [2]:
# Setting up Kaggle so that we can download datasets directly
!pip install kaggle



In [3]:
# Using Kaggle API key
creds = '{"username":"bachaudhry","key":"f8e0ee26f34cb0d3e4eccd774de62f63"}'

In [4]:
# Check if file path exists or needs to be created
from pathlib import Path

cred_path = Path('~/.kaggle/kaggle.json').expanduser()
if not cred_path.exists():
  cred_path.parent.mkdir(exist_ok=True)
  cred_path.write_text(creds)
  cred_path.chmod(0o600)

Now that we have Kaggle setup for this NB, let's download the **US Patent Phrase Matching** dataset.

In [8]:
path = Path('us-patent-phrase-to-phrase-matching')

if not iskaggle and not path.exists():
  import zipfile, kaggle
  kaggle.api.competition_download_cli(str(path))
  zipfile.ZipFile(f'{path}.zip').extractall(path)

us-patent-phrase-to-phrase-matching.zip: Skipping, found more recently modified local copy (use --force to force download)


## **Import Data and EDA**

Here's a [description](https://https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/data) for the dataset that we'll be working on.

In short, we will be working on pairs of phrases - which consist of an `anchor` and a `target` phrase.

Additionally, similarity has been scored within a patent's `context`, which is a CPC classification and indicates the subject to which the patent relates.

In [11]:
# Checking local / GDrive path to verify files in the downloaded dataset.
!ls {path}

sample_submission.csv  test.csv  train.csv


In [12]:
# Loading training dataset in a DataFrame
df = pd.read_csv(path/'train.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36473 entries, 0 to 36472
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   id       36473 non-null  object 
 1   anchor   36473 non-null  object 
 2   target   36473 non-null  object 
 3   context  36473 non-null  object 
 4   score    36473 non-null  float64
dtypes: float64(1), object(4)
memory usage: 1.4+ MB


In [13]:
df.head(10)

Unnamed: 0,id,anchor,target,context,score
0,37d61fd2272659b1,abatement,abatement of pollution,A47,0.5
1,7b9652b17b68b7a4,abatement,act of abating,A47,0.75
2,36d72442aefd8232,abatement,active catalyst,A47,0.25
3,5296b0c19e1ce60e,abatement,eliminating process,A47,0.5
4,54c1e3b9184cb5b6,abatement,forest region,A47,0.0
5,067203128142739c,abatement,greenhouse gases,A47,0.25
6,061d17f04be2d1cf,abatement,increased rate,A47,0.25
7,e1f44e48399a2027,abatement,measurement level,A47,0.25
8,0a425937a3e86d10,abatement,minimising sounds,A47,0.5
9,ef2d4c2e6bbb208d,abatement,mixing core materials,A47,0.25


In [14]:
df.tail(10)

Unnamed: 0,id,anchor,target,context,score
36463,16a5c8551e534d1c,wood article,wood apple fruit,B44,0.0
36464,8ceaa2b5c2d56250,wood article,wood article,B44,1.0
36465,c4ac9d407fb427ab,wood article,wood logs,B44,0.5
36466,8a57100f6ee40ffc,wood article,wood material,B44,0.75
36467,f55e072f78d1fedb,wood article,wood substrate,B44,0.5
36468,8e1386cbefd7f245,wood article,wooden article,B44,1.0
36469,42d9e032d1cd3242,wood article,wooden box,B44,0.5
36470,208654ccb9e14fa3,wood article,wooden handle,B44,0.5
36471,756ec035e694722b,wood article,wooden material,B44,0.75
36472,8d135da0b55b8c88,wood article,wooden substrate,B44,0.5


In [16]:
df.describe(include='object')

Unnamed: 0,id,anchor,target,context
count,36473,36473,36473,36473
unique,36473,733,29340,106
top,37d61fd2272659b1,component composite coating,composition,H01
freq,1,152,24,2186




We have 36473 rows, 733 unique anchors, 106 contexts and 29340 targets.





In [18]:
# Creating a column to concatenate the above so that we have a more convenient representation.
df['input'] = 'TEXT1: ' + df.context + '; TEXT2: ' + df.target + '; ANC1: ' + df.anchor
df.input.head()

0    TEXT1: A47; TEXT2: abatement of pollution; ANC...
1    TEXT1: A47; TEXT2: act of abating; ANC1: abate...
2    TEXT1: A47; TEXT2: active catalyst; ANC1: abat...
3    TEXT1: A47; TEXT2: eliminating process; ANC1: ...
4    TEXT1: A47; TEXT2: forest region; ANC1: abatement
Name: input, dtype: object

## Tokenization Using HF Tokenizer

We will be working with HuggingFace's Tokenizer, which uses a `Dataset` object for storage.

In [20]:
!pip install datasets

from datasets import Dataset, DatasetDict

Collecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m15.8 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m23.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: xxhash, dill, multiprocess, datasets
Successfully installed dataset

In [22]:
# This is how a dataset object works when we load in the training DF.
ds = Dataset.from_pandas(df)
ds

Dataset({
    features: ['id', 'anchor', 'target', 'context', 'score', 'input'],
    num_rows: 36473
})

It should be noted that Tokenization is dependent on the particular model that we will be using in this notebook. This detail has to be declared in the Tokenizer explicitly.

In [23]:
# Choosing a small NLP model for exploration
model_nm = 'microsoft/deberta-v3-small'

# Importing HuggingFace Tokenizer
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokz = AutoTokenizer.from_pretrained(model_nm)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/578 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]



In [24]:
tokz.tokenize("In the midst of winter, I found there was, within me, an invincible summer.")

['▁In',
 '▁the',
 '▁midst',
 '▁of',
 '▁winter',
 ',',
 '▁I',
 '▁found',
 '▁there',
 '▁was',
 ',',
 '▁within',
 '▁me',
 ',',
 '▁an',
 '▁invincible',
 '▁summer',
 '.']

In [25]:
tokz.tokenize("amazon.com, is a super-handy website. It's the bee's knees when it comes to e-commerce.")

['▁amazon',
 '.',
 'com',
 ',',
 '▁is',
 '▁a',
 '▁super',
 '-',
 'hand',
 'y',
 '▁website',
 '.',
 '▁It',
 "'",
 's',
 '▁the',
 '▁bee',
 "'",
 's',
 '▁knees',
 '▁when',
 '▁it',
 '▁comes',
 '▁to',
 '▁e',
 '-',
 'commerce',
 '.']

Let's tokenize our inputs column which we created in the previous section.

In [26]:
# Function to tokenize our inputs.
def tok_func(x):
  return tokz(x["input"])

In [27]:
# Testing the difference b/w batched=True and the default.
%time tok_ds = ds.map(tok_func)

Map:   0%|          | 0/36473 [00:00<?, ? examples/s]

CPU times: user 14.2 s, sys: 94.1 ms, total: 14.3 s
Wall time: 18.2 s


In [28]:
%time tok_ds = ds.map(tok_func, batched=True)

Map:   0%|          | 0/36473 [00:00<?, ? examples/s]

CPU times: user 4.18 s, sys: 49.8 ms, total: 4.23 s
Wall time: 4.44 s


The performance gains with the usage of `batched` processing are quite pronounced...

In [35]:
round((4.44 - 18.2) / ((4.44 + 18.2)/2) * 100, 2)

-121.55

... in this case, we have an improvement which is close to 122%.

Looking at the dataset again, we can see that there is a new column called `input_ids`.

In [31]:
tok_ds

Dataset({
    features: ['id', 'anchor', 'target', 'context', 'score', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 36473
})

These IDS come from a list called `vocab` in the tokenizer which contains unique integer for all token strings.

A random example of this can be:

In [39]:
row = tok_ds[0]
row['input'], row['input_ids']

('TEXT1: A47; TEXT2: abatement of pollution; ANC1: abatement',
 [1,
  54453,
  435,
  294,
  336,
  5753,
  346,
  54453,
  445,
  294,
  47284,
  265,
  6435,
  346,
  23702,
  435,
  294,
  47284,
  2])

In [45]:
# Checking vocab integers for random words.
tokz.vocab['▁needle'], tokz.vocab['▁the'], tokz.vocab['▁rain']

(9445, 262, 2894)

In [46]:
# NOTE: Transformers expects the labels column to be named labels by default.
# In our dataset this is refers to the `score` column.
tok_ds = tok_ds.rename_columns({'score': 'labels'})