## **Getting started with NLP**

Using NLP to classify US Patent data to compare 2 words/phrase and ranking them for their similarity.

In [1]:
!pip install datasets --q
!pip install transformers==4.30
!pip install accelerate



In [35]:
import pandas as pd
import numpy as np

In [3]:
df=pd.read_csv("train.csv")
df.head()

Unnamed: 0,id,anchor,target,context,score
0,37d61fd2272659b1,abatement,abatement of pollution,A47,0.5
1,7b9652b17b68b7a4,abatement,act of abating,A47,0.75
2,36d72442aefd8232,abatement,active catalyst,A47,0.25
3,5296b0c19e1ce60e,abatement,eliminating process,A47,0.5
4,54c1e3b9184cb5b6,abatement,forest region,A47,0.0


In [4]:
df.describe(include="object")           # To understand the data

Unnamed: 0,id,anchor,target,context
count,36473,36473,36473,36473
unique,36473,733,29340,106
top,8d135da0b55b8c88,component composite coating,composition,H01
freq,1,152,24,2186


We can convert the input to a single string as:
"TEXT1: abatement; TEXT2: eliminating process"....

In [5]:
df["input"] = "TEXT1: "+df.context +"; TEXT2: "+df.target+"; ANC1: "+df.anchor
df.input.head()

0    TEXT1: A47; TEXT2: abatement of pollution; ANC...
1    TEXT1: A47; TEXT2: act of abating; ANC1: abate...
2    TEXT1: A47; TEXT2: active catalyst; ANC1: abat...
3    TEXT1: A47; TEXT2: eliminating process; ANC1: ...
4    TEXT1: A47; TEXT2: forest region; ANC1: abatement
Name: input, dtype: object

### Tokenization

In [6]:
# Changing our dataframe into a Huggingface dataset
from datasets import Dataset, DatasetDict

ds=Dataset.from_pandas(df)

In [7]:
ds

Dataset({
    features: ['id', 'anchor', 'target', 'context', 'score', 'input'],
    num_rows: 36473
})

A deep learning model expects numbers as input, not words/sentences. So we need to do 2 things:
1. **Tokenization:** Split each text up into words (or tokens)
2. **Numericalization:** Convert each word (or tokens) into a number

To implement this, we will have to implement a model.

In [8]:
model_nm="microsoft/deberta-v3-small"

`AutoTokenizer` will create a tokenizer appropriate for a given model

In [9]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokz=AutoTokenizer.from_pretrained(model_nm)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


An example of how the tokenizer splits text into 'tokens'.

In [10]:
tokz.tokenize("Hello! I'm Vansh Kharidia. I love software engineering :)")

['▁Hello',
 '!',
 '▁I',
 "'",
 'm',
 '▁Van',
 'sh',
 '▁Khar',
 'idia',
 '.',
 '▁I',
 '▁love',
 '▁software',
 '▁engineering',
 '▁:',
 ')']

In [11]:
tokz.tokenize("A platypus is an ornithorhyncus anatinus.")

['▁A',
 '▁platypus',
 '▁is',
 '▁an',
 '▁or',
 'nit',
 'hor',
 'hy',
 'n',
 'cus',
 '▁an',
 'at',
 'inus',
 '.']

A simple function to tokenize our inputs

In [12]:
def tok_func(x):
    return tokz(x['input'])

We can use the `.map()` function to run this quickly in parallel on every row in our dataset.

In [13]:
tok_ds=ds.map(tok_func, batched=True)

Map:   0%|          | 0/36473 [00:00<?, ? examples/s]

A new row has been added to our dataset called `input_ids`.

In [14]:
row=tok_ds[0]
row['input'], row['input_ids']

('TEXT1: A47; TEXT2: abatement of pollution; ANC1: abatement',
 [1,
  54453,
  435,
  294,
  336,
  5753,
  346,
  54453,
  445,
  294,
  47284,
  265,
  6435,
  346,
  23702,
  435,
  294,
  47284,
  2])

The tokenizer contains a list called `vocab` which contains a unique integer for every possible token string. To find the token for a specific word:

In [15]:
tokz.vocab['of']

1580

In [16]:
tok_ds

Dataset({
    features: ['id', 'anchor', 'target', 'context', 'score', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 36473
})

Transformers always assumes that your labels have the column name `labels`, but in our dataset, it's `score`, so we will rename it!

In [17]:
tok_ds=tok_ds.rename_columns({'score':'labels'})

We define a function to find the correlation

In [36]:
def corr(x,y):
  return np.corrcoef(x,y)[0][1]

def corr_d(eval_pred):
  return {'pearson': corr(*eval_pred)}

### **Testing and validation sets**

In [37]:
eval_df=pd.read_csv('test.csv')
eval_df.describe()

Unnamed: 0,id,anchor,target,context
count,36,36,36,36
unique,36,34,36,29
top,4112d61851461f60,hybrid bearing,inorganic photoconductor drum,G02
freq,1,2,1,3


In [38]:
dds = tok_ds.train_test_split(0.25, seed=42)        # 25% validation, 75% train
dds

# It is a validation set even though it says 'test'

DatasetDict({
    train: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 27354
    })
    test: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 9119
    })
})

Creating a test set called `eval`

In [46]:
eval_df['input'] = 'TEXT1: ' + eval_df.context + '; TEXT2: ' + eval_df.target + '; ANC1: ' + eval_df.anchor
eval_ds = Dataset.from_pandas(eval_df).map(tok_func, batched=True)

Map:   0%|          | 0/36 [00:00<?, ? examples/s]

### **Training our model**

In [39]:
from transformers import TrainingArguments, Trainer

We pick a batch size that fits our GPUs, and a small number of epochs so we can run experiments quickly.

In [40]:
bs=128
epochs=4

In [41]:
lr=8e-5         # lr is the learning rate

Transformers uses the `Training Arguments` class to set up arguments.

In [42]:
args=TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True, evaluation_strategy='epoch', per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2, num_train_epochs=epochs, weight_decay=0.01, report_to='none')

Now we can create our model `Trainer`, which combines the data and model together.

In [43]:
model=AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=1)

trainer=Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'], tokenizer=tokz, compute_metrics=corr_d)

Some weights of the model checkpoint at microsoft/deberta-v3-small were not used when initializing DebertaV2ForSequenceClassification: ['lm_predictions.lm_head.dense.weight', 'mask_predictions.classifier.bias', 'mask_predictions.classifier.weight', 'mask_predictions.LayerNorm.weight', 'mask_predictions.dense.bias', 'mask_predictions.dense.weight', 'mask_predictions.LayerNorm.bias', 'lm_predictions.lm_head.LayerNorm.weight', 'lm_predictions.lm_head.dense.bias', 'lm_predictions.lm_head.bias', 'lm_predictions.lm_head.LayerNorm.bias']
- This IS expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from 

Finally, let's train our model!

In [44]:
trainer.train()



Epoch,Training Loss,Validation Loss,Pearson
1,No log,0.025983,0.798615
2,No log,0.023316,0.823594
3,0.030000,0.023581,0.829898
4,0.030000,0.022816,0.831775


TrainOutput(global_step=856, training_loss=0.022971898038810657, metrics={'train_runtime': 201.714, 'train_samples_per_second': 542.431, 'train_steps_per_second': 4.244, 'total_flos': 716605488222960.0, 'train_loss': 0.022971898038810657, 'epoch': 4.0})

Getting the predictions

In [49]:
preds = trainer.predict(eval_ds).predictions.astype(float)
preds[0:5]          # Seeing some sample predictions

array([[ 0.50292969],
       [ 0.65527344],
       [ 0.62304688],
       [ 0.30810547],
       [-0.0085144 ]])

Some of our predictions are not in the range of -1 to 1, so we clip those values.

In clipping, values <-1 will become -1 and values>1 will become 1

In [51]:
preds = np.clip(preds, 0, 1)
preds[0:5]

array([[0.50292969],
       [0.65527344],
       [0.62304688],
       [0.30810547],
       [0.        ]])

Now creating a submission file, like we would in a Kaggle contest

In [52]:
import datasets

submission=datasets.Dataset.from_dict({
    'id': eval_ds['id'],
    'score': preds
})

submission.to_csv('submission.csv', index=False)

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

1016