## Us-patent-phrase-matching from Jeremy Howard Kaggle notebook

**Description and link:** The dataset is obtained from a competition that sort to classify two different phrases as similiar or different depending on the category that they belonged to. Link to the competition is [here](https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/) and link to Jeremy Howard's notebook which is where I drew and practise all of the codes below is [here](https://www.kaggle.com/code/jhoward/getting-started-with-nlp-for-absolute-beginners/notebook)

I will be downloading the dataset directly from kaggle and before now, I already `pip install kaggle`, in order to be able to do this, I create a credential ID that helps me to access the kaggle API.

In [1]:
#pip install datasets, and transformers class
#!pip install transformers
#!pip install datasets
#!pip install kaggle

In [2]:
#Import Pandas 
import pandas as pd
import transformers
import datasets
import numpy as np

In [3]:
#creds = '{"username":"xxxx","key":"yyyy"}' #Note: Import kaggle library for personal use

#Import important libraries for os and file manipulation
from pathlib import Path
import kaggle
import zipfile
#Make path to store the data to be downloaded
path = Path("us-patent-phrase-to-phrase-matching")

In [4]:
#go ahaed to download the dataset using the kaggle api
kaggle.api.competition_download_cli(str(path))
zipfile.ZipFile(f"{path}.zip").extractall(path)

us-patent-phrase-to-phrase-matching.zip: Skipping, found more recently modified local copy (use --force to force download)


In [5]:
#Check the contents of directory and files downloaded
!dir {path}

 Volume in drive C has no label.
 Volume Serial Number is 9AC6-46B4

 Directory of C:\Users\Dell\Git-Projects\nlp_from_kaggle\us-patent-phrase-to-phrase-matching

08/11/2022  04:02 PM    <DIR>          .
08/11/2022  04:02 PM    <DIR>          ..
08/12/2022  02:28 AM               693 sample_submission.csv
08/12/2022  02:28 AM             1,965 test.csv
08/12/2022  02:28 AM         2,141,136 train.csv
               3 File(s)      2,143,794 bytes
               2 Dir(s)  149,297,606,656 bytes free


From here we can see that there are three files that are of interest to us here and they are all csv files. The first thing would be see the information in each of these files. Next step process is Opening the file and EDA.

In [6]:
df = pd.read_csv(path/"train.csv")
df.sample(7)

Unnamed: 0,id,anchor,target,context,score
2294,e49be196400c90db,arrange in fashion,sorting,G03,0.25
18127,3d294efb3362eac9,leveller,leveling system,F02,0.75
572,a62cd63eb3259492,acylate with acids,acylate,C07,0.5
34456,e63fea436185f42e,upper series,top series,B66,0.75
20606,c59a24740fece978,morpholin,base,C07,0.25
2566,f445b7e0d753ec2f,average pore size,particulate substance,C04,0.25
6434,1b5f4e9b1f12f812,component composite coating,component curable coat,C09,0.25


From the competition from which this dataset was obtained from, the matter of interest in the dataset is to classify the score colum of the dataset with respect to how the phrase of each row of anchor and target are similar to each other depending on the context of use.

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36473 entries, 0 to 36472
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   id       36473 non-null  object 
 1   anchor   36473 non-null  object 
 2   target   36473 non-null  object 
 3   context  36473 non-null  object 
 4   score    36473 non-null  float64
dtypes: float64(1), object(4)
memory usage: 1.4+ MB


In [8]:
#Check the description of the table
df.describe(include="object")

Unnamed: 0,id,anchor,target,context
count,36473,36473,36473,36473
unique,36473,733,29340,106
top,37d61fd2272659b1,component composite coating,composition,H01
freq,1,152,24,2186


With over 36K+ datapoints, there are only 733 unique anchor phrases, 29K+ target phrases and 106 unique contexts in which a patent can be catgeorized under. The major columns of interest that would serve as our independent datasets against the score prediction that we want to make are the anchor phrase, target phrase and context of each patent row.

In [9]:
#Create an input column that concatenates the said columns with appropriate columns(phrase) seperator
#I will be using what Jeremy used for his concatenation

df["input"] = 'TEXT1: ' + df.context + '; TEXT2: ' + df.target + '; ANC1: ' + df.anchor
df.input.head()

0    TEXT1: A47; TEXT2: abatement of pollution; ANC...
1    TEXT1: A47; TEXT2: act of abating; ANC1: abate...
2    TEXT1: A47; TEXT2: active catalyst; ANC1: abat...
3    TEXT1: A47; TEXT2: eliminating process; ANC1: ...
4    TEXT1: A47; TEXT2: forest region; ANC1: abatement
Name: input, dtype: object

The next step in this is to tokenize and numericalize our input columns. However, since we will be making use of transformers, transformers have their already made vocabularies of tokens which then means that our input column must be tokenized the same way in order to make meaningful result from the model. In order to be able to perform the following transformer based task, it is important that we transform our dataset to a kind of dataset that is suitable for transformers which are called datasets.

In [10]:
from datasets import Dataset, DatasetDict

dset = Dataset.from_pandas(df)
dset

Dataset({
    features: ['id', 'anchor', 'target', 'context', 'score', 'input'],
    num_rows: 36473
})

In [11]:
#In order to tokenize, we need to obtain the tokenizer from the model of choice

model_choice = 'microsoft/deberta-v3-small'

In [12]:
#Import Autotokenizer from transformers and AutoSequenceModels
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokz = AutoTokenizer.from_pretrained(model_choice)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [13]:
#Example of token in action

tokz.tokenize("Hello World! This is my first underwhelmingly approach to using hugging-face transformers. Yours' in love")

['▁Hello',
 '▁World',
 '!',
 '▁This',
 '▁is',
 '▁my',
 '▁first',
 '▁underwhelming',
 'ly',
 '▁approach',
 '▁to',
 '▁using',
 '▁hugging',
 '-',
 'face',
 '▁transformers',
 '.',
 '▁Your',
 's',
 "'",
 '▁in',
 '▁love']

In [14]:
#Let's define a function takes in a dataset as above and tokenizes the input column --or choice column for nlp

def token(x):
    """
    Takes in a dataset and returns the token-
    ized version of the input
    --------------------------
    
    x: dataframe
    """
    return tokz(x["input"])

In [15]:
#Apply function to dset

tok_ds = dset.map(token, batched=True)

  0%|          | 0/37 [00:00<?, ?ba/s]

In [16]:
tok_ds

Dataset({
    features: ['id', 'anchor', 'target', 'context', 'score', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 36473
})

In [17]:
row0 = tok_ds[0]
row0["input"], row0["input_ids"]

('TEXT1: A47; TEXT2: abatement of pollution; ANC1: abatement',
 [1,
  54453,
  435,
  294,
  336,
  5753,
  346,
  54453,
  445,
  294,
  47284,
  265,
  6435,
  346,
  23702,
  435,
  294,
  47284,
  2])

In [18]:
tokz.vocab["▁of"] #This describes the token position of as it begins a sentence, which we see in the input_ids

265

In [19]:
#We need to also rename the target column from score to labels because of the transformer convention
tok_ds = tok_ds.rename_columns({"score": "labels"})

The next step before model training is to split the dataset into train, validation and test datasets that can now be used as appropriately... The actual training dset is split into a validation and training by 0.25 validation

In [20]:
dds = tok_ds.train_test_split(0.25, seed=42)
dds

DatasetDict({
    train: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 27354
    })
    test: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 9119
    })
})

In [21]:
#Our test set is already given as as a dataset
eval_df = pd.read_csv(path/"test.csv")
eval_df.head()

Unnamed: 0,id,anchor,target,context
0,4112d61851461f60,opc drum,inorganic photoconductor drum,G02
1,09e418c93a776564,adjust gas flow,altering gas flow,F23
2,36baf228038e314b,lower trunnion,lower locating,B60
3,1f37ead645e7f0c8,cap component,upper portion,D06
4,71a5b6ad068d531f,neural stimulation,artificial neural network,H04


In [22]:
#We need to perform data wrangling as we did to the training dataset
eval_df["input"] = 'TEXT1: ' + eval_df.context + '; TEXT2: ' + eval_df.target + '; ANC1: ' + eval_df.anchor
eval_ds = Dataset.from_pandas(eval_df).map(token, batched=True)

  0%|          | 0/1 [00:00<?, ?ba/s]

In [23]:
eval_ds

Dataset({
    features: ['id', 'anchor', 'target', 'context', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 36
})

In [24]:
#Since we will be using the pearson correlation coefficient, we must define it and put in a dictionary
#as transformers convention

def corr_d(eval_pred):
    corr = np.corrcoef(*eval_pred)[0][1]
    return {"pearson": corr}

#The key of the dictionary helps to give title to the error metric while running

With most things in place, it is time to train our model

In [28]:
#Import TrainigArguments and Trainer from transformers

from transformers import TrainingArguments, Trainer

#Define some variables that will be used in our TrainingArguments boiler plate

bs = 128     #batch size
epochs = 4 #Fast for experimentation purpose
lr = 8e-5  #Learning rate -- We experiment with the lr to be able to find the highest possibe lr to give optimal result

args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=False,
    evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
    num_train_epochs=epochs, weight_decay=0.01, report_to='none')

PyTorch: setting up devices


Note that in the arguments constructed above, `fp16=False` as against `fp16=True` that we see in the actual source notebook. The reason is becuase the fp16 parameter is a gpu only parameter. [Check the stackoverflow here](https://stackoverflow.com/questions/68007097/getting-a-mixed-precison-cuda-error-while-running-a-cell-trying-to-fine-tuning-a)

In [26]:
#Put arguments in the trainer
model = AutoModelForSequenceClassification.from_pretrained(model_choice, num_labels=1)
train = Trainer(model,
               args,
               train_dataset=dds["train"],
               eval_dataset=dds["test"],
               compute_metrics=corr_d,
               tokenizer=tokz)

Downloading pytorch_model.bin:   0%|          | 0.00/273M [00:00<?, ?B/s]

Some weights of the model checkpoint at microsoft/deberta-v3-small were not used when initializing DebertaV2ForSequenceClassification: ['mask_predictions.dense.weight', 'lm_predictions.lm_head.LayerNorm.bias', 'lm_predictions.lm_head.dense.bias', 'mask_predictions.dense.bias', 'mask_predictions.classifier.weight', 'mask_predictions.classifier.bias', 'lm_predictions.lm_head.bias', 'mask_predictions.LayerNorm.bias', 'lm_predictions.lm_head.dense.weight', 'mask_predictions.LayerNorm.weight', 'lm_predictions.lm_head.LayerNorm.weight']
- This IS expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from 

In [None]:
#Train model
train.train()

Using a CPU would lead up to 14hours o training if not more. So at this point, I transferred my notebook to my [kaggle workspace](https://www.kaggle.com/davidakingbeni/us-patent) to acheive the result in a shorter time frame and better parameters