<a href="https://colab.research.google.com/github/frank-895/machine_learning_journey/blob/main/NLP_classification/NLP_phrase_to_phrase_matching.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# U.S. Patent Phrase to Phrase Matching


## Introduction

**Natural Language Processing (NLP)** is a field of AI that focuses on the interaction between computers and human languages. The goal of NLP is to enable machines to understand, interpret, and generate human language in a way that is both meaningful and useful.

This project entails fine tuning a **pretrained** NLP model using a library called `Huggingface Transformers`.

A pretrained model has many parameters already fit. We can **fine-tune** the model, meaning we change the parameters we are not sure about to fit our purpose. When working with machine learning, many of the base parameters (such as detecting a 'corner' in an image) will be relevant in a wide range of applications (essentially all image classification tasks for example). By making use of an NLP pre-trained model, we can drastically reduce the amount of training required, as the model will have a good understanding of how language (and the world) works.

We will use the NLP model to match key phrases in United States patent documents. It is crucial to understand what a **document** is: it is an input to an NLP model that contains text - its length is arbitrary.

We are tasked with comparing two phrases and scoring them whether they are similar or not, based on their patent class.
- Score of `1` means identical meanings.
- Score of `0` means completely different meanings.

We can have a score of `0.5` indicating a somewhat similar meaning with differences.

This isn't strictly a classification problem yet, as the answer can fall anywhere between 0 and 1. So, to convert it into a classification problem we will represent each piece of data as a question:

Which category does the following text fall under?
`"TEXT1:{phrase1}; TEXT2:{phrase2}; CONTEXT:{context}"`
- Different
- Similar
- Identical

## Collecting Data

Because this data comes from a Kaggle competion, we will have to collect from the Kaggle API.

In [10]:
%%capture
!pip install kaggle
import os

# now the Kaggle API knows where to look for the kaggle.json file with credentials
os.environ['KAGGLE_CONFIG_DIR'] = "/content"

In [12]:
import zipfile, kaggle # to allow us to work with zip archives and interact with Kaggle services
from pathlib import Path # to create and manipulate file paths in an OO manner

path = Path('us-patent-phrase-to-phrase-matching')
kaggle.api.competition_download_cli(str(path)) # download all files for a specific competition
zipfile.ZipFile(f'{path}.zip', 'r').extractall(path) # open ZIP file and extracts data for use in notebook

us-patent-phrase-to-phrase-matching.zip: Skipping, found more recently modified local copy (use --force to force download)


We can now inspect the file.

In [13]:
!ls {path}

sample_submission.csv  test.csv  train.csv


Because we can see that the data is stored in CSV files, it will be ideal to use Pandas.

In [14]:
import pandas as pd

df = pd.read_csv(path/'train.csv')
df

Unnamed: 0,id,anchor,target,context,score
0,37d61fd2272659b1,abatement,abatement of pollution,A47,0.50
1,7b9652b17b68b7a4,abatement,act of abating,A47,0.75
2,36d72442aefd8232,abatement,active catalyst,A47,0.25
3,5296b0c19e1ce60e,abatement,eliminating process,A47,0.50
4,54c1e3b9184cb5b6,abatement,forest region,A47,0.00
...,...,...,...,...,...
36468,8e1386cbefd7f245,wood article,wooden article,B44,1.00
36469,42d9e032d1cd3242,wood article,wooden box,B44,0.50
36470,208654ccb9e14fa3,wood article,wooden handle,B44,0.50
36471,756ec035e694722b,wood article,wooden material,B44,0.75


We can also use the `describe()` method to understand how each of the columns is used.

In [16]:
df.describe(include='object')

Unnamed: 0,id,anchor,target,context
count,36473,36473,36473,36473
unique,36473,733,29340,106
top,37d61fd2272659b1,component composite coating,composition,H01
freq,1,152,24,2186


What we can note from the table above is that there is not much unique data to train our model with. There are only 733 unique anchors, with the most frequent 'component composite coating' being repeated 152 times.

## Feature Engineering

Now we can use the suggestion that was made in the introduction to create a **series** in our dataframe that will be used as input to the model. A series is the 'pandas' name for a column. We are accessing each series using Python "dot" notation - but if we are altering the series, we should use the 'dictionary-type' notation instead.

Each entry in the input column represents a **document** which we will use to fine-tune the NLP model.

In [20]:
df['input'] = 'TEXT1: ' + df.anchor + ' TEXT2: ' + df.target + ' CONTEXT: ' + df.context
df.input

Unnamed: 0,input
0,TEXT1: abatement TEXT2: abatement of pollution...
1,TEXT1: abatement TEXT2: act of abating CONTEXT...
2,TEXT1: abatement TEXT2: active catalyst CONTEX...
3,TEXT1: abatement TEXT2: eliminating process CO...
4,TEXT1: abatement TEXT2: forest region CONTEXT:...
...,...
36468,TEXT1: wood article TEXT2: wooden article CONT...
36469,TEXT1: wood article TEXT2: wooden box CONTEXT:...
36470,TEXT1: wood article TEXT2: wooden handle CONTE...
36471,TEXT1: wood article TEXT2: wooden material CON...


## Tokenization and Numericalization

Neural networks (NN) work with numbers. As we saw in the last notebook, available in the GitHub repository `frank-895/machine_learning_journey/tree/main/manual_creation_of_NN`. Each layer of the NN relies on matrix multiplication or activation functions - which we cannot do with strings.

As such, we have to perform two tasks:
- **Tokenization**, which is the process of breaking down text into smaller pieces, called **tokens**. Tokens are often words, subwords, or characters, depending on the level of tokenization. The list of all the unique tokens will define the **vocabulary**.
- **Numericalization**, which is converting each token into a number by assigning each token in the vocabulary with a digit.

This process depends on tokenization model we use and there are thousands available. If we use tokenisation that creates a vast vocabulary, our model may be very accurate but also very slow. The converse is true if our vocabulary is sparse.


Since we will be using Transformers for tokenization we need to convert our Pandas dataframe into a Huggingface dataset.

In [25]:
%%capture
!pip install datasets
from datasets import Dataset, DatasetDict

In [26]:
ds = Dataset.from_pandas(df)
ds

Dataset({
    features: ['id', 'anchor', 'target', 'context', 'score', 'input'],
    num_rows: 36473
})

Because we are using a pretrained model and fine-tuning it, we need to use the same tokenization mechanism as the pretrained model to ensure the vocabularies are identical. This means we do not need to make all the little decisions involved with effective tokenization.

We will choose our NLP model then use `AutoTokenizer` to create a tokenizer that is appropriate for our given model. AutoTokenizer is essentially just a dictionary which maps each model to a tokenizer.

[Hugging Face Models](https://huggingface.co/models) contains over 1 million pretrained NLP models. They have a variety of different architectures trained on a variety of different **coropuses** (collections of written texts).

We could choose one of many models pretrained for use on patents; however, we will opt for a more general model for the purpose of learning, `deberta-vs-small`.

In [28]:
%%capture
model_nm = 'microsoft/deberta-v3-small'

from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokz = AutoTokenizer.from_pretrained(model_nm)

Let's see how the tokenizer works...

The `'_'` represents the start of each word (as there is a different meaning between `'my'` at the start of a word and in the middle of a word).

In [29]:
tokz.tokenize("Hello, my name is Frank and I'm practicing NLP classification")

['▁Hello',
 ',',
 '▁my',
 '▁name',
 '▁is',
 '▁Frank',
 '▁and',
 '▁I',
 "'",
 'm',
 '▁practicing',
 '▁NLP',
 '▁classification']

Now, we will perform numericalization using a simple function.

In [30]:
def tok_func(x): return tokz(x["input"])

Numericalization can be a costly process, so we will perform this in parallel on every row in our dataset using `map`. We will inspect one row of our dataset, which now contains a column called `'input_ids'` which is the tokenized and numericalized version of input.

The number represents the position in the vocabulary of each word in the string.

In [32]:
tok_ds = ds.map(tok_func, batched=True)
tok_ds[0]['input'], tok_ds[0]['input_ids']

Map:   0%|          | 0/36473 [00:00<?, ? examples/s]

('TEXT1: abatement TEXT2: abatement of pollution CONTEXT: A47',
 [1,
  54453,
  435,
  294,
  47284,
  54453,
  445,
  294,
  47284,
  265,
  6435,
  20967,
  104917,
  294,
  336,
  5753,
  2])