# U.S. Patent Phrase to Phrase Matching With DeBERTaV3 and YouTube Walkthrough
---
## Table of Contents
* [Introduction](#introduction)
* [YouTube Walkthrough](#youtube)
* [Before You Begin](#begin)
* [Data and Files](#datafiles)
* [Implementation](#implementation)
    * [Imports](#datafile)
    * [Hyperparameter Definitions](#hyperparam)
    * [EDA](#eda)
    * [Merge Datasets](#merge)
    * [Model and Tokenizer Definition](#modeltokenizer)
    * [Dataset and Dataloader Implementation](#dataobj)
    * [Train Loop](#train)
    * [Free-up Memory](#freememo)
    * [Prediction Loop](#predict)
    * [Submission File Preperation](#submit)
* [Submission](#submit)
* [References](#references)

## Introduction <a class="anchor" id="introduction"></a>
In this competition our aim is to predict the similarity between two given texts. Between *anchor* and the *target*. Formally: *In this competition, you will train your models on a novel semantic similarity dataset to extract relevant information by matching key phrases in patent documents.* 

We'll **treat this task as a sequence classification problem**. However instead of assigning a class to each passage, we'll assign them scores with regression loss. Model scores are evaluated with *pearson correlation coefficient*.

---
Any question and suggestion is appreciated. You can put them in the comments.

## YouTube Walkthrough <a class="anchor" id="youtube"></a>
If you want to see me implement and explain this notebook live, watch the YouTube video below.

#### Kaggle Competition Walkthrough [NLP] - U.S. Patent Phrase to Phrase Matching - CommonLit Readability Prize
[<img src="https://i.imgur.com/FQZRfBG.png" width="640" height="360"/>](https://www.youtube.com/watch?v=o0mkv52nazI)

## Before You Begin <a class="anchor" id="begin"></a>

* **Toggle *internet off* button for this notebook.** This competition requires the notebook to be offline. In order to submit your solution you have to turn off internet for this notebook. You can do it from the right panel.

* Add pretrained [DeBERTaV3](https://www.kaggle.com/datasets/debarshichanda/debertav3base). Since the internet is off for this notebook we need a way to load our DeBERTaV3 model offline. To do that go to the right panel, click *data*, click *add-data* and type *debertav3base*. Add the resulting dataset.

* Add [cpc-codes](https://www.kaggle.com/datasets/xhlulu/cpc-codes). We'll merge this additional data with the competition data. To do that go to the right panel, click *data*, click *add-data* and type *cpc-codes*. Add the resulting dataset.

## Data and Files <a class="anchor" id="datafiles"></a>

We're given 3 files as inputs. *train.csv* which contains training data, *test.csv* which contains test data and *sample_submission.csv* which contains the submission format. Columns that files contain are:

* `id` - unique ID for excerpt
* `anchor` - anchor of the patent we're matching
* `target` - target of the patent we're matching
* `context` - patent class identifier
* `score` - similarity score (we'll predict this)


1. **train.csv**: contains training data
    * `id`
    * `anchor`
    * `target`
    * `context`
    * `score` 
  
  
2. **test.csv** contains the test data
    * `id`
    * `anchor`
    * `target`
    * `context`  
    
    
3. **sample_submission.csv** contains submission format
    * `id`
    * `score`
    
---
Additionally we loaded *titles.csv* from *cpc-codes* dataset. Columns that files contain are:

* code - patent class identifier (corresponds to *context* of the main data)
* title - title of the patent
* section - section identifier of the patent
* class - class of the patent
* subclass - subclass of the patent
* group - group of the patent
* main_group - main group op the patent


1. **titles.csv**: only file
    * code
    * title
    * section
    * class
    * subclass
    * group
    * main_group

## Implementation <a class="anchor" id="implementation"></a>

### Imports <a class="anchor" id="imports"></a>

In [None]:
import pandas as pd
import transformers
from transformers import DebertaV2TokenizerFast, DebertaV2ForSequenceClassification
import torch
from torch import optim
from torch.utils.data import DataLoader, Dataset, random_split
from torchmetrics.regression import PearsonCorrCoef
import numpy as np
import random
import timeit
from tqdm import tqdm

### Hyperparameter Definitions <a class="anchor" id="hyperparam"></a>

In [None]:
RANDOM_SEED = 42
MODEL_PATH = "/kaggle/input/debertav3base"
MAX_LENGTH = 256
BATCH_SIZE = 64
LEARNING_RATE = 2e-5
EPOCHS = 2

random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)
torch.cuda.manual_seed(RANDOM_SEED)
torch.cuda.manual_seed_all(RANDOM_SEED)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

device = "cuda" if torch.cuda.is_available() else "cpu"
transformers.utils.logging.set_verbosity_error() 

### EDA <a class="anchor" id="eda"></a>

In [None]:
train_df = pd.read_csv("/kaggle/input/us-patent-phrase-to-phrase-matching/train.csv")
test_df = pd.read_csv("/kaggle/input/us-patent-phrase-to-phrase-matching/test.csv")
submission_df = pd.read_csv("/kaggle/input/us-patent-phrase-to-phrase-matching/sample_submission.csv")
patents_df = pd.read_csv("/kaggle/input/cpc-codes/titles.csv")

In [None]:
train_df.head()

In [None]:
test_df.head()

In [None]:
submission_df.head()

In [None]:
patents_df.head()

### Merge Datasets <a class="anchor" id="merge"></a>

In [None]:
updated_train_df = train_df.merge(patents_df, left_on='context', right_on='code')
updated_train_df["input"] = updated_train_df["title"] + " " + updated_train_df["anchor"]
updated_train_df.tail()

In [None]:
updated_test_df = test_df.merge(patents_df, left_on='context', right_on='code')
updated_test_df["input"] = updated_test_df["title"] + " " + updated_test_df["anchor"]
updated_train_df.tail()

### Model and Tokenizer Definition <a class="anchor" id="modeltokenizer"></a>

In [None]:
tokenizer = DebertaV2TokenizerFast.from_pretrained(MODEL_PATH)
model = DebertaV2ForSequenceClassification.from_pretrained(MODEL_PATH, num_labels=1).to(device)

### Dataset and Dataloader Implementation <a class="anchor" id="dataobj"></a>

You may realize we added two inputs to the `tokenizer`. By this, the tokenizer combines them as `text1 [SEP] text2`. Hence we can work with both.

In [None]:
class PhraseTrainDataset(Dataset):
    def __init__(self, inputs, targets, scores, tokenizer):
        self.scores = scores
        self.encodings = tokenizer(inputs, targets, padding=True, truncation=True, max_length=MAX_LENGTH)

    def __getitem__(self, idx):
        out_dic = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        out_dic["scores"] = self.scores[idx]
        return out_dic
    
    def __len__(self):
        return len(self.scores)

In [None]:
class PhraseSubmitDataset(Dataset):
    def __init__(self, inputs, targets, ids, tokenizer):
        self.ids = ids
        self.encodings = tokenizer(inputs, targets, padding=True, truncation=True, max_length=MAX_LENGTH)

    def __getitem__(self, idx):
        out_dic = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        out_dic["ids"] = self.ids[idx]
        return out_dic
    
    def __len__(self):
        return len(self.ids)

In [None]:
dataset = PhraseTrainDataset(updated_train_df["input"].to_list(), updated_train_df["target"].to_list(), updated_train_df["score"].to_list(), tokenizer)
test_dataset = PhraseSubmitDataset(updated_test_df["input"].to_list(), updated_test_df["target"].to_list(), updated_test_df["id"].to_list(), tokenizer)
print("-"*30)
print(len(dataset))
print(dataset[0])
print("-"*30)
print(len(test_dataset))
print(test_dataset[0])

In [None]:
generator = torch.Generator().manual_seed(RANDOM_SEED)
train_dataset, val_dataset = random_split(dataset, [0.9, 0.1], generator=generator)

In [None]:
train_dataloader = DataLoader(dataset=train_dataset,
                              batch_size=BATCH_SIZE,
                              shuffle=True)

val_dataloader = DataLoader(dataset=val_dataset,
                              batch_size=BATCH_SIZE,
                              shuffle=True)

test_dataloader = DataLoader(dataset=test_dataset,
                              batch_size=BATCH_SIZE,
                              shuffle=False)

### Training Loop <a class="anchor" id="train"></a>

In [None]:
optimizer = optim.AdamW(model.parameters(), lr=LEARNING_RATE)
pearson = PearsonCorrCoef()

start = timeit.default_timer() 
for epoch in tqdm(range(EPOCHS), position=0, leave=True):
    model.train()
    train_running_loss = 0 
    for idx, sample in enumerate(tqdm(train_dataloader, position=0, leave=True)):
        input_ids = sample['input_ids'].to(device)
        attention_mask = sample['attention_mask'].to(device)
        targets = sample["scores"].to(device)
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=targets)
        loss = outputs.loss

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        train_running_loss += loss.item()
    train_loss = train_running_loss / (idx + 1)

    model.eval()
    val_running_loss = 0 
    preds = []
    golds = []
    with torch.no_grad():
        for idx, sample in enumerate(tqdm(val_dataloader, position=0, leave=True)):
            input_ids = sample['input_ids'].to(device)
            attention_mask = sample['attention_mask'].to(device)
            targets = sample["scores"].to(device)
            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=targets)
            
            preds.extend([float(i) for i in outputs["logits"].squeeze()])
            golds.extend([float(i) for i in targets])
            
            val_running_loss += outputs.loss.item()
        val_loss = val_running_loss / (idx + 1)

    print("-"*30)
    print(f"Pearson Score: {float(pearson(torch.tensor(preds), torch.tensor(golds))):.4f}")
    print(f"Train Loss EPOCH {epoch+1}: {train_loss:.4f}")
    print(f"Valid Loss EPOCH {epoch+1}: {val_loss:.4f}")
    print("-"*30)
stop = timeit.default_timer()
print(f"Training Time: {stop-start:.2f}s")

### Free-up Memory <a class="anchor" id="freememo"></a>

In [None]:
torch.cuda.empty_cache()

### Prediction Loop <a class="anchor" id="predict"></a>

In [None]:
preds = []
ids = []
model.eval()
with torch.no_grad():
    for idx, sample in enumerate(tqdm(test_dataloader, position=0, leave=True)):
        input_ids = sample['input_ids'].to(device)
        attention_mask = sample['attention_mask'].to(device)
        ids.extend(sample["ids"])
        
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        preds.extend([float(i) for i in outputs["logits"].squeeze()])

### Submission File Preperartion <a class="anchor" id="submit"></a>

In [None]:
submission_df = pd.DataFrame(list(zip(ids, preds)),
               columns =["id", "score"])
submission_df.to_csv("submission.csv", index=False)
submission_df.head()

## Submission <a class="anchor" id="submission"></a>

If your notebook runs smoothly you can go to the right panel and click submit. Congratulations :)     
Any question and suggestion is appreciated. You can put them in the comments.

## References <a class="anchor" id="references"></a>

* https://www.kaggle.com/code/ksork6s4/uspppm-bert-for-patents-baseline-train

<img src="https://i.imgur.com/8gNhPcv.jpg"/>