In [1]:
# This notebook is by Anastasia Ruzmaikina, for Kaggle Competition US Patent Phrase to Phrase Matching

Can you extract meaning from a large, text-based dataset derived from inventions? Here's your chance to do so.

The U.S. Patent and Trademark Office (USPTO) offers one of the largest repositories of scientific, technical, and commercial information in the world through its Open Data Portal. Patents are a form of intellectual property granted in exchange for the public disclosure of new and useful inventions. Because patents undergo an intensive vetting process prior to grant, and because the history of U.S. innovation spans over two centuries and 11 million patents, the U.S. patent archives stand as a rare combination of data volume, quality, and diversity.

In this competition, you will train your models on a novel semantic similarity dataset to extract relevant information by matching key phrases in patent documents. Determining the semantic similarity between phrases is critically important during the patent search and examination process to determine if an invention has been described before. For example, if one invention claims "television set" and a prior publication describes "TV set", a model would ideally recognize these are the same and assist a patent attorney or examiner in retrieving relevant documents. This extends beyond paraphrase identification; if one invention claims a "strong material" and another uses "steel", that may also be a match. What counts as a "strong material" varies per domain (it may be steel in one domain and ripstop fabric in another, but you wouldn't want your parachute made of steel). We have included the Cooperative Patent Classification as the technical domain context as an additional feature to help you disambiguate these situations.

Can you build a model to match phrases in order to extract contextual information, thereby helping the patent community connect the dots between millions of patent documents?



In this notebook, I use Deberta-V3-Small to classify key phrases in patents by similarity score. The best accuracy score of this notebook on the competition test dataset is 57%.

In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import tensorflow as tf
import keras_core as keras
import keras
import keras_nlp
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
import csv
import requests
from keras.activations import softmax
import torch
torch.cuda.empty_cache()

print("TensorFlow version:", tf.__version__)
print("KerasNLP version:", keras_nlp.__version__)
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Using TensorFlow backend
TensorFlow version: 2.16.1
KerasNLP version: 0.14.4
/kaggle/input/huggingface-bert-variants/bert-large-uncased/bert-large-uncased/config.json
/kaggle/input/huggingface-bert-variants/bert-large-uncased/bert-large-uncased/tokenizer.json
/kaggle/input/huggingface-bert-variants/bert-large-uncased/bert-large-uncased/tf_model.h5
/kaggle/input/huggingface-bert-variants/bert-large-uncased/bert-large-uncased/tokenizer_config.json
/kaggle/input/huggingface-bert-variants/bert-large-uncased/bert-large-uncased/pytorch_model.bin
/kaggle/input/huggingface-bert-variants/bert-large-uncased/bert-large-uncased/vocab.txt
/kaggle/input/huggingface-bert-variants/bert-large-uncased/bert-large-uncased/flax_model.msgpack
/kaggle/input/huggingface-bert-variants/bert-large-uncased/bert-large-uncased/whole-word-masking/._bert_config.json
/kaggle/input/huggingface-bert-variants/bert-large-uncased/bert-large-uncased/whole-word-masking/bert_config.json
/kaggle/input/huggingface-bert-variants

In [3]:
df1 = pd.read_csv("/kaggle/input/us-patent-phrase-to-phrase-matching/train.csv")
df2 = pd.read_csv("/kaggle/input/us-patent-phrase-to-phrase-matching/train.csv")
df = df1 #pd.concat([df1, df2], axis=0)
df= df.sample(frac =0.2)
df_train = df.reset_index()
df_train.drop(columns=['index'], inplace=True)
print(df_train.head())
df_test = pd.read_csv("/kaggle/input/us-patent-phrase-to-phrase-matching/test.csv")
df_test.head()

from datasets import load_dataset
#ds = load_dataset('csv', data_files=['/kaggle/input/llm-detect-ai-generated-text/train_essays.csv', '/kaggle/input/generated2/generated.csv', '/kaggle/input/generated1/generated1.csv'])
#print(ds)
print(len(df_train))
print(len(df_test))

df_train['anchor'] = df_train['anchor'].str.lower()
df_train['anchor'] = df_train['anchor'].str.replace("#", "")
#df_train = df_train.sample(frac=0.8)
df_test['anchor'] = df_test['anchor'].str.lower()#replace("#", "" )
df_train['target'] = df_train['target'].str.lower()
df_train['target'] = df_train['target'].str.replace("#", "")
#df_train = df_train.sample(frac=0.8)
df_test['target'] = df_test['target'].str.lower() #replace("#", "" )


                 id                   anchor                   target context  \
0  c39320ae0d02d6da         composite slurry     automotive composite     C04   
1  d37da07e6d7a914d               close gate               closed box     F03   
2  95c59b9bfb137d58   block selection signal  output of the capacitor     B41   
3  8da7c718467da78a  expandable intraluminal      expandable elongate     A61   
4  f1a472b505a640b0           receiver shaft              drive shaft     F41   

   score  
0   0.00  
1   0.00  
2   0.25  
3   0.50  
4   0.50  
7295
36


In [4]:
from sklearn.model_selection import train_test_split
from datasets import Dataset,DatasetDict
df_train['score'] = df_train['score']*4
df_train['score']= df_train['score'].round().astype(int)
#df_train['score']= df_train['score'].astype(str)
df_train['input'] = 'TEXT1: '+ df_train.anchor + ';  TEXT2: '+ df_train.target + ';' # ' ANC: ' + df_train.context
df_train.insert(1, 'input', df_train.pop('input'))
df_train

Unnamed: 0,id,input,anchor,target,context,score
0,c39320ae0d02d6da,TEXT1: composite slurry; TEXT2: automotive co...,composite slurry,automotive composite,C04,0
1,d37da07e6d7a914d,TEXT1: close gate; TEXT2: closed box;,close gate,closed box,F03,0
2,95c59b9bfb137d58,TEXT1: block selection signal; TEXT2: output ...,block selection signal,output of the capacitor,B41,1
3,8da7c718467da78a,TEXT1: expandable intraluminal; TEXT2: expand...,expandable intraluminal,expandable elongate,A61,2
4,f1a472b505a640b0,TEXT1: receiver shaft; TEXT2: drive shaft;,receiver shaft,drive shaft,F41,2
...,...,...,...,...,...,...
7290,fccaa89d213e3d6c,TEXT1: multiplexed data; TEXT2: multiplex dig...,multiplexed data,multiplex digital data,H04,3
7291,0117ce8d8f8abe79,TEXT1: compression loss; TEXT2: discharge loss;,compression loss,discharge loss,F04,2
7292,dfe314b5d6848738,TEXT1: moisture proof film; TEXT2: film video;,moisture proof film,film video,H05,0
7293,2b558317abd901d4,TEXT1: metal phase; TEXT2: novel phase;,metal phase,novel phase,B22,1


In [5]:
df_train1 = df_train.drop(['id', 'target', 'anchor'], axis=1)#, 'language'  , 'context' , 'target', 'anchor'
ds = Dataset.from_pandas(df_train1)
print(ds)

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
#from transformers import DebertaV3Model
from transformers import AutoModelForSequenceClassification,AutoTokenizer
from transformers import TextClassificationPipeline, AutoTokenizer, AutoModelForSequenceClassification
from transformers import TrainingArguments,Trainer
model_nm = '/kaggle/input/huggingface-bert-variants/distilbert-base-uncased/distilbert-base-uncased'#'/kaggle/input/huggingface-bert-variants/distilbert-base-uncased-distilled-squad/distilbert-base-uncased-distilled-squad'
#'/kaggle/input/huggingface-bert-variants/bert-base-uncased/bert-base-uncased'
#'/kaggle/input/huggingface-bert-variants/bert-large-uncased/bert-large-uncased'
# #'/kaggle/input/debertav3small' #'/kaggle/input/huggingface-deberta-variants/deberta-base-mnli/deberta-base-mnli'  #'/kaggle/input/debertav3small'   
tokz = AutoTokenizer.from_pretrained(model_nm)
def tok_func(x): return tokz(x["input"])
tok_ds = ds.map(tok_func, batched=True)
tok_ds = tok_ds.rename_columns({'score':'labels'})
dds = tok_ds.train_test_split(0.15, seed=420)
print(dds)
df_test['input'] = 'TEXT1: ' + df_test.anchor + '; TEXT2: ' +df_test.target + ';' #' ANC: ' + df_test.context
df_test.insert(1, 'input', df_test.pop('input'))
df_test1 = df_test.drop(['id', 'target', 'anchor'], axis=1)#, 'context' , 'target', 'anchor'
eval_ds = Dataset.from_pandas(df_test1).map(tok_func, batched=True)
bs = 1
epochs = 2
lr = 4.15e-6

args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
    evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
    num_train_epochs=epochs, weight_decay=0.01, report_to='none')
model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=5)

trainer = Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'],
                  tokenizer=tokz)#, compute_metrics=compute_metrics
trainer.train();

Dataset({
    features: ['input', 'context', 'score'],
    num_rows: 7295
})




Map:   0%|          | 0/7295 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input', 'context', 'labels', 'input_ids', 'attention_mask'],
        num_rows: 6200
    })
    test: Dataset({
        features: ['input', 'context', 'labels', 'input_ids', 'attention_mask'],
        num_rows: 1095
    })
})


Map:   0%|          | 0/36 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at /kaggle/input/huggingface-bert-variants/distilbert-base-uncased/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss
1,1.5629,1.716851
2,1.5724,1.717243


In [6]:
preds = trainer.predict(eval_ds).predictions.astype(float)
print(preds)
preds = np.clip(preds, 0, 1)

# Make predictions
#predictions = classifier.predict(X_test)

# Evaluate the model (optional)
#classifier.evaluate(X_test)
submission = df_test.id.copy().to_frame()
submission["score"] = np.argmax(preds, axis=1) * 0.25  #classifier.predict(X_test)
#submission = df_test.id.copy().to_frame()
#submission["prediction"] = np.argmax(predictions, axis=1)

#submission["generated"] = submission["generated"].round(1)
submission.to_csv("/kaggle/working/submission.csv", index=False)

[[-1.04101562 -0.14489746  3.68164062 -0.73730469 -2.609375  ]
 [-1.63671875 -1.12011719  1.21191406 -0.04940796  0.18212891]
 [ 1.0546875  -0.92871094  1.93261719 -0.31152344 -2.37109375]
 [-0.79248047  4.87890625 -0.30053711 -1.85449219 -3.08398438]
 [ 1.29882812 -0.33544922  1.78027344 -0.52001953 -2.85742188]
 [-1.41210938 -1.33300781  3.0703125  -0.2722168  -0.93896484]
 [-0.97363281 -0.09460449  3.60351562 -0.73779297 -2.66601562]
 [ 3.38085938 -0.16271973 -0.61376953 -0.671875   -2.13867188]
 [ 2.23046875  0.34716797  0.60888672 -0.59716797 -2.9140625 ]
 [-1.72558594 -1.12207031  1.48730469 -0.13476562  0.11553955]
 [-0.04074097  4.21875     0.13342285 -1.74707031 -3.47460938]
 [ 2.44726562  0.76611328  0.07653809 -0.77099609 -3.0625    ]
 [ 1.88964844 -0.47998047  1.24316406 -0.34838867 -2.6875    ]
 [-1.86816406 -1.21289062  1.82519531 -0.18115234 -0.06304932]
 [-1.44140625 -0.99169922  0.73974609 -0.00794983  0.29858398]
 [-0.33081055  4.7578125  -0.54199219 -1.93359375 -3.22

In [7]:
sub = pd.read_csv('submission.csv')
sub

Unnamed: 0,id,score
0,4112d61851461f60,0.5
1,09e418c93a776564,0.5
2,36baf228038e314b,0.0
3,1f37ead645e7f0c8,0.25
4,71a5b6ad068d531f,0.0
5,474c874d0c07bd21,0.5
6,442c114ed5c4e3c9,0.5
7,b8ae62ea5e1d8bdb,0.0
8,faaddaf8fcba8a3f,0.0
9,ae0262c02566d2ce,0.5


In [8]:
import os
def remove_folder_contents(folder):
    for the_file in os.listdir(folder):
        file_path = os.path.join(folder, the_file)
        try:
            if os.path.isfile(file_path):
                os.unlink(file_path)
            elif os.path.isdir(file_path):
                remove_folder_contents(file_path)
                os.rmdir(file_path)
        except Exception as e:
            print(e)

folder_path = '/kaggle/working'
#remove_folder_contents(folder_path)
#os.rmdir(folder_path)