# Data Description
In this dataset, you are presented pairs of phrases (an anchor and a target phrase) and asked to rate how similar they are on a scale from 0 (not at all similar) to 1 (identical in meaning). This challenge differs from a standard semantic similarity task in that similarity has been scored here within a patent's context, specifically its CPC classification (version 2021.05), which indicates the subject to which the patent relates. For example, while the phrases "bird" and "Cape Cod" may have low semantic similarity in normal language, the likeness of their meaning is much closer if considered in the context of "house".

This is a code competition, in which you will submit code that will be run against an unseen test set. The unseen test set contains approximately 12k pairs of phrases. A small public test set has been provided for testing purposes, but is not used in scoring.

Information on the meaning of CPC codes may be found on the USPTO website. The CPC version 2021.05 can be found on the CPC archive website.

## Score meanings
The scores are in the 0-1 range with increments of 0.25 with the following meanings:

- 1.0 - Very close match. This is typically an exact match except possibly for differences in conjugation, quantity (e.g. singular vs. plural), and addition or removal of stopwords (e.g. “the”, “and”, “or”).
- 0.75 - Close synonym, e.g. “mobile phone” vs. “cellphone”. This also includes abbreviations, e.g. "TCP" -> "transmission control protocol".
- 0.5 - Synonyms which don’t have the same meaning (same function, same properties). This includes broad-narrow (hyponym) and narrow-broad (hypernym) matches.
- 0.25 - Somewhat related, e.g. the two phrases are in the same high level domain but are not synonyms. This also includes antonyms.
- 0.0 - Unrelated.

## Files
- train.csv - the training set, containing phrases, contexts, and their similarity scores
- test.csv - the test set set, identical in structure to the training set but without the score
- sample_submission.csv - a sample submission file in the correct format

## Columns
- id - a unique identifier for a pair of phrases
- anchor - the first phrase
- target - the second phrase
- context - the CPC classification (version 2021.05), which indicates the subject within which the similarity is to be scored
- score - the similarity. This is sourced from a combination of one or more manual expert ratings.

> "Google Patent Phrase Similarity Dataset" by Google is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0)

In [1]:
%matplotlib inline

import numpy as np
import warnings
import matplotlib.pyplot as plt

warnings.filterwarnings('ignore')
np.random.seed(2022)

In [2]:
import pandas as pd
import numpy as np

In [3]:
sts_train_data = pd.read_csv('./data/train.csv', index_col='id')
sts_test_data = pd.read_csv('./data/test.csv', index_col='id')

In [4]:
# display(sts_train_data)
sts_train_data.drop(columns=['context'], inplace=True)
display(sts_train_data)

Unnamed: 0_level_0,anchor,target,score
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
37d61fd2272659b1,abatement,abatement of pollution,0.50
7b9652b17b68b7a4,abatement,act of abating,0.75
36d72442aefd8232,abatement,active catalyst,0.25
5296b0c19e1ce60e,abatement,eliminating process,0.50
54c1e3b9184cb5b6,abatement,forest region,0.00
...,...,...,...
8e1386cbefd7f245,wood article,wooden article,1.00
42d9e032d1cd3242,wood article,wooden box,0.50
208654ccb9e14fa3,wood article,wooden handle,0.50
756ec035e694722b,wood article,wooden material,0.75


In [5]:
# display(sts_train_data.sample(2000))

In [6]:
sts_test_data.drop(columns=['context'], inplace=True)
display(sts_test_data.head())

Unnamed: 0_level_0,anchor,target
id,Unnamed: 1_level_1,Unnamed: 2_level_1
4112d61851461f60,opc drum,inorganic photoconductor drum
09e418c93a776564,adjust gas flow,altering gas flow
36baf228038e314b,lower trunnion,lower locating
1f37ead645e7f0c8,cap component,upper portion
71a5b6ad068d531f,neural stimulation,artificial neural network


In [7]:
sts_train_data.rename(columns={'anchor': 'sentence1', 'target':'sentence2'}, inplace=True)
sts_test_data.rename(columns={'anchor': 'sentence1', 'target':'sentence2'}, inplace=True)

In this data, the column named score contains numerical values (which we’d like to predict) that are human-annotated similarity scores for each given pair of sentences.

In [8]:
print('Min score=', min(sts_train_data['score']), ', Max score=', max(sts_train_data['score']))


Min score= 0.0 , Max score= 1.0


In [9]:
from autogluon.text import TextPredictor
predictor_sts = TextPredictor.load(path='./ag_sts')
# predictor_sts.fit(sts_train_data, time_limit=60*60)

Load pretrained checkpoint: ./ag_sts/model.ckpt


In [12]:
predictor_sts

<autogluon.text.automm.predictor.AutoMMPredictor at 0x7ff130ffe100>

In [13]:
train_resuls = predictor_sts.predict(sts_train_data.head(10))

Predicting: 100%|██████████| 1/1 [00:09<00:00,  9.18s/it]


In [14]:
train_resuls

id
37d61fd2272659b1    2
7b9652b17b68b7a4    2
36d72442aefd8232    1
5296b0c19e1ce60e    2
54c1e3b9184cb5b6    1
067203128142739c    1
061d17f04be2d1cf    1
e1f44e48399a2027    1
0a425937a3e86d10    2
ef2d4c2e6bbb208d    1
Name: score, dtype: int64

In [23]:
train_score = predictor_sts.evaluate(sts_train_data, metrics=['rmse', 'pearsonr', 'spearmanr'])

Predicting: 100%|██████████| 570/570 [13:36<00:00,  1.43s/it]


NameError: name 'test_score' is not defined

In [24]:
print('RMSE = {:.2f}'.format(train_score['rmse']))
print('PEARSONR = {:.4f}'.format(train_score['pearsonr']))
print('SPEARMANR = {:.4f}'.format(train_score['spearmanr']))

RMSE = 0.75
PEARSONR = 0.7136
SPEARMANR = 0.6854


In [25]:
train_score

{'rmse': 0.7519338706484906,
 'pearsonr': 0.7135913565211445,
 'spearmanr': 0.6853648284504438}

In [15]:
sts_test_data.head()

Unnamed: 0_level_0,sentence1,sentence2
id,Unnamed: 1_level_1,Unnamed: 2_level_1
4112d61851461f60,opc drum,inorganic photoconductor drum
09e418c93a776564,adjust gas flow,altering gas flow
36baf228038e314b,lower trunnion,lower locating
1f37ead645e7f0c8,cap component,upper portion
71a5b6ad068d531f,neural stimulation,artificial neural network


In [16]:
score1 = predictor_sts.predict({'sentence1': ['opc drum'],
                                'sentence2': ['inorganic photoconductor drum']}, as_pandas=False)
print(score1)

Predicting: 100%|██████████| 1/1 [00:08<00:00,  8.98s/it]
[2]


In [18]:
results = predictor_sts.predict(sts_test_data) 

Predicting: 100%|██████████| 1/1 [00:09<00:00,  9.76s/it]


In [21]:
submission = pd.DataFrame({ 'score': results/4})
submission.to_csv('submission.csv', index=True)

In [22]:
submission

Unnamed: 0_level_0,score
id,Unnamed: 1_level_1
4112d61851461f60,0.5
09e418c93a776564,0.75
36baf228038e314b,0.5
1f37ead645e7f0c8,0.25
71a5b6ad068d531f,0.5
474c874d0c07bd21,0.5
442c114ed5c4e3c9,0.5
b8ae62ea5e1d8bdb,0.0
faaddaf8fcba8a3f,0.5
ae0262c02566d2ce,1.0
