# Bert based fine-tuned model for paraphrasers evaluation

Authors: Fatma Ben Ayed

Copyright (C) 2021 Fatma Ben Ayed and and DynaGroup i.T. GmbH

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!pip install transformers > /dev/null

In [None]:
import pandas as pd
df = pd.read_csv('/content/drive/My Drive/Paraphrasing API/datasets/MSRP/msrp-train.csv')
df=df.drop(0)
df.head()

Unnamed: 0,Quality,s1,s2
1,1,"Amrozi accused his brother, whom he called ""th...","Referring to him as only ""the witness"", Amrozi..."
2,0,Yucaipa owned Dominick's before selling the ch...,Yucaipa bought Dominick's in 1995 for $693 mil...
3,1,They had published an advertisement on the Int...,"On June 10, the ship's owners had published an..."
4,0,"Around 0335 GMT, Tab shares were up 19 cents, ...","Tab shares jumped 20 cents, or 4.6%, to set a ..."
5,1,"The stock rose $2.11, or about 11 percent, to ...",PG&E Corp. shares jumped $1.63 or 8 percent to...


In [None]:
df_test = pd.read_csv('/content/drive/My Drive/Paraphrasing API/datasets/MSRP/msrp-test.csv')
df_test=df_test.drop(0)
df_test.head()

Unnamed: 0,Quality,s1,s2
1,1,"Amrozi accused his brother, whom he called ""th...","Referring to him as only ""the witness"", Amrozi..."
2,0,Yucaipa owned Dominick's before selling the ch...,Yucaipa bought Dominick's in 1995 for $693 mil...
3,1,They had published an advertisement on the Int...,"On June 10, the ship's owners had published an..."
4,0,"Around 0335 GMT, Tab shares were up 19 cents, ...","Tab shares jumped 20 cents, or 4.6%, to set a ..."
5,1,"The stock rose $2.11, or about 11 percent, to ...",PG&E Corp. shares jumped $1.63 or 8 percent to...


In [None]:
df["Quality"] = df['Quality'].astype('int')

In [None]:
# Similar sentences samples            
ctr = 0
for row in df[df['Quality']==1].itertuples():
        print(f"1. {row[2]}\n2. {row[3]}")
        print("="*80)
        ctr += 1
        if 5==ctr:
            break    

1. Amrozi accused his brother, whom he called "the witness", of deliberately distorting his evidence.
2. Referring to him as only "the witness", Amrozi accused his brother of deliberately distorting his evidence.
1. They had published an advertisement on the Internet on June 10, offering the cargo for sale, he added.
2. On June 10, the ship's owners had published an advertisement on the Internet, offering the explosives for sale.
1. The stock rose $2.11, or about 11 percent, to close Friday at $21.51 on the New York Stock Exchange.
2. PG&E Corp. shares jumped $1.63 or 8 percent to $21.03 on the New York Stock Exchange on Friday.
1. Revenue in the first quarter of the year dropped 15 percent from the same period a year earlier.
2. With the scandal hanging over Stewart's company, revenue the first quarter of the year dropped 15 percent from the same period a year earlier.
1. The DVD-CCA then appealed to the state Supreme Court.
2. The DVD CCA appealed that decision to the U.S. Supreme Co

In [None]:
# Different sentences samples : the meaning isn't the same           
ctr = 0
for row in df[df['Quality']==0].itertuples():
        print(f"1. {row[2]}\n2. {row[3]}")
        print("="*80)
        ctr += 1
        if 5==ctr:
            break 

1. Yucaipa owned Dominick's before selling the chain to Safeway in 1998 for $2.5 billion.
2. Yucaipa bought Dominick's in 1995 for $693 million and sold it to Safeway for $1.8 billion in 1998.
1. Around 0335 GMT, Tab shares were up 19 cents, or 4.4%, at A$4.56, having earlier set a record high of A$4.57.
2. Tab shares jumped 20 cents, or 4.6%, to set a record closing high at A$4.57.
1. The Nasdaq had a weekly gain of 17.27, or 1.2 percent, closing at 1,520.15 on Friday.
2. The tech-laced Nasdaq Composite .IXIC rallied 30.46 points, or 2.04 percent, to 1,520.15.
1. That compared with $35.18 million, or 24 cents per share, in the year-ago period.
2. Earnings were affected by a non-recurring $8 million tax benefit in the year-ago period.
1. Shares of Genentech, a much larger company with several products on the market, rose more than 2 percent.
2. Shares of Xoma fell 16 percent in early trade, while shares of Genentech, a much larger company with several products on the market, were up 2 

# Determine if two sequences are paraphrases of each other with Huggingface  bert base cased-finetuned-mrpc model

* Instantiate a tokenizer and a model from the checkpoint name. 
The model is identified as a BERT model and loads it with the weights stored in the checkpoint.

* Build a sequence from the two sentences, with the correct model-specific separators , token type ids and attention masks 

* Pass this sequence through the model so that it is classified in one of the two available classes: 0 (not a paraphrase) and 1 (is a paraphrase)

* Compute the softmax of the result to get probabilities over the classes

* see the results below



In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")

classes = ["not paraphrase", "is paraphrase"]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/433 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/413M [00:00<?, ?B/s]

1. Test 3 diffrent sentences 

In [None]:
sequence_0 = df.iloc[1,1]
sequence_1 = df.iloc[3,2]
sequence_2 = df.iloc[2,1]

In [None]:
df.iloc[1,1]

"Yucaipa owned Dominick's before selling the chain to Safeway in 1998 for $2.5 billion."

In [None]:
df.iloc[2,1]

'They had published an advertisement on the Internet on June 10, offering the cargo for sale, he added.'

In [None]:
df.iloc[3,1]

'Around 0335 GMT, Tab shares were up 19 cents, or 4.4%, at A$4.56, having earlier set a record high of A$4.57.'

In [None]:
paraphrase = tokenizer.encode_plus(sequence_0, sequence_2, return_tensors="pt")
not_paraphrase = tokenizer.encode_plus(sequence_0, sequence_1, return_tensors="pt")

paraphrase_classification_logits = model(**paraphrase)[0]
not_paraphrase_classification_logits = model(**not_paraphrase)[0]

paraphrase_results = torch.softmax(paraphrase_classification_logits, dim=1).tolist()[0]
not_paraphrase_results = torch.softmax(not_paraphrase_classification_logits, dim=1).tolist()[0]

print("Should be paraphrase")
for i in range(len(classes)):
    print(f"{classes[i]}: {round(paraphrase_results[i] * 100)}%")

print("\nShould not be paraphrase")
for i in range(len(classes)):
    print(f"{classes[i]}: {round(not_paraphrase_results[i] * 100)}%")

Should be paraphrase
not paraphrase: 95%
is paraphrase: 5%

Should not be paraphrase
not paraphrase: 95%
is paraphrase: 5%


2. Test similar sentences

In [None]:
sequence_0 = df.iloc[1,1]
sequence_1 = df.iloc[1,2]
sequence_2 = df.iloc[1,1]

In [None]:
df.iloc[1,1]

"Yucaipa owned Dominick's before selling the chain to Safeway in 1998 for $2.5 billion."

In [None]:
df.iloc[1,2]

"Yucaipa bought Dominick's in 1995 for $693 million and sold it to Safeway for $1.8 billion in 1998."

In [None]:
paraphrase = tokenizer.encode_plus(sequence_0, sequence_2, return_tensors="pt")
not_paraphrase = tokenizer.encode_plus(sequence_0, sequence_1, return_tensors="pt")

paraphrase_classification_logits = model(**paraphrase)[0]
not_paraphrase_classification_logits = model(**not_paraphrase)[0]

paraphrase_results = torch.softmax(paraphrase_classification_logits, dim=1).tolist()[0]
not_paraphrase_results = torch.softmax(not_paraphrase_classification_logits, dim=1).tolist()[0]

print("Should be paraphrase")
for i in range(len(classes)):
    print(f"{classes[i]}: {round(paraphrase_results[i] * 100)}%")

print("\nShould not be paraphrase")
for i in range(len(classes)):
    print(f"{classes[i]}: {round(not_paraphrase_results[i] * 100)}%")

Should be paraphrase
not paraphrase: 6%
is paraphrase: 94%

Should not be paraphrase
not paraphrase: 64%
is paraphrase: 36%
