## MRPC : Microsoft Research Paraphrase Corpus

The  Microsoft Research Paraphrase Corpus (MRPC) task is a sentence pair classification task.  It consists of sentence pairs collected from news sources with labels of semantic equivalence.

See [website](https://www.microsoft.com/en-us/download/details.aspx?id=52398) and [paper](https://www.aclweb.org/anthology/I/I05/I05-5002.pdf) for more info.

In [1]:
import numpy as np
import pandas as pd
import os
import sys
import csv
from sklearn import metrics
from sklearn.metrics import classification_report

sys.path.append("../") 
from bert_sklearn import BertClassifier

DATADIR = os.getcwd() + '/glue_data'

The original glue link for downloading the data no longer wroks. So we are using the suggested alternative link:

In [2]:
%%bash
git clone https://github.com/wasiahmad/paraphrase_identification.git
python download_glue_data.py --data_dir glue_data --tasks MRPC --path_to_mrpc=paraphrase_identification/dataset/msr-paraphrase-corpus

Processing MRPC...
	Completed!


Cloning into 'paraphrase_identification'...


In [5]:
"""
MRPC train data size: 3668 
MRPC dev data size: 408 
"""
def read_tsv(filename,quotechar=None):
    with open(filename, "r", encoding='utf-8') as f:
        return list(csv.reader(f,delimiter="\t",quotechar=quotechar))
    
def get_mrpc_df(filename,cols = [3, 4, 0]):
    rows = read_tsv(filename)
    df=pd.DataFrame(rows[1:])
    df=df[cols]
    df.columns=['text_a','text_b','label']
    df = df[pd.notnull(df['label'])]
    return df

def get_mrpc_data(train_file = DATADIR+'/MRPC/train.tsv', 
                   dev_file =  DATADIR+'/MRPC/dev.tsv'):
    
    train = get_mrpc_df(train_file)
    print("MRPC train data size: %d "%(len(train)))
    dev = get_mrpc_df(dev_file)
    print("MRPC dev data size: %d "%(len(dev)))

    label_list = np.unique(train['label'].values)
    return train,dev,label_list  
                  
train,dev,label_list =  get_mrpc_data()             

MRPC train data size: 3668 
MRPC dev data size: 408 


In [3]:
print(label_list)

['0' '1']


In [6]:
train.head()

Unnamed: 0,text_a,text_b,label
0,"Amrozi accused his brother, whom he called ""th...","Referring to him as only ""the witness"", Amrozi...",1
1,Yucaipa owned Dominick's before selling the ch...,Yucaipa bought Dominick's in 1995 for $693 mil...,0
2,They had published an advertisement on the Int...,"On June 10, the ship's owners had published an...",1
3,"Around 0335 GMT, Tab shares were up 19 cents, ...","Tab shares jumped 20 cents, or 4.6%, to set a ...",0
4,"The stock rose $2.11, or about 11 percent, to ...",PG&E Corp. shares jumped $1.63 or 8 percent to...,1


In [3]:
%%time
X_train = train[['text_a','text_b']] # text pair data
y_train = train['label']            # labels

# define model
model = BertClassifier()
model.epochs = 4
model.learning_rate = 2e-05
model.max_seq_length = 128
model.gradient_accumulation_steps = 2
model.validation_fraction = 0

# fit model
model.fit(X_train, y_train)

# test model on dev
test = dev
X_test = test[['text_a','text_b']]
y_test = test['label']

# make predictions
y_pred = model.predict(X_test)
print("Accuracy: %0.2f%%\n"%(metrics.accuracy_score(y_pred,y_test) * 100))

target_names = ['not equivalent', 'equivalent']
print(classification_report(y_test, y_pred, target_names=target_names))

Building sklearn classifier...
Loading bert-base-uncased model...
Defaulting to linear classifier/regressor
train data size: 3668, validation data size: 0


Training: 100%|██████████| 230/230 [01:30<00:00,  2.98it/s, loss=0.642]
Training: 100%|██████████| 230/230 [01:31<00:00,  2.94it/s, loss=0.314]
Training: 100%|██████████| 230/230 [01:34<00:00,  2.70it/s, loss=0.168]
Training: 100%|██████████| 230/230 [01:37<00:00,  2.83it/s, loss=0.125]
                                                           

Accuracy: 86.76%

                precision    recall  f1-score   support

not equivalent       0.87      0.68      0.77       129
    equivalent       0.87      0.95      0.91       279

     micro avg       0.87      0.87      0.87       408
     macro avg       0.87      0.82      0.84       408
  weighted avg       0.87      0.87      0.86       408

CPU times: user 3min 59s, sys: 2min 26s, total: 6min 25s
Wall time: 6min 26s


