# QQP : Quora Question Pair

The Quora Question Pair(QQP) task is a sentence pair classification task. It consists of sentences pairs from the Quora website labeled as duplicate or not.

See [original release post](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs) for more info.


In [1]:
import numpy as np
import pandas as pd
import os
import sys
import csv
from sklearn import metrics
from sklearn.metrics import classification_report

sys.path.append("../") 
from bert_sklearn import BertClassifier

DATADIR = os.getcwd() + '/glue_data'

In [2]:
%%bash
python3 download_glue_data.py --data_dir glue_data --tasks QQP 

Downloading and extracting QQP...
	Completed!


In [4]:
"""
QQP train data size: 363849 
QQP dev data size: 40430 
"""

def read_tsv(filename,quotechar=None):
    with open(filename, "r", encoding='utf-8') as f:
        return list(csv.reader(f,delimiter="\t",quotechar=quotechar))
    
def get_quora_df(filename):
    rows = read_tsv(filename)
    df=pd.DataFrame(rows[1:],columns=rows[0])
    df=df[['question1','question2','is_duplicate']]
    df = df[pd.notnull(df['is_duplicate'])]
    df.columns=['text_a','text_b','label']
    return df

def get_quora_data(train_file = DATADIR+'/QQP/train.tsv', 
                   dev_file =  DATADIR+'/QQP/dev.tsv'):
    
    train = get_quora_df(train_file)
    print("QQP train data size: %d "%(len(train)))
    dev = get_quora_df(dev_file)
    print("QQP dev data size: %d "%(len(dev)))

    label_list = np.unique(train['label'].values)
    return train,dev,label_list 

train,dev,label_list = get_quora_data()

QQP train data size: 363849 
QQP dev data size: 40430 


In [4]:
print(label_list)

['0' '1']


In [8]:
train.head()

Unnamed: 0,text_a,text_b,label
0,How is the life of a math student? Could you d...,Which level of prepration is enough for the ex...,0
1,How do I control my horny emotions?,How do you control your horniness?,1
2,What causes stool color to change to yellow?,What can cause stool to come out as little balls?,0
3,What can one do after MBBS?,What do i do after my MBBS ?,1
4,Where can I find a power outlet for my laptop ...,"Would a second airport in Sydney, Australia be...",0


In [4]:
%%time

X_train = train[['text_a','text_b']]
y_train = train['label']

# define model
model = BertClassifier()
model.epochs = 4
model.learning_rate = 2e-5
model.max_seq_length = 128
model.validation_fraction = 0.1

print('\n',model,'\n')

# fit model
model.fit(X_train, y_train)

# test model on dev
test = dev
X_test = test[['text_a','text_b']]
y_test = test['label']

# make predictions
y_pred = model.predict(X_test)
print("Accuracy: %0.2f%%"%(metrics.accuracy_score(y_pred,y_test) * 100))

target_names = ['not duplicate', 'duplicate']
print(classification_report(y_test, y_pred, target_names=target_names))

Building sklearn classifier...

 BertClassifier(bert_model='bert-base-uncased', epochs=4, eval_batch_size=8,
        fp16=False, gradient_accumulation_steps=1, label_list=None,
        learning_rate=2e-05, local_rank=-1, logfile='bert_sklearn.log',
        loss_scale=0, max_seq_length=128, num_mlp_hiddens=500,
        num_mlp_layers=0, random_state=42, restore_file=None,
        train_batch_size=32, use_cuda=True, validation_fraction=0.1,
        warmup_proportion=0.1) 

Loading bert-base-uncased model...
Defaulting to linear classifier/regressor
train data size: 327465, validation data size: 36384


Training: 100%|██████████| 10234/10234 [2:11:06<00:00,  1.45it/s, loss=0.371] 
                                                               

Epoch 1, Train loss : 0.3710, Val loss: 0.2673, Val accy = 88.14%


Training: 100%|██████████| 10234/10234 [2:10:53<00:00,  1.45it/s, loss=0.226] 
                                                               

Epoch 2, Train loss : 0.2261, Val loss: 0.2370, Val accy = 89.94%


Training: 100%|██████████| 10234/10234 [2:10:54<00:00,  1.45it/s, loss=0.162] 
                                                               

Epoch 3, Train loss : 0.1616, Val loss: 0.2475, Val accy = 90.46%


Training: 100%|██████████| 10234/10234 [2:11:35<00:00,  1.44it/s, loss=0.133] 
                                                               

Epoch 4, Train loss : 0.1330, Val loss: 0.2646, Val accy = 90.36%


                                                               

Accuracy: 90.22%
               precision    recall  f1-score   support

not duplicate       0.92      0.93      0.92     25545
    duplicate       0.87      0.86      0.87     14885

  avg / total       0.90      0.90      0.90     40430

CPU times: user 8h 4min 10s, sys: 4h 10min 59s, total: 12h 15min 10s
Wall time: 9h 35min 18s


## with a MLP...

In [5]:
%%time

X_train = train[['text_a','text_b']]
y_train = train['label']

# define model
model = BertClassifier()
model.epochs = 5
model.learning_rate = 2e-5
model.max_seq_length = 128
model.validation_fraction = 0.1
model.num_mlp_layers = 4

print('\n',model,'\n')

# fit model
model.fit(X_train, y_train)

# test model on dev
test = dev
X_test = test[['text_a','text_b']]
y_test = test['label']

# make predictions
y_pred = model.predict(X_test)
print("Accuracy: %0.2f%%"%(metrics.accuracy_score(y_pred,y_test) * 100))

target_names = ['not duplicate', 'duplicate']
print(classification_report(y_test, y_pred, target_names=target_names))

Building sklearn classifier...

 BertClassifier(bert_model='bert-base-uncased', epochs=5, eval_batch_size=8,
        fp16=False, gradient_accumulation_steps=1, label_list=None,
        learning_rate=2e-05, local_rank=-1, logfile='bert_sklearn.log',
        loss_scale=0, max_seq_length=128, num_mlp_hiddens=500,
        num_mlp_layers=4, random_state=42, restore_file=None,
        train_batch_size=32, use_cuda=True, validation_fraction=0.1,
        warmup_proportion=0.1) 

Loading bert-base-uncased model...
Using mlp with D=768,H=500,K=2,n=4
train data size: 327465, validation data size: 36384


Training: 100%|██████████| 10233/10233 [2:12:41<00:00,  1.29it/s, loss=0.401] 
                                                               

Epoch 1, Train loss : 0.4011, Val loss: 0.2642, Val accy = 88.65%


Training: 100%|██████████| 10233/10233 [2:12:40<00:00,  1.29it/s, loss=0.228] 
                                                               

Epoch 2, Train loss : 0.2278, Val loss: 0.2312, Val accy = 90.34%


Training: 100%|██████████| 10233/10233 [2:12:40<00:00,  1.28it/s, loss=0.158] 
                                                               

Epoch 3, Train loss : 0.1575, Val loss: 0.2316, Val accy = 90.78%


Training: 100%|██████████| 10233/10233 [2:12:41<00:00,  1.29it/s, loss=0.117] 
                                                               

Epoch 4, Train loss : 0.1171, Val loss: 0.2439, Val accy = 90.89%


Training: 100%|██████████| 10233/10233 [2:12:42<00:00,  1.29it/s, loss=0.1]   
                                                               

Epoch 5, Train loss : 0.1004, Val loss: 0.2503, Val accy = 90.84%


                                                               

Accuracy: 90.75%
               precision    recall  f1-score   support

not duplicate       0.93      0.92      0.93     25545
    duplicate       0.87      0.88      0.88     14885

  avg / total       0.91      0.91      0.91     40430

CPU times: user 10h 17min 43s, sys: 5h 5min 10s, total: 15h 22min 53s
Wall time: 12h 5min 3s
