## STS-B : Semantic Textual Similarity Benchmark

The  Semantic Textual Similarity Benchmark (STS-B) task is a sentence pair regression task. It consists of sentence pairs drawn from news headlines and image captions  with annotated similarity scores ranging from 1 to 5.

See [website](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark) and [paper](http://www.aclweb.org/anthology/S/S17/S17-2001.pdf) for more info.

In [1]:
import numpy as np
import pandas as pd
import os
import sys
import csv
from sklearn import metrics

sys.path.append("../") 
from bert_sklearn import BertRegressor
from bert_sklearn import load_model

DATADIR = os.getcwd() + '/glue_data'

In [2]:
%%bash
python3 download_glue_data.py --data_dir glue_data --tasks STS 

Downloading and extracting STS...
	Completed!


In [3]:
"""
STS-B train data size: 5749 
STS-B dev data size: 1500 
"""
def read_tsv(filename,quotechar=None):
    with open(filename, "r", encoding='utf-8') as f:
        return list(csv.reader(f,delimiter="\t",quotechar=quotechar))
   
def get_sts_b_df(filename):
    rows = read_tsv(filename)
    df=pd.DataFrame(rows[1:],columns=rows[0])
    df=df[['sentence1','sentence2','score']]    
    df.columns=['text_a','text_b','label']
    df.label = pd.to_numeric(df.label)
    df = df[pd.notnull(df['label'])]                
    return df

def get_sts_b_data(train_file = DATADIR + '/STS-B/train.tsv',
                   dev_file ='/data/glue_data/STS-B/dev.tsv',
                   nrows=None):
    train = get_sts_b_df(train_file)
    print("STS-B train data size: %d "%(len(train)))    
    dev   = get_sts_b_df(dev_file)
    print("STS-B dev data size: %d "%(len(dev)))  
    return train,dev

train,dev = get_sts_b_data()


STS-B train data size: 5749 
STS-B dev data size: 1500 


In [5]:
train.head()

Unnamed: 0,text_a,text_b,label
0,A plane is taking off.,An air plane is taking off.,5.0
1,A man is playing a large flute.,A man is playing a flute.,3.8
2,A man is spreading shreded cheese on a pizza.,A man is spreading shredded cheese on an uncoo...,3.8
3,Three men are playing chess.,Two men are playing chess.,2.6
4,A man is playing the cello.,A man seated is playing the cello.,4.25


In [4]:
%%time

X_train = train[['text_a','text_b']]
y_train = train['label']

# define model
model = BertRegressor()
model.epochs = 4
model.learning_rate = 3e-5
model.max_seq_length = 128
model.validation_fraction = 0.1

print('\n',model,'\n')

# fit model
model.fit(X_train, y_train)

# test model on dev
test = dev
X_test = test[['text_a','text_b']]
y_test = test['label']

model.score(X_test,y_test)

Building sklearn regressor...

 BertRegressor(bert_model='bert-base-uncased', epochs=4, eval_batch_size=8,
       fp16=False, gradient_accumulation_steps=1, label_list=None,
       learning_rate=3e-05, local_rank=-1, logfile='bert_sklearn.log',
       loss_scale=0, max_seq_length=128, num_mlp_hiddens=500,
       num_mlp_layers=0, random_state=42, restore_file=None,
       train_batch_size=32, use_cuda=True, validation_fraction=0.1,
       warmup_proportion=0.1) 

Loading bert-base-uncased model...
Defaulting to linear classifier/regressor
train data size: 5175, validation data size: 574


Training: 100%|██████████| 162/162 [02:03<00:00,  1.35it/s, loss=1.86]
                                                           

Epoch 1, Train loss : 1.8593, Val loss: 0.5730, Val accy = 88.42%


Training: 100%|██████████| 162/162 [02:05<00:00,  1.35it/s, loss=0.375]
                                                           

Epoch 2, Train loss : 0.3746, Val loss: 0.4373, Val accy = 89.81%


Training: 100%|██████████| 162/162 [02:05<00:00,  1.35it/s, loss=0.221]
                                                           

Epoch 3, Train loss : 0.2206, Val loss: 0.4139, Val accy = 90.04%


Training: 100%|██████████| 162/162 [02:05<00:00,  1.35it/s, loss=0.184]
                                                           

Epoch 4, Train loss : 0.1842, Val loss: 0.4209, Val accy = 90.05%


                                                          


Test loss: 0.4538, Test accuracy = 89.72%
CPU times: user 8min 1s, sys: 4min 2s, total: 12min 3s
Wall time: 9min 34s




In [5]:
from scipy.stats import pearsonr
y_pred = model.predict(X_test)
pearson_accy = pearsonr(y_pred,y_test)[0] * 100
print("Pearson : %0.2f"%(pearson_accy))

                                                             

Pearson : 89.72


