## CoLA : Corpus of Linguistic Acceptability

The Corpus of Linguistic Acceptability(CoLA) is a single sentence classification task. 
It consists of sentences drawn from linguistic publications and annotated for being acceptable English or not.

See [website](https://nyu-mll.github.io/CoLA/) and [paper](https://arxiv.org/abs/1805.12471) for more info.

In [1]:
import numpy as np
import pandas as pd
import os
import sys
import csv
from sklearn import metrics
from sklearn.metrics import classification_report

sys.path.append("../") 
from bert_sklearn import BertClassifier
from bert_sklearn import load_model

DATADIR = os.getcwd() + '/glue_data'

In [2]:
%%bash
python3 download_glue_data.py --data_dir glue_data --tasks CoLA 

Downloading and extracting CoLA...
	Completed!


In [2]:
"""
CoLA train data size: 8551 
CoLA dev data size: 1043 
"""

def get_cola_df(filename,cols = [3,1]):
    df = pd.read_csv(filename, sep='\t',  encoding = 'utf8',keep_default_na=False,header=None)
    df = df[cols]
    df.columns=['text','label']
    return df

def get_cola_data(train_file = DATADIR+'/CoLA/train.tsv', 
                   dev_file =  DATADIR+'/CoLA/dev.tsv'):
    
    train = get_cola_df(train_file)
    print("CoLA train data size: %d "%(len(train)))
    dev = get_cola_df(dev_file)
    print("CoLA dev data size: %d "%(len(dev)))

    label_list = np.unique(train['label'].values)
    return train,dev,label_list  
                  
train,dev,label_list =  get_cola_data()             

CoLA train data size: 8551 
CoLA dev data size: 1043 


In [8]:
print(label_list)

[0 1]


In [9]:
train.head()

Unnamed: 0,text,label
0,"Our friends won't buy this analysis, let alone...",1
1,One more pseudo generalization and I'm giving up.,1
2,One more pseudo generalization or I'm giving up.,1
3,"The more we study verbs, the crazier they get.",1
4,Day by day the facts are getting murkier.,1


In [4]:
%%time
from sklearn.metrics import matthews_corrcoef

X_train = train['text']
y_train = train['label']

# define model
model = BertClassifier()
model.epochs = 3
model.validation_fraction = 0
model.learning_rate = 2e-5
model.max_seq_length = 128
model.gradient_accumulation_steps = 2

print('\n',model,'\n')

# fit model
model.fit(X_train, y_train)

# test model on dev
test = dev
X_test = test['text']
y_test = test['label']

# make predictions
y_pred = model.predict(X_test)
print("Accuracy: %0.2f%%"%(metrics.accuracy_score(y_pred,y_test) * 100))
print(classification_report(y_test, y_pred, target_names=['negative','positive']))

# Mathews correlation coefficient
print("\nMathews Correlation: %0.2f"%(matthews_corrcoef(y_test, y_pred) * 100))

Building sklearn classifier...

 BertClassifier(bert_model='bert-base-uncased', epochs=3, eval_batch_size=8,
        fp16=False, gradient_accumulation_steps=2, label_list=None,
        learning_rate=2e-05, local_rank=-1, logfile='bert_sklearn.log',
        loss_scale=0, max_seq_length=128, num_mlp_hiddens=500,
        num_mlp_layers=0, random_state=42, restore_file=None,
        train_batch_size=32, use_cuda=True, validation_fraction=0,
        warmup_proportion=0.1) 

Loading bert-base-uncased model...
Defaulting to linear classifier/regressor
train data size: 8551, validation data size: 0


Training: 100%|██████████| 535/535 [04:49<00:00,  1.98it/s, loss=0.515]
Training: 100%|██████████| 535/535 [04:51<00:00,  1.98it/s, loss=0.247]
Training: 100%|██████████| 535/535 [04:51<00:00,  1.98it/s, loss=0.155]
                                                             

Accuracy: 82.84%
             precision    recall  f1-score   support

   negative       0.82      0.57      0.67       322
   positive       0.83      0.94      0.88       721

avg / total       0.83      0.83      0.82      1043


Mathews Correlation: 57.79
CPU times: user 12min 39s, sys: 6min 8s, total: 18min 48s
Wall time: 14min 57s


