# demo 

We will go through 3 examples:

* **[text classification ](#text_classification)** - the goal is to classify a single sentence or short text.


* **[text pair classification ](#text_pair_classification)** - the goal is to to classify a pair of sentences or short texts.


* **[text pair regression ](#text_pair_regression)** - the goal is to predict a numerical value for a pair of sentences or short texts.


#### A note on GPU cards...

While its possible, it would be slow to run the examples without a GPU card of some sort. In addition , the `BERT` models(especially the large one) are pretty big so it helps to have more memory. 

The two biggest parameters you can change which will reduce the memory requirements significantly are:

* `max_seq_length` - this is set to defualt at 128. But seting it to a smaller value like 96 or even 64 still gets good results on a lot of tasks.


* `train_batch_size` - this is set to a default of 32. Cutting it in half should also still give good results.


In addition to these two paremeters,  [huggingface/pytorch-pretrained-BERT](https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_classifier.py) has several options to reduce the GPU memory requirements which are passed through in `bert-sklearn`:

* `gradient_accumulation_steps` - the default is 1. Setting it to 4, 8, or 16 will trade  memory for compute time. I use this a lot when I train BERT models on my laptop GPU.


* `fp16` - the default is set to `False`. To enable half precision , you must install  [Nvidia apex](https://github.com/NVIDIA/apex). Then setting this option to `True` will cut the model memory load in half. I use this when i train on my laptop GPU as well.


* `multiple gpus` - for a single machine with multiple GPUs ,following the huggingface port, the GPUs should be detected and will split the load  onto the multiple cards. 


* `distibuted training` - the huggingface port allows you to train across distributed GPUs. The parameters, i.e `local_rank`, are exposed in `bert-sklearn`. But this option has not been tested yet...


In [1]:
import numpy as np
import pandas as pd
import os
import math
import random
import csv
import sys
from sklearn import metrics
from sklearn.metrics import classification_report

from bert_sklearn import BertClassifier
from bert_sklearn import BertRegressor
from bert_sklearn import load_model

DATADIR = os.getcwd() + '/glue_examples/glue_data'

def read_tsv(filename,quotechar=None):
    with open(filename, "r", encoding='utf-8') as f:
        return list(csv.reader(f,delimiter="\t",quotechar=quotechar))   


<a id='text_classification'></a>
## text classification 

For single text/sentence classification we have the input data `X`, and target data `y` where:

* `X` is a list, pandas Series, or numpy array of text data.


* `y` is a list, pandas Series, or numpy array of text labels.

For this example, we will use the **`Stanford Sentiment Treebank (SST-2)`** data set from the [GLUE benchmarks](https://gluebenchmark.com/). The `SST-2` task consists of semtences drawn from movie reviews and annotated for their sentiment. 

See [website](https://nlp.stanford.edu/sentiment/code.html) and [paper](https://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf) for more info.

The input features are short sentences and the labels are the standard sentiment polarity of:
*    0 for negative 
*    1 for positive.


First download the data using the GLUE downloder:

In [2]:
%%bash
python3 ./glue_examples/download_glue_data.py --data_dir ./glue_examples//glue_data --tasks SST 

Downloading and extracting SST...
	Completed!


In [2]:
"""
SST-2 train data size: 67349 
SST-2 dev data size: 872 
"""
def get_sst_data(train_file = DATADIR + '/SST-2/train.tsv',
                dev_file  = DATADIR + '/SST-2/dev.tsv'):
    
    train = pd.read_csv(train_file, sep='\t',  encoding = 'utf8',keep_default_na=False)
    train.columns=['text','label']
    print("SST-2 train data size: %d "%(len(train)))
    
    dev = pd.read_csv(dev_file, sep='\t',  encoding = 'utf8',keep_default_na=False)
    dev.columns=['text','label']
    print("SST-2 dev data size: %d "%(len(dev)))
    label_list = np.unique(train['label'])
    
    return train,dev,label_list

train,dev,label_list = get_sst_data()
train.head()

SST-2 train data size: 67349 
SST-2 dev data size: 872 


Unnamed: 0,text,label
0,hide new secretions from the parental units,0
1,"contains no wit , only labored gags",0
2,that loves its characters and communicates som...,1
3,remains utterly satisfied to remain the same t...,0
4,on the worst revenge-of-the-nerds clichés the ...,0


#### setup data

In [3]:
# lets subsample data  for the demo
train = train.sample(1000,random_state=42)

X_train = train['text']
y_train = train['label']

# use the dev set for testing
test = dev
X_test = test['text']
y_test = test['label']

#### define model

We will set up a classifier with the defualt settings, but lets reduce `max_sequence_length` from the default 128, so it can run on smaller GPU. This config should be ~ 5Gb:

In [4]:
model = BertClassifier(max_seq_length=64)
model

Building sklearn classifier...


BertClassifier(bert_model='bert-base-uncased', epochs=3, eval_batch_size=8,
        fp16=False, gradient_accumulation_steps=1, label_list=None,
        learning_rate=2e-05, local_rank=-1, logfile='bert_sklearn.log',
        loss_scale=0, max_seq_length=64, num_mlp_hiddens=500,
        num_mlp_layers=0, random_state=42, restore_file=None,
        train_batch_size=32, use_cuda=True, validation_fraction=0.1,
        warmup_proportion=0.1)

#### fit  model on train data
The fit routine :
* loads the pretrained BERT model
* Uses 10% of the data for validation and finetunes BERT on the remainder for 3 epochs

In [5]:
%%time
model.fit(X_train, y_train)

Loading bert-base-uncased model...
Defaulting to linear classifier/regressor
train data size: 900, validation data size: 100


Training: 100%|██████████| 29/29 [00:12<00:00,  2.78it/s, loss=0.623]
                                                           

Epoch 1, Train loss : 0.6230, Val loss: 0.6812, Val accy = 65.00%


Training: 100%|██████████| 29/29 [00:12<00:00,  2.87it/s, loss=0.237]
                                                           

Epoch 2, Train loss : 0.2366, Val loss: 0.4520, Val accy = 83.00%


Training: 100%|██████████| 29/29 [00:12<00:00,  2.86it/s, loss=0.121]
                                                           

Epoch 3, Train loss : 0.1209, Val loss: 0.4515, Val accy = 83.00%
CPU times: user 30.7 s, sys: 15.2 s, total: 46 s
Wall time: 47 s




BertClassifier(bert_model='bert-base-uncased', epochs=3, eval_batch_size=8,
        fp16=False, gradient_accumulation_steps=1,
        label_list=array([0, 1]), learning_rate=2e-05, local_rank=-1,
        logfile='bert_sklearn.log', loss_scale=0, max_seq_length=64,
        num_mlp_hiddens=500, num_mlp_layers=0, random_state=42,
        restore_file=None, train_batch_size=32, use_cuda=True,
        validation_fraction=0.1, warmup_proportion=0.1)

#### score and make predictions on test data

In [6]:
# score model
accy = model.score(X_test, y_test,verbose=True)

# make class probability predicts
y_prob = model.predict_proba(X_test)
print("class prob estimates:\n", y_prob)

# make predictions
y_pred = model.predict(X_test)
print("Accuracy: %0.2f%%"%(metrics.accuracy_score(y_pred,y_test) * 100))

target_names = ['negative', 'positive']
print(classification_report(y_test, y_pred, target_names=target_names))

Predicting:   0%|          | 0/109 [00:00<?, ?it/s]       


Test loss: 0.3214, Test accuracy = 87.39%


Predicting:   0%|          | 0/109 [00:00<?, ?it/s]          

class prob estimates:
 [[0.01246537 0.98753464]
 [0.9561755  0.04382455]
 [0.02255277 0.9774473 ]
 ...
 [0.85820186 0.14179818]
 [0.78386265 0.21613738]
 [0.10148279 0.89851725]]


                                                             

Accuracy: 87.39%
              precision    recall  f1-score   support

    negative       0.85      0.91      0.88       428
    positive       0.91      0.84      0.87       444

   micro avg       0.87      0.87      0.87       872
   macro avg       0.88      0.87      0.87       872
weighted avg       0.88      0.87      0.87       872





#### save/load model from disk

In [7]:
#save model to disk
savefile='/data/test.bin'
model.save(savefile)

# load model from disk
new_model = load_model(savefile)

# predict with new model
accy = new_model.score(X_test, y_test,verbose=True)

Loading model from /data/test.bin...


100%|██████████| 407873900/407873900 [00:28<00:00, 14080940.67B/s]


Defaulting to linear classifier/regressor


Testing:   0%|          | 0/109 [00:00<?, ?it/s]

Building sklearn classifier...


                                                          


Test loss: 0.3214, Test accuracy = 87.39%




### options



In [7]:
from bert_sklearn import SUPPORTED_MODELS
SUPPORTED_MODELS

('bert-base-uncased',
 'bert-large-uncased',
 'bert-base-cased',
 'bert-large-cased',
 'bert-base-multilingual-uncased',
 'bert-base-multilingual-cased',
 'bert-base-chinese')

Lets try the larger bert model : `'bert-large-uncased'`
Lets also accumulate gradients to avoid running out of memory.. 

In [8]:
%%time

# try large bert model with different options
model = BertClassifier(bert_model='bert-large-uncased',
                       max_seq_length=64,
                       epochs=3,
                       learning_rate=2e-5,
                       validation_fraction=0,
                       gradient_accumulation_steps=16)

# fit model
model.fit(X_train,y_train)

# score model
model.score(X_test,y_test)

Building sklearn classifier...
Loading bert-large-uncased model...
Defaulting to linear classifier/regressor
train data size: 1000, validation data size: 0


Training: 100%|██████████| 500/500 [01:02<00:00,  7.75it/s, loss=0.718]
Training: 100%|██████████| 500/500 [01:03<00:00,  7.72it/s, loss=0.386]
Training: 100%|██████████| 500/500 [01:07<00:00,  7.39it/s, loss=0.211]
                                                          


Test loss: 0.2742, Test accuracy = 90.37%
CPU times: user 2min 57s, sys: 50.8 s, total: 3min 48s
Wall time: 3min 48s




<a id='text_pair_classification'></a>

## text pair classification

For text pair classification, we have input data `X`, and target data `y` where :

* `X` is a list, pandas dataframe, or numpy array of text pairs (`text_a`, `text_b`) .


* `y` is a list, pandas Series, or numpy array of text labels

For this example, we will use the **`Quora Question Pair(QQP)`** data set from the [GLUE benchmarks](https://gluebenchmark.com/). This data consists of sentences pairs from the Quora website labeled as duplicate or not. See [original release post](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs) for more info.

The input features are pairs of questions (text_a,text_b) along with the labels :
*    0 if `text_a` and `text_b` are not duplicates

*    1 if `text_a` and `text_b` are duplicates

Lets download the data:

In [10]:
%%bash
python3 ./glue_examples/download_glue_data.py --data_dir ./glue_examples//glue_data --tasks QQP 

Downloading and extracting QQP...
	Completed!


In [11]:
"""
QQP train data size: 363849 
QQP dev data size: 40430 
"""
   
def get_quora_df(filename):
    rows = read_tsv(filename)
    df=pd.DataFrame(rows[1:],columns=rows[0])
    df=df[['question1','question2','is_duplicate']]
    df = df[pd.notnull(df['is_duplicate'])]
    df.columns=['text_a','text_b','label']
    return df

def get_quora_data(train_file = DATADIR+'/QQP/train.tsv', 
                   dev_file =  DATADIR+'/QQP/dev.tsv'):
    train = get_quora_df(train_file)
    print("QQP train data size: %d "%(len(train)))
    dev = get_quora_df(dev_file)
    print("QQP dev data size: %d "%(len(dev)))

    label_list = np.unique(train['label'].values)
    return train,dev,label_list

train,dev,label_list = get_quora_data()

train.head()

QQP train data size: 363849 
QQP dev data size: 40430 


Unnamed: 0,text_a,text_b,label
0,How is the life of a math student? Could you d...,Which level of prepration is enough for the ex...,0
1,How do I control my horny emotions?,How do you control your horniness?,1
2,What causes stool color to change to yellow?,What can cause stool to come out as little balls?,0
3,What can one do after MBBS?,What do i do after my MBBS ?,1
4,Where can I find a power outlet for my laptop ...,"Would a second airport in Sydney, Australia be...",0


In [13]:
%%time

# subsample data for demo
n = 10000
train = train.sample(n=n,random_state=42)
dev = dev.sample(n=n,random_state=42)

X_train = train[['text_a','text_b']]
y_train = train['label']

# use the dev set for testing...
test = dev
X_test = test[['text_a','text_b']]
y_test = test['label']

# define model
model = BertClassifier(max_seq_length=64)

# fit model
model.fit(X_train, y_train)

# score model
model.score(X_test, y_test)

Building sklearn classifier...
Loading bert-base-uncased model...
Defaulting to linear classifier/regressor
train data size: 9000, validation data size: 1000


Training: 100%|██████████| 282/282 [02:02<00:00,  2.65it/s, loss=0.517]
                                                             

Epoch 1, Train loss : 0.5173, Val loss: 0.4309, Val accy = 78.70%


Training: 100%|██████████| 282/282 [02:15<00:00,  2.09it/s, loss=0.311]
                                                             

Epoch 2, Train loss : 0.3113, Val loss: 0.4210, Val accy = 80.80%


Training: 100%|██████████| 282/282 [02:29<00:00,  2.36it/s, loss=0.235]
                                                             

Epoch 3, Train loss : 0.2346, Val loss: 0.4475, Val accy = 80.60%


                                                            


Test loss: 0.4171, Test accuracy = 81.91%
CPU times: user 5min 25s, sys: 2min 46s, total: 8min 11s
Wall time: 8min 12s




<a id='text_pair_regression'></a>

## text pair regression  

For text pair regression we have input data `X`, and target data `y` where :

* `X` is a list, pandas dataframe, or numpy array of text pairs (`text_a`, `text_b`) .


* `y` is a list, pandas Series, or numpy array of floats.


For this example, we will use the **`STS-B`** data set from [GLUE benchmarks](https://gluebenchmark.com/). The data consists of sentence pairs drawn from news headlines and image captions with annotated similarity scores ranging from 1 to 5.

See [website](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark) and [paper](http://www.aclweb.org/anthology/S/S17/S17-2001.pdf) for more info.


### STS-B

In [14]:
%%bash
python3 ./glue_examples/download_glue_data.py --data_dir ./glue_examples//glue_data --tasks STS 

Downloading and extracting STS...
	Completed!


In [15]:
"""
STS-B train data size: 5749 
STS-B dev data size: 1500 
"""
def get_sts_b_df(filename):
    rows = read_tsv(filename)
    df=pd.DataFrame(rows[1:],columns=rows[0])
    df=df[['sentence1','sentence2','score']]    
    df.columns=['text_a','text_b','label']
    df.label = pd.to_numeric(df.label)
    df = df[pd.notnull(df['label'])]                
    return df

def get_sts_b_data(train_file = DATADIR + '/STS-B/train.tsv',
                   dev_file ='/data/glue_data/STS-B/dev.tsv',
                   nrows=None):
    train = get_sts_b_df(train_file)
    print("STS-B train data size: %d "%(len(train)))    
    dev   = get_sts_b_df(dev_file)
    print("STS-B dev data size: %d "%(len(dev)))  
    return train,dev

train,dev = get_sts_b_data()
train.head()

STS-B train data size: 5749 
STS-B dev data size: 1500 


Unnamed: 0,text_a,text_b,label
0,A plane is taking off.,An air plane is taking off.,5.0
1,A man is playing a large flute.,A man is playing a flute.,3.8
2,A man is spreading shreded cheese on a pizza.,A man is spreading shredded cheese on an uncoo...,3.8
3,Three men are playing chess.,Two men are playing chess.,2.6
4,A man is playing the cello.,A man seated is playing the cello.,4.25


In [16]:
%%time

from scipy.stats import pearsonr

train = train.sample(n=1000,random_state=42)
dev = dev.sample(n=1000,random_state=42)


X_train = train[['text_a','text_b']]
y_train = train['label']

# use the dev set for testing...
test = dev
X_test = test[['text_a','text_b']]
y_test = test['label']

# define model
model = BertRegressor()
model.max_seq_length = 64

# fit
model.fit(X_train, y_train)

# score on test data
model.score(X_test, y_test)

# predict on test data 
y_pred = model.predict(X_test)
pearson_accy = pearsonr(y_pred,y_test)[0] * 100
print("Pearson : %0.2f"%(pearson_accy))

Building sklearn regressor...
Loading bert-base-uncased model...
Defaulting to linear classifier/regressor
train data size: 900, validation data size: 100


Training: 100%|██████████| 29/29 [00:12<00:00,  2.92it/s, loss=3.53]
                                                           

Epoch 1, Train loss : 3.5257, Val loss: 1.7551, Val accy = 63.68%


Training: 100%|██████████| 29/29 [00:12<00:00,  2.89it/s, loss=0.766]
                                                           

Epoch 2, Train loss : 0.7659, Val loss: 0.7042, Val accy = 80.97%


Training: 100%|██████████| 29/29 [00:12<00:00,  2.90it/s, loss=0.511]
                                                           

Epoch 3, Train loss : 0.5107, Val loss: 0.7286, Val accy = 81.32%


Predicting:   0%|          | 0/125 [00:00<?, ?it/s]       


Test loss: 0.6663, Test accuracy = 84.17%


                                                             

Pearson : 84.17
CPU times: user 36.2 s, sys: 16.8 s, total: 53 s
Wall time: 54.1 s


