# TODOs

## Short
- Implement baselines
    - Word2Vec
    - Bag of Words
- Training - .py file so we can do multi-gpu
- Inference
    - Make inference as small as possible (seperate notebook?)
- ~~Input your OWN reviews!!~~
    - Showcase strengths and weaknesses of BERT embeddings
    - Sarcasm
    - Negation
    - apple vs banana
    - https://spectrum.ieee.org/tech-talk/robotics/artificial-intelligence/the-next-frontier-in-ai-nothing
- Multiple Datasets
    - multi-class datasets - tags from stack overflow
    - imdb
- Visualize embeddings (cuML, ipyVolume, UMAP?, t-SNE, PCA)
    - Embed vocab then apply t-SNE
    - Visualize end result, trajectory of negative/positive sentences
- Create our own Neural Modules
    - Create our data layer from processed data (ie tokens)
    - Our own MultiLayerPerceptron (nn.linear) with varying hidden size
    - More models: DistilBert, T5
    - Interactive inference ?
- Add figure showing scaling results

## Long / Optional
- Compare multi-class vs binary labels
- Compare frozen BERT performance vs fine-tuned BERT

## Optimizations
- Implement preprocessed data layer (tokenize once)
- Multi-CPU tokenization
- RAPIDS GPU tokenization?
    
## Questions for NVIDIA Engineering Teams
- Can memory be cleared interactively?
https://forums.fast.ai/t/clearing-gpu-memory-pytorch/14637/2
- Multi-gpu INSIDE Jupyter
- TensorCore Utilization Dashboard (Tensorboard has it for TensorFlow?)
    - Can we get it from pynvml/nvml?
- Closing/cleaning up factories safely (zombie protection)

In [1]:
import nemo
from nemo.utils.lr_policies import get_lr_policy
import nemo_nlp
from nemo_nlp.utils.callbacks.sentence_classification import \
    eval_iter_callback, eval_epochs_done_callback
import preproc_data_layer
from pytorch_transformers import BertTokenizer
import torch.nn.functional as f

import math
import numpy as np
import pandas as pd
pd.options.display.max_colwidth = -1

import json

from preproc_data_layer import BertSentenceClassificationDataset, PreprocBertSentenceClassificationDataLayer



## Data Explore

The SST-2 dataset https://nlp.stanford.edu/sentiment/index.html is a standard benchmark for sentence classification and is part of the GLUE Benchmark: https://gluebenchmark.com/tasks.

In [2]:
data_dir = '/raid/data/nlp/SST-2'
df = pd.read_csv(data_dir + '/train.tsv', sep='\t')
test_df = pd.read_csv(data_dir + '/test.tsv', sep='\t')

In [3]:
df.head()

Unnamed: 0,sentence,label
0,hide new secretions from the parental units,0
1,"contains no wit , only labored gags",0
2,that loves its characters and communicates something rather beautiful about human nature,1
3,remains utterly satisfied to remain the same throughout,0
4,on the worst revenge-of-the-nerds clichés the filmmakers could dredge up,0


In [4]:
test_df

Unnamed: 0,index,sentence
0,0,uneasy mishmash of styles and genres .
1,1,this film 's relationship to actual tension is the same as what christmas-tree flocking in a spray can is to actual snow : a poor -- if durable -- imitation .
2,2,"by the end of no such thing the audience , like beatrice , has a watchful affection for the monster ."
3,3,director rob marshall went out gunning to make a great one .
4,4,"lathan and diggs have considerable personal charm , and their screen rapport makes the old story seem new ."
...,...,...
1816,1816,"it risks seeming slow and pretentious , because it thinks the gamble is worth the promise ."
1817,1817,take care of my cat offers a refreshingly different slice of asian cinema .
1818,1818,davis has filled out his cast with appealing fresh faces .
1819,1819,"it represents better-than-average movie-making that does n't demand a dumb , distracted audience ."


The dataset comes with a train file (labeled) and a test file (not labeled).  We will use part of the train file for model validation

In [5]:
# Split train to train and val and save to disk
np.random.seed(123)
train_mask = np.random.rand((len(df))) < .8
train_df = df[train_mask]
val_df = df[~train_mask]

In order to take advantage of NeMo's prebuilt sentence classification data layer, the data must formatted as "sentence\tlabel" (sentence tab label).

In [6]:
# We will add a label column with all 0's (but they will not be used for anything).
test_df['label'] = 0

In [7]:
test_df = test_df[['sentence', 'label']]

In [8]:
test_df.head()

Unnamed: 0,sentence,label
0,uneasy mishmash of styles and genres .,0
1,this film 's relationship to actual tension is the same as what christmas-tree flocking in a spray can is to actual snow : a poor -- if durable -- imitation .,0
2,"by the end of no such thing the audience , like beatrice , has a watchful affection for the monster .",0
3,director rob marshall went out gunning to make a great one .,0
4,"lathan and diggs have considerable personal charm , and their screen rapport makes the old story seem new .",0


In [9]:
# Save new train, val, and test to disk
train_df.to_csv('/raid/data/nlp/SST-2/my_train.tsv', sep='\t', index=False)
val_df.to_csv('/raid/data/nlp/SST-2/my_val.tsv', sep='\t', index=False)
test_df.to_csv('/raid/data/nlp/SST-2/my_test.tsv', sep='\t', index=False)

# Data Preprocessing

In order to use BERT or any other Deep NLP based model, we must first tokenize the data. Our tokenizer will map each word to an integer.

In [10]:
# Each pretrained BERT model comes with a Tokenizer
pretrained_bert_model = 'bert-base-uncased'
#pretrained_bert_model = 'bert-large-uncased'
tokenizer = BertTokenizer.from_pretrained(pretrained_bert_model)

In [11]:
max_seq_length = 64 # we will pad 0's to shorter sentences and truncate longer
sample_dataset = BertSentenceClassificationDataset(
'/raid/data/nlp/SST-2/my_train.tsv',
max_seq_length,
tokenizer,
num_samples=100,
shuffle=False)

In [12]:
# Our dataset consists of the tokens, token_types, attn_mask, and labels
sample_idx = np.random.randint(0, len(sample_dataset))
print(train_df.iloc[sample_idx])
print(f'Tokens: {sample_dataset[sample_idx][0]}')

sentence    the corporate circus that is the recording industry in the current climate of mergers and downsizing 
label       0                                                                                                    
Name: 68, dtype: object
Tokens: [  101  4089  1996  2087 16465  7214  7749   102     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0]


We can gain a lot of efficiency by saving the tokenized data to disk. For future model runs we then don't need to tokenize every time.

In [13]:
!python preproc_data.py \
--input_file "/raid/data/nlp/SST-2/my_train.tsv" \
--output_dir "/raid/data/nlp/SST-2/preproc" \
--dataset_name "train-sst-2" \
--max_seq_length $max_seq_length \
--pretrained_bert_model $pretrained_bert_model



In [14]:
!python preproc_data.py \
--input_file "/raid/data/nlp/SST-2/my_test.tsv" \
--output_dir "/raid/data/nlp/SST-2/preproc" \
--dataset_name "test-sst-2" \
--max_seq_length $max_seq_length \
--pretrained_bert_model $pretrained_bert_model



In [15]:
!python preproc_data.py \
--input_file "/raid/data/nlp/SST-2/my_val.tsv" \
--output_dir "/raid/data/nlp/SST-2/preproc" \
--dataset_name "val-sst-2" \
--max_seq_length $max_seq_length \
--pretrained_bert_model $pretrained_bert_model



## Neural Modules

In NeMo, everything is a Neural Module. Neural modules abstract data and neural network layers. This allows us to combine many different datasets and neural network architectures.

The Neural Module Factory will then manage the neural modules, and is responsible for training (including mixed precision and distributed), logging, and inference.

In [16]:
# instantiate the neural module factory
log_dir = '/home/eharper/github/ericharper/nlp/logs_' + pretrained_bert_model
checkpoint_dir = '/home/eharper/github/ericharper/nlp/checkpoints_' + pretrained_bert_model
tensorboard_dir = '/home/eharper/github/ericharper/nlp/tensorboard_' + pretrained_bert_model
nf = nemo.core.NeuralModuleFactory(log_dir=log_dir,
                                   checkpoint_dir=checkpoint_dir,
                                   tensorboard_dir=tensorboard_dir,
                                   create_tb_writer=True,
                                   add_time_to_log_dir=False,
                                   optimization_level='O1')

Pre-trained models will be automatically downloaded and cached.

In [17]:
# Pre-trained BERT
bert = nemo_nlp.BERT(pretrained_model_name=pretrained_bert_model)

2020-01-14 18:53:07,525 - INFO - loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-config.json from cache at /root/.cache/torch/pytorch_transformers/6dfaed860471b03ab5b9acb6153bea82b6632fb9bbe514d3fff050fe1319ee6d.4c88e2dec8f8b017f319f6db2b157fee632c0860d9422e4851bd0d6999f9ce38
2020-01-14 18:53:07,527 - INFO - Model config {
  "attention_probs_dropout_prob": 0.1,
  "finetuning_task": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "num_labels": 2,
  "output_attentions": false,
  "output_hidden_states": false,
  "torchscript": false,
  "type_vocab_size": 2,
  "vocab_size": 30522
}

2020-01-14 18:53:08,110 - INFO - loading weights file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-pytorch_model.bin from cach

In [18]:
# same bert model config for later use
bert_config_dict = bert.config.to_dict()

bert_model_config_path = pretrained_bert_model + '_config.json'
with open(bert_model_config_path, 'w+') as json_file:
    json.dump(bert_config_dict, json_file)

Note here that the BERT models we are working with are massive. This gives our models a large capacity for learning that is needed to understand the nuance and complexity of natural language.

In [19]:
print(f'{pretrained_bert_model} has {bert.num_weights} weights')

bert-large-uncased has 335141888 weights


Here we define and instantiate the feed forward network that takes as input our BERT embeddings. This network will be used to output the sentence classifications.

In [20]:
# mlp classifier
bert_hidden_size = bert.local_parameters['hidden_size']

mlp = nemo_nlp.SequenceClassifier(hidden_size=bert_hidden_size, 
                                  num_classes=2,
                                  num_layers=2,
                                  log_softmax=False,
                                  dropout=0.1)

loss = nemo.backends.pytorch.common.CrossEntropyLoss()

In [21]:
# Compared to the BERT model, the MLP is tiny.
print(f'MLP has {mlp.num_weights} weights')

MLP has 1051650 weights


# Pipelines

Pipelines are used to define how data will flow the different neural networks. In this case, our data will flow through the BERT network and then the MLP network.

We also have different pipelines for training, validation, and inference data.  

For training data, we want it to be used for optimization so it must be shuffled and we need to compute the loss.

For validation data, we won't use it for optimization but we want to know the loss.

And for inference data, we only want the final predictions coming from the model.

## Data

In [22]:
use_preproc = True

if pretrained_bert_model == 'bert-base-uncased':
    batch_size = 256
if pretrained_bert_model == 'bert-large-uncased':
    batch_size = 64

if use_preproc:
    train_data = preproc_data_layer.PreprocBertSentenceClassificationDataLayer(
        input_file='/raid/data/nlp/SST-2/preprocessed/sst2_train_bert-large-uncased_128.hdf5',
        shuffle=True,
        num_samples=-1, # lower for dev, -1 for all dataset
        batch_size=batch_size
    )
    val_data = preproc_data_layer.PreprocBertSentenceClassificationDataLayer(
        input_file='/raid/data/nlp/SST-2/preprocessed/sst2_val_bert-large-uncased_128.hdf5',
        shuffle=False,
        num_samples=-1, # lower for dev, -1 for all dataset
        batch_size=batch_size
    )
else:
    train_data = nemo_nlp.BertSentenceClassificationDataLayer(
        input_file='/raid/data/nlp/SST-2/my_train.tsv',
        tokenizer=tokenizer,
        max_seq_length=max_seq_length,
        shuffle=True,
        num_samples=-1, # lower for dev, -1 for all dataset
        batch_size=batch_size
    )
    val_data = nemo_nlp.BertSentenceClassificationDataLayer(
        input_file='/raid/data/nlp/SST-2/my_val.tsv',
        tokenizer=tokenizer,
        max_seq_length=max_seq_length,
        shuffle=False,
        num_samples=-1, # lower for dev, -1 for all dataset
        batch_size=batch_size
    )

In [23]:
train_input, train_token_types, train_attn_mask, train_labels = train_data()
val_input, val_token_types, val_attn_mask, val_labels = val_data()

## BERT Embeddings

In [24]:
train_embeddings = bert(input_ids=train_input,
                        token_type_ids=train_token_types,
                        attention_mask=train_attn_mask)
val_embeddings = bert(input_ids=val_input,
                        token_type_ids=val_token_types,
                        attention_mask=val_attn_mask)

## Visualize BERT Embeddings

If we want to inspect the data as it flows through our neural factory we can use the .infer method.  This method will give us the tensors without performing any optimization.

In [25]:
%%time
#val_embeddings_tensors = nf.infer(tensors=[val_embeddings])

CPU times: user 4 µs, sys: 2 µs, total: 6 µs
Wall time: 8.82 µs


In [26]:
# each word is embedded into bert_hidden_size space
# max_seq_len words are embedded
#print(val_embeddings_tensors[0][0][0].shape)

In [27]:
#print(val_embeddings_tensors[0][0][0])

## Training Pipeline 

In order to optimize our network, we need to pass the embeddings through the MLP network and then compute the loss.

In [28]:
train_logits = mlp(hidden_states=train_embeddings)
val_logits = mlp(hidden_states=val_embeddings)

train_loss = loss(logits=train_logits, labels=train_labels)
val_loss = loss(logits=val_logits, labels=val_labels)

## Callbacks

Callbacks are used to record and log metrics and save checkpoints for the training and evaluation. We use callbacks to print to screen and also to tensorboard.




In [29]:
num_epochs = 3

num_gpus = 1

train_data_size = len(train_data)

steps_per_epoch = math.ceil(train_data_size / (batch_size * num_gpus))

train_callback = nemo.core.SimpleLossLoggerCallback(
    tensors=[train_loss, train_logits],
    print_func=lambda x: nf.logger.info(f'Train loss: {str(np.round(x[0].item(), 3))}'),
    tb_writer=nf.tb_writer,
    get_tb_values=lambda x: [["train_loss", x[0]]],
    step_freq=1)

eval_callback = nemo.core.EvaluatorCallback(
    eval_tensors=[val_logits, val_labels],
    user_iter_callback=lambda x, y: eval_iter_callback(
        x, y, val_data),
    user_epochs_done_callback=lambda x: eval_epochs_done_callback(
        x, f'{nf.work_dir}/graphs'),
    tb_writer=nf.tb_writer,
    eval_epoch=1,
    eval_step=steps_per_epoch)

# Create callback to save checkpoints
ckpt_callback = nemo.core.CheckpointCallback(
    folder=nf.checkpoint_dir,
    epoch_freq=1,
    step_freq=-1,
    checkpoints_to_keep=num_epochs)

In [30]:
lr_policy_fn = get_lr_policy('WarmupAnnealing',
                             total_steps=num_epochs * steps_per_epoch,
                             warmup_ratio=0.1)

In [None]:
%%time
nf.train(tensors_to_optimize=[train_loss],
         callbacks=[train_callback, eval_callback, ckpt_callback],
         lr_policy=lr_policy_fn,
         optimizer='adam',
         optimization_params={'num_epochs': num_epochs, 'lr': 5e-5})

2020-01-14 18:53:19,944 - INFO - Starting .....
2020-01-14 18:53:19,950 - INFO - Found 2 modules with weights:
2020-01-14 18:53:19,951 - INFO - BERT
2020-01-14 18:53:19,952 - INFO - SequenceClassifier
2020-01-14 18:53:19,953 - INFO - Total model parameters: 336193538
2020-01-14 18:53:19,954 - INFO - Restoring checkpoint from folder /home/eharper/github/ericharper/nlp/checkpoints_bert-large-uncased ...
2020-01-14 18:53:19,961 - INFO - Starting epoch 0


Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic


2020-01-14 18:53:22,994 - INFO - Step: 0
2020-01-14 18:53:22,997 - INFO - Train loss: 0.706
2020-01-14 18:53:23,000 - INFO - Step time: 1.535071611404419 seconds
2020-01-14 18:53:23,001 - INFO - Doing Evaluation ..............................
2020-01-14 18:53:44,747 - INFO - Accuracy: 0.5239369610466845
2020-01-14 18:53:44,749 - INFO - Sampled preds: [0 0 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 0 1 1]
2020-01-14 18:53:44,751 - INFO - Sampled labels: [1 1 1 0 0 0 1 0 0 0 0 0 1 0 1 0 1 0 1 0]
2020-01-14 18:53:45,003 - INFO -               precision    recall  f1-score   support

           0       0.45      0.34      0.39      5950
           1       0.56      0.67      0.61      7502

    accuracy                           0.52     13452
   macro avg       0.51      0.50      0.50     13452
weighted avg       0.51      0.52      0.51     13452

2020-01-14 18:53:45,004 - INFO - Evaluation time: 22.001027822494507 seconds
2020-01-14 18:53:45,575 - INFO - Step: 1
2020-01-14 18:53:45,577 - INFO - Tra

Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 32768.0


2020-01-14 18:53:52,091 - INFO - Step time: 0.24991726875305176 seconds
2020-01-14 18:53:52,434 - INFO - Step: 20
2020-01-14 18:53:52,436 - INFO - Train loss: 0.678
2020-01-14 18:53:52,438 - INFO - Step time: 0.3397190570831299 seconds
2020-01-14 18:53:52,780 - INFO - Step: 21
2020-01-14 18:53:52,782 - INFO - Train loss: 0.661
2020-01-14 18:53:52,784 - INFO - Step time: 0.3394813537597656 seconds
2020-01-14 18:53:53,166 - INFO - Step: 22
2020-01-14 18:53:53,168 - INFO - Train loss: 0.665
2020-01-14 18:53:53,170 - INFO - Step time: 0.3768351078033447 seconds
2020-01-14 18:53:53,438 - INFO - Step: 23
2020-01-14 18:53:53,440 - INFO - Train loss: 0.673


Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 16384.0


2020-01-14 18:53:53,443 - INFO - Step time: 0.264603853225708 seconds
2020-01-14 18:53:53,790 - INFO - Step: 24
2020-01-14 18:53:53,792 - INFO - Train loss: 0.662
2020-01-14 18:53:53,795 - INFO - Step time: 0.3442692756652832 seconds
2020-01-14 18:53:54,141 - INFO - Step: 25
2020-01-14 18:53:54,143 - INFO - Train loss: 0.651
2020-01-14 18:53:54,145 - INFO - Step time: 0.3448491096496582 seconds
2020-01-14 18:53:54,496 - INFO - Step: 26
2020-01-14 18:53:54,498 - INFO - Train loss: 0.67
2020-01-14 18:53:54,500 - INFO - Step time: 0.34807848930358887 seconds
2020-01-14 18:53:54,876 - INFO - Step: 27
2020-01-14 18:53:54,878 - INFO - Train loss: 0.675
2020-01-14 18:53:54,881 - INFO - Step time: 0.37221503257751465 seconds
2020-01-14 18:53:55,236 - INFO - Step: 28
2020-01-14 18:53:55,239 - INFO - Train loss: 0.67
2020-01-14 18:53:55,240 - INFO - Step time: 0.3533971309661865 seconds
2020-01-14 18:53:55,586 - INFO - Step: 29
2020-01-14 18:53:55,588 - INFO - Train loss: 0.636
2020-01-14 18:53:

Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8192.0


2020-01-14 18:54:09,175 - INFO - Step time: 0.2646448612213135 seconds
2020-01-14 18:54:09,552 - INFO - Step: 67
2020-01-14 18:54:09,554 - INFO - Train loss: 0.27
2020-01-14 18:54:09,556 - INFO - Step time: 0.3712656497955322 seconds
2020-01-14 18:54:09,927 - INFO - Step: 68
2020-01-14 18:54:09,929 - INFO - Train loss: 0.245
2020-01-14 18:54:09,931 - INFO - Step time: 0.3677978515625 seconds
2020-01-14 18:54:10,310 - INFO - Step: 69
2020-01-14 18:54:10,312 - INFO - Train loss: 0.325
2020-01-14 18:54:10,315 - INFO - Step time: 0.37429070472717285 seconds
2020-01-14 18:54:10,665 - INFO - Step: 70
2020-01-14 18:54:10,667 - INFO - Train loss: 0.343
2020-01-14 18:54:10,670 - INFO - Step time: 0.3489408493041992 seconds
2020-01-14 18:54:11,010 - INFO - Step: 71
2020-01-14 18:54:11,012 - INFO - Train loss: 0.37
2020-01-14 18:54:11,014 - INFO - Step time: 0.3391423225402832 seconds
2020-01-14 18:54:11,355 - INFO - Step: 72
2020-01-14 18:54:11,357 - INFO - Train loss: 0.311
2020-01-14 18:54:11,

Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0


2020-01-14 18:54:14,257 - INFO - Step time: 0.2828042507171631 seconds
2020-01-14 18:54:14,626 - INFO - Step: 81
2020-01-14 18:54:14,628 - INFO - Train loss: 0.367
2020-01-14 18:54:14,630 - INFO - Step time: 0.36292505264282227 seconds
2020-01-14 18:54:14,978 - INFO - Step: 82
2020-01-14 18:54:14,980 - INFO - Train loss: 0.245
2020-01-14 18:54:14,983 - INFO - Step time: 0.3449366092681885 seconds
2020-01-14 18:54:15,339 - INFO - Step: 83
2020-01-14 18:54:15,341 - INFO - Train loss: 0.459
2020-01-14 18:54:15,343 - INFO - Step time: 0.3529181480407715 seconds
2020-01-14 18:54:15,728 - INFO - Step: 84
2020-01-14 18:54:15,730 - INFO - Train loss: 0.308
2020-01-14 18:54:15,732 - INFO - Step time: 0.3782074451446533 seconds
2020-01-14 18:54:16,112 - INFO - Step: 85
2020-01-14 18:54:16,114 - INFO - Train loss: 0.158
2020-01-14 18:54:16,117 - INFO - Step time: 0.3767280578613281 seconds
2020-01-14 18:54:16,476 - INFO - Step: 86
2020-01-14 18:54:16,478 - INFO - Train loss: 0.337
2020-01-14 18:5

In [36]:
print(tensorboard_dir)

/home/eharper/github/ericharper/nlp/tensorboard_bert-large-uncased


In [None]:
!ls $checkpoint_dir

In [None]:
# GET TRAIN COMMAND

In [None]:
#time python -m torch.distributed.launch --nproc_per_node=2 sentence_classification.py --train_file /raid/data/nlp/SST-2/preprocessed/sst2_train_bert-large-uncased_128.hdf5 --eval_file /raid/data/nlp/SST-2/preprocessed/sst2_val_bert-large-uncased_128.hdf5 --num_gpus 2 --batch_size 64 --amp_opt_level O1 --work_dir workdir/ --mode train --num_classes 2 --num_samples -1 --num_epochs 1 --preproc

## Inference Pipeline

RESTART KERNEL - OR Shutdown KERNEL and Link to Inference Notebook

In [1]:
import nemo
import nemo_nlp
from pytorch_transformers import BertTokenizer
import torch
import torch.nn.functional as f


import math
import numpy as np
import pandas as pd
pd.options.display.max_colwidth = -1

import json



In [2]:
#pretrained_bert_model = 'bert-base-uncased'
pretrained_bert_model = 'bert-large-uncased'
bert_model_config_path = pretrained_bert_model + '_config.json'
checkpoint_dir = '/home/eharper/github/ericharper/nlp/checkpoints_' + pretrained_bert_model
inference_log_dir = 'inference_' + pretrained_bert_model

In [3]:
!ls -lh $checkpoint_dir

total 12G
-rw-r--r-- 1 root root 1.3G Jan 14 19:04 BERT-EPOCH-1.pt
-rw-r--r-- 1 root root 1.3G Jan 14 19:10 BERT-EPOCH-2.pt
-rw-r--r-- 1 root root 1.3G Jan 14 19:11 BERT-EPOCH-3.pt
-rw-r--r-- 1 root root 4.1M Jan 14 19:04 SequenceClassifier-EPOCH-1.pt
-rw-r--r-- 1 root root 4.1M Jan 14 19:10 SequenceClassifier-EPOCH-2.pt
-rw-r--r-- 1 root root 4.1M Jan 14 19:11 SequenceClassifier-EPOCH-3.pt
-rw-r--r-- 1 root root 2.5G Jan 14 19:05 trainer-EPOCH-1.pt
-rw-r--r-- 1 root root 2.5G Jan 14 19:10 trainer-EPOCH-2.pt
-rw-r--r-- 1 root root 2.5G Jan 14 19:11 trainer-EPOCH-3.pt


In [4]:
nf = nemo.core.NeuralModuleFactory(log_dir=inference_log_dir,
                                   optimization_level='O1')

tokenizer = BertTokenizer.from_pretrained(pretrained_bert_model)

if bert_model_config_path is not None:
    with open(bert_model_config_path) as json_file:
        config = json.load(json_file)
        bert = nemo_nlp.huggingface.BERT(**config)

# mlp classifier
bert_hidden_size = bert.local_parameters['hidden_size']

mlp = nemo_nlp.SequenceClassifier(hidden_size=bert_hidden_size, 
                                  num_classes=2,
                                  num_layers=2,
                                  log_softmax=False,
                                  dropout=0.0)

2020-01-14 19:41:26,870 - INFO - loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt from cache at /root/.cache/torch/pytorch_transformers/9b3c03a36e83b13d5ba95ac965c9f9074a99e14340c523ab405703179e79fc46.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084


In [5]:
bert.restore_from(checkpoint_dir + '/BERT-EPOCH-3.pt')
mlp.restore_from(checkpoint_dir + '/SequenceClassifier-EPOCH-3.pt')

In [6]:
max_seq_length = 64
if pretrained_bert_model == 'bert-base-uncased':
    batch_size = 256
if pretrained_bert_model == 'bert-large-uncased':
    batch_size = 64
test_data = nemo_nlp.BertSentenceClassificationDataLayer(
    input_file='/raid/data/nlp/SST-2/my_test.tsv',
    tokenizer=tokenizer,
    max_seq_length=max_seq_length,
    shuffle=False,
    num_samples=-1, # lower for dev, -1 for all dataset
    batch_size=64
)

2020-01-14 19:42:44,599 - INFO - /raid/data/nlp/SST-2/my_test.tsv: 1821
2020-01-14 19:42:47,265 - INFO - Min: 4 |                  Max: 64 |                  Mean: 24.85722130697419 |                  Median: 24.0
2020-01-14 19:42:47,268 - INFO - 75 percentile: 32.0
2020-01-14 19:42:47,269 - INFO - 99 percentile: 53.799999999999955
2020-01-14 19:42:47,271 - INFO - 0 out of 1821                        sentencess with more than 64 subtokens.
2020-01-14 19:42:47,272 - INFO - *** Example ***
2020-01-14 19:42:47,273 - INFO - example_index: 0
2020-01-14 19:42:47,274 - INFO - subtokens: [CLS] uneasy mis ##hma ##sh of styles and genres . [SEP]
2020-01-14 19:42:47,275 - INFO - sent_label: 0
2020-01-14 19:42:47,276 - INFO - input_ids: 101 15491 28616 22444 4095 1997 6782 1998 11541 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2020-01-14 19:42:47,278 - INFO - input_mask: 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

In [7]:
test_input, test_token_types, test_attn_mask, _ = test_data()

In [8]:
test_embeddings = bert(input_ids=test_input,
                        token_type_ids=test_token_types,
                        attention_mask=test_attn_mask)

In [9]:
test_logits = mlp(hidden_states=test_embeddings)

In [10]:
%%time
test_logits_tensors = nf.infer(tensors=[test_logits])

Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic


2020-01-14 19:43:12,409 - INFO - Evaluating batch 0 out of 29
2020-01-14 19:43:13,187 - INFO - Evaluating batch 2 out of 29
2020-01-14 19:43:13,354 - INFO - Evaluating batch 4 out of 29
2020-01-14 19:43:13,524 - INFO - Evaluating batch 6 out of 29
2020-01-14 19:43:13,887 - INFO - Evaluating batch 8 out of 29
2020-01-14 19:43:14,056 - INFO - Evaluating batch 10 out of 29
2020-01-14 19:43:14,219 - INFO - Evaluating batch 12 out of 29
2020-01-14 19:43:14,390 - INFO - Evaluating batch 14 out of 29
2020-01-14 19:43:14,551 - INFO - Evaluating batch 16 out of 29
2020-01-14 19:43:14,718 - INFO - Evaluating batch 18 out of 29
2020-01-14 19:43:14,880 - INFO - Evaluating batch 20 out of 29
2020-01-14 19:43:15,045 - INFO - Evaluating batch 22 out of 29
2020-01-14 19:43:15,206 - INFO - Evaluating batch 24 out of 29
2020-01-14 19:43:15,376 - INFO - Evaluating batch 26 out of 29
2020-01-14 19:43:15,536 - INFO - Evaluating batch 28 out of 29


CPU times: user 2.82 s, sys: 993 ms, total: 3.81 s
Wall time: 4.21 s


In [11]:
test_probs = f.softmax(torch.cat(test_logits_tensors[0])).numpy()[:, 1] 

  return orig_fn(*new_args, **kwargs)


In [12]:
data_dir = '/raid/data/nlp/SST-2'
test_df = pd.read_csv(data_dir + '/test.tsv', sep='\t')

In [13]:
test_df['prob'] = test_probs 

In [14]:
test_df

Unnamed: 0,index,sentence,prob
0,0,uneasy mishmash of styles and genres .,0.000731
1,1,this film 's relationship to actual tension is the same as what christmas-tree flocking in a spray can is to actual snow : a poor -- if durable -- imitation .,0.001504
2,2,"by the end of no such thing the audience , like beatrice , has a watchful affection for the monster .",0.987900
3,3,director rob marshall went out gunning to make a great one .,0.999914
4,4,"lathan and diggs have considerable personal charm , and their screen rapport makes the old story seem new .",0.999918
...,...,...,...
1816,1816,"it risks seeming slow and pretentious , because it thinks the gamble is worth the promise .",0.098339
1817,1817,take care of my cat offers a refreshingly different slice of asian cinema .,0.999911
1818,1818,davis has filled out his cast with appealing fresh faces .,0.999904
1819,1819,"it represents better-than-average movie-making that does n't demand a dumb , distracted audience .",0.998947


In [15]:
test_df.to_csv(inference_log_dir + '/test_inference.tsv', sep='\t', index=False)

In [16]:
def sample_classification(data_path):
    df = pd.read_csv(data_path, sep='\t')
    sample = df.sample()
    sentence = sample.sentence.values[0]
    prob = sample.prob.values[0]
    result = f'{sentence} | {prob}'
    return result

In [17]:
sample_classification(inference_log_dir + '/test_inference.tsv')

'lathan and diggs have considerable personal charm , and their screen rapport makes the old story seem new . | 0.99991846'

In [19]:
num_samples = 10
for _ in range(num_samples):
    print(sample_classification('/raid/data/nlp/SST-2/test_inference.tsv'))

it could change america , not only because it is full of necessary discussion points , but because it is so accessible that it makes complex politics understandable to viewers looking for nothing but energetic entertainment . | 0.9994081
a wildly funny prison caper . | 0.9991358
zany , exuberantly irreverent animated space adventure . | 0.99895155
majidi 's poetic love story is a ravishing consciousness-raiser , if a bit draggy at times . | 0.99937856
unlike the nauseating fictions peddled by such ` have-yourself-a-happy-little-holocaust ' movies as life is beautiful and jakob the liar , the grey zone is honest enough to deny the possibility of hope in auschwitz . | 0.9753666
when it really counts ... bloody sunday connects on a visceral level that transcends language . | 0.99928623
offers enough playful fun to entertain the preschool set while embracing a wholesome attitude . | 0.9995290000000001
somehow we 're meant to buy that this doting mother would shun her kids , travel to one o

## BERT nails it:
the film is just a big , gorgeous , mind-blowing , breath-taking mess . | 0.2738656

a sensual performance from abbass buoys the flimsy story , but her inner journey is largely unexplored and we 're left wondering about this exotic-looking woman whose emotional depths are only hinted at . | 0.48260054

## Classify my sentence

In [20]:
def classify_sentence(nf, tokenizer, bert, mlp, sentence):
    sentence = sentence.lower()
    with open("tmp_sentence.tsv", 'w+') as tmp_tsv:
        header = 'sentence\tlabel\n'
        line = sentence + '\t0\n'
        tmp_tsv.writelines([header, line])

    tmp_data = nemo_nlp.BertSentenceClassificationDataLayer(
        input_file='tmp_sentence.tsv',
        tokenizer=tokenizer,
        max_seq_length=128,
        shuffle=False,
        num_samples=-1, # lower for dev, -1 for all dataset
        batch_size=1
    )
    tmp_input, tmp_token_types, tmp_attn_mask, _ = tmp_data()
    tmp_embeddings = bert(input_ids=tmp_input,
                            token_type_ids=tmp_token_types,
                            attention_mask=tmp_attn_mask)
    tmp_logits = mlp(hidden_states=tmp_embeddings)
    tmp_logits_tensors = nf.infer(tensors=[tmp_logits, tmp_embeddings])
    tmp_probs = f.softmax(torch.cat(tmp_logits_tensors[0])).numpy()[:, 1] 
    print(f'{sentence} | {tmp_probs[0]}')

In [21]:
#sentence = 'point break is the best movie of all time'
#sentence = 'the movie was a wonderful exercise in understanding the struggles of native americans'
#sentence = 'the performance of diego luna had me excited and annoyed at the same time'
sentence = 'matt damon is the only good thing about this film'
classify_sentence(nf, tokenizer, bert, mlp, sentence)

2020-01-14 19:45:14,636 - INFO - tmp_sentence.tsv: 1
2020-01-14 19:45:14,638 - INFO - Min: 12 |                  Max: 12 |                  Mean: 12.0 |                  Median: 12.0
2020-01-14 19:45:14,640 - INFO - 75 percentile: 12.0
2020-01-14 19:45:14,643 - INFO - 99 percentile: 12.0
2020-01-14 19:45:14,644 - INFO - 0 out of 1                        sentencess with more than 128 subtokens.
2020-01-14 19:45:14,645 - INFO - *** Example ***
2020-01-14 19:45:14,647 - INFO - example_index: 0
2020-01-14 19:45:14,647 - INFO - subtokens: [CLS] matt damon is the only good thing about this film [SEP]
2020-01-14 19:45:14,649 - INFO - sent_label: 0
2020-01-14 19:45:14,650 - INFO - input_ids: 101 4717 11317 2003 1996 2069 2204 2518 2055 2023 2143 102
2020-01-14 19:45:14,651 - INFO - input_mask: 1 1 1 1 1 1 1 1 1 1 1 1


Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic


2020-01-14 19:45:15,910 - INFO - Evaluating batch 0 out of 1


matt damon is the only good thing about this film | 0.6561076641082764


  return orig_fn(*new_args, **kwargs)
