# TODOs

## Short
- Implement baselines
    - Word2Vec
    - Bag of Words
- Training - .py file so we can do multi-gpu
- Inference
    - Make inference as small as possible (seperate notebook?)
- ~~Input your OWN reviews!!~~
    - Showcase strengths and weaknesses of BERT embeddings
    - Sarcasm
    - Negation
    - apple vs banana
    - https://spectrum.ieee.org/tech-talk/robotics/artificial-intelligence/the-next-frontier-in-ai-nothing
- Multiple Datasets
    - multi-class datasets - tags from stack overflow
    - imdb
- Visualize embeddings (cuML, ipyVolume, UMAP?, t-SNE, PCA)
    - Embed vocab then apply t-SNE
    - Visualize end result, trajectory of negative/positive sentences
- Create our own Neural Modules
    - Create our data layer from processed data (ie tokens)
    - Our own MultiLayerPerceptron (nn.linear) with varying hidden size
    - More models: DistilBert, T5
    - Interactive inference ?
- Add figure showing scaling results

## Long / Optional
- Compare multi-class vs binary labels
- Compare frozen BERT performance vs fine-tuned BERT

## Optimizations
- Implement preprocessed data layer (tokenize once)
- Multi-CPU tokenization
- RAPIDS GPU tokenization?
    
## Questions for NVIDIA Engineering Teams
- Can memory be cleared interactively?
https://forums.fast.ai/t/clearing-gpu-memory-pytorch/14637/2
- Multi-gpu INSIDE Jupyter
- TensorCore Utilization Dashboard (Tensorboard has it for TensorFlow?)
    - Can we get it from pynvml/nvml?
- Closing/cleaning up factories safely (zombie protection)

In [1]:
import nemo
from nemo.utils.lr_policies import get_lr_policy
import nemo_nlp
from nemo_nlp.utils.callbacks.sentence_classification import \
    eval_iter_callback, eval_epochs_done_callback
import preproc_data_layer
from pytorch_transformers import BertTokenizer
import torch.nn.functional as f

import math
import numpy as np
import pandas as pd
pd.options.display.max_colwidth = -1

import json

from preproc_data_layer import BertSentenceClassificationDataset, PreprocBertSentenceClassificationDataLayer



## Data Explore

The SST-2 dataset https://nlp.stanford.edu/sentiment/index.html is a standard benchmark for sentence classification and is part of the GLUE Benchmark: https://gluebenchmark.com/tasks.

In [2]:
!sh get_data.sh

data directory already exists
SST-2.zip already exists


In [3]:
data_dir = 'data/SST-2'
df = pd.read_csv(data_dir + '/train.tsv', sep='\t')
test_df = pd.read_csv(data_dir + '/test.tsv', sep='\t')

In [4]:
df.head()

Unnamed: 0,sentence,label
0,hide new secretions from the parental units,0
1,"contains no wit , only labored gags",0
2,that loves its characters and communicates something rather beautiful about human nature,1
3,remains utterly satisfied to remain the same throughout,0
4,on the worst revenge-of-the-nerds clichés the filmmakers could dredge up,0


In [5]:
test_df

Unnamed: 0,index,sentence
0,0,uneasy mishmash of styles and genres .
1,1,this film 's relationship to actual tension is the same as what christmas-tree flocking in a spray can is to actual snow : a poor -- if durable -- imitation .
2,2,"by the end of no such thing the audience , like beatrice , has a watchful affection for the monster ."
3,3,director rob marshall went out gunning to make a great one .
4,4,"lathan and diggs have considerable personal charm , and their screen rapport makes the old story seem new ."
...,...,...
1816,1816,"it risks seeming slow and pretentious , because it thinks the gamble is worth the promise ."
1817,1817,take care of my cat offers a refreshingly different slice of asian cinema .
1818,1818,davis has filled out his cast with appealing fresh faces .
1819,1819,"it represents better-than-average movie-making that does n't demand a dumb , distracted audience ."


The dataset comes with a train file (labeled) and a test file (not labeled).  We will use part of the train file for model validation

In [6]:
# Split train to train and val and save to disk
np.random.seed(123)
train_mask = np.random.rand((len(df))) < .8
train_df = df[train_mask]
val_df = df[~train_mask]

In order to take advantage of NeMo's pre-built sentence classification data layer, the data must formatted as "sentence\tlabel" (sentence tab label).

In [7]:
# We will add a label column with all 0's (but they will not be used for anything).
test_df['label'] = 0

In [8]:
test_df = test_df[['sentence', 'label']]

In [9]:
test_df.head()

Unnamed: 0,sentence,label
0,uneasy mishmash of styles and genres .,0
1,this film 's relationship to actual tension is the same as what christmas-tree flocking in a spray can is to actual snow : a poor -- if durable -- imitation .,0
2,"by the end of no such thing the audience , like beatrice , has a watchful affection for the monster .",0
3,director rob marshall went out gunning to make a great one .,0
4,"lathan and diggs have considerable personal charm , and their screen rapport makes the old story seem new .",0


In [10]:
# Save new train, val, and test to disk
train_df.to_csv(data_dir + '/my_train.tsv', sep='\t', index=False)
val_df.to_csv(data_dir + '/my_val.tsv', sep='\t', index=False)
test_df.to_csv(data_dir + '/my_test.tsv', sep='\t', index=False)

# Data Preprocessing

In order to use BERT or any other Deep NLP based model, we must first tokenize the data. Our tokenizer will map each word to an integer.

In [11]:
# Each pretrained BERT model comes with a Tokenizer
pretrained_bert_model = 'bert-base-uncased'
#pretrained_bert_model = 'bert-large-uncased'
tokenizer = BertTokenizer.from_pretrained(pretrained_bert_model)

In [12]:
max_seq_length = 64 # we will pad 0's to shorter sentences and truncate longer
sample_dataset = BertSentenceClassificationDataset(
data_dir + '/my_train.tsv',
max_seq_length,
tokenizer,
num_samples=100,
shuffle=False)

In [13]:
# Our dataset consists of the tokens, token_types, attn_mask, and labels
sample_idx = np.random.randint(0, len(sample_dataset))
print(train_df.iloc[sample_idx])
print(f'Tokens: {sample_dataset[sample_idx][0]}')

sentence    the corporate circus that is the recording industry in the current climate of mergers and downsizing 
label       0                                                                                                    
Name: 68, dtype: object
Tokens: [  101  4089  1996  2087 16465  7214  7749   102     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0]


We can gain a lot of efficiency by saving the tokenized data to disk. For future model runs we then don't need to tokenize every time.

In [38]:
!python preproc_data.py \
--input_file $f'{data_dir}/my_train.tsv' \
--output_dir $f'{data_dir}/preproc' \
--dataset_name 'train-sst-2' \
--max_seq_length $max_seq_length \
--pretrained_bert_model $pretrained_bert_model



In [40]:
!python preproc_data.py \
--input_file $f'{data_dir}/my_test.tsv' \
--output_dir $f'{data_dir}/preproc' \
--dataset_name "test-sst-2" \
--max_seq_length $max_seq_length \
--pretrained_bert_model $pretrained_bert_model



In [41]:
!python preproc_data.py \
--input_file $f'{data_dir}/my_val.tsv' \
--output_dir $f'{data_dir}/preproc' \
--dataset_name "val-sst-2" \
--max_seq_length $max_seq_length \
--pretrained_bert_model $pretrained_bert_model



## Neural Modules

In NeMo, everything is a Neural Module. Neural modules abstract data and neural network layers. This allows us to combine many different datasets and neural network architectures.

The Neural Module Factory will then manage the neural modules, and is responsible for training (including mixed precision and distributed), logging, and inference.

In [42]:
# instantiate the neural module factory
log_dir = 'logs_' + pretrained_bert_model
checkpoint_dir = 'checkpoints_' + pretrained_bert_model
tensorboard_dir = 'tensorboard_' + pretrained_bert_model
nf = nemo.core.NeuralModuleFactory(log_dir=log_dir,
                                   checkpoint_dir=checkpoint_dir,
                                   tensorboard_dir=tensorboard_dir,
                                   create_tb_writer=True,
                                   add_time_to_log_dir=False,
                                   optimization_level='O1')

Pre-trained models will be automatically downloaded and cached.

In [43]:
# Pre-trained BERT
bert = nemo_nlp.BERT(pretrained_model_name=pretrained_bert_model)

2020-01-14 23:48:17,549 - INFO - loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json from cache at /root/.cache/torch/pytorch_transformers/4dad0251492946e18ac39290fcfe91b89d370fee250efe9521476438fe8ca185.bf3b9ea126d8c0001ee8a1e8b92229871d06d36d8808208cc2449280da87785c
2020-01-14 23:48:17,549 - INFO - loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json from cache at /root/.cache/torch/pytorch_transformers/4dad0251492946e18ac39290fcfe91b89d370fee250efe9521476438fe8ca185.bf3b9ea126d8c0001ee8a1e8b92229871d06d36d8808208cc2449280da87785c
2020-01-14 23:48:17,554 - INFO - Model config {
  "attention_probs_dropout_prob": 0.1,
  "finetuning_task": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hid

In [44]:
# same bert model config for later use
bert_config_dict = bert.config.to_dict()

bert_model_config_path = pretrained_bert_model + '_config.json'
with open(bert_model_config_path, 'w+') as json_file:
    json.dump(bert_config_dict, json_file)

Note here that the BERT models we are working with are massive. This gives our models a large capacity for learning that is needed to understand the nuance and complexity of natural language.

In [45]:
print(f'{pretrained_bert_model} has {bert.num_weights} weights')

bert-base-uncased has 109482240 weights


Here we define and instantiate the feed forward network that takes as input our BERT embeddings. This network will be used to output the sentence classifications.

In [46]:
# mlp classifier
bert_hidden_size = bert.local_parameters['hidden_size']

mlp = nemo_nlp.SequenceClassifier(hidden_size=bert_hidden_size, 
                                  num_classes=2,
                                  num_layers=2,
                                  log_softmax=False,
                                  dropout=0.1)

loss = nemo.backends.pytorch.common.CrossEntropyLoss()

In [47]:
# Compared to the BERT model, the MLP is tiny.
print(f'MLP has {mlp.num_weights} weights')

MLP has 592130 weights


# Pipelines

Pipelines are used to define how data will flow the different neural networks. In this case, our data will flow through the BERT network and then the MLP network.

We also have different pipelines for training, validation, and inference data.  

For training data, we want it to be used for optimization so it must be shuffled and we need to compute the loss.

For validation data, we won't use it for optimization but we want to know the loss.

And for inference data, we only want the final predictions coming from the model.

## Data

In [54]:
use_preproc = True

if pretrained_bert_model == 'bert-base-uncased':
    batch_size = 256
if pretrained_bert_model == 'bert-large-uncased':
    batch_size = 64

if use_preproc:
    train_data = preproc_data_layer.PreprocBertSentenceClassificationDataLayer(
        input_file=f'{data_dir}/preproc/train-sst-2_{pretrained_bert_model}_{max_seq_length}.hdf5',
        shuffle=True,
        num_samples=-1, # lower for dev, -1 for all dataset
        batch_size=batch_size
    )
    val_data = preproc_data_layer.PreprocBertSentenceClassificationDataLayer(
        input_file=f'{data_dir}/preproc/val-sst-2_{pretrained_bert_model}_{max_seq_length}.hdf5',
        shuffle=False,
        num_samples=-1, # lower for dev, -1 for all dataset
        batch_size=batch_size
    )
else:
    train_data = nemo_nlp.BertSentenceClassificationDataLayer(
        input_file=data_dir + '/my_train.tsv',
        tokenizer=tokenizer,
        max_seq_length=max_seq_length,
        shuffle=True,
        num_samples=-1, # lower for dev, -1 for all dataset
        batch_size=batch_size
    )
    val_data = nemo_nlp.BertSentenceClassificationDataLayer(
        input_file=data_dir + '/my_val.tsv',
        tokenizer=tokenizer,
        max_seq_length=max_seq_length,
        shuffle=False,
        num_samples=-1, # lower for dev, -1 for all dataset
        batch_size=batch_size
    )

In [55]:
train_input, train_token_types, train_attn_mask, train_labels = train_data()
val_input, val_token_types, val_attn_mask, val_labels = val_data()

## BERT Embeddings

In [56]:
train_embeddings = bert(input_ids=train_input,
                        token_type_ids=train_token_types,
                        attention_mask=train_attn_mask)
val_embeddings = bert(input_ids=val_input,
                        token_type_ids=val_token_types,
                        attention_mask=val_attn_mask)

## Visualize BERT Embeddings

If we want to inspect the data as it flows through our neural factory we can use the .infer method.  This method will give us the tensors without performing any optimization.

In [57]:
%%time
val_embeddings_tensors = nf.infer(tensors=[val_embeddings])

Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic


2020-01-15 00:08:41,132 - INFO - Evaluating batch 0 out of 53
2020-01-15 00:08:41,132 - INFO - Evaluating batch 0 out of 53
2020-01-15 00:08:42,426 - INFO - Evaluating batch 5 out of 53
2020-01-15 00:08:42,426 - INFO - Evaluating batch 5 out of 53
2020-01-15 00:08:43,194 - INFO - Evaluating batch 10 out of 53
2020-01-15 00:08:43,194 - INFO - Evaluating batch 10 out of 53
2020-01-15 00:08:43,807 - INFO - Evaluating batch 15 out of 53
2020-01-15 00:08:43,807 - INFO - Evaluating batch 15 out of 53
2020-01-15 00:08:44,418 - INFO - Evaluating batch 20 out of 53
2020-01-15 00:08:44,418 - INFO - Evaluating batch 20 out of 53
2020-01-15 00:08:45,031 - INFO - Evaluating batch 25 out of 53
2020-01-15 00:08:45,031 - INFO - Evaluating batch 25 out of 53
2020-01-15 00:08:45,643 - INFO - Evaluating batch 30 out of 53
2020-01-15 00:08:45,643 - INFO - Evaluating batch 30 out of 53
2020-01-15 00:08:46,253 - INFO - Evaluating batch 35 out of 53
2020-01-15 00:08:46,253 - INFO - Evaluating batch 35 out of

CPU times: user 4.79 s, sys: 3.77 s, total: 8.56 s
Wall time: 9.08 s


In [58]:
# each word is embedded into bert_hidden_size space
# max_seq_len words are embedded
print(val_embeddings_tensors[0][0][0].shape)

torch.Size([64, 768])


In [59]:
print(val_embeddings_tensors[0][0][0])

tensor([[-0.4182, -0.2171, -0.3546,  ..., -0.4809,  0.5344,  0.3483],
        [-0.4274,  0.6009, -0.4455,  ..., -0.3183,  0.5212, -0.3771],
        [-0.3586,  0.3772, -0.2083,  ..., -0.4166, -0.0050,  0.3189],
        ...,
        [-0.2747, -0.3725,  0.1221,  ..., -0.1801,  0.1489, -0.0565],
        [-0.2247, -0.3183,  0.1365,  ..., -0.2435,  0.2664, -0.0509],
        [-0.0195,  0.3641, -0.0066,  ..., -0.1118,  0.3956, -0.1401]])


## Training Pipeline 

In order to optimize our network, we need to pass the embeddings through the MLP network and then compute the loss.

In [60]:
train_logits = mlp(hidden_states=train_embeddings)
val_logits = mlp(hidden_states=val_embeddings)

train_loss = loss(logits=train_logits, labels=train_labels)
val_loss = loss(logits=val_logits, labels=val_labels)

## Callbacks

Callbacks are used to record and log metrics and save checkpoints for the training and evaluation. We use callbacks to print to screen and also to tensorboard.




In [61]:
num_epochs = 3

num_gpus = 1

train_data_size = len(train_data)

steps_per_epoch = math.ceil(train_data_size / (batch_size * num_gpus))

train_callback = nemo.core.SimpleLossLoggerCallback(
    tensors=[train_loss, train_logits],
    print_func=lambda x: nf.logger.info(f'Train loss: {str(np.round(x[0].item(), 3))}'),
    tb_writer=nf.tb_writer,
    get_tb_values=lambda x: [["train_loss", x[0]]],
    step_freq=1)

eval_callback = nemo.core.EvaluatorCallback(
    eval_tensors=[val_logits, val_labels],
    user_iter_callback=lambda x, y: eval_iter_callback(
        x, y, val_data),
    user_epochs_done_callback=lambda x: eval_epochs_done_callback(
        x, f'{nf.work_dir}/graphs'),
    tb_writer=nf.tb_writer,
    eval_epoch=1,
    eval_step=steps_per_epoch)

# Create callback to save checkpoints
ckpt_callback = nemo.core.CheckpointCallback(
    folder=nf.checkpoint_dir,
    epoch_freq=1,
    step_freq=-1,
    checkpoints_to_keep=num_epochs)

In [63]:
lr_policy_fn = get_lr_policy('WarmupAnnealing',
                             total_steps=num_epochs * steps_per_epoch,
                             warmup_ratio=0.1)

In [64]:
%%time
nf.train(tensors_to_optimize=[train_loss],
         callbacks=[train_callback, eval_callback, ckpt_callback],
         lr_policy=lr_policy_fn,
         optimizer='adam',
         optimization_params={'num_epochs': num_epochs, 'lr': 5e-5})

2020-01-15 00:09:53,030 - INFO - Starting .....
2020-01-15 00:09:53,030 - INFO - Starting .....
2020-01-15 00:09:53,035 - INFO - Found 2 modules with weights:
2020-01-15 00:09:53,035 - INFO - Found 2 modules with weights:
2020-01-15 00:09:53,037 - INFO - SequenceClassifier
2020-01-15 00:09:53,037 - INFO - SequenceClassifier
2020-01-15 00:09:53,038 - INFO - BERT
2020-01-15 00:09:53,038 - INFO - BERT
2020-01-15 00:09:53,040 - INFO - Total model parameters: 110074370
2020-01-15 00:09:53,040 - INFO - Total model parameters: 110074370
2020-01-15 00:09:53,044 - INFO - Restoring checkpoint from folder checkpoints_bert-base-uncased ...
2020-01-15 00:09:53,044 - INFO - Restoring checkpoint from folder checkpoints_bert-base-uncased ...
2020-01-15 00:09:53,053 - INFO - Starting epoch 0
2020-01-15 00:09:53,053 - INFO - Starting epoch 0


Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic


2020-01-15 00:09:56,161 - INFO - Step: 0
2020-01-15 00:09:56,161 - INFO - Step: 0
2020-01-15 00:09:56,165 - INFO - Train loss: 0.695
2020-01-15 00:09:56,165 - INFO - Train loss: 0.695
2020-01-15 00:09:56,168 - INFO - Step time: 0.44701504707336426 seconds
2020-01-15 00:09:56,168 - INFO - Step time: 0.44701504707336426 seconds
2020-01-15 00:09:56,170 - INFO - Doing Evaluation ..............................
2020-01-15 00:09:56,170 - INFO - Doing Evaluation ..............................
2020-01-15 00:10:06,265 - INFO - Accuracy: 0.4843146000594707
2020-01-15 00:10:06,265 - INFO - Accuracy: 0.4843146000594707
2020-01-15 00:10:06,267 - INFO - Sampled preds: [0 1 1 1 1 1 0 1 0 1 1 0 0 0 1 1 1 1 1 1]
2020-01-15 00:10:06,267 - INFO - Sampled preds: [0 1 1 1 1 1 0 1 0 1 1 0 0 0 1 1 1 1 1 1]
2020-01-15 00:10:06,270 - INFO - Sampled labels: [1 1 1 0 0 0 1 0 0 0 0 0 1 0 1 0 1 0 1 0]
2020-01-15 00:10:06,270 - INFO - Sampled labels: [1 1 1 0 0 0 1 0 0 0 0 0 1 0 1 0 1 0 1 0]
2020-01-15 00:10:06,544 

CPU times: user 3min 18s, sys: 1min 10s, total: 4min 29s
Wall time: 4min 39s


In [65]:
print(tensorboard_dir)

tensorboard_bert-base-uncased


In [67]:
!ls -lh $checkpoint_dir

total 3.7G
-rw-r--r-- 1 root root 418M Jan 15 00:12 BERT-EPOCH-1.pt
-rw-r--r-- 1 root root 418M Jan 15 00:14 BERT-EPOCH-2.pt
-rw-r--r-- 1 root root 418M Jan 15 00:14 BERT-EPOCH-3.pt
-rw-r--r-- 1 root root 2.3M Jan 15 00:12 SequenceClassifier-EPOCH-1.pt
-rw-r--r-- 1 root root 2.3M Jan 15 00:14 SequenceClassifier-EPOCH-2.pt
-rw-r--r-- 1 root root 2.3M Jan 15 00:14 SequenceClassifier-EPOCH-3.pt
-rw-r--r-- 1 root root 836M Jan 15 00:12 trainer-EPOCH-1.pt
-rw-r--r-- 1 root root 836M Jan 15 00:14 trainer-EPOCH-2.pt
-rw-r--r-- 1 root root 836M Jan 15 00:14 trainer-EPOCH-3.pt


In [69]:
# # command for distributed training
# time python -m torch.distributed.launch --nproc_per_node=2 sentence_classification.py \
# --train_file $f'{data_dir}/preproc/train-sst-2_{pretrained_bert_model}_{max_seq_length}.hdf5' \
# --eval_file $f'{data_dir}/preproc/train-sst-2_{pretrained_bert_model}_{max_seq_length}.hdf5' \
# --num_gpus 2 \
# --batch_size $batch_size \
# --amp_opt_level O1 \
# --work_dir $f'distributed_logs_{pretrained_bert_model}' \
# --mode train \
# --num_classes 2 \
# --num_samples -1 \
# --num_epochs 1 \
# --preproc

## Inference Pipeline

RESTART KERNEL

In [1]:
import nemo
import nemo_nlp
from pytorch_transformers import BertTokenizer
import torch
import torch.nn.functional as f


import math
import numpy as np
import pandas as pd
pd.options.display.max_colwidth = -1

import json



In [7]:
pretrained_bert_model = 'bert-base-uncased'
#pretrained_bert_model = 'bert-large-uncased'
bert_model_config_path = pretrained_bert_model + '_config.json'
checkpoint_dir = 'checkpoints_' + pretrained_bert_model
inference_log_dir = 'inference_' + pretrained_bert_model
data_dir = 'data/SST-2'

In [8]:
!ls -lh $checkpoint_dir

total 3.7G
-rw-r--r-- 1 root root 418M Jan 15 00:12 BERT-EPOCH-1.pt
-rw-r--r-- 1 root root 418M Jan 15 00:14 BERT-EPOCH-2.pt
-rw-r--r-- 1 root root 418M Jan 15 00:14 BERT-EPOCH-3.pt
-rw-r--r-- 1 root root 2.3M Jan 15 00:12 SequenceClassifier-EPOCH-1.pt
-rw-r--r-- 1 root root 2.3M Jan 15 00:14 SequenceClassifier-EPOCH-2.pt
-rw-r--r-- 1 root root 2.3M Jan 15 00:14 SequenceClassifier-EPOCH-3.pt
-rw-r--r-- 1 root root 836M Jan 15 00:12 trainer-EPOCH-1.pt
-rw-r--r-- 1 root root 836M Jan 15 00:14 trainer-EPOCH-2.pt
-rw-r--r-- 1 root root 836M Jan 15 00:14 trainer-EPOCH-3.pt


In [4]:
nf = nemo.core.NeuralModuleFactory(log_dir=inference_log_dir,
                                   optimization_level='O1')

tokenizer = BertTokenizer.from_pretrained(pretrained_bert_model)

if bert_model_config_path is not None:
    with open(bert_model_config_path) as json_file:
        config = json.load(json_file)
        bert = nemo_nlp.huggingface.BERT(**config)

# mlp classifier
bert_hidden_size = bert.local_parameters['hidden_size']

mlp = nemo_nlp.SequenceClassifier(hidden_size=bert_hidden_size, 
                                  num_classes=2,
                                  num_layers=2,
                                  log_softmax=False,
                                  dropout=0.0)

2020-01-15 00:50:27,777 - INFO - loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at /root/.cache/torch/pytorch_transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084


In [5]:
bert.restore_from(checkpoint_dir + '/BERT-EPOCH-3.pt')
mlp.restore_from(checkpoint_dir + '/SequenceClassifier-EPOCH-3.pt')

In [9]:
max_seq_length = 64

if pretrained_bert_model == 'bert-base-uncased':
    batch_size = 256
if pretrained_bert_model == 'bert-large-uncased':
    batch_size = 64
    
test_data = nemo_nlp.BertSentenceClassificationDataLayer(
    input_file=data_dir + '/my_test.tsv',
    tokenizer=tokenizer,
    max_seq_length=max_seq_length,
    shuffle=False,
    num_samples=-1, # lower for dev, -1 for all dataset
    batch_size=64
)

2020-01-15 00:54:26,649 - INFO - data/SST-2/my_test.tsv: 1821
2020-01-15 00:54:29,408 - INFO - Min: 4 |                  Max: 64 |                  Mean: 24.85722130697419 |                  Median: 24.0
2020-01-15 00:54:29,409 - INFO - 75 percentile: 32.0
2020-01-15 00:54:29,412 - INFO - 99 percentile: 53.799999999999955
2020-01-15 00:54:29,414 - INFO - 0 out of 1821                        sentencess with more than 64 subtokens.
2020-01-15 00:54:29,415 - INFO - *** Example ***
2020-01-15 00:54:29,416 - INFO - example_index: 0
2020-01-15 00:54:29,417 - INFO - subtokens: [CLS] uneasy mis ##hma ##sh of styles and genres . [SEP]
2020-01-15 00:54:29,417 - INFO - sent_label: 0
2020-01-15 00:54:29,418 - INFO - input_ids: 101 15491 28616 22444 4095 1997 6782 1998 11541 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2020-01-15 00:54:29,418 - INFO - input_mask: 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

In [10]:
test_input, test_token_types, test_attn_mask, _ = test_data()

In [11]:
test_embeddings = bert(input_ids=test_input,
                        token_type_ids=test_token_types,
                        attention_mask=test_attn_mask)

In [12]:
test_logits = mlp(hidden_states=test_embeddings)

In [13]:
%%time
test_logits_tensors = nf.infer(tensors=[test_logits])

Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic


2020-01-15 00:55:18,638 - INFO - Evaluating batch 0 out of 29
2020-01-15 00:55:19,310 - INFO - Evaluating batch 2 out of 29
2020-01-15 00:55:19,402 - INFO - Evaluating batch 4 out of 29
2020-01-15 00:55:19,484 - INFO - Evaluating batch 6 out of 29
2020-01-15 00:55:19,636 - INFO - Evaluating batch 8 out of 29
2020-01-15 00:55:19,713 - INFO - Evaluating batch 10 out of 29
2020-01-15 00:55:19,788 - INFO - Evaluating batch 12 out of 29
2020-01-15 00:55:19,863 - INFO - Evaluating batch 14 out of 29
2020-01-15 00:55:19,939 - INFO - Evaluating batch 16 out of 29
2020-01-15 00:55:20,016 - INFO - Evaluating batch 18 out of 29
2020-01-15 00:55:20,091 - INFO - Evaluating batch 20 out of 29
2020-01-15 00:55:20,166 - INFO - Evaluating batch 22 out of 29
2020-01-15 00:55:20,242 - INFO - Evaluating batch 24 out of 29
2020-01-15 00:55:20,317 - INFO - Evaluating batch 26 out of 29
2020-01-15 00:55:20,393 - INFO - Evaluating batch 28 out of 29


CPU times: user 1.54 s, sys: 1.01 s, total: 2.56 s
Wall time: 2.91 s


In [14]:
test_probs = f.softmax(torch.cat(test_logits_tensors[0])).numpy()[:, 1] 

  return orig_fn(*new_args, **kwargs)


In [15]:
test_df = pd.read_csv(data_dir + '/test.tsv', sep='\t')

In [16]:
test_df['prob'] = test_probs 

In [17]:
test_df

Unnamed: 0,index,sentence,prob
0,0,uneasy mishmash of styles and genres .,0.002884
1,1,this film 's relationship to actual tension is the same as what christmas-tree flocking in a spray can is to actual snow : a poor -- if durable -- imitation .,0.002166
2,2,"by the end of no such thing the audience , like beatrice , has a watchful affection for the monster .",0.998256
3,3,director rob marshall went out gunning to make a great one .,0.998232
4,4,"lathan and diggs have considerable personal charm , and their screen rapport makes the old story seem new .",0.999278
...,...,...,...
1816,1816,"it risks seeming slow and pretentious , because it thinks the gamble is worth the promise .",0.988558
1817,1817,take care of my cat offers a refreshingly different slice of asian cinema .,0.999533
1818,1818,davis has filled out his cast with appealing fresh faces .,0.999630
1819,1819,"it represents better-than-average movie-making that does n't demand a dumb , distracted audience .",0.998245


In [18]:
test_df.to_csv(inference_log_dir + '/test_inference.tsv', sep='\t', index=False)

In [19]:
def sample_classification(data_path):
    df = pd.read_csv(data_path, sep='\t')
    sample = df.sample()
    sentence = sample.sentence.values[0]
    prob = sample.prob.values[0]
    result = f'{sentence} | {prob}'
    return result

In [20]:
sample_classification(inference_log_dir + '/test_inference.tsv')

'all this turns out to be neither funny nor provocative - only dull . | 0.0017007225'

In [21]:
num_samples = 10
for _ in range(num_samples):
    print(sample_classification(inference_log_dir + '/test_inference.tsv'))

( it 's ) a prison soccer movie starring charismatic tough guy vinnie jones , but it had too much spitting for me to enjoy . | 0.0048579657
as elegantly crafted as it often is , anderson 's movie is essentially a one-trick pony that , hampered by an undeveloped script , ultimately pulls up lame . | 0.02967562
despite the premise of a good story ... it wastes all its star power on cliched or meaningless roles . | 0.0027683857
what full frontal lacks in thematic coherence it largely makes up for as loosey-goosey , experimental entertainment . | 0.97737074
a tough go , but leigh 's depth and rigor , and his skill at inspiring accomplished portrayals that are all the more impressive for their lack of showiness , offsets to a notable degree the film 's often-mined and despairing milieu . | 0.9195341999999999
an incendiary , deeply thought-provoking look at one of the most peculiar ( and peculiarly venomous ) bigotries in our increasingly frightening theocracy | 0.99882597
boring we did n't 

## BERT nails it:
the film is just a big , gorgeous , mind-blowing , breath-taking mess . | 0.2738656

a sensual performance from abbass buoys the flimsy story , but her inner journey is largely unexplored and we 're left wondering about this exotic-looking woman whose emotional depths are only hinted at . | 0.48260054

## Classify my sentence

In [22]:
def classify_sentence(nf, tokenizer, bert, mlp, sentence):
    sentence = sentence.lower()
    with open("tmp_sentence.tsv", 'w+') as tmp_tsv:
        header = 'sentence\tlabel\n'
        line = sentence + '\t0\n'
        tmp_tsv.writelines([header, line])

    tmp_data = nemo_nlp.BertSentenceClassificationDataLayer(
        input_file='tmp_sentence.tsv',
        tokenizer=tokenizer,
        max_seq_length=128,
        shuffle=False,
        num_samples=-1, # lower for dev, -1 for all dataset
        batch_size=1
    )
    tmp_input, tmp_token_types, tmp_attn_mask, _ = tmp_data()
    tmp_embeddings = bert(input_ids=tmp_input,
                            token_type_ids=tmp_token_types,
                            attention_mask=tmp_attn_mask)
    tmp_logits = mlp(hidden_states=tmp_embeddings)
    tmp_logits_tensors = nf.infer(tensors=[tmp_logits, tmp_embeddings])
    tmp_probs = f.softmax(torch.cat(tmp_logits_tensors[0])).numpy()[:, 1] 
    print(f'{sentence} | {tmp_probs[0]}')

In [23]:
#sentence = 'point break is the best movie of all time'
#sentence = 'the movie was a wonderful exercise in understanding the struggles of native americans'
#sentence = 'the performance of diego luna had me excited and annoyed at the same time'
sentence = 'matt damon is the only good thing about this film'
classify_sentence(nf, tokenizer, bert, mlp, sentence)

2020-01-15 00:56:44,104 - INFO - tmp_sentence.tsv: 1
2020-01-15 00:56:44,106 - INFO - Min: 12 |                  Max: 12 |                  Mean: 12.0 |                  Median: 12.0
2020-01-15 00:56:44,108 - INFO - 75 percentile: 12.0
2020-01-15 00:56:44,109 - INFO - 99 percentile: 12.0
2020-01-15 00:56:44,111 - INFO - 0 out of 1                        sentencess with more than 128 subtokens.
2020-01-15 00:56:44,112 - INFO - *** Example ***
2020-01-15 00:56:44,113 - INFO - example_index: 0
2020-01-15 00:56:44,114 - INFO - subtokens: [CLS] matt damon is the only good thing about this film [SEP]
2020-01-15 00:56:44,115 - INFO - sent_label: 0
2020-01-15 00:56:44,117 - INFO - input_ids: 101 4717 11317 2003 1996 2069 2204 2518 2055 2023 2143 102
2020-01-15 00:56:44,118 - INFO - input_mask: 1 1 1 1 1 1 1 1 1 1 1 1


Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic


2020-01-15 00:56:45,239 - INFO - Evaluating batch 0 out of 1


matt damon is the only good thing about this film | 0.026708465069532394


  return orig_fn(*new_args, **kwargs)
