In [None]:
!pip install simpletransformers
# in "Runtime" please select "Change runtime type"

Collecting simpletransformers
[?25l  Downloading https://files.pythonhosted.org/packages/3b/36/884727c20a4777105705cd6d01d57abfa7274d63a7aebb6d23d46b589d2d/simpletransformers-0.46.6-py3-none-any.whl (199kB)
[K     |█▋                              | 10kB 21.3MB/s eta 0:00:01[K     |███▎                            | 20kB 6.8MB/s eta 0:00:01[K     |█████                           | 30kB 8.3MB/s eta 0:00:01[K     |██████▋                         | 40kB 9.0MB/s eta 0:00:01[K     |████████▏                       | 51kB 7.9MB/s eta 0:00:01[K     |█████████▉                      | 61kB 8.8MB/s eta 0:00:01[K     |███████████▌                    | 71kB 8.9MB/s eta 0:00:01[K     |█████████████▏                  | 81kB 9.2MB/s eta 0:00:01[K     |██████████████▊                 | 92kB 9.4MB/s eta 0:00:01[K     |████████████████▍               | 102kB 10.0MB/s eta 0:00:01[K     |██████████████████              | 112kB 10.0MB/s eta 0:00:01[K     |███████████████████▊         


# Transformers

(From my colleague and co-author Laura Mitchell: https://badootech.badoo.com/achieving-state-of-the-art-results-in-natural-language-processing-d6fd25954a90)

## Elmo and Bert

An introduction to ELMo can be found in this paper: https://arxiv.org/pdf/1802.05365.pdf. ELMo aims to provide an improved word representation for NLP tasks in different contexts by producing multiple word embeddings per single word, across different scenarios. 


In the example below, the word “minute” has multiple meanings (homonyms) so gets represented by multiple embeddings with ELMo. However, with other models such as GloVe, each instance would have the same representation regardless of its context.


![alt text](https://miro.medium.com/max/700/0*ItUMFXLvJOUzTZg9)

ELMo uses a bidirectional language model (biLM) to learn both word and linguistic context. At each word, the internal states from both the forward and backward pass are concatenated to produce an intermediate word vector. As such, it is the model’s bidirectional nature that gives it a hint not only as to the next word in a sentence but also the words that came before.

![alt text](https://miro.medium.com/max/700/0*nbgjbNyQltrAd4Dd)

Another feature of ELMo is that it uses language models comprised of multiple layers, forming a multilayer RNN. The intermediate word vector produced by layer 1 is fed up to layer 2. The more layers that are present in the model, the more the internal states get processed and as such represent more abstract semantics such as topics and sentiment. By contrast, lower layers represent less abstract semantics such as short phrases or parts of speech.


![alt text](https://miro.medium.com/max/700/0*L7y2XGKTmWokhZEE)


In order to compute the word embeddings that get fed into the first layer of the biLM, ELMo uses a character-based CNN. The input is computed purely from combinations of characters within a word. This has two key benefits:
It is able to form representations of words external to the vocabulary it was trained on. For example, the model could determine that “Mum” and “Mummy” are somewhat related before even considering the context in which they are used. This is particularly useful for us at Badoo as it can help detect misspelled words through context.
It continues to perform well when it encounters a word that was absent from the training dataset.



Loading the Elmo Model
The model trained on One Billion World Language Model Benchmark (http://www.statmt.org/lm-benchmark/) as been exposed on Tensorflow Hub.

# BERT

Bidirectional Encoder Representations from Transformers (BERT). incorporates an attention mechanism (transformer) that learns contextual relations between words in text. Unlike bidirectional models such as ELMo, where the text input is read sequentially (left-to-right or right-to-left), here the entire sequence of words is read at once: one could actually describe BERT as non-directional.


Essentially, BERT is a trained transformer encoder stack where results are passed up from one encoder to the next.


![alt text](https://miro.medium.com/max/700/0*jjmGBjYsAtHYUws5)

At each encoder, self-attention is applied and this helps the encoder to look at other words in the input sentence as it encodes each specific word, so helping it to learn correlations between the words. These results then pass through a feed-forward network.

![alt text](https://miro.medium.com/max/700/0*dzRZ740T1XfHMo1K)


BERT was trained on Wikipedia text data and uses masked modelling rather than sequential modelling during training. It masks 15% of the words in each sequence and tries to predict the original value based on the context. This involves the following:
Adding a classification layer on top of the encoder output.
Multiplying the output vectors by the embedding matrix, transforming them into the vocabulary dimension.
Calculating the probability of each word in the vocabulary with softmax.

![alt text](https://miro.medium.com/max/700/0*qY_xM6JcrkJILRIH)


As the BERT loss function only takes into consideration the prediction of the masked values, so converging more slowly than directional models. This drawback, however, is offset by its increased awareness of context.

In [None]:
%%writefile setup.sh

git clone https://github.com/NVIDIA/apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./apex


Writing setup.sh


In [None]:
!sh setup.sh

Cloning into 'apex'...
remote: Enumerating objects: 16, done.[K
remote: Counting objects: 100% (16/16), done.[K
remote: Compressing objects: 100% (16/16), done.[K
remote: Total 7416 (delta 1), reused 1 (delta 0), pack-reused 7400[K
Receiving objects: 100% (7416/7416), 13.91 MiB | 11.29 MiB/s, done.
Resolving deltas: 100% (4999/4999), done.
  cmdoptions.check_install_build_global(options)
Created temporary directory: /tmp/pip-ephem-wheel-cache-bshef9gz
Created temporary directory: /tmp/pip-req-tracker-7f66u_gz
Created requirements tracker '/tmp/pip-req-tracker-7f66u_gz'
Created temporary directory: /tmp/pip-install-llbg92ml
Processing ./apex
  Created temporary directory: /tmp/pip-req-build-sax9aqqp
  Added file:///content/apex to build tracker '/tmp/pip-req-tracker-7f66u_gz'
    Running setup.py (path:/tmp/pip-req-build-sax9aqqp/setup.py) egg_info for package from file:///content/apex
    Running command python setup.py egg_info


    torch.__version__  = 1.6.0+cu101


    running 

In [None]:
!pip install simpletransformers

In [None]:
from simpletransformers.classification import ClassificationModel
import pandas as pd
import logging


logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

# Train and Evaluation data needs to be in a Pandas Dataframe of two columns. The first column is the text with type str, and the second column is the label with type int.
train_data = [['Example sentence belonging to class 1', 1], ['Example sentence belonging to class 0', 0]]
train_df = pd.DataFrame(train_data)

eval_data = [['Example eval sentence belonging to class 1', 1], ['Example eval sentence belonging to class 0', 0]]
eval_df = pd.DataFrame(eval_data)

# Create a ClassificationModel
model = ClassificationModel('roberta', 'roberta-base') # You can set class weights by using the optional weight argument

# Train the model
model.train_model(train_df)

# Evaluate the model
result, model_outputs, wrong_predictions = model.eval_model(eval_df)

INFO:filelock:Lock 139642786838008 acquired on /root/.cache/torch/transformers/e1a2a406b5a05063c31f4dfdee7608986ba7c6393f7f79db5e69dcd197208534.117c81977c5979de8c088352e74ec6e70f5c66096c28b61d3c50101609b39690.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=481.0, style=ProgressStyle(description_…

INFO:filelock:Lock 139642786838008 released on /root/.cache/torch/transformers/e1a2a406b5a05063c31f4dfdee7608986ba7c6393f7f79db5e69dcd197208534.117c81977c5979de8c088352e74ec6e70f5c66096c28b61d3c50101609b39690.lock
INFO:filelock:Lock 139642786839688 acquired on /root/.cache/torch/transformers/80b4a484eddeb259bec2f06a6f2f05d90934111628e0e1c09a33bd4a121358e1.49b88ba7ec2c26a7558dda98ca3884c3b80fa31cf43a1b1f23aef3ff81ba344e.lock





HBox(children=(FloatProgress(value=0.0, description='Downloading', max=501200538.0, style=ProgressStyle(descri…

INFO:filelock:Lock 139642786839688 released on /root/.cache/torch/transformers/80b4a484eddeb259bec2f06a6f2f05d90934111628e0e1c09a33bd4a121358e1.49b88ba7ec2c26a7558dda98ca3884c3b80fa31cf43a1b1f23aef3ff81ba344e.lock





- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:filelock:Lock 139642786838624 acquired on /root/.cache/torch/transformers/d0c5776499adc1ded22493fae699da0971c1ee4c2587111707a4d177d20257a2.ef00af9e673c7160b4d41cfda1f48c5f4cba57d5142754525572a846a1ab1b9b.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898823.0, style=ProgressStyle(descripti…

INFO:filelock:Lock 139642786838624 released on /root/.cache/torch/transformers/d0c5776499adc1ded22493fae699da0971c1ee4c2587111707a4d177d20257a2.ef00af9e673c7160b4d41cfda1f48c5f4cba57d5142754525572a846a1ab1b9b.lock





INFO:filelock:Lock 139642643794296 acquired on /root/.cache/torch/transformers/b35e7cd126cd4229a746b5d5c29a749e8e84438b14bcdb575950584fe33207e8.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…

INFO:filelock:Lock 139642643794296 released on /root/.cache/torch/transformers/b35e7cd126cd4229a746b5d5c29a749e8e84438b14bcdb575950584fe33207e8.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda.lock





  "Dataframe headers not specified. Falling back to using column 0 as text and column 1 as labels."
INFO:simpletransformers.classification.classification_model: Converting to features started. Cache is not used.


HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, description='Epoch', max=1.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Running Epoch 0 of 1', max=1.0, style=ProgressStyle(descr…









INFO:simpletransformers.classification.classification_model: Training of roberta model complete. Saved to outputs/.
  "Dataframe headers not specified. Falling back to using column 0 as text and column 1 as labels."
INFO:simpletransformers.classification.classification_model: Converting to features started. Cache is not used.


HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, description='Running Evaluation', max=1.0, style=ProgressStyle(descrip…

  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
INFO:simpletransformers.classification.classification_model:{'mcc': 0.0, 'tp': 1, 'tn': 0, 'fp': 1, 'fn': 0, 'eval_loss': 0.6954442858695984}






We set the trainable parameter to True when creating the module so that the 4 scalar weights (as described in the paper) and all LSTM cell variables can be trained. In this setting, the module still keeps all other parameters fixed. This will help to get the embedding of a word the model has not seen, given the context.

## Structure
The ELmo model consists of two files:

options.json : These are the parameters/options using which the language model was trained on

weights.hdf5 : The weights file for the best model

The input to the pre trained model (elmo) above can be fed in two different ways:

In [None]:
elmo()