A.S. Lundervold, v.011122

# Introduction

This is a quick example of some techniques and ideas from natural language processing (NLP) and some approaches to NLP based on deep learning. The goal is to introduce some of the things going on in this field and for you to better understand some recent ideas and developments in deep learning.

> NLP is an exciting area these days. Breakthroughs in deep learning for language processing recently initiated a revolution in NLP, and we're still in it. The best place to start exploring this is perhaps the HuggingFace community and library (at least if you want to get started right away playing around with using state-of-the-art NLP models): https://huggingface.co/. <br> <a href="https://huggingface.co/"><img width=20% src="https://luxcapital-website-media.s3.amazonaws.com/wp-content/uploads/2019/12/23115642/Logo-600x554.png"></a>

# Setup

In [1]:
# This is a quick check of whether the notebook is currently running on Google Colaboratory
# or on Kaggle, as that makes some difference for the code below.
# We'll do this in every notebook of the course.
try:
    import colab
    colab=True
except:
    colab=False

import os
kaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')

In [2]:
import numpy as np

We'll use the excellent HuggingFace Transformers library, which covers all our natural language processing needs:

<img src="https://camo.githubusercontent.com/b253a30b83a0724f3f74f3f58236fb49ced8d7b27cb15835c9978b54e444ab08/68747470733a2f2f68756767696e67666163652e636f2f64617461736574732f68756767696e67666163652f646f63756d656e746174696f6e2d696d616765732f7265736f6c76652f6d61696e2f7472616e73666f726d6572735f6c6f676f5f6e616d652e706e67">


We will not cover the library in any detail. If you're interested, take a look at the [HuggingFace course](https://huggingface.co/course/chapter1/1) and its excellent documentation over at https://huggingface.co/transformers.

# Load data

We'll use the [IMDB dataset](https://huggingface.co/datasets/imdb) containing 50.000 movie reviews from IMDB, each labeled as either negative (0) or positive (1). It is split into 25.000 reviews for training and 25.000 reviews for testing. 

The dataset is available via HuggingFace `datasets`:

In [3]:
from datasets import load_dataset

In [4]:
dataset = load_dataset("imdb")

Found cached dataset imdb (/home/ubuntu/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1)


  0%|          | 0/3 [00:00<?, ?it/s]

In [5]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

## Make a sample dataset

As the training process takes a long time, we create a small sample dataset:

In [6]:
sample = True

In [7]:
if sample:
    dataset = dataset['train']
    dataset = dataset.train_test_split(train_size=0.2, shuffle=True, seed=42)['train']
    dataset = dataset.train_test_split(test_size=0.2)

Loading cached split indices for dataset at /home/ubuntu/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1/cache-c676b5ef844040e4.arrow and /home/ubuntu/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1/cache-2711e3ecf04e8063.arrow


In [8]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1000
    })
})

# Explore the data

The training data is stored under `train`, the test data under `test`:

Here are two training instances:

In [9]:
dataset['train'][10:12]

{'text': ['Like some of the other reviewers have alluded to previously, I\'d like to know what moron actually read the script and went\', "Yea!!! This is it. This is the next film we are going to green light!!" And whoever that person is, should have his or her head examined for actual brain activity. Because whoever is responsible for actually dishing out money to have this made after reading the script, well, I\'d love to give you my email address and maybe you\'d like to just give away some more money. This film is atrocious in every way.<br /><br />The Wayans are funny, at least they can be. They have made some good films and had some incredibly funny performances along the way. But in here, not only does the premise defy all logic, not only is the acting terrible, not only is the entire movie offensive from start to finish, not only is the direction as amateurish as you can find, but they actually want you to pay to see this film. Maybe if it was free...naaah, it would still be a 

We can print them a in a more readable form:

In [10]:
from pprint import pprint

In [11]:
pprint(dataset['train'][10])

{'label': 0,
 'text': "Like some of the other reviewers have alluded to previously, I'd "
         'like to know what moron actually read the script and went\', "Yea!!! '
         'This is it. This is the next film we are going to green light!!" And '
         'whoever that person is, should have his or her head examined for '
         'actual brain activity. Because whoever is responsible for actually '
         'dishing out money to have this made after reading the script, well, '
         "I'd love to give you my email address and maybe you'd like to just "
         'give away some more money. This film is atrocious in every way.<br '
         '/><br />The Wayans are funny, at least they can be. They have made '
         'some good films and had some incredibly funny performances along the '
         'way. But in here, not only does the premise defy all logic, not only '
         'is the acting terrible, not only is the entire movie offensive from '
         'start to finish, not on

> **How do we represent the text for consumption by a machine learning model?**

> **How can a computer read??**

<img src="https://camo.githubusercontent.com/7d5ed540c87d660cae46ca0d2055d760f786bea36513bb1a0b0784d47cef45b1/687474703a2f2f322e62702e626c6f6773706f742e636f6d2f5f2d2d75564865746b5549512f54446165356a476e6138492f4141414141414141414b302f734253704c7564576d63772f73313630302f72656164696e672e676966">

# Prepare the data: tokenization and numericalization

For a computer, everything is numbers. We have to convert the text to a series of numbers and then feed those to the computer.

This can be done in two widely used steps in natural language processing: **tokenization** and **numericalization**.

## Tokenization

In tokenization, the text is split into single words called tokens. A simple way to achieve this is to separate according to spaces in the text. But then we, among other things, lose punctuation and the fact that some words are contractions of multiple words (for example "isn't" and "don't").

<img src="https://camo.githubusercontent.com/6c79dd15098f840a49149649832fa0efd7252d71d03257b5fc96379f7456d889/68747470733a2f2f73706163792e696f2f746f6b656e697a6174696f6e2d35376536313862643739643933336334636364333038623537333930363264362e737667">

Multiple tokenization strategies can tackle these and other issues, for example, **rule-based splitting of sentences** (used by ULMFiT and Transformer XL and others), **Byte-Pair encoding** (used by GPT-2 and others), **WordPiece** (used by BERT and others), and **SentencePiece** (used by XLM and others).

### Rule-based splitting of sentences into words

The NLP library `spaCy` can help us with this kind of tokenization. We install spaCy and download a set of rules for tokenizing English text:

In [12]:
%%capture
import sys
%pip install spacy
!{sys.executable} -m spacy download en_core_web_sm

In [13]:
import spacy

In [14]:
nlp = spacy.load("en_core_web_sm")

In [15]:
example_sentence = "Here's a sentence to be tokenized by a tokenizer, and it includes the non-existent word graffalacticus"

In [16]:
doc = nlp(example_sentence)
for token in doc:
    print(token.text)

Here
's
a
sentence
to
be
tokenized
by
a
tokenizer
,
and
it
includes
the
non
-
existent
word
graffalacticus


### Subword tokenization

In [17]:
from transformers import BertTokenizer

In [18]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

In [19]:
tokenizer.tokenize(example_sentence)

['here',
 "'",
 's',
 'a',
 'sentence',
 'to',
 'be',
 'token',
 '##ized',
 'by',
 'a',
 'token',
 '##izer',
 ',',
 'and',
 'it',
 'includes',
 'the',
 'non',
 '-',
 'existent',
 'word',
 'graf',
 '##fa',
 '##la',
 '##ctic',
 '##us']

### Byte-Pair encoding: an example of training an encoder

In [20]:
from tokenizers import Tokenizer
from tokenizers.models import BPE

In [21]:
from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(special_tokens=["[UNK]"])

In [22]:
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))

In [23]:
from tokenizers.pre_tokenizers import Whitespace
tokenizer.pre_tokenizer = Whitespace()

In [24]:
tokenizer.train_from_iterator(dataset['train']['text'],trainer=trainer)






In [25]:
example_sentence_bpe = tokenizer.encode(example_sentence)

In [26]:
example_sentence_bpe

Encoding(num_tokens=27, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [27]:
example_sentence_bpe.tokens

['Here',
 "'",
 's',
 'a',
 'sentence',
 'to',
 'be',
 'token',
 'ized',
 'by',
 'a',
 'token',
 'izer',
 ',',
 'and',
 'it',
 'includes',
 'the',
 'non',
 '-',
 'existent',
 'word',
 'gra',
 'ff',
 'al',
 'actic',
 'us']

In [28]:
example_sentence_bpe.ids[:15]

[2011, 8, 84, 66, 11053, 157, 172, 15599, 1418, 269, 66, 15599, 15855, 13, 155]

## Numericalization

We convert tokens to numbers by making a list of all the tokens that have been used and assign them to numbers. This has already been taken care of for us:

In [29]:
example_sentence_bpe.ids[:15]

[2011, 8, 84, 66, 11053, 157, 172, 15599, 1418, 269, 66, 15599, 15855, 13, 155]

# Embeddings and using pre-trained text encoders

http://projector.tensorflow.org/

# Fine-tuning pre-trained models

The advent of the **Transformers models** (see [here]() for a quick intro) has revolutionized the field of natural language processing. Therefore, when faced with any NLP task for which deep learning is applicable, everyone tends to turn to Transformers models. Furthermore, one typically uses _pre-trained models_. In other words, models that have already been trained on large-scale NLP tasks and thus contain representations that typically provide useful starting points for new tasks.

## Text representation for pre-trained models

When using pre-trained models, we must pre-process the text exactly as expected by the model. In other words, that we use the expected tokenization, numericalization, padding, and truncation strategies.

In [30]:
from transformers import AutoTokenizer

In [31]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [32]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

In [33]:
tokenized_datasets = dataset.map(tokenize_function, batched=True)

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [34]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1000
    })
})

## Fine-tune a model

We'll fine-tune a BERT model on our IMDB dataset. (Note that this is where it's best to use a sample of the dataset. Otherwise the training process will take a long time.)

In [35]:
from transformers import AutoModelForSequenceClassification

**Define the model and its preprocessing steps**

In [36]:
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

In [37]:
#trainer.model

**Set up our evaluation metric**

In [38]:
import evaluate
metric = evaluate.load("accuracy")

In [39]:
def compute_metrics(eval_pred):

    logits, labels = eval_pred

    predictions = np.argmax(logits, axis=-1)

    return metric.compute(predictions=predictions, references=labels)

**Configure the training process**

In [40]:
from transformers import TrainingArguments, Trainer

In [41]:
#?TrainingArguments

In [42]:
training_args = TrainingArguments(output_dir=".", num_train_epochs=1, evaluation_strategy="epoch", report_to='all')

In [43]:
trainer = Trainer(

    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
    compute_metrics=compute_metrics,

)

**Train and evaluate the model**

In [44]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
    There is an imbalance between your GPUs. You may want to exclude GPU 1 which
    has less than 75% of the memory or cores of GPU 0. You can do so by setting
    the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES
    environment variable.
***** Running training *****
  Num examples = 4000
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 250


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.256287,0.898


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 16


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=250, training_loss=0.4057184753417969, metrics={'train_runtime': 113.0898, 'train_samples_per_second': 35.37, 'train_steps_per_second': 2.211, 'total_flos': 1052444221440000.0, 'train_loss': 0.4057184753417969, 'epoch': 1.0})

### Use the model on new data

In [51]:
test_data = ["This movie was pretty good.", "Not my cup of tea"]

In [53]:
test_data = tokenizer(test_data, return_tensors="pt", padding=True)["input_ids"].cuda()

In [54]:
outputs = model(test_data)

In [55]:
# Predictions
outputs.logits.argmax(-1)

tensor([1, 0], device='cuda:0')