A.S. Lundervold, v.011122

# Introduction

This is a quick example of some techniques and ideas from natural language processing (NLP) and some approaches to NLP based on deep learning. The goal is to introduce some of the things going on in this field and for you to better understand some recent ideas and developments in deep learning.

> NLP is an exciting area these days. Breakthroughs in deep learning for language processing recently initiated a revolution in NLP, and we're still in it. The best place to start exploring this is perhaps the HuggingFace community and library (at least if you want to get started right away playing around with using state-of-the-art NLP models): https://huggingface.co/. <br> <a href="https://huggingface.co/"><img width=20% src="https://luxcapital-website-media.s3.amazonaws.com/wp-content/uploads/2019/12/23115642/Logo-600x554.png"></a>

# Setup

In [None]:
# This is a quick check of whether the notebook is currently running on Google Colaboratory
# or on Kaggle, as that makes some difference for the code below.
# We'll do this in every notebook of the course.
try:
    import colab
    colab=True
except:
    colab=False

import os
kaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')

In [None]:
import numpy as np

We'll use the excellent HuggingFace Transformers library, which covers all our natural language processing needs:

<img src="https://camo.githubusercontent.com/b253a30b83a0724f3f74f3f58236fb49ced8d7b27cb15835c9978b54e444ab08/68747470733a2f2f68756767696e67666163652e636f2f64617461736574732f68756767696e67666163652f646f63756d656e746174696f6e2d696d616765732f7265736f6c76652f6d61696e2f7472616e73666f726d6572735f6c6f676f5f6e616d652e706e67">


We will not cover the library in any detail. If you're interested, take a look at the [HuggingFace course](https://huggingface.co/course/chapter1/1) and its excellent documentation over at https://huggingface.co/transformers.

# Load data

We'll use the [IMDB dataset](https://huggingface.co/datasets/imdb) containing 50.000 movie reviews from IMDB, each labeled as either negative (0) or positive (1). It is split into 25.000 reviews for training and 25.000 reviews for testing. 

The dataset is available via HuggingFace `datasets`:

In [None]:
from datasets import load_dataset

In [None]:
dataset = load_dataset("imdb")

In [None]:
dataset

## Make a sample dataset

As the training process takes a long time, we create a small sample dataset:

In [None]:
sample = True

In [None]:
if sample:
    dataset = dataset['train']
    dataset = dataset.train_test_split(train_size=0.2, shuffle=True, seed=42)['train']
    dataset = dataset.train_test_split(test_size=0.2)

In [None]:
dataset

# Explore the data

The training data is stored under `train`, the test data under `test`:

Here are two training instances:

In [None]:
dataset['train'][10:12]

We can print them a in a more readable form:

In [None]:
from pprint import pprint

In [None]:
pprint(dataset['train'][10])

> **How do we represent the text for consumption by a machine learning model?**

> **How can a computer read??**

<img src="https://camo.githubusercontent.com/7d5ed540c87d660cae46ca0d2055d760f786bea36513bb1a0b0784d47cef45b1/687474703a2f2f322e62702e626c6f6773706f742e636f6d2f5f2d2d75564865746b5549512f54446165356a476e6138492f4141414141414141414b302f734253704c7564576d63772f73313630302f72656164696e672e676966">

# Prepare the data: tokenization and numericalization

For a computer, everything is numbers. We have to convert the text to a series of numbers and then feed those to the computer.

This can be done in two widely used steps in natural language processing: **tokenization** and **numericalization**.

## Tokenization

In tokenization, the text is split into single words called tokens. A simple way to achieve this is to separate according to spaces in the text. But then we, among other things, lose punctuation and the fact that some words are contractions of multiple words (for example "isn't" and "don't").

<img src="https://camo.githubusercontent.com/6c79dd15098f840a49149649832fa0efd7252d71d03257b5fc96379f7456d889/68747470733a2f2f73706163792e696f2f746f6b656e697a6174696f6e2d35376536313862643739643933336334636364333038623537333930363264362e737667">

Multiple tokenization strategies can tackle these and other issues, for example, **rule-based splitting of sentences** (used by ULMFiT and Transformer XL and others), **Byte-Pair encoding** (used by GPT-2 and others), **WordPiece** (used by BERT and others), and **SentencePiece** (used by XLM and others).

### Rule-based splitting of sentences into words

The NLP library `spaCy` can help us with this kind of tokenization. We install spaCy and download a set of rules for tokenizing English text:

In [None]:
%%capture
import sys
%pip install spacy
!{sys.executable} -m spacy download en_core_web_sm

In [None]:
import spacy

In [None]:
nlp = spacy.load("en_core_web_sm")

In [None]:
example_sentence = "Here's a sentence to be tokenized by a tokenizer, and it includes the non-existent word graffalacticus"

In [None]:
doc = nlp(example_sentence)
for token in doc:
    print(token.text)

### Subword tokenization

In [None]:
from transformers import BertTokenizer

In [None]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

In [None]:
tokenizer.tokenize(example_sentence)

### Byte-Pair encoding: an example of training an encoder

In [None]:
from tokenizers import Tokenizer
from tokenizers.models import BPE

In [None]:
from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(special_tokens=["[UNK]"])

In [None]:
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))

In [None]:
from tokenizers.pre_tokenizers import Whitespace
tokenizer.pre_tokenizer = Whitespace()

In [None]:
tokenizer.train_from_iterator(dataset['train']['text'],trainer=trainer)

In [None]:
example_sentence_bpe = tokenizer.encode(example_sentence)

In [None]:
example_sentence_bpe

In [None]:
example_sentence_bpe.tokens

In [None]:
example_sentence_bpe.ids[:15]

## Numericalization

We convert tokens to numbers by making a list of all the tokens that have been used and assign them to numbers. This has already been taken care of for us:

In [None]:
example_sentence_bpe.ids[:15]

# Embeddings and using pre-trained text encoders

These concepts were introduced in the lecture using the TensorFlow Embedding Projector: http://projector.tensorflow.org/

<a href="http://projector.tensorflow.org/"><img src="assets/TensorFlowProjector.png"></a>

<img src="assets/TensorFlowProjector.gif">

## Some key concepts that were mentioned

In the lecture, I told a short story about the following key concepts, widely used in modern deep learning:

* Embeddings and representations
* Word2Vec
* Language Models
* Training language models
* Reusing such text representations (and similar representations of other kinds of data, including images)

# Fine-tuning pre-trained models

The advent of the **Transformers models** has revolutionized the field of natural language processing. Therefore, when faced with any NLP task for which deep learning is applicable, everyone tends to turn to Transformers models. Furthermore, one typically uses _pre-trained models_. In other words, models that have already been trained on large-scale NLP tasks and thus contain representations that typically provide useful starting points for new tasks.

## Text representation for pre-trained models

When using pre-trained models, we must pre-process the text exactly as expected by the model. In other words, that we use the expected tokenization, numericalization, padding, and truncation strategies.

In [None]:
from transformers import AutoTokenizer

In [None]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [None]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

In [None]:
tokenized_datasets = dataset.map(tokenize_function, batched=True)

In [None]:
tokenized_datasets

## Fine-tune a model

We'll fine-tune a BERT model on our IMDB dataset. (Note that this is where it's best to use a sample of the dataset. Otherwise the training process will take a long time.)

In [None]:
from transformers import AutoModelForSequenceClassification

**Define the model and its preprocessing steps**

In [None]:
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)

In [None]:
#trainer.model

**Set up our evaluation metric**

In [None]:
import evaluate
metric = evaluate.load("accuracy")

In [None]:
def compute_metrics(eval_pred):

    logits, labels = eval_pred

    predictions = np.argmax(logits, axis=-1)

    return metric.compute(predictions=predictions, references=labels)

**Configure the training process**

In [None]:
from transformers import TrainingArguments, Trainer

In [None]:
#?TrainingArguments

In [None]:
training_args = TrainingArguments(output_dir=".", num_train_epochs=1, evaluation_strategy="epoch", report_to='all')

In [None]:
trainer = Trainer(

    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
    compute_metrics=compute_metrics,

)

**Train and evaluate the model**

In [None]:
trainer.train()

### Use the model on new data

In [None]:
test_data = ["This movie was pretty good.", "Not my cup of tea"]

In [None]:
test_data = tokenizer(test_data, return_tensors="pt", padding=True)["input_ids"].cuda()

In [None]:
outputs = model(test_data)

In [None]:
# Predictions
outputs.logits.argmax(-1)