# **A Visual Notebook for using BERT.**

<img src="https://jalammar.github.io/images/distilBERT/bert-distilbert-sentence-classification.png" />

In this notebook, we will use a pre-trained deep learning model to process the textual features. We will then use the output of that pre-trained model to classify the text. The text is a list of sentences from film reviews. And we will classify each sentence as either speaking "**positively**" about its subject or "**negatively**".

## **Models: Sentence Sentiment Classification.**

Our goal is to create a model that takes a sentence (just like the ones in our dataset) and produces either **1** (indicating the sentence has a positive sentiment) or a **0** (indicating the sentence has a negative sentiment). We can think of it as looking like this:

<img src="https://jalammar.github.io/images/distilBERT/sentiment-classifier-1.png" />

Under the hood, the model uses two learning algorithms.

* **DistilBERT** processes the sentence and passes some extracted information to the next model. DistilBERT is a smaller version of BERT developed and open-sourced by the team at Hugging Face. It's a lighter and faster version of BERT that roughly matches its performance.

* In the next step, a basic **Logistic Regression** model from the scikit-learn website will take in the result of DistilBERT's processing, and classify the sentence as either positive or negative (1 or 0, respectively).

The data between the two models is a vector of size 768. We can think of this vector as an embedding for the sentence to use for classification.

<img src="https://jalammar.github.io/images/distilBERT/distilbert-bert-sentiment-classifier.png" />

## **Dataset.**

The dataset we will use in this example is [**SST2**](https://nlp.stanford.edu/sentiment/index.html), which contains sentences from movie reviews, each labeled as either positive or negative.


<table class="features-table">
  <tr>
    <th class="mdc-text-light-green-600">
    sentence
    </th>
    <th class="mdc-text-purple-600">
    label
    </th>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      a stirring , funny and finally transporting re imagining of beauty and the beast and 1930s horror films
    </td>
    <td class="mdc-bg-purple-50">
      1
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      apparently reassembled from the cutting room floor of any given daytime soap
    </td>
    <td class="mdc-bg-purple-50">
      0
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      they presume their audience won't sit still for a sociology lesson
    </td>
    <td class="mdc-bg-purple-50">
      0
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      this is a visually stunning rumination on love , memory , history and the war between art and commerce
    </td>
    <td class="mdc-bg-purple-50">
      1
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      jonathan parker 's bartleby should have been the be all end all of the modern office anomie films
    </td>
    <td class="mdc-bg-purple-50">
      1
    </td>
  </tr>
</table>

### **Install the Transformers Library.**

Let's start by installing the HuggingFace transformers library so we can load our deep learning NLP model.

In [None]:
!pip install transformers

In [2]:
# Import Library.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import torch
import transformers
import warnings

warnings.filterwarnings("ignore")

In [3]:
# Import Dataset.
data = pd.read_csv(
    "https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv",
    delimiter="\t",
    header=None,
)

# For performance reasons, we'll only use 2,000 sentences from the dataset.
data = data[:2000]

print(data)

                                                      0  1
0     a stirring , funny and finally transporting re...  1
1     apparently reassembled from the cutting room f...  0
2     they presume their audience wo n't sit still f...  0
3     this is a visually stunning rumination on love...  1
4     jonathan parker 's bartleby should have been t...  1
...                                                 ... ..
1995  too bland and fustily tasteful to be truly pru...  0
1996                         it does n't work as either  0
1997  this one aims for the toilet and scores a dire...  0
1998  in the name of an allegedly inspiring and easi...  0
1999  the movie is undone by a filmmaking methodolog...  0

[2000 rows x 2 columns]


In [4]:
# Class Frequency.
data[1].value_counts()

1    1041
0     959
Name: 1, dtype: int64

## **Load the Pre-trained BERT Model.**

In [5]:
# Load DistilBERT.
model_class, tokenizer_class, pretrained_weights = (
    transformers.DistilBertModel,
    transformers.DistilBertTokenizer,
    "distilbert-base-uncased",
)

# Load pre-trained model/tokenizer.
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


The variable `model` holds the pre-trained DistilBERT $-$ a version of BERT that is smaller but much faster and requires much less memory.

## **Prepare the Dataset.**

Before we can hand our sentences to BERT, we need to do some minimal processing to put them in the required format.

<img src="https://jalammar.github.io/images/distilBERT/bert-distilbert-tokenization-2-token-ids.png" />

In [6]:
"""
Step 1: Tokenization.
The first step is to tokenize the sentences and break them up into words and subwords in the format BERT is comfortable with.
"""

tokenized = data[0].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

"""
Step 2: Padding.
After tokenization, each sentence represents a list of tokens. We want BERT to process our examples all at once (as one batch).
Therefore, we need to pad all lists to the same size to represent the input as one 2D array rather than a list of lists (of different lengths).
"""

max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0] * (max_len - len(i)) for i in tokenized.values])

"""
Step 3: Masking.
If we directly send `padded` to BERT, that would slightly confuse it. We need to create another variable to tell 
it to ignore (mask) the padding we've added when it's processing its input. That's what attention_mask is:
"""

attention_mask = np.where(padded != 0, 1, 0)

## **Step #1: Sentence Embeddings.**

Now that we have our model and inputs ready, let's run the model!

<img src="https://jalammar.github.io/images/distilBERT/bert-distilbert-tutorial-sentence-embedding.png" />

The `model()` function runs our sentences through BERT. The results of the processing will be returned to the `last_hidden_states`.

In [7]:
input_ids = torch.tensor(padded)
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

Let's slice only the part of the output that we need. That is the output corresponding the first token of each sentence. The way BERT does sentence classification, is that it adds a token called `[CLS]` (for classification) at the beginning of every sentence. The output corresponding to that token can be thought of as an embedding for the entire sentence.

<img src="https://jalammar.github.io/images/distilBERT/bert-output-tensor-selection.png" />

We'll save those in the `features` variable, as they'll serve as the features of our logistic regression model.

In [8]:
features = last_hidden_states[0][:, 0, :].numpy()

The labels indicating which sentence is positive and negative now go into the `labels` variable.

In [9]:
labels = data[1]

## **Step #2: Split Dataset into Training and Test Set.**

<img src="https://jalammar.github.io/images/distilBERT/bert-distilbert-train-test-split-sentence-embedding.png" />

In [10]:
train_features, test_features, train_labels, test_labels = train_test_split(
    features, labels, train_size=0.75, stratify=labels, random_state=1
)

## **Step #3: Model Training.**

<img src="https://jalammar.github.io/images/distilBERT/bert-training-logistic-regression.png" />

In [11]:
lr_clf = LogisticRegression().fit(train_features, train_labels)

## **Step #4: Model Evaluation.**

In [12]:
lr_clf.score(test_features, test_labels)

0.842

### **Proper SST2 Scores.**

For reference, the [**highest accuracy score**](http://nlpprogress.com/english/sentiment_analysis.html) for this dataset is currently **96.8**. DistilBERT can be trained to improve its score on this task $–$ a process called **fine-tuning** which updates BERT's weights to make it achieve a better performance in this sentence classification task (which we can call the downstream task). The fine-tuned DistilBERT turns out to achieve an accuracy score of **90.7**. The full-size BERT model achieves an accuracy score of **94.9**.


# **References.**

> [**A Visual Guide to Using BERT for the First Time - Jay Alammar**](https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/)