# Natural Language Processing using BERT

Please study AMA Lecture 12 "Natural Language Processing Using BERT" before practicing this code.

In addition to `tensorflow` and `keras` packages, this code also requires two new packages:
+ `tensorflow_hub` -- "a repository of trained machine learning models ready for fine-tuning and deployable anywhere" (https://www.tensorflow.org/hub)
  + install `tensorflow-hub` via Anaconda Navigator
  + after that, downgrade package `tensorflow-estimator` to version 2.3.0 (because the newer versions are buggy)
+ `bert`, which we'll install via "pip install" (see later in this code)

In [1]:
import numpy as np
import pandas as pd

In [2]:
# Need tf version >=2.0
import tensorflow as tf
print("TF version: ", tf.__version__)

TF version:  2.6.0


In [4]:
# Need hub version >=0.7
import tensorflow_hub as hub
print("Hub version: ", hub.__version__)

2021-11-30 20:20:22.761897: E tensorflow/core/lib/monitoring/collection_registry.cc:77] Cannot register 2 metrics with the same name: /tensorflow/api/estimator


AlreadyExistsError: Another metric with the same name already exists.

In [None]:
from tensorflow.keras import models
from tensorflow.keras import layers
from tensorflow.keras import optimizers

## Case study: the IMDB dataset

This is a widely used large dataset for text mining from a [2011 ACL meeting paper](https://ai.stanford.edu/~amaas/data/sentiment/) by Maas et al. I processed the data so it fits in a single CSV file 'IMDB_small.csv'.

The original dataset has 50,000 balanced records, and the data file takes too long to upload. For our course, I randomly sampled 10,000 records and saved them in file 'IMDB_small.csv'. This is still a balanced sample, where the first 5000 are negative reviews and the rest are positive reviews.

In [None]:
# load the IMDB dataset
df = pd.read_csv('IMDB_small.csv')
df.head()

In [None]:
df.sentiment.value_counts()

In [None]:
# one negative example:
import textwrap
print(textwrap.fill(df.review[2], 80))

In [None]:
# one positive example:
print(textwrap.fill(df.review[5000], 80))

In [None]:
# The following codes make it easier for you to adopt
# this file for other text mining datasets.
DATA_COLUMN = 'review'
LABEL_COLUMN = 'sentiment'
label_list = [0, 1] #0-negative, 1-positive

## Introducing BERT

**BERT (Bidirectional Encoder Representations from Transformers)** is the state-of-the-art feature extraction model for natural language.

Some resources on BERT:
- See BERT on paper: https://arxiv.org/pdf/1810.04805.pdf
- See BERT on GitHub: https://github.com/google-research/bert
- See BERT on TensorHub: https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1
- See 'old' use of BERT for comparison: https://colab.research.google.com/github/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb

Next, we will use BERT in four steps:
* Import and build the BERT model
* Tokenization
* Convert tokens to BERT input format
* Sentence/word embedding

## Importing and building the BERT model

This part of code might confuse you a bit for now. We will come back and explain it more.

In [None]:
# !pip install sentencepiece
!pip install bert-for-tf2
import bert

In [None]:
# BERT requires a MAX_SEQ_LENGTH that can be any integer<=512.
# Here we pick a smaller number to cut down computation cost.
max_seq_length = 256

In [None]:
# BERT requires the following three types of inputs (more on them later)
input_word_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                       name="input_word_ids")
input_mask = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                   name="input_mask")
segment_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                    name="segment_ids")

In [None]:
# Now we load the already pre-trained BERT layers
# Ignore the warning message, which won't affect our usage of bert
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1",
                            trainable=True)
pooled_output, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])

In [None]:
model = models.Model(inputs=[input_word_ids, input_mask, segment_ids], 
                     outputs=[pooled_output, sequence_output])

In [None]:
model.summary()

## BERT for tokenization

Import tokenizer using the original vocab file:

In [None]:
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = bert.bert_tokenization.FullTokenizer(vocab_file, do_lower_case)

In [None]:
# The tokenizer converts a sentence to a sequence of tokens. Here's an example:
text = "Here is an example sentence that I want to tokenize."
tokenized_text = tokenizer.tokenize(text)
print(tokenized_text)

Now we tokenize every review in the IMDB dataset. This may take a minute to finish.

In [None]:
df['tokens'] = df[DATA_COLUMN].apply(lambda x : tokenizer.tokenize(x))

In [None]:
# An example of how the tokens for a review look like:
print(df['tokens'][2])

In [None]:
# Some reviews are long. For example:
len(df['tokens'][0])

In [None]:
# We now truncate any review with >=(MAX_SEQ_LENGTH-2) tokens.
# And add special tokens [CLS] and [SEP].

def truncate_and_add(x, max_seq_length):
  a = ["[CLS]"] + x
  if len(a)>max_seq_length-1:
    a[max_seq_length-1] = "[SEP]"
    return a[:max_seq_length]
  else:
    return a + ["[SEP]"]

df['tokens'] = df['tokens'].apply(lambda x : truncate_and_add(x, max_seq_length))

## Converting tokens to BERT input format

We'll need to transform our data into a format BERT understands. This involves two steps. First, we create  `InputExamples` using the constructor provided in the BERT library.

- `text_a` is the text we want to classify, which in this case, is the `review` field in our Dataframe. 
- `text_b` is used if we're training a model to understand the relationship between sentences (i.e. is `text_b` a translation of `text_a`? Is `text_b` an answer to the question asked by `text_a`?). This doesn't apply to our task, so we can leave `text_b` blank.
- `label` is the target in supervised learning, which is `sentiment` in our example

To use BERT embedding, we need to convert the tokens of each text input into the following format:
 - input token ids (tokenizer converts tokens using vocab file)
 - input masks (1 for useful tokens, 0 for padding)
 - segment ids (for 2 text training: 0 for the first one, 1 for the second one)


Define some functions for ease of preprocessing:

In [None]:
def get_ids(tokens, tokenizer, max_seq_length):
    token_ids = tokenizer.convert_tokens_to_ids(tokens)
    token_ids = token_ids + [0] * (max_seq_length-len(token_ids))
    return np.array(token_ids, dtype=np.int32)
    
def get_masks(tokens, max_seq_length):
    token_masks = [1]*len(tokens) + [0] * (max_seq_length - len(tokens))
    return np.array(token_masks, dtype=np.int32)

def get_segments(tokens, max_seq_length):
    """Segments: 0 for the first sequence, 1 for the second"""
    segments = []
    current_segment_id = 0
    for token in tokens:
        segments.append(current_segment_id)
        if token == "[SEP]":
            current_segment_id = 1
    segments = segments + [0] * (max_seq_length - len(tokens))
    return np.array(segments, dtype=np.int32)


In [None]:
df['ids'] = df['tokens'].apply(lambda x : get_ids(x, tokenizer, max_seq_length))
df['masks'] = df['tokens'].apply(lambda x : get_masks(x, max_seq_length))
df['segments'] = df['tokens'].apply(lambda x : get_segments(x, max_seq_length))

In [None]:
# Let's see what the first movie review is now converted to:
df.iloc[0]

In [None]:
# Now assemble the data as required by the definition of BERT inputs
n = df.shape[0]
all_ids = np.zeros(shape=(n,max_seq_length))
all_masks = np.zeros(shape=(n,max_seq_length)) 
all_segments = np.zeros(shape=(n,max_seq_length))
i = 0
for index, row in df.iterrows():
  all_ids[i] = row.ids
  all_masks[i] = row.masks
  all_segments[i] = row.segments
  i += 1


## Using the pre-trained BERT model for sentence embedding

BERT converts each text input (in our example, a tokenized movie review) into the following.
* **pooled output** (also called pooled embedding, sentence embedding): this is a vector of size `768`, which represents the whole sentence.
* **sequence outputs** (also called sequence embeddings, word embeddings): this is a matrix of size `[max_seq_length, 768]`, where each token is now represented by a vector of size `768`.

**For sentiment analysis, we only need the pooled output.**

Similar to other deep learning models, BERT doesn't transform text one record at a time. Instead, BERT takes a batch of texts (e.g., a batch of movie reviews in our case) and convert them all at once. Thus the output shapes are:
 - pooled output of shape `[batch_size, 768]` with representations for the entire input sequences
 - sequence output of shape `[batch_size, max_seq_length, 768]`

### A big data problem

The output size from BERT can be huge. For example, in our dataset of 10000 movie reviews, where each review has a (truncated) length of 256, the total size of sequence embeddings is: `10000 * 256 * 768 * 4 ~= 8 Gigabyte`. This is too large to fit in the memory of most personal computers. So the following single-line code will likely trigger a "ResourceExhaustedError".

`pool_embs, seq_embs = model.predict([all_ids,all_masks,all_segments])`

Below is an workaround to avoid this bid data problem. We process our data 1000 records a time, i.e., set batch size at 1000. After each batch is processed, discard the sequence embeddings because we don't need them, and only save the pooled embeddings.

In [None]:
pool_embs = np.zeros(shape=(n,768))
for i in np.arange(10):
  j = i*1000
  pool_embs[j:j+1000], seq_embs = model.predict([all_ids[j:j+1000],
                                                 all_masks[j:j+1000],
                                                 all_segments[j:j+1000]])
  print(f'{i+1}/10 of the data processed.')

In [None]:
pool_embs.shape

In [None]:
pool_embs[0]

## Assembling a new dataset with features extracted by BERT

For each text, the corresponding pooled output is a vector of 768 numbers that summaries this whole text. We can now treat these 768 numbers as features extracted by BERT. Let's assemble a new DataFrame with these figures and the sentiment data.

In [None]:
feature_df = pd.DataFrame(pool_embs)
feature_df.head()

In [None]:
feature_df['sentiment'] = df['sentiment']
feature_df.head()

In [None]:
# Warning: this file will be large, about 150MB
feature_df.to_csv("IMDB_small_BERT.csv", index=False)

## Building and evaluating the prediction model

The rest is similar to what we did with the business loan dataset earlier this semester. I'll use the simple logistic regression model.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = feature_df.drop(columns=['sentiment'])
y = feature_df['sentiment']

# reserve 30% dataset as testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                    random_state=1,
                                                    stratify=y)

In [None]:
model2 = models.Sequential()
model2.add(layers.Dense(128, activation='relu', input_dim=768))
# model2.add(layers.Dropout(0.5))
model2.add(layers.Dense(1, activation='sigmoid'))

In [None]:
model2.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['acc'])

In [None]:
model2.summary()

In [None]:
model2.fit(X_train, y_train, epochs=30)

In [None]:
test_loss, test_acc = model2.evaluate(X_test,  y_test, verbose=2)

In [None]:
# prediction
model2.predict(X_test.iloc[[0]])

In [None]:
print(y_test[0])