<a href="https://colab.research.google.com/github/cwmarris/pull-request-monitor/blob/master/HBAP_bert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a id='top'></a>

------



*Rashmi Banthia*

--------


### BERT - Bidirectional Encoder Representations from Transformers 



------

BERT - J. Devlin, et al.  - https://arxiv.org/abs/1810.04805 
(Released in Oct 2018, as of 12.01.2020 - citatons 12752)

Attention is all you need - A Vaswani, et al. - https://arxiv.org/abs/1706.03762 
Tutorials - 

(1) BERT Research series by Chris Mccormick https://www.youtube.com/playlist?list=PLam9sigHPGwOBuH4_4fr-XvDbe5uneaf6 

(2) Blog by Jay Alammar 
- http://jalammar.github.io/illustrated-bert/
- http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/
- http://jalammar.github.io/illustrated-transformer/


- BERT is trained on MLM (Masked language model) and  NSP (Next sentence prediction) task. 
- Pretraining data - BooksCorpus (800M words) (Zhu et al.,
2015) and English Wikipedia (2,500M words).

- MLM (Masked Language Model) - Randomly mask 15% of the input with [MASK] token. Predict the masked words based on the context provided by the other non masked words. 

- NSP (Next Sentence Prediction) - Whether one sentence2 follows the sentence1. 

- Many variants -  most popular BERT base and BERT large. 
  BERT base - 12 layer, 768 hidden nodes, 12 attention heads, 110M parameters
  Bert large - 24 layer, 1024 hidden nodes, 16 attention heads, 330M parameters .


![picture](https://drive.google.com/uc?export=view&id=15LyRQL3hhkblxVmsPQPWGSBdVUCJhlcO)
 
(Image source - BERT Tokenization - https://arxiv.org/pdf/1810.04805.pdf) 

# Demo

In [1]:
# Ref: https://www.youtube.com/watch?v=z6Kl52nh04U&t=773s
import tensorflow as tf

We will be using [Huggingface](https://huggingface.co/transformers/) library, which can installed with `pip install transformers`

In [2]:
!pip install -q transformers==4.5.0 tensorflow_datasets > /dev/null

### Loading IMDB dataset

In [3]:
import tensorflow_datasets as tfds 
import transformers
from transformers import BertTokenizer

from transformers import TFBertForSequenceClassification
import tensorflow as tf 
print(transformers.__version__)

4.5.0


In [4]:
tfds.disable_progress_bar()
(ds_train, ds_test), ds_info = tfds.load('imdb_reviews', 
                                         split =(tfds.Split.TRAIN, tfds.Split.TEST),
                                         as_supervised=True,
                                         with_info=True)
print('\n\n',ds_info)

#List of all datasets provided by TFDS  - tfds.list_builders() - https://www.tensorflow.org/datasets/catalog/overview

INFO:absl:No config specified, defaulting to first: imdb_reviews/plain_text
INFO:absl:Load pre-computed DatasetInfo (eg: splits, num examples,...) from GCS: imdb_reviews/plain_text/1.0.0
INFO:absl:Load dataset info from /tmp/tmpwpaawegytfds
INFO:absl:Field info.config_name from disk and from code do not match. Keeping the one from code.
INFO:absl:Field info.config_description from disk and from code do not match. Keeping the one from code.
INFO:absl:Field info.citation from disk and from code do not match. Keeping the one from code.
INFO:absl:Generating dataset imdb_reviews (/root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0)


[1mDownloading and preparing dataset imdb_reviews/plain_text/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


INFO:absl:Downloading http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz into /root/tensorflow_datasets/downloads/ai.stanfor.edu_amaas_sentime_aclImdb_v1PaujRp-TxjBWz59jHXsMDm5WiexbxzaFQkEnXc3Tvo8.tar.gz.tmp.fd42cd6114d04a808a506470533147ba...
INFO:absl:Generating split train
INFO:absl:Done writing /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete9GJDWH/imdb_reviews-train.tfrecord. Shard lengths: [25000]
INFO:absl:Generating split test


Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete9GJDWH/imdb_reviews-train.tfrecord


INFO:absl:Done writing /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete9GJDWH/imdb_reviews-test.tfrecord. Shard lengths: [25000]
INFO:absl:Generating split unsupervised


Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete9GJDWH/imdb_reviews-test.tfrecord


INFO:absl:Done writing /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete9GJDWH/imdb_reviews-unsupervised.tfrecord. Shard lengths: [50000]


Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete9GJDWH/imdb_reviews-unsupervised.tfrecord


INFO:absl:Skipping computing stats for mode ComputeStatsMode.SKIP.
INFO:absl:Constructing tf.data.Dataset for split (Split('train'), Split('test')), from /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0


[1mDataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


 tfds.core.DatasetInfo(
    name='imdb_reviews',
    version=1.0.0,
    description='Large Movie Review Dataset.
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.',
    homepage='http://ai.stanford.edu/~amaas/data/sentiment/',
    features=FeaturesDict({
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
        'text': Text(shape=(), dtype=tf.string),
    }),
    total_num_examples=100000,
    splits={
        'test': 25000,
        'train': 25000,
        'unsupervised': 50000,
    },
    supervised_keys=('text', 'label'),
    citation="""@InProceedings{maas-EtAl:2011:ACL-HLT2011,
      author    = {M

Now let's explore the examples for fine tuning. We can just take the top 5 examples and labels by ds_train.take(5), so that we can explore the dataset without the need to iterate over whole 25000 examples in the dataset. 

In [5]:
for review, label in tfds.as_numpy(ds_train.take(5)):
  print('review:', review.decode()[:100], label)

review: This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside.  0
review: I have been known to fall asleep during films, but this is usually due to a combination of things in 0
review: Mann photographs the Alberta Rocky Mountains in a superb fashion, and Jimmy Stewart and Walter Brenn 0
review: This is the kind of film for a snowy Sunday afternoon when the rest of the world can go ahead with i 1
review: As others have mentioned, all the women that go nude in this film are mostly absolutely gorgeous. Th 1


### Tokenization 


![picture](https://drive.google.com/uc?export=view&id=18l47wVRZ2I1kP8DpVK160FRKuyUt3TG7)
 
(Image source - BERT Tokenization - http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/) 
 

We will be using Huggingface Tokenizer - https://huggingface.co/transformers/tokenizer_summary.html 

101 is the token id for CLS token, 102 is SEP  

In [7]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




The BERT tokenizer uses WordPiece vocabulary. It has over 30000 words and it maps pretrained embeddings for each. Each word has its own ids, we would need to map the tokens to those ids.  Beyond the 30k word vocab, words will get split (you can customize vocab also, if you don't want to split words.)

In [8]:
vocabulary = tokenizer.get_vocab()
print(list(vocabulary.keys())[5000:5020])

['knight', 'lap', 'survey', 'ma', '##ow', 'noise', 'billy', '##ium', 'shooting', 'guide', 'bedroom', 'priest', 'resistance', 'motor', 'homes', 'sounded', 'giant', '##mer', '150', 'scenes']



![picture](https://drive.google.com/uc?export=view&id=1qqOgQNVYN7ACqUNZtNEYg06VVS4nT1sX)
 
(Image source - BERT Tokenization - http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/) 

In [10]:
max_length_test=20
test_sentence = "a visually stunning rumination on love"  #"Test tokenization sentence. followed by another sentence" 
test_sentence = "HBAP has great students, for e.g: Thomas Tejas James"

# add special tokens
test_sentence_with_special_tokens = '[CLS]' + test_sentence + '[SEP]'
tokenized = tokenizer.tokenize(test_sentence_with_special_tokens)
print('tokenized:', tokenized)

# convert tokens to idx in WordPiece 
input_ids = tokenizer.convert_tokens_to_ids(tokenized)
attention_mask = [1] * len(input_ids)
padding_length = max_length_test - len(input_ids) 
input_ids = input_ids + ([0] * padding_length)
token_type_ids = ([0] * max_length_test)
attention_mask = attention_mask + ([0] * padding_length)

bert_input = {
    'input_ids': input_ids, 
    'token_type_ids':token_type_ids,
    'attention_mask':attention_mask 
}

print(bert_input)

tokenized: ['[CLS]', 'h', '##ba', '##p', 'has', 'great', 'students', ',', 'for', 'e', '.', 'g', ':', 'thomas', 'te', '##jas', 'james', '[SEP]']
{'input_ids': [101, 1044, 3676, 2361, 2038, 2307, 2493, 1010, 2005, 1041, 1012, 1043, 1024, 2726, 8915, 17386, 2508, 102, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]}


In [None]:
bert_input = tokenizer.encode_plus(
    test_sentence, 
    add_special_tokens = True, 
    max_length = max_length_test,
    padding = True,
    return_attention_mask = True,
)

From [Huggingface glossary.](https://huggingface.co/transformers/glossary.html#general-terms)

**Input ids:** The input ids are often the only required parameters to be passed to the model as input. They are token indices, numerical representations of tokens building the sequences that will be used as input by the model.

**Attention Mask:** This argument indicates to the model which tokens should be attended to, and which should not. 

**Token type ids:** Some models’ purpose is to do sequence classification or question answering. These require two different sequences to be joined in a single “input_ids” entry, which usually is performed with the help of special tokens, such as the classifier ([CLS]) and separator ([SEP]) tokens. For example, the BERT model builds its two sequence input as such: 




In [11]:
max_length = 512 
batch_size = 4
epochs = 1

In [12]:
def convert_example_to_feature(review):
    # print(max_length)
    return tokenizer.encode_plus(
            review, 
            add_special_tokens = True, 
            truncation=True,
            max_length = max_length,
            padding = 'max_length',
            pad_to_max_length = True,
            return_attention_mask = True,
            
        )

In [13]:
def map_example_to_dict(input_ids, attention_masks, token_type_ids, label):
    return { 'input_ids': input_ids, 
    'token_type_ids':token_type_ids,
    'attention_mask':attention_masks },label


def encode_examples(ds, limit=-1):
    input_ids_list = []
    token_type_ids_list = []
    attention_mask_list = []
    label_list = [] 

    if limit > 0:
        ds = ds.take(limit)

    for review, label in tfds.as_numpy(ds):
        # print(review, label, len(review))
        bert_input = convert_example_to_feature(review.decode())
        input_ids_list.append(bert_input['input_ids'])
        token_type_ids_list.append(bert_input['token_type_ids'])
        attention_mask_list.append(bert_input['attention_mask'])
        label_list.append([label])

    # print(len(input_ids_list), len(label_list))

    return tf.data.Dataset.from_tensor_slices((input_ids_list, token_type_ids_list, attention_mask_list,label_list)).map(map_example_to_dict)


In [17]:

ds_train_encoded = encode_examples(ds_train).shuffle(25000).batch(batch_size).prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

ds_test_encoded = encode_examples(ds_test).batch(batch_size).prefetch(buffer_size=tf.data.experimental.AUTOTUNE)



In [18]:
learning_rate = 2e-5

number_of_epochs=1

model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased')

optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate, epsilon=1e-08)
loss  = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')

model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
history = model.fit(ds_train_encoded, epochs=number_of_epochs, validation_data = ds_test_encoded, verbose=1)

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method




Cause: while/else statement not yet supported


Cause: while/else statement not yet supported


Cause: while/else statement not yet supported




Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.


Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.


















KeyboardInterrupt: ignored

In [19]:
#Final validation accuracy: 0.9265 #0.8846