<a href="https://colab.research.google.com/github/dmitry-kabanov/datascience/blob/main/2022-07-06-bert-fine-tunning-in-keras.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%%html
<div style="font-weight: bold; font-size: 36px;">
    BERT fine-tunning in Keras
</div>

# Installation

In [2]:
import tensorflow as tf

device_name = tf.test.gpu_device_name()

# The device name should look like the following:
if device_name == "/device:GPU:0":
    print("Found GPU at: {}".format(device_name))
else:
    raise SystemError("GPU device not found")

Found GPU at: /device:GPU:0


In [3]:
!pip install -q transformers tensorflow_datasets

[K     |████████████████████████████████| 4.4 MB 22.4 MB/s 
[K     |████████████████████████████████| 6.6 MB 39.5 MB/s 
[K     |████████████████████████████████| 596 kB 58.7 MB/s 
[K     |████████████████████████████████| 101 kB 12.0 MB/s 
[?25h

# Loading IMDB dataset

In [4]:
import tensorflow_datasets as tfds

(ds_train, ds_test), ds_info = tfds.load(
    "imdb_reviews",
    split=(tfds.Split.TRAIN, tfds.Split.TEST),
    as_supervised=True,
    with_info=True    
)

print("info", ds_info)

[1mDownloading and preparing dataset imdb_reviews/plain_text/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]





0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete21NN61/imdb_reviews-train.tfrecord


  0%|          | 0/25000 [00:00<?, ? examples/s]

0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete21NN61/imdb_reviews-test.tfrecord


  0%|          | 0/25000 [00:00<?, ? examples/s]

0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete21NN61/imdb_reviews-unsupervised.tfrecord


  0%|          | 0/50000 [00:00<?, ? examples/s]



[1mDataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m
info tfds.core.DatasetInfo(
    name='imdb_reviews',
    version=1.0.0,
    description='Large Movie Review Dataset.
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.',
    homepage='http://ai.stanford.edu/~amaas/data/sentiment/',
    features=FeaturesDict({
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
        'text': Text(shape=(), dtype=tf.string),
    }),
    total_num_examples=100000,
    splits={
        'test': 25000,
        'train': 25000,
        'unsupervised': 50000,
    },
    supervised_keys=('text', 'label'),
    citation="""@InProceedings{maas-EtAl:2011:ACL-HLT2011,
      author    = 

In [5]:
for review, label in tfds.as_numpy(ds_train.take(5)):
    print("review", review.decode()[0:50], label)

review This was an absolutely terrible movie. Don't be lu 0
review I have been known to fall asleep during films, but 0
review Mann photographs the Alberta Rocky Mountains in a  0
review This is the kind of film for a snowy Sunday aftern 1
review As others have mentioned, all the women that go nu 1


# Tokenization

Now we apply BERT tokenizer. We need to use a tokenizer that matches the pretrained model that we use for training/prediction.

In [6]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", do_lower_case=True)

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

The BERT tokenizer is built using WordPiece vocabulary, with size over 30'000 words, and it maps pretrained embeddings for each. Each word has an ids.
We need to map the tokens to those ids.

In [7]:
vocab = tokenizer.get_vocab()
print("Vocab size: ", len(vocab))

Vocab size:  30522


In [8]:
print("Random examples from the vocabulary:\n",
      list(vocab.keys())[5000:5010], "\n",
      list(vocab.keys())[5010:5020], "\n")

Random examples from the vocabulary:
 ['knight', 'lap', 'survey', 'ma', '##ow', 'noise', 'billy', '##ium', 'shooting', 'guide'] 
 ['bedroom', 'priest', 'resistance', 'motor', 'homes', 'sounded', 'giant', '##mer', '150', 'scenes'] 



In [9]:
max_length_test = 20
test_sentence = "Test tokenization sentence. Followed by another sentence"

# add special tokens.
test_sentence_with_special_tokens = "[CLS]" + test_sentence + "[SEP]"

tokenized = tokenizer.tokenize(test_sentence_with_special_tokens)
print("tokenized: ", tokenized)

# Convert tokens to ids in WordPiece.
input_ids = tokenizer.convert_tokens_to_ids(tokenized)

# Precalculation of pad length, so that we can reuse it later on.
padding_length = max_length_test - len(input_ids)

# Map tokens to WordPiece dictionary and add pad token for those text shorter than our max length.
input_ids = input_ids + ([0] * padding_length)

# Attention should focus just on sequence with non padded tokens.
attention_mask = [1] * len(input_ids)

# Do not focus on padded tokens.
attention_mask = attention_mask + ([0] * padding_length)

# Token types, which are used in question-answer sequences. Here, we just use only one type.
token_type_ids = [0] * max_length_test

bert_input = {
    "token_ids": input_ids,
    "token_type_ids": token_type_ids,
    "attention_mask": attention_mask,
}
print(bert_input)

tokenized:  ['[CLS]', 'test', 'token', '##ization', 'sentence', '.', 'followed', 'by', 'another', 'sentence', '[SEP]']
{'token_ids': [101, 3231, 19204, 3989, 6251, 1012, 2628, 2011, 2178, 6251, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]}


In [10]:
len(input_ids)

20

In [11]:
len(attention_mask)

29

In [12]:
bert_input_easy = tokenizer.encode_plus(
    test_sentence,
    add_special_tokens=True,  # Add [CLS], [SEP]
    max_length=max_length_test,  # Max length of the text that can go to BERT.
    pad_to_max_length=True,  # Add [PAD] tokens
    return_attention_mask=True,  # Add attention mask to not focus on pad tokens.
)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [13]:
bert_input_easy

{'input_ids': [101, 3231, 19204, 3989, 6251, 1012, 2628, 2011, 2178, 6251, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]}

# Encode train and test datasets

In [14]:
# Max length is up to 512 for BERT.
max_length = 512
batch_size = 6

In [15]:
def convert_example_to_feature(review):
    # Combine step for tokenization, WordPiece vector mapping, adding special tokens as well as truncation.
    return tokenizer.encode_plus(
        review,
        add_special_tokens=True,  # Add [CLS], [SEP] tokens
        max_length=max_length,  # Max text length 
        padding="max_length",  # Add [PAD] tokens
        return_attention_mask=True,  # Add attention mask to not focus on pad tokens
    )

In [16]:
def map_example_to_dict(input_ids, attention_masks, token_type_ids, label):
    """Prepare input for TFBertForSequenceClassification."""
    return {
        "input_ids": input_ids,
        "token_type_ids": token_type_ids,
        "attention_mask": attention_mask,
    }, label

In [17]:
def encode_examples(ds, limit=-1):
    # Prepare lists, so that we can build final TensorFlow dataset from slices.
    input_ids_list = []
    token_type_ids_list = []
    attention_mask_list = []
    label_list = []

    if (limit > 0):
        ds = ds.take(limit)

    for i, (review, label) in enumerate(tfds.as_numpy(ds)):
        bert_input = convert_example_to_feature(review.decode())

        input_ids_list.append(bert_input["input_ids"])
        token_type_ids_list.append(bert_input["token_type_ids"])
        attention_mask_list.append(bert_input["attention_mask"])
        label_list.append([label])

        if i % 1000 == 0:
            print("{:05d}th example".format(i))

    return tf.data.Dataset.from_tensor_slices(
        (input_ids_list, attention_mask_list, token_type_ids_list, label_list)
    ).map(map_example_to_dict)

In [None]:
# Train dataset.
ds_train_encoded = encode_examples(ds_train).shuffle(10000).batch(batch_size)

# Test dataset.
ds_test_encoded = encode_examples(ds_test).batch(batch_size)

00000th example
01000th example
02000th example
03000th example
04000th example
05000th example
06000th example
07000th example
08000th example
09000th example
10000th example
11000th example
12000th example
13000th example
14000th example
15000th example
16000th example
17000th example
18000th example
19000th example
20000th example
21000th example
22000th example
23000th example
24000th example


# Model initialization

In [22]:
from transformers import TFBertForSequenceClassification
import tensorflow as tf

# Recommended learning rate for ADAM 5e-5, 3e-5, 2e-5
learning_rate = 2e-5

# The more number of epochs will be better but slower.
epochs = 1

# Use pretrained model from `transformers` library.
model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased")

# Choosing ADAM optimizer.
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate, epsilon=1e-08)

loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy("accuracy")

model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [25]:
len(ds_train_encoded)

NameError: ignored

In [26]:
len(ds_test_encoded)

NameError: ignored