# Transformers architecture and BERT
<sup>This notebook is a part of Natural Language Processing class at the University of Ljubljana, Faculty for computer and information science. Please contact [slavko.zitnik@fri.uni-lj.si](mailto:slavko.zitnik@fri.uni-lj.si) for any comments.</sub>

[Transformers](https://huggingface.co/transformers/quicktour.html) library offers a variety of implemented architectures (Tensorflow and PyTorch) along with [pre-trained models](https://huggingface.co/models) for different tasks - sequence classification, sequence tagging, machine translation, .... There you can find also some Slovene models. Otherwise, Slovene models are available at:
   
* [CroSloEn BERT](https://www.clarin.si/repository/xmlui/handle/11356/1330)
* [SloBERTa 1.0](https://www.clarin.si/repository/xmlui/handle/11356/1387)
* [SloBERTa 2.0](https://www.clarin.si/repository/xmlui/handle/11356/1397)

[A nice introduction into BERT](https://huggingface.co/blog/bert-101) (for reading).


The examples here require at least >4GB GPU (adapt batch sizes for smaller cards) and Tensorflow 2.x library.

In [1]:
import tensorflow as tf
import os
print(f"Tensorflow version: {tf.__version__}")

# Restrict TensorFlow to only allocate 4GBs of memory on the first GPU
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
    tf.config.experimental.set_virtual_device_configuration(
        gpus[0],
        [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=4096)])
    #tf.config.experimental.set_memory_growth(gpus[0], True)
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print(f"The system contains '{len(gpus)}' Physical GPUs and '{len(logical_gpus)}' Logical GPUs")
  except RuntimeError as e:
    print(e)
else:
    print(f"Your system does not contain a GPU that could be used by Tensorflow!")

Tensorflow version: 2.16.1
Your system does not contain a GPU that could be used by Tensorflow!


We import general libraries that will be used.

In [2]:
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

import numpy as np
from sklearn.model_selection import train_test_split

from transformers import (TFBertForSequenceClassification, 
                          BertTokenizer)

from tqdm import tqdm

We read the dataset and change the sentiment values to number format.

In [3]:
data = pd.read_csv('IMDB Dataset.csv')

# Transform positive/negative values to 1/0s
label_encoder = preprocessing.LabelEncoder()
data['sentiment'] = label_encoder.fit_transform(data['sentiment'])

data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


Split the data into train, development and test sets.

In [4]:
X = (np.array(data['review']))
y = (np.array(data['sentiment']))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=13)
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=42)
print("Train dataset shape: {0}, \nTest dataset shape: {1} \nValidation dataset shape: {2}".format(X_train.shape, X_test.shape, X_val.shape))

Train dataset shape: (40000,), 
Test dataset shape: (5000,) 
Validation dataset shape: (5000,)


Load the models from the public transformers repository. Generally for each model we load the classifier (i.e. trained model w/o specific head) and tokenizer. Tokenizer is used to transform input into tokens and word parts that can be fed to the classifier based on the token id in the vocabulary. 

In [5]:
bert_model = TFBertForSequenceClassification.from_pretrained("bert-base-cased")
#bert_model = TFBertForSequenceClassification.from_pretrained('./')

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [6]:
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

In [7]:
bert_tokenizer.tokenize("don't be so judgmental")

['don', "'", 't', 'be', 'so', 'judgment', '##al']

We prepare the input for the classifier (semi-manually using *encode_plus* method):

* *input_ids* contain id of each token from the tokenizer vocabulary
* *attention_masks* identify which is used to avoid using attention mechanism on padded tokens
* *token_type_ids* represent the sequence part of the input (used during pre-training for next sentence prediction)

In [8]:
pad_token=0
pad_token_segment_id=0
max_length=128

def convert_to_input(reviews):
  input_ids,attention_masks,token_type_ids=[],[],[]
  
  for x in tqdm(reviews,position=0, leave=True):
    inputs = bert_tokenizer.encode_plus(x,add_special_tokens=True, max_length=max_length)
    
    i, t = inputs["input_ids"], inputs["token_type_ids"]
    m = [1] * len(i)

    padding_length = max_length - len(i)

    i = i + ([pad_token] * padding_length)
    m = m + ([0] * padding_length)
    t = t + ([pad_token_segment_id] * padding_length)
    
    input_ids.append(i)
    attention_masks.append(m)
    token_type_ids.append(t)
  
  return [np.asarray(input_ids), 
            np.asarray(attention_masks), 
            np.asarray(token_type_ids)]

In [9]:
X_test_input=convert_to_input(X_test)
X_train_input=convert_to_input(X_train)
X_val_input=convert_to_input(X_val)

  0%|          | 0/5000 [00:00<?, ?it/s]Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
100%|██████████| 5000/5000 [00:20<00:00, 246.71it/s]
100%|██████████| 40000/40000 [02:41<00:00, 247.97it/s]
100%|██████████| 5000/5000 [00:19<00:00, 252.81it/s]


Tensorflow models by default take object of type [tf.data.Dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) as input for training or prediction. It allows for shuffling, splitting or automatic batch creation.

In [10]:
def example_to_features(input_ids,attention_masks,token_type_ids,y):
  return {"input_ids": input_ids,
          "attention_mask": attention_masks,
          "token_type_ids": token_type_ids},y

train_ds = tf.data.Dataset.from_tensor_slices((X_train_input[0],X_train_input[1],X_train_input[2],y_train)).map(example_to_features).shuffle(100).batch(12).repeat(5)
val_ds=tf.data.Dataset.from_tensor_slices((X_val_input[0],X_val_input[1],X_val_input[2],y_val)).map(example_to_features).batch(12)
test_ds=tf.data.Dataset.from_tensor_slices((X_test_input[0],X_test_input[1],X_test_input[2],y_test)).map(example_to_features).batch(12)

We set the parameters ffor training and train the model. As our model is already pretrained and contains a specific head for sequence classification, we can use it directly.

In [11]:
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')

bert_model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

In [12]:
print("Fine-tuning BERT on IMDB dataset")
bert_history = bert_model.fit(train_ds, epochs=3, validation_data=val_ds)

Fine-tuning BERT on IMDB dataset
Epoch 1/3
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: <cyfunction Socket.send at 0x7fcca6fed750> is not a module, class, method, function, traceback, frame, or code object
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: <cyfunction Socket.send at 0x7fcca6fed750> is not a module, class, method, function, traceback, frame, or code object

Epoch 2/3
Epoch 3/3


The fine-tuning will output something similar to the following:

```
Fine-tuning BERT on IMDB dataset
Train for 16670 steps, validate for 417 steps
Epoch 1/3
16670/16670 [==] - 5116s 307ms/step - loss: 0.1515 - accuracy: 0.9392 - val_loss: 0.5599 - val_accuracy: 0.8676
Epoch 2/3
16670/16670 [==] - 5123s 307ms/step - loss: 0.0347 - accuracy: 0.9884 - val_loss: 0.4681 - val_accuracy: 0.8742
Epoch 3/3
16670/16670 [==] - 5136s 308ms/step - loss: 0.0254 - accuracy: 0.9920 - val_loss: 0.6523 - val_accuracy: 0.8668
```

We can observe that the loss is decreasing and the accuracy on the validation data is increasing. 

After training for a few epochs we evaluate the model against the test data. First we prepare true values as a numpy array:

In [13]:
results_true = test_ds.unbatch()
results_true = np.asarray([element[1].numpy() for element in results_true])
print(results_true)

[0 0 1 ... 1 0 0]


Then we get predictions from the model. As the predictions consist of vectors of dimension two, we select the final prediction class by selection the maximum value for the class.

In [14]:
results = bert_model.predict(test_ds)
print(f"Model predictions:\n {results.logits}")

results_predicted = np.argmax(results.logits, axis=1)

Model predictions:
 [[ 5.638682  -4.606999 ]
 [ 6.5738583 -5.4174294]
 [-6.3349085  7.2059326]
 ...
 [-5.756212   6.591787 ]
 [ 5.4666157 -4.4481173]
 [ 5.9342785 -4.881941 ]]


Lastly, we evaluate the model using standard scores such as F score and accuracy:

In [15]:
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score

print(f"F1 score: {f1_score(results_true, results_predicted)}")
print(f"Accuracy score: {accuracy_score(results_true, results_predicted)}")

F1 score: 0.8744437995743858
Accuracy score: 0.8702


The results we achieve using such approach should be something like as follows:

```
F1 score: 0.88
Accuracy score: 0.8788
```

We can save the fine-tuned model and then load it using the same approach as above:

In [16]:
# SAVING YOUR MODEL
bert_model.save_pretrained('./')

# LOADING YOUR MODEL
bert_model = TFBertForSequenceClassification.from_pretrained('./')

Some layers from the model checkpoint at ./ were not used when initializing TFBertForSequenceClassification: ['dropout_37']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForSequenceClassification were initialized from the model checkpoint at ./.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


# Custom neural model for IMDB reviews sentiment prediction

Now let's use additional parameters of the *encode_plus* function to achieve the same as above to get the *input_ids*. The input for our models will be just indices of words and class values.

In [17]:
def get_token_ids(texts):
    return bert_tokenizer.batch_encode_plus(texts, 
                                            add_special_tokens=True, 
                                            max_length = 128, 
                                            pad_to_max_length = True)["input_ids"]

train_token_ids = get_token_ids(X_train)
test_token_ids = get_token_ids(X_test)




In [18]:
train_data = tf.data.Dataset.from_tensor_slices((tf.constant(train_token_ids), tf.constant(y_train))).batch(12)
test_data = tf.data.Dataset.from_tensor_slices((tf.constant(test_token_ids), tf.constant(y_test))).batch(12)

The Tensorflow API allows for [custom model creation](https://www.tensorflow.org/api_docs/python/tf/keras/Model) and [inclusion of existing models/layers](https://www.tensorflow.org/guide/keras/custom_layers_and_models) into a model. We create a custom model using several layers as follows:

In [19]:
import tensorflow as tf
from tensorflow.keras import layers

class CustomIMDBModel(tf.keras.Model):
    
    def __init__(self,
                 vocabulary_size,
                 embedding_dimensions=128,
                 cnn_filters=50,
                 dnn_units=512,
                 model_output_classes=2,
                 dropout_rate=0.1,
                 training=False,
                 name="custom_imdb_model"):
        super(CustomIMDBModel, self).__init__(name=name)
        
        self.embedding = layers.Embedding(vocabulary_size,
                                          embedding_dimensions)
        self.cnn_layer1 = layers.Conv1D(filters=cnn_filters,
                                        kernel_size=2,
                                        padding="valid",
                                        activation="relu")
        self.cnn_layer2 = layers.Conv1D(filters=cnn_filters,
                                        kernel_size=3,
                                        padding="valid",
                                        activation="relu")
        self.cnn_layer3 = layers.Conv1D(filters=cnn_filters,
                                        kernel_size=4,
                                        padding="valid",
                                        activation="relu")
        self.pool = layers.GlobalMaxPool1D()
        
        self.dense_1 = layers.Dense(units=dnn_units, activation="relu")
        self.dropout = layers.Dropout(rate=dropout_rate)
        if model_output_classes == 2:
            self.last_dense = layers.Dense(units=1,
                                           activation="sigmoid")
        else:
            self.last_dense = layers.Dense(units=model_output_classes,
                                           activation="softmax")
    
    def call(self, inputs, training):
        l = self.embedding(inputs)
        l_1 = self.cnn_layer1(l) 
        l_1 = self.pool(l_1) 
        l_2 = self.cnn_layer2(l) 
        l_2 = self.pool(l_2)
        l_3 = self.cnn_layer3(l)
        l_3 = self.pool(l_3) 
        
        concatenated = tf.concat([l_1, l_2, l_3], axis=-1) # (batch_size, 3 * cnn_filters)
        concatenated = self.dense_1(concatenated)
        concatenated = self.dropout(concatenated, training)
        model_output = self.last_dense(concatenated)
        
        return model_output

In [20]:
VOCAB_LENGTH = len(bert_tokenizer.vocab)
EMB_DIM = 200
CNN_FILTERS = 100
DNN_UNITS = 256
OUTPUT_CLASSES = 2
DROPOUT_RATE = 0.2
NB_EPOCHS = 5

custom_model = CustomIMDBModel(vocabulary_size=VOCAB_LENGTH,
                        embedding_dimensions=EMB_DIM,
                        cnn_filters=CNN_FILTERS,
                        dnn_units=DNN_UNITS,
                        model_output_classes=OUTPUT_CLASSES,
                        dropout_rate=DROPOUT_RATE)

In [21]:
if OUTPUT_CLASSES == 2:
    custom_model.compile(loss="binary_crossentropy",
                       optimizer="adam",
                       metrics=["accuracy"])
else:
    custom_model.compile(loss="sparse_categorical_crossentropy",
                       optimizer="adam",
                       metrics=["sparse_categorical_accuracy"])

In [22]:
custom_model.fit(train_data, epochs=NB_EPOCHS)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7fcaa4feb1d0>

The fine-tuning will output something similar to the following:

```
Train for 3334 steps
Epoch 1/5
3334/3334 [==] - 123s 37ms/step - loss: 0.4285 - accuracy: 0.7985
Epoch 2/5
3334/3334 [==] - 121s 36ms/step - loss: 0.2344 - accuracy: 0.9065
Epoch 3/5
3334/3334 [==] - 121s 36ms/step - loss: 0.0862 - accuracy: 0.9693
Epoch 4/5
3334/3334 [==] - 121s 36ms/step - loss: 0.0537 - accuracy: 0.9805
Epoch 5/5
3334/3334 [==] - 123s 37ms/step - loss: 0.0355 - accuracy: 0.9874
```

We can observe that the loss is decreasing and the accuracy on the validation data is increasing. 

After training for a few epochs we evaluate the model against the test data. First we prepare true values as a numpy array:

In [23]:
results_predicted = [1 if x>=0.5 else 0 for x in custom_model.predict(test_data) ]
results_true = np.array(y_test)

In [24]:
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score

print(f"F1 score: {f1_score(results_true, results_predicted)}")
print(f"Accuracy score: {accuracy_score(results_true, results_predicted)}")

F1 score: 0.8379888268156425
Accuracy score: 0.826


The results we achieve using such approach should be something like as follows:

```
F1 score: 0.84
Accuracy score: 0.83
```

The model achieves decent performance.

# Custom neural model for IMDB reviews sentiment prediction using BERT Embeddings

It has been shown that BERT embeddings can be used to improve your models (see lecture materials). In this scenario we change the custom model above and use BERT embeddings as input for further models.

We load the plain [TFBert model](https://huggingface.co/docs/transformers/model_doc/bert#transformers.TFBertModel) without a specific head and use the last layer as input.

In [25]:
from transformers import BertTokenizer, TFBertModel
import tensorflow as tf

bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = TFBertModel.from_pretrained('bert-base-uncased')

token_ids = bert_tokenizer.encode("Hello, my dog is cute", add_special_tokens=True, max_length = 55, pad_to_max_length = True)
input_ids = tf.constant(token_ids)[None, :]  # Batch size 1
outputs = bert_model(input_ids)
last_hidden_states = outputs[0]
print(last_hidden_states)

Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. 

tf.Tensor(
[[[-0.53795666  0.02915638  0.4568426  ... -0.4206041   0.3360427
   -0.68869185]
  [-0.02613155  0.10147773  0.38023055 ...  0.16423033  0.6685274
   -0.07883158]
  [-0.8667504   0.8550649   0.72283715 ... -0.56866455  0.08689685
    0.24533442]
  ...
  [ 0.3817766  -0.4433983   0.7945325  ... -0.62353337 -0.00921943
   -1.3112266 ]
  [ 0.33190864 -0.4060634   0.8477452  ... -0.6768912  -0.03242041
   -1.2069647 ]
  [ 0.3906283  -0.44234878  0.72817904 ... -0.66963124 -0.02545467
   -1.3906384 ]]], shape=(1, 55, 768), dtype=float32)




In [26]:
def get_token_ids(texts):
    return bert_tokenizer.batch_encode_plus(texts, 
                                            add_special_tokens=True, 
                                            max_length = 128, 
                                            pad_to_max_length = True)["input_ids"]

train_token_ids = get_token_ids(X_train)
test_token_ids = get_token_ids(X_test)

In [27]:
train_data = tf.data.Dataset.from_tensor_slices((tf.constant(train_token_ids), tf.constant(y_train))).batch(12)
test_data = tf.data.Dataset.from_tensor_slices((tf.constant(test_token_ids), tf.constant(y_test))).batch(12)

Custom model improvement:

In [28]:
from transformers import BertTokenizer, TFBertModel, TFBertPreTrainedModel, TFBertMainLayer
from tensorflow.keras import layers

import tensorflow as tf
class BertIMDBEmbeddingModel(TFBertPreTrainedModel):
    def __init__(self, config,
                 cnn_filters=50,
                 dnn_units=512,
                 model_output_classes=2,
                 dropout_rate=0.1,
                 training=False,
                 name="text_model",
                 *inputs, **kwargs):
        super().__init__(config, *inputs, **kwargs)
        self.bert = TFBertMainLayer(config, name="bert", trainable = False)
        
        self.cnn_layer1 = layers.Conv1D(filters=cnn_filters,
                                        kernel_size=2,
                                        padding="valid",
                                        activation="relu")
        self.cnn_layer2 = layers.Conv1D(filters=cnn_filters,
                                        kernel_size=3,
                                        padding="valid",
                                        activation="relu")
        self.cnn_layer3 = layers.Conv1D(filters=cnn_filters,
                                        kernel_size=4,
                                        padding="valid",
                                        activation="relu")
        self.pool = layers.GlobalMaxPool1D()
        
        self.dense_1 = layers.Dense(units=dnn_units, activation="relu")
        self.dropout = layers.Dropout(rate=dropout_rate)
        if model_output_classes == 2:
            self.last_dense = layers.Dense(units=1,
                                           activation="sigmoid")
        else:
            self.last_dense = layers.Dense(units=model_output_classes,
                                           activation="softmax")

    def call(self, inputs, training = False, **kwargs):        
        bert_outputs = self.bert(inputs, training = training, **kwargs)
        
        l_1 = self.cnn_layer1(bert_outputs[0]) 
        l_1 = self.pool(l_1) 
        l_2 = self.cnn_layer2(bert_outputs[0]) 
        l_2 = self.pool(l_2)
        l_3 = self.cnn_layer3(bert_outputs[0])
        l_3 = self.pool(l_3) 
        
        concatenated = tf.concat([l_1, l_2, l_3], axis=-1) # (batch_size, 3 * cnn_filters)
        concatenated = self.dense_1(concatenated)
        concatenated = self.dropout(concatenated, training)
        model_output = self.last_dense(concatenated)
        
        return model_output

Loading and training our model:

In [29]:
CNN_FILTERS = 100
DNN_UNITS = 256
OUTPUT_CLASSES = 2
DROPOUT_RATE = 0.2
NB_EPOCHS = 5

text_model = BertIMDBEmbeddingModel.from_pretrained('bert-base-uncased',
                        cnn_filters=CNN_FILTERS,
                        dnn_units=DNN_UNITS,
                        model_output_classes=OUTPUT_CLASSES,
                        dropout_rate=DROPOUT_RATE)

Some layers from the model checkpoint at bert-base-uncased were not used when initializing BertIMDBEmbeddingModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing BertIMDBEmbeddingModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertIMDBEmbeddingModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of BertIMDBEmbeddingModel were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['conv1d_5', 'dense_3', 'global_max_pooling1d_1', 'dropout_151', 'conv1d_4', 'conv1d_3', 'dense_2']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [30]:
if OUTPUT_CLASSES == 2:
    text_model.compile(loss="binary_crossentropy",
                       optimizer="adam",
                       metrics=["accuracy"])
else:
    text_model.compile(loss="sparse_categorical_crossentropy",
                       optimizer="adam",
                       metrics=["sparse_categorical_accuracy"])

In [31]:
text_model.fit(train_data, epochs=NB_EPOCHS)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7fcb3c22c748>

In [32]:
results_predicted = [1 if x>=0.5 else 0 for x in text_model.predict(test_data) ]
results_true = np.array(y_test)

In [33]:
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score

print(f"F1 score: {f1_score(results_true, results_predicted)}")
print(f"Accuracy score: {accuracy_score(results_true, results_predicted)}")

F1 score: 0.8747278844250941
Accuracy score: 0.8734


The results we achieve using such approach should be something like as follows:

```
F1 score: 0.88
Accuracy score: 0.88
```

We achieve better performance than plain custom model with simple embeddings but a bit lower performance than the fine-tuned model (probably not significantly different due to hyperparameters modification).

By setting the parameter *trainable = False* in a model layer, we can disable model weights updating (i.e. fine-tuning). If we enable weights updating for the whole BERT model in the example above, we get much worse performance as there are many more parameters to update. To be sure which parameters are updating, use *model.summary* function to output the architecture and parameters of your model.

## References
    
* [BERT scientific paper explained](http://nlp.seas.harvard.edu/2018/04/03/attention.html)
* [High-level explanation of BERT](https://towardsdatascience.com/lost-in-translation-found-by-transformer-46a16bf6418f)
* [Sequence models vs. BERT](https://medium.com/saarthi-ai/transformers-attention-based-seq2seq-machine-translation-a28940aaa4fe) and an [opinion](https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/) of why BERT achieves better performance.
* [Movie reviews using TF 2.0](https://androidkt.com/state-of-the-art-text-classification-using-bert-in-ten-lines-of-tensorflow-2-0) 
* [Movie reviews using BERT embeddings](https://stackabuse.com/text-classification-with-bert-tokenizer-and-tf-2-0-in-python)
* [BERT movie reviews notebook (TF 1x)](https://colab.research.google.com/github/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb#scrollTo=LL5W8gEGRTAf)
* [BERT embeddings using TF 2.0](https://colab.research.google.com/drive/1hMLd5-r82FrnFnBub-B-fVW78Px4KPX1)
* [Sentence-level embeddings](https://towardsdatascience.com/simple-bert-using-tensorflow-2-0-132cb19e9b22)
* [FineTuning BERT, Named entity recognition using BERT](https://medium.com/swlh/named-entity-recognition-using-bert-2fb924864d47)
* [DeepPavlov BERT models](http://docs.deeppavlov.ai/en/master/features/models/bert.html): BERT models for Slavic languages.
* [LSTM and BERT example in PyTorch for classification](https://towardsdatascience.com/bert-for-dummies-step-by-step-tutorial-fb90890ffe03)
* [LSTM in Keras for intent classification (NER-like)](https://towardsdatascience.com/natural-language-understanding-with-sequence-to-sequence-models-e87d41ad258b)
* [GPT-2 for sequence classification](https://gmihaila.medium.com/gpt2-for-text-classification-using-hugging-face-transformers-574555451832)
    
   