# Assignment 2: Text Classification with Various Neural Networks

**Description:** This assignment covers various neural network architectures and components, largely used in the context of classification. You will compare Deep Averaging Networks, Deep Weighted Averaging Networks using Attention, and BERT-based models. You should also be able to develop an intuition for:


*   The effects of fine-tuning word vectors or starting with random word vectors
*   How various networks behave when the training set size changes
*   The effect of shuffling your training data
*   The benefits of Attention calculations
*   Working with BERT


The assignment notebook closely follows the lesson notebooks. We will use the IMDB dataset and will leverage some of the models, or part of the code, for our current investigation.

The initial part of the notebook is purely setup. We will then evaluate how Attention can make Deep Averaging networks better.

Do not try to run this entire notebook on your GCP instance as the training of models requires a GPU to work in a timely fashion. This notebook should be run on a Google Colab leveraging a GPU. By default, when you open the notebook in Colab it will try to use a GPU. Total runtime of the entire notebook (with solutions and a Colab GPU) should be about 1h.


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datasci-w266/2024-spring-main/blob/master/assignment/a2/Text_classification.ipynb)

The overall assignment structure is as follows:


1. Setup
  
  1.1 Libraries, Embeddings,  & Helper Functions

  1.2 Data Acquisition

  1.3. Data Preparation

      1.3.1 Training/Test Sets using Word2Vec

      1.3.2 Training/Test Sets for BERT-based models


2. Classification with various Word2Vec-based Models

  2.1 The Role of Shuffling of the Training Set

  2.2 DAN vs Weighted Averaging Models using Attention

    2.2.1 Warm-Up
    
    2.2.2 The WAN Model
    
  2.3 Approaches for Training of Embeddings


3. Classification with BERT

  3.1. BERT Basics

  3.2 CLS-Token-based Classification

  3.3 Averaging of BERT Outputs

  3.4. Adding a CNN on top of BERT



**INSTRUCTIONS:**:

* Questions are always indicated as **QUESTION**, so you can search for this string to make sure you answered all of the questions. You are expected to fill out, run, and submit this notebook, as well as to answer the questions in the **answers** file as you did in a1.  Please do **not** remove the output from your notebooks when you submit them as we'll look at the output as well as your code for grading purposes.

* **### YOUR CODE HERE** indicates that you are supposed to write code.

* If you want to, you can run all of the cells in section 1 in bulk. This is setup work and no questions are in there. At the end of section 1 we will state all of the relevant variables that were defined and created in section 1.

* Finally, unless otherwise indicated your validation accuracy will be 0.65 or higher if you have correctly implemented the model.



## 1. Setup

### 1.1. Libraries and Helper Functions

This notebook requires the TensorFlow dataset and other prerequisites that you must download.

In [1]:
#@title Installs

!pip install pydot --quiet
!pip install gensim --quiet
!pip install tensorflow-datasets --quiet
!pip install -U tensorflow-text --quiet
!pip install transformers --quiet

Now we are ready to do the imports.

In [2]:
#@title Imports

import numpy as np
import tensorflow as tf
from tensorflow import keras

from tensorflow.keras.layers import Embedding, Input, Dense, Lambda
from tensorflow.keras.models import Model
import tensorflow.keras.backend as K
import tensorflow_datasets as tfds
import tensorflow_text as tf_text

from transformers import BertTokenizer, TFBertModel
from transformers import logging
logging.set_verbosity_error()

import sklearn as sk
import os
import nltk
from nltk.data import find

import matplotlib.pyplot as plt

import re

import gensim
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
from gensim.test.utils import datapath

Below is a helper function to plot histories.

In [3]:
#@title Plotting Function

# 4-window plot. Small modification from matplotlib examples.

def make_plot(axs,
              model_history1,
              model_history2,
              model_1_name='model 1',
              model_2_name='model 2',
              ):
    box = dict(facecolor='yellow', pad=5, alpha=0.2)

    for i, metric in enumerate(['loss', 'accuracy']):
        # small adjustment to account for the 2 accuracy measures in the Weighted Averging Model with Attention
        if 'classification_%s' % metric in model_history2.history:
            metric2 = 'classification_%s' % metric
        else:
            metric2 = metric

        y_lim_lower1 = np.min(model_history1.history[metric])
        y_lim_lower2 = np.min(model_history2.history[metric2])
        y_lim_lower = min(y_lim_lower1, y_lim_lower2) * 0.9

        y_lim_upper1 = np.max(model_history1.history[metric])
        y_lim_upper2 = np.max(model_history2.history[metric2])
        y_lim_upper = max(y_lim_upper1, y_lim_upper2) * 1.1

        for j, model_history in enumerate([model_history1, model_history2]):
            model_name = [model_1_name, model_2_name][j]
            model_metric = [metric, metric2][j]
            ax1 = axs[i, j]
            ax1.plot(model_history.history[model_metric])
            ax1.plot(model_history.history['val_%s' % model_metric])
            ax1.set_title('%s - %s' % (metric, model_name))
            ax1.set_ylabel(metric, bbox=box)
            ax1.set_ylim(y_lim_lower, y_lim_upper)

Next, we get the word2vec model from nltk.

In [4]:
#@title NLTK & Word2Vec

nltk.download('word2vec_sample')

word2vec_sample = str(find('models/word2vec_sample/pruned.word2vec.txt'))

wvmodel = KeyedVectors.load_word2vec_format(datapath(word2vec_sample), binary=False)

Now here we have the embedding **model** defined, let's see how many words are in the vocabulary:

In [5]:
len(wvmodel)

What do the word vectors look like? As expected:

In [6]:
wvmodel['great'][:20]

We can now build the embedding matrix and a vocabulary dictionary:

In [7]:
EMBEDDING_DIM = len(wvmodel['university'])      # we know... it's 300

# initialize embedding matrix and word-to-id map:
embedding_matrix = np.zeros((len(wvmodel) + 1, EMBEDDING_DIM))
vocab_dict = {}

# build the embedding matrix and the word-to-id map:
for i, word in enumerate(wvmodel.index_to_key):
    embedding_vector = wvmodel[word]

    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector
        vocab_dict[word] = i

# we can use the last index at the end of the vocab for unknown tokens
vocab_dict['[UNK]'] = len(vocab_dict)

In [8]:
embedding_matrix.shape

In [9]:
embedding_matrix[:5, :5]

The last row consists of all zeros. We will use that for the UNK token, the placeholder token for unknown words.

### 1.2 Data Acquisition


We will use the IMDB dataset delivered as part of the TensorFlow-datasets library, and split into training and test sets. For expedience, we will limit ourselves in terms of train and test examples.

In [10]:
train_data, test_data = tfds.load(
    name="imdb_reviews",
    split=('train[:80%]', 'test[80%:]'),
    as_supervised=True)

train_examples, train_labels = next(iter(train_data.batch(20000)))
test_examples, test_labels = next(iter(test_data.batch(5000)))

It is always highly recommended to look at the data. What do the records look like? Are they clean or do they contain a lot of cruft (potential noise)?

In [11]:
train_examples[:4]

In [12]:
train_labels[:4]

For convenience, in this assignment we will define a sequence length and truncate all records at that length. For records that are shorter than our defined sequence length we will add padding characters to insure that our input shapes are consistent across all records.

In [13]:
MAX_SEQUENCE_LENGTH = 100

## 1.3. Data Preparation

### 1.3.1. Training/Test Sets for Word2Vec-based Models

First, we tokenize the data:

In [14]:
tokenizer = tf_text.WhitespaceTokenizer()
train_tokens = tokenizer.tokenize(train_examples)
test_tokens = tokenizer.tokenize(test_examples)

Let's look at some tokens.  Does they look acceptable?

In [15]:
train_tokens[0]

Yup... looks right. Of course we will need to take care of the encoding later.

Next, we define a simple function that converts the tokens above into the appropriate word2vec index values.   

In [16]:
def docs_to_vocab_ids(tokenized_texts_list):
    """
    converting a list of strings to a list of lists of word ids
    """
    texts_vocab_ids = []
    text_labels = []
    valid_example_list = []
    for i, token_list in enumerate(tokenized_texts_list):

        # Get the vocab id for each token in this doc ([UNK] if not in vocab)
        vocab_ids = []
        for token in list(token_list.numpy()):
            decoded = token.decode('utf-8', errors='ignore')
            if decoded in vocab_dict:
                vocab_ids.append(vocab_dict[decoded])
            else:
                vocab_ids.append(vocab_dict['[UNK]'])

        # Truncate text to max length, add padding up to max length
        vocab_ids = vocab_ids[:MAX_SEQUENCE_LENGTH]
        n_padding = (MAX_SEQUENCE_LENGTH - len(vocab_ids))
        # For simplicity in this model, we'll just pad with unknown tokens
        vocab_ids += [vocab_dict['[UNK]']] * n_padding
        # Add this example to the list of converted docs
        texts_vocab_ids.append(vocab_ids)

        if i % 5000 == 0:
            print('Examples processed: ', i)

    print('Total examples: ', i)
    return np.array(texts_vocab_ids)

Now we can create training and test data that can be fed into the models of interest.  We need to convert all of the tokens in to their respective input ids.

In [17]:
train_input_ids = docs_to_vocab_ids(train_tokens)
test_input_ids = docs_to_vocab_ids(test_tokens)

train_input_labels = np.array(train_labels)
test_input_labels = np.array(test_labels)

Let's convince ourselves that the data looks correct:

In [18]:
train_input_ids[:2]

### 1.3.2. Training/Test Sets for BERT-based models

We already imported the BERT model and the Tokenizer libraries. Now, let's load the pretrained BERT model and tokenizer. Always make sure to load the tokenizer that goes with the model you're going to use.

In [19]:
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
bert_model = TFBertModel.from_pretrained('bert-base-cased')

Next, we will preprocess our train and test data for use in the BERT model. We need to convert our documents into vocab IDs, like we did above with the Word2Vec vocabulary. But this time we'll use the BERT tokenizer, which has a different vocabulary specific to the BERT model we're going to use.

In [20]:
#@title BERT Tokenization of training and test data

train_examples_str = [x.decode('utf-8') for x in train_examples.numpy()]
test_examples_str = [x.decode('utf-8') for x in test_examples.numpy()]

bert_train_tokenized = bert_tokenizer(train_examples_str,
              max_length=MAX_SEQUENCE_LENGTH,
              truncation=True,
              padding='max_length',
              return_tensors='tf')
bert_train_inputs = [bert_train_tokenized.input_ids,
                     bert_train_tokenized.token_type_ids,
                     bert_train_tokenized.attention_mask]
bert_train_labels = np.array(train_labels)

bert_test_tokenized = bert_tokenizer(test_examples_str,
              max_length=MAX_SEQUENCE_LENGTH,
              truncation=True,
              padding='max_length',
              return_tensors='tf')
bert_test_inputs = [bert_test_tokenized.input_ids,
                     bert_test_tokenized.token_type_ids,
                     bert_test_tokenized.attention_mask]
bert_test_labels = np.array(test_labels)

Overall, here are the key variables and sets that we encoded for word2vec and BERT and that may be used moving forward. If the variable naming does not make it obvious, we also state the purpose:

#### Parameters:

* MAX_SEQUENCE_LENGTH (100)


#### Word2vec-based models:

* train(/test)_input_ids: input ids for the training(/test) sets for word2vec models
* train(/test)_input_labels: the corresponding labels

#### BERT:


* bert_train(/test)_inputs: list of input_ids, token_type_ids and attention_mask for the training(/test) sets for BERT models
* bert_train(/test)_labels: the corresponding labels for BERT

**NOTE:** We recommend you inspect these variables if you have not gone through the code.

### 1.4  Keras Functional API warm up

Shown below is the output of a call to model summary.  It shows a network with specific named layers.  You are to reproduce the model that generated this summary.

**QUESTION:**

1.a Create a model using the Keras functional API so that the model.summary() call of your model identicially reproduces the model summary shown here:

**Model Summary Output To Reproduce**
```
Model: "a2_question1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
_________________________________________________________________
 input_words (InputLayer)    [(None, 100)]             0         
                                                                 
 embedding (Embedding)       (None, 100, 300)          13194600  
                                                                 
 lambda (Lambda)             (None, 300)               0         
                                                                 
 hidden1 (Dense)             (None, 300)               90300     
                                                                 
 hidden2 (Dense)             (None, 200)               60200     
                                                                 
 output (Dense)              (None, 5)                 1005      
                                                                 
__________________________________________________________________
Total params: 13346105 (50.91 MB)
Trainable params: 151505 (591.82 KB)
Non-trainable params: 13194600 (50.33 MB)
_________________________________________________________________
```

In [21]:
input_x = Input(shape = (MAX_SEQUENCE_LENGTH,), name="input_words")

### YOUR CODE HERE





### END YOUR CODE

# Compile the model
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics = ['accuracy'])

In [22]:
#Run this cell to generate your summary to match the summary output dabove
model.summary()

## 2. Classification with various Word2Vec-based Models

**QUESTION:**

2.a. Revisit the dataset. Is it balanced? Find the percentage of positive examples in the training set. (Copy and paste the decimal value for your calculation, e.g. a number like 0.5678 or 0.8765)

In [23]:
### YOUR CODE HERE
### END YOUR CODE

**QUESTION:**

2.b. Now find the percentage of positive examples in the test set.  (Copy and paste the decimal value for your calculation, e.g. a number like 0.5678 or 0.8765)

In [24]:
### YOUR CODE HERE
### END YOUR CODE

### 2.1 The Role of Shuffling of the Training Set


We will first revisit the DAN model.

2. Reuse the code from the class notebook to build a DAN network with one hidden layer of dimension 100. The optimizer should be Adam. Wrap the model creation in a function according to this API:

In [25]:
def create_dan_model(retrain_embeddings=False,
                     max_sequence_length=MAX_SEQUENCE_LENGTH,
                     hidden_dim=100,
                     dropout=0.3,
                     embedding_initializer='word2vec',
                     learning_rate=0.001):
  """
  Construct the DAN model including the compilation and return it. Parametrize it using the arguments.
  :param retrain_embeddings: boolean, indicating whether  the word embeddings are trainable
  :param hidden_dim: dimension of the hidden layer
  :param dropout: dropout applied to the hidden layer

  :returns: the compiled model
  """

  if embedding_initializer == 'word2vec':
    embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix)
  else:
    embeddings_initializer='uniform'


  ### YOUR CODE HERE

  # start by creating the dan_embedding_layer. Use the embeddings_initializer. variable defined above.







  ### END YOUR CODE
  return dan_model


Let us create a sorted version of the training dataset to run some simulations:

In [26]:
sorted_train_input_data = [(x, y) for (x, y) in zip(list(train_input_ids), list(train_input_labels))]
sorted_train_input_data.sort(key = lambda x: x[1])
sorted_training_input_ids = np.array([x[0] for x in sorted_train_input_data])
sorted_training_labels = np.array([x[1] for x in sorted_train_input_data])

Next, create your DAN model using the default parameters and train it by:

1.  Using the sorted dataset
2.  Using 'shuffle=False' as one of the model.fit parameters.
3.  Train for 10 epochs with a batch size of 32

Make sure you store the history (name it 'dan_sorted_history') as we did in the lesson notebooks.



In [27]:
### YOUR CODE HERE

dan_model_sorted = create_dan_model()

#use dan_sorted_history = ... below
### END YOUR CODE

**QUESTION:**

2.1.a What is the final validation accuracy that you observed after you completed the 10 epochs? (Copy and paste the decimal value for the final validation accuracy, e.g. a number like 0.5678 or 0.8765)

Hint: You should have an accuracy number above 0.30.



Next, recreate the same model and train it with **'shuffle=True'**. (Note that this is also the default.). Use 'dan_shuffled_history' for the history.

In [28]:
### YOUR CODE HERE

dan_model_shuffled = create_dan_model()

#use dan_shuffled_history = ... below

### END YOUR CODE

**QUESTION:**

2.1.b What is the final validation accuracy that you observed for the shuffled run after completing 10 epochs? (Copy and paste the decimal value for the final validation accuracy, e.g. a number like 0.5678 or 0.8765)


Compare the 2 histories in a plot.

In [29]:
fig, axs = plt.subplots(2, 2)
fig.subplots_adjust(left=0.2, wspace=0.6)
make_plot(axs,
          dan_sorted_history,
          dan_shuffled_history,
          model_1_name='sorted',
         model_2_name='shuffled')

fig.align_ylabels(axs[:, 1])
fig.set_size_inches(18.5, 10.5)
plt.show()

### 2.2 DAN vs Weighted Averaging Models using Attention

#### 2.2.1. Warm-Up: Manual Attention Calculation

**QUESTION:**

2.2.1.a Calculate the context vector for the following query and key/value vectors. You can do this manually, or you can use


```
tf.keras.layers.Attention()
```

2.2.1.b What are the weights for the key/value vectors?


In [30]:
q = [1, 2., 1]

k1 = v1 = [-1, -1, 3.]
k2 = v2 = [1, 2, -5.]

In [31]:
### YOUR CODE HERE
### END YOUR CODE

#### 2.2.2 The 'WAN' Model


Next, we would like to improve our DAN by attempting to train a neural net that learns to put more weight on some words than others. How could we do that? **Attention** is the answer!

Here, we will build a model that you can call "Weighted Averaging Models using Attention". You should construct a network that uses attention to weight the input tokens for a given example.

The core structure is the same as for the DAN network, so remember to re-use the embedding matrix you initialized earlier with word2vec embedding weights.

However, there are obviously some critical changes from the DAN:

1) How do I create a learnable query vector for the attention calculation that is supposed to generate the suitable token probabilities? And what is its size?

2) What are the key vectors for the attention calculation?

3) How does the averaging change?


First, the key vectors should be the incoming word vectors.

The query vector needs to have the size of the word vectors, as it needs to attend to them. A good way to create the query vector is to generate an embedding like vector easily by getting a single row of trained weights from a Dense layer if we pass in a value of one to multiply by that weight matrix in the usual way:


```
wan_query_layer = tf.keras.layers.Dense(embedding_matrix.shape[1])
```

That sounds great... but how do I use this to have a vector available in my calculation? And... make this vector available to all examples in the batch?

What you can use is a 'fake input-like layer' that creates for each incoming batch example a '1', that then the query layer can get applied to.
Assuming that the input layer for your network is **wan_input_layer**, this could be done with

```
wan_one_vector = tf.Variable(tf.ones((1, 1, 1)))
wan_batch_of_ones = tf.tile(wan_one_vector, (tf.shape(wan_input_layer)[0], 1, 1))
```

You could then have the query vector available for each example through:

```
wan_query_vector = wan_query_layer(wan_batch_of_ones)

```

You will see that this structure is essentially the same as what we did for word vectors, except that we had to replace the input layer with our fake layer, as there is no actual input. We will also have **2 outputs** (discussed in a bit.)

How does the averaging change? You should use:

```
tf.keras.layers.Attention()
```

and make sure you consider the proper inputs and outputs for that calculation.

So why 2 outputs, and how do we do that? First off, we need the output that makes the classification, as always. What is the second output? We also would like our model to provide us with the attention weights it calculated. This will tell us which words were considered how much for the context creation.

Can we implement 2 outputs? You need to have a list of the two outputs. But note that you may also want to have a list of 2 cost function and 2 metrics. You can use 'None' both times to account for our new second output, and you can ignore the corresponding values that the model report. (In general, the total loss will be a sum of the individual losses. So one would rather construct a loss that always returns zero for the second loss, but as it is very small we can ignore this here.)

Finally, you may want to reshape the output after the Attention layer, because the Attention layer will still give a sequence of vectors for each example. It will just be a sequence of one weighted average vector for each example. You may want to remove that middle dimension of size one so you just have a single vector for each example. You can do that with layers.Reshape():

```
wan_attention_output = tf.keras.layers.Reshape((wan_attention_output.shape[-1],))(wan_attention_output)
```

In [32]:
def create_wan_model(retrain_embeddings=False,
                     max_sequence_length=MAX_SEQUENCE_LENGTH,
                     hidden_dim=100,
                     dropout=0.3,
                     embedding_initializer='word2vec',
                     learning_rate=0.001):
  """
  Construct the WAN model including the compilation and return it. Parametrize it using the arguments.
  :param retrain_embeddings: boolean, indicating whether the word embeddings are trainable
  :param hidden_dim: dimension of the hidden layer
  :param dropout: dropout applied to the hidden layer

  :returns: the compiled model
  """

  if embedding_initializer == 'word2vec':
    embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix)
  else:
    embeddings_initializer='uniform'

  ### YOUR CODE HERE

  wan_one_vector = tf.Variable(tf.ones((1, 1, 1)))

  wan_batch_of_ones = tf.tile(wan_one_vector, (tf.shape(wan_input_layer)[0], 1, 1))

  wan_query_layer = tf.keras.layers.Dense(embedding_matrix.shape[1])

  wan_query_vector = wan_query_layer(wan_batch_of_ones)





  ### END YOUR CODE

  return wan_model


Now train the model for the same dataset as we did for the DAN model (shuffled data) and save its history in a variable named 'wan_history'.

In [33]:
### YOUR CODE HERE

wan_model = create_wan_model()

# use wan_history = ... below


### END YOUR CODE

**QUESTION:**

2.2.2.a What is the final validation accuracy that you observed for the wan training after 10 epochs? (Copy and paste the decimal value for the final validation accuracy, e.g. a number like 0.5678 or 0.8765)


Now compare the results of the initial dan_model training and the wan_model training:

In [34]:
fig, axs = plt.subplots(2, 2)
fig.subplots_adjust(left=0.2, wspace=0.6)
make_plot(axs,
          dan_shuffled_history,
          wan_history,
          model_1_name='dan',
         model_2_name='wan')

fig.align_ylabels(axs[:, 1])
fig.set_size_inches(18.5, 10.5)
plt.show()

Next, let us see for the wan_model which words matter most for the classification prediction and which ones did less so. How can we tell? We can look at the attention weights!

Let's look at the first training example.  We'll need to convert the input_ids back into the associated strings.

In [35]:
train_examples[0].numpy().decode('utf-8')

The corresponding list of input ids that are suitably formatted, i.e. with sequence length 100, are these:

In [36]:
probe_input_ids = train_input_ids[:1]
probe_input_ids

and the first 10 corresponding tokens are:

In [37]:
probe_tokens = [x.decode('utf-8') for x in train_tokens[0].numpy()][:100]
probe_tokens[:10]

Using only the first record in the training set, identify the **5 words** with the highest impact and the **5 words** with the lowest impact on the score, i.e., identify the 5 words with the largest and  smallest weights, respectively. (Note that multiple occurences of the same word count separately for the exercise).

HINT: You should create a list of (word/weight) pairs, and then sort by the second argument. Python's '.sort()' function may come in handy.  And make sure you decode the integer ids.

In [38]:
### YOUR CODE HERE

# 'pairs' should be the variable that holds the  token/weight pairs.



### END YOUR CODE

print('most important tokens:')
print('\t', pairs[:10])
print('\nleast important tokens:')
print('\t', pairs[-10:])



 **QUESTION:**

 2.2.2.b List the 5 most important words, with the most important first. (Again, if a word appears twice, you can include it twice.)

 2.2.2.c List the 5 least important words in descending order. (Again, if a word appears twice, note it twice in the answers file.)

### 2.3 Approaches for Training of Embeddings

Rerun the DAN Model in 3 separate configurations:


1.   embedding_initializer = 'word2vec' and retrain_embeddings=False
2.   embedding_initializer = 'word2vec' and retrain_embeddings=True
3.   embedding_initializer = 'uniform' and retrain_embeddings=True


**NOTE:** Train the model with static embeddings for 10 epochs and the ones with trainable embeddings for 3 epochs each.

What do you observe about the effects of initializing and retraining the embedding matrix?



In [39]:
### YOUR CODE HERE


### END YOUR CODE

**QUESTION:**

2.3.a First, what is the final validation accuracy that you just observed for the static model initialized with the word2vec after 10 epochs?  (Copy and paste the decimal value for the final validation accuracy, e.g. a number like 0.5678 or 0.8765)

In [40]:
### YOUR CODE HERE


### END YOUR CODE

**QUESTION:**


2.3.b What is the final validation accuracy that you observed for the model where you initialized with word2vec vectors but allow them to retrain for 3 epochs? (Copy and paste the decimal value for the final validation accuracy, e.g. a number like 0.5678 or 0.8765)



In [41]:
### YOUR CODE HERE


### END YOUR CODE

**QUESTION:**

2.3.c What is the final validation accuracy that you observed for the model where you initialized randomly and then trained?  (Copy and paste the decimal value for the final validation accuracy, e.g. a number like 0.5678 or 0.8765)





## 3. BERT-based Classification Models

Now we turn to classification with BERT. We will perform classifications with various models that are based on pre-trained BERT models.


### 3.1. Basics

Let us first explore some basics of BERT.

We've already loaded the pretrained BERT model and tokenizer that we'll use (
'bert-base-cased').

Now, consider this input:

In [42]:
test_input = ['this bank is closed on Sunday', 'the steepest bank of the river is dangerous']

Now apply the BERT tokenizer to tokenize it:

In [43]:
tokenized_input = bert_tokenizer(test_input,
                                 max_length=12,
                                 truncation=True,
                                 padding='max_length',
                                 return_tensors='tf')

tokenized_input

 **QUESTION:**

 3.1.a  Why do the attention_masks have 4 and 1 zeros, respectively?  Choose the correct one and enter it in the answers file.

  *  For the first example the last four tokens belong to a different segment. For the second one it is only the last token.

  *  For the first example 4 positions are padded while for the second one it is only one.

------


Next, let us look at the BERT outputs for these 2 sentences:

In [44]:
### YOUR CODE HERE

# bert_output = ...


### END YOUR CODE

 **QUESTION:**

 3.1.b How many outputs are there?

 Enter your code below.

In [45]:
### YOUR CODE HERE

#b. -> print it out



### END YOUR CODE

**QUESTION:**

 3.1.c Which output do we need to use to get token-level embeddings?

 * the first

 * the second

 Put your answer in the answers file.

**QUESTION:**

 3.1.d In the tokenized input, which input_id number (i.e. the vocabulary id) corresponds to 'bank' in the two sentences? ('bert_tokenizer.tokenize()' may come in handy.. and don't forget the CLS token! )


**QUESTION:**

 3.1.e In the array of tokens, which position index number corresponds to 'bank' in the first sentence? ('bert_tokenizer.tokenize()' may come in handy.. and don't forget the CLS token! )

In [46]:
### YOUR CODE HERE

#d/e. -> Look at tokens generated by the bert tokenizer for the first example


### END YOUR CODE

**QUESTION:**

3.1.f Which array position index number corresponds to 'bank' in the second sentence?

In [47]:
### YOUR CODE HERE

#f. -> Look at tokenization for the second example


### END YOUR CODE

**QUESTION:**

 3.1.g What is the cosine similarity between the BERT embeddings for the two occurences of 'bank' in the two sentences?

In [48]:
### YOUR CODE HERE

#g.  -> get the vectors and calculate cosine similarity between the two 'bank' BERT embedddings




### END YOUR CODE

**QUESTION:**

3.1.h How does this relate to the cosine similarity of 'this' (in sentence 1) and the first 'the' (in sentence 2). Compute their cosine similarity.


In [49]:
### YOUR CODE HERE

#h.  -> get the vectors and calculate cosine similarity


### END YOUR CODE

### 3.2 CLS-Token-based Classification

In the live session we discussed classification with BERT using the pooled token. We now will do the same but extract the [CLS] token output for each example and use that for classification purposes.

Consult the model from the live session and change accordingly. Make sure the BERT model is fully trainable.

**HINT:**
You will want to extract the output of the [CLS] token from the BERT output similarly to what we did above to get the output for 'bank', etc.


In [50]:
def create_bert_cls_model(bert_base_model,
                          max_sequence_length=MAX_SEQUENCE_LENGTH,
                          hidden_size = 100,
                          dropout=0.3,
                          learning_rate=0.00005):
    """
    Build a simple classification model with BERT. Use the CLS Token output for classification purposes.
    """

    ### YOUR CODE HERE











    ### END YOUR CODE

    return classification_model

Now create the model and train for 2 epochs. Use batch size 8 and the appropriate validation/test set. (We don't make a distinction here between validation and test although we might in other contexts.)


In [51]:
### YOUR CODE HERE


### END YOUR CODE

 **QUESTION:**

 3.2.a What is the final validation accuracy that you observed for the [CLS]-classification model after training for 2 epochs? (Copy and paste the decimal value for the final validation accuracy, e.g. a number like 0.5678 or 0.8765)




### 3.3 Classification by Averaging the BERT outputs

Instead of using only the output vector for the [CLS] token, we will now average the output vectors from BERT for all of the tokens in the full sequence.

**HINT:**
You will want to get the full sequence of token output vectors from the BERT model and then apply an average across the tokens. You may want to use:

```
tf.math.reduce_mean()
```
but you can also do it in other ways.



In [52]:
def create_bert_avg_model(bert_a_model,
                          max_sequence_length=MAX_SEQUENCE_LENGTH,
                          hidden_size = 100,
                          dropout=0.3,
                          learning_rate=0.00005):
    """
    Build a simple classification model with BERT. Use the average of the BERT output tokens
    """

    ### YOUR CODE HERE











    ### END YOUR CODE

    return classification_model

Now create the model and train for 2 epochs. Use batch size 8 and the appropriate validation/test set. (We again don't make a distinction here.)  Remember that all layers of the BERT model should be trainable.

In [53]:
### YOUR CODE HERE


### END YOUR CODE

 **QUESTION:**

 3.3.a What is the final validation accuracy that you observed for the BERT-averaging-classification model after training for 2 epochs? (Copy and paste the decimal value for the final validation accuracy, e.g. a number like 0.5678 or 0.8765)




### 3.4 Adding a CNN on top of BERT

Can we also combine advanced architectures? Absolutely! In the end we are dealing with tensors and it does not matter whether they are coming from static word2vec embeddings or context-based BERT embeddings. (Whether we want to is another question, but let's try it here.)


**HINT:**
You should appropriately stitch together the BERT-based components and the CNN components from the lesson notebook. Remember that BERT provides a sequence of contextualized token embeddings as its main output, and a CNN takes a sequence of vectors as input.

Use the provided hyperparameters for CNN filter sizes and numbers of filters. Keep the same hyperparameters for the rest of the model, including a dropout layer and dense layer after the CNN, with the provided dropout rate and hidden_size. Again make sure the BERT model is trainable.

In [54]:
def create_bert_cnn_model(bert_cnn_model,
                          max_sequence_length=MAX_SEQUENCE_LENGTH,
                          num_filters = [131, 127, 51, 23, 17],
                          kernel_sizes = [2, 3, 4, 5, 7],
                          dropout = 0.3,
                          hidden_size = 275, #100
                          learning_rate=0.00005):
    """
    Build a  classification model with BERT, where you apply CNN layers  to the BERT output
    """

    ### YOUR CODE HERE














    ### END YOUR CODE

    return classification_model

Train this model for 2 epochs as well with mini-batch size of 8:

In [55]:
### YOUR CODE HERE


### END YOUR CODE

 **QUESTION:**

3.4.a What is the final validation accuracy that you observed for the BERT-CNN-classification model after 2 epochs?  (Copy and paste the decimal value for the final validation accuracy, e.g. a number like 0.5678 or 0.8765)


# That's It!
## Congratulations... You are done!
## We hope you learned a ton!