# Named Entity Recognition(NER) with RNNs

- [RNNs - StatQuest](https://youtu.be/AsNTP8Kwu80)
- [LSTM - StatQuest](https://youtu.be/YCzL96nL7j0)

## Imports

In [1]:
%matplotlib inline
import collections
import math
import numpy as np
import pandas as pd
import os
import random
import tensorflow as tf
import zipfile
from matplotlib import pylab
from itertools import chain
from six.moves import range
from six.moves.urllib.request import urlretrieve

seed = 54321

# %env TF_FORCE_GPU_ALLOW_GROWTH=true

## Download Data

In [12]:
import os
import requests

def download_data(url, local_directory):
    try:
        # Create the local directory if it doesn't exist
        if not os.path.exists(local_directory):
            os.makedirs(local_directory)

        response = requests.get(url)

        if response.status_code == 200:
            data = response.json()

            for item in data:
                if 'download_url' in item:
                    file_name = item['name']
                    download_url = item['download_url']
                    local_path = os.path.join(local_directory, file_name)

                    file_response = requests.get(download_url)
                    if file_response.status_code == 200:
                        with open(local_path, 'wb') as file:
                            file.write(file_response.content)
                        print(f"Downloaded {file_name}")
                    else:
                        print(f"Failed to download {file_name}")
        else:
            print("Failed to fetch file list from GitHub API")

        print("All files downloaded and saved.")
    except Exception as e:
        print(f"An error occurred: {str(e)}")

# Replace with your GitHub URL and local directory
github_url = 'https://api.github.com/repos/ZihanWangKi/CrossWeigh/contents/data'
local_directory = 'data'

download_data(github_url, local_directory)

Downloaded conllpp_dev.txt
Downloaded conllpp_test.txt
Downloaded conllpp_train.txt
All files downloaded and saved.


In [13]:
# url = 'https://github.com/ZihanWangKi/CrossWeigh/raw/master/data/'
# dir_name = 'data'

# def download_data(url, download_dir):
#     # Create directories if doesn't exist
#     os.makedirs(download_dir, exist_ok=True)
    
#     file_names = ['conllpp_train.txt', 'conllpp_dev.txt', 'conllpp_test.txt']
    
#     for filename in file_names:
#         if not os.path.exists(os.path.join(download_dir,filename)):
#             filepath, _ = urlretrieve(url + filename, os.path.join(download_dir,filename))
#         else:
#             filepath = os.path.join(download_dir, filename)
#         print(filepath)
        
#     return

# download_data(url=url, download_dir=dir_name)

## Read the Data

### Understanding the data

The document has a single word in each line along with the associated tags of that
word. These tags are in the following order:

1. The Part-of-speech (POS) tag (e.g. noun - `NN`, verb - `VB`, determinant - `DT`, etc.)
2. Chunk tag – A chunk is a segment of text made of one or more tokens (for example, `NP` represents a noun phrase such as “The European Commission”)
3. Named entity tag (e.g. Location, Organization, Person, etc.)

Both chunk tags and named entity tags have a B- and I- prefix (e.g. B-ORG or I-ORG). These prefixes are there to differentiate the starting token of an entity/chunk from the continuing token of an entity/chunk. 

There are also five types of entities in the dataset:
- Location-based entities (`LOC`)
- Person-based entities (`PER`)
- Organization-based entities (`ORG`)
- Miscellaneous entities (`MISC`)
- Non-entities (`O`)

Finally, there’s an empty line between separate sentences. 

In [2]:
def read_data(filename):
    '''
    Read data from a file with given filename
    Returns a list of sentences (each sentence a string), 
    and list of ner labels for each string
    '''

    # print("Reading data ...")
    # master lists - Holds sentences (list of tokens), ner_labels (for each token an NER label)
    sentences, ner_labels = [], [] 
    
    # Open the file
    with open(filename,'r',encoding='latin-1') as f:        
        # Read each line
        is_sos = True # We record at each line if we are seeing the beginning of a sentence
        
        # Tokens and labels of a single sentence, flushed when encountered a new one
        sentence_tokens = []
        sentence_labels = []
        i = 0
        for row in f:
            # If we are seeing an empty line or -DOCSTART- that's a new line
            if len(row.strip()) == 0 or row.split(' ')[0] == '-DOCSTART-':
                is_sos = False
            # Otherwise keep capturing tokens and labels
            else:
                is_sos = True
                token, _, _, ner_label = row.split(' ')
                sentence_tokens.append(token.strip())
                sentence_labels.append(ner_label.strip())
            
            # When we reach the end / or reach the beginning of next
            # add the data to the master lists, flush the temporary one
            if not is_sos and len(sentence_tokens)>0:
                sentences.append(' '.join(sentence_tokens))
                ner_labels.append(sentence_labels)
                sentence_tokens, sentence_labels = [], []
    
    # print('\tDone')
    return sentences, ner_labels

In [3]:
# Train data
train_filepath = 'data\conllpp_train.txt'
train_sentences, train_labels = read_data(train_filepath) 
# Validation data
dev_filepath = 'data\conllpp_dev.txt'
valid_sentences, valid_labels = read_data(dev_filepath) 
# Test data
test_filepath = 'data\conllpp_test.txt'
test_sentences, test_labels = read_data(test_filepath) 

# Print some stats
print(f"Train size: {len(train_labels)}")
print(f"Valid size: {len(valid_labels)}")
print(f"Test size: {len(test_labels)}")

# Print some data
print('\nSample data\n')
for v_sent, v_labels in zip(train_sentences[:5], train_labels[:5]):
    print(f"Sentence: {v_sent}")
    print(f"Labels: {v_labels}\n")

Train size: 14041
Valid size: 3250
Test size: 3452

Sample data

Sentence: EU rejects German call to boycott British lamb .
Labels: ['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O']

Sentence: Peter Blackburn
Labels: ['B-PER', 'I-PER']

Sentence: BRUSSELS 1996-08-22
Labels: ['B-LOC', 'O']

Sentence: The European Commission said on Thursday it disagreed with German advice to consumers to shun British lamb until scientists determine whether mad cow disease can be transmitted to sheep .
Labels: ['O', 'B-ORG', 'I-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'B-MISC', 'O', 'O', 'O', 'O', 'O', 'B-MISC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']

Sentence: Germany 's representative to the European Union 's veterinary committee Werner Zwingmann said on Wednesday consumers should buy sheepmeat from countries other than Britain until the scientific advice was clearer .
Labels: ['B-LOC', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'O', 'O', 'O', 'B-PER', 'I-PER', 'O', 'O', 

In [20]:
import pandas as pd

def read_data_with_pandas(filename):
    sentences, ner_labels = [], []
    sentence_tokens, sentence_labels = [], []
    
    with open(filename, 'r', encoding='latin-1') as f:
        for row in f:
            if len(row.strip()) == 0 or row.split(' ')[0] == '-DOCSTART-':
                if sentence_tokens:  # Ensure we don't append empty sentences
                    sentences.append(' '.join(sentence_tokens))
                    ner_labels.append(sentence_labels)
                    sentence_tokens, sentence_labels = [], []
            else:
                token, _, _, ner_label = row.split(' ')
                sentence_tokens.append(token)
                sentence_labels.append(ner_label.strip())
    
    # Create a DataFrame
    df = pd.DataFrame({'Sentence': sentences, 'NER_Labels': ner_labels})
    return df

# df = read_data_with_pandas(train_filepath)
# df

## Checking the balance of labels

One of the unique characteristics of NER tasks is the class imbalance. That is, not all classes will have a roughly equal number of samples. As you can probably guess, in a corpus, there are more non-named entities than named entities. This leads to a significant class imbalance among labels. Therefore, let’s have a look at the distribution of samples among different classes:

In [4]:
from itertools import chain

# As train_labels are list of lists, so 
# To create a flat list, we can use the chain() function 

# Print the value count for each label
print("Training data label counts")
print(pd.Series(chain(*train_labels)).value_counts())

print("\nValidation data label counts")
print(pd.Series(chain(*valid_labels)).value_counts())

print("\nTest data label counts")
print(pd.Series(chain(*test_labels)).value_counts())

Training data label counts
O         169578
B-LOC       7140
B-PER       6600
B-ORG       6321
I-PER       4528
I-ORG       3704
B-MISC      3438
I-LOC       1157
I-MISC      1155
Name: count, dtype: int64

Validation data label counts
O         42759
B-PER      1842
B-LOC      1837
B-ORG      1341
I-PER      1307
B-MISC      922
I-ORG       751
I-MISC      346
I-LOC       257
Name: count, dtype: int64

Test data label counts
O         38143
B-ORG      1714
B-LOC      1645
B-PER      1617
I-PER      1161
I-ORG       881
B-MISC      722
I-LOC       259
I-MISC      252
Name: count, dtype: int64


As we can see, `O` labels are several magnitudes higher than the volume of other labels. 

We need to keep this in mind when training the model. 

## Analysing the sequence length

Let's analyze the sequence length(i.e. number of tokens) of each sentence. We need this information later to pad our sentences to a fixed length

In [5]:
pd.Series(train_sentences).str.split().str.len().describe(percentiles=[0.05,0.95])

count    14041.000000
mean        14.501887
std         11.602756
min          1.000000
5%           2.000000
50%         10.000000
95%         37.000000
max        113.000000
dtype: float64

We can see that $95\%$ of our sentences have $37$ tokens or less.

## Processing data

**Padding/Truncating sentences to create arrays**

Now it’s time to process the data. 

- We will keep the sentences in the same format, i.e. a list of strings where each string represents a sentence. This is because we will integrate text processing right into our model (as opposed to doing it externally). 

- For labels, we have to do several changes. Remember labels are a list of lists, where the inner lists(of diff. lengths) represent labels for all the tokens in each sentence. Specifically we will do the following:
    - Convert the class labels to class IDs
    - Pad the sequences of labels to a specified maximum length
    - Generate a mask that indicates the padded labels, so that we can use this information to disregard the padded labels during model training
    
    
> *"Generate a mask that indicates the padded labels, so that we can use this information to disregard the padded labels during model training"* ***is referring to this mask creation process. It's about creating a mask that allows you to tell your model which parts of the input sequences are actual tokens and which parts are padding tokens, so that the padding tokens don't influence the learning process.***

- In the context of Named Entity Recognition (NER), when you're processing sequences of words to label them with named entity tags, you often have sentences of varying lengths. To process these sentences in batches for training, you need to ensure that all sequences in a batch have the same length. This is where padding comes in - you add special padding tokens to sequences that are shorter than the maximum length in the batch.

- However, you don't want the padding tokens to affect the learning of your model. They're not actual tokens, so you'd want your model to ignore them during the training process. This is where the concept of a "mask" comes in.

- A mask is a binary sequence (0s and 1s) that has the same length as your sequences. It's used to indicate which positions in a sequence are actual tokens and which are padding tokens. A value of 1 in the mask indicates an actual token, while a value of 0 indicates a padding token.

- So, when you're training your model, you'll use this mask to "mask out" the effects of the padding tokens. For example, during the calculation of the loss function, you can use the mask to set the loss to 0 for the positions that are padded, effectively disregarding them in the learning process. This is particularly useful when sequences have variable lengths, and you want to prevent padding tokens from affecting the gradients and learned representations.



In [6]:
def get_lable_id_map(train_labels):
    # Get the unique list of labels
    unique_train_labels = pd.Series(chain(*train_labels)).unique()
    # Create a class_label --> class_ID mapping
    label_map = dict(zip(unique_train_labels, np.arange(unique_train_labels.shape[0])))
    
    print("label_map: {}".format(label_map))
    return label_map

In [7]:
labels_map = get_lable_id_map(train_labels)

label_map: {'B-ORG': 0, 'O': 1, 'B-MISC': 2, 'B-PER': 3, 'I-PER': 4, 'B-LOC': 5, 'I-ORG': 6, 'I-MISC': 7, 'I-LOC': 8}


In [8]:
def get_padded_int_labels(labels:list[list[str]], labels_map:dict[str, int], 
                          max_seq_length:int, return_mask:bool = True):
    """
    This function takes sequences of class labels and return sequences of padded 
    class IDs, with the option to return a mask indicating padded labels.
    
    This function takes the following arguments:
        * labels (List[List[str]]) – A list of lists of strings, where each string is 
                                     a class label of the string type
        
        * labels_map (Dict[str, int]) – A dictionary mapping a string label to a 
                                        class ID of type integer
        
        * max_seq_length (int) – A maximum length to be padded to (longer sequences 
                                 will be truncated at this length)
        
        * return_mask (bool) – Whether to return the mask showing padded labels or not
    """
    
    # Convert string labels to integers
    int_labels = [[labels_map[x] for x in one_seq] for one_seq in labels]
    
    # Pad sequences
    if return_mask:
        # If we return mask, we first pad with a special value (-1) and
        # use that to create the mask and later replace -1 with 'O'
        padded_labels = np.array(
                           tf.keras.preprocessing.sequence.pad_sequences(
                                int_labels, maxlen=max_seq_length, padding='post',
                                truncating='post', value=-1
                           )
                        )
        # mask filter
        mask_filter = (padded_labels != -1)
        
        # replace -1 with 'O' s ID
        padded_labels[~mask_filter] = labels_map['O']
        
        return padded_labels, mask_filter.astype('int')
    else:
        # padded_labels = np.array(ner_pad_sequence_func(int_labels, 
        #                                                value=labels_map['O'])
        #                         )
        # return padded_labels
        return

**Explaining what happens after generating the padded_labels in the above code:**

After getting the `padded_labels`, we can simply generate the mask as a boolean filter where padded_labels is not equal to -1. 

Thus, the positions where original labels exist will have a value of 1 and the rest will have 0 in the mask. 

However, we have to convert the -1 values to a class ID found in the `labels_map`. We will give them the class ID of the label `O` (i.e. others).

### Processing the labels

Remember that the 95% percentile fell at the length of 37 words. So, let's set `max_seq_length = 40`

In [9]:
max_seq_length = 40

In [10]:
# Convert string labels to integers for all train/validation/test data
# Pad train/validation/test data
padded_train_labels, train_mask = get_padded_int_labels(train_labels, labels_map, 
                                                        max_seq_length, return_mask=True)

padded_valid_labels, valid_mask = get_padded_int_labels(valid_labels, labels_map, 
                                                        max_seq_length, return_mask=True)

padded_test_labels, test_mask  = get_padded_int_labels(test_labels, labels_map, 
                                                       max_seq_length, return_mask=True)

print(padded_train_labels.shape, train_mask.shape)
print("\nLable Map:", labels_map)
print("\nPadded Label:",padded_train_labels[0])
print("Mask:\t", train_mask[0])

(14041, 40) (14041, 40)

Lable Map: {'B-ORG': 0, 'O': 1, 'B-MISC': 2, 'B-PER': 3, 'I-PER': 4, 'B-LOC': 5, 'I-ORG': 6, 'I-MISC': 7, 'I-LOC': 8}

Padded Label: [0 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1]
Mask:	 [1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0]


You can see that the mask is indicating the true labels and padded ones clearly. `1` in the mask represents that in that position we have actual labels, while `0` in the mask represents that in those postions have been padded.

## Defining Hyperparameters

- `max_seq_length` -  Denotes the maximum length for a sequence. We infer this from our training data during data exploration. It is important to have a reasonable length for sequences, as otherwise, memory can explode, due to the unrolling of the RNN.

- `emedding_size` -  The dimensionality of token embeddings. Since we have a small corpus, a value < 100 will suffice.

- `rnn_hidden_size` - The dimensionality of hidden layers in the RNN. Increasing dimensionality of the hidden layer usually leads to better performance. However, note that increasing the size of the hidden layer causes all three sets of internal weights (that is, U, W, and V) to increase as well, thus resulting in a high computational footprint.

- `n_classes` -  Number of unique output classes present.

- `batch_size` -  The batch size for training data, validation data, and test data. A higher batch size often leads to better results as we are seeing more data during each optimization step, but just like unrolling, this causes a higher memory requirement.

- `epochs` -  The number of epochs to train the model for.

In [11]:
# The maximum length of sequences
max_seq_length = 40

# Size of token embeddings
embedding_size = 64

# Number of hidden units in the RNN layer
rnn_hidden_size = 64

# Number of output nodes in the last layer
n_classes = len(labels_map)

# Number of samples in a batch
batch_size = 64

# Number of epochs to train
epochs = 3

## Defining a Simple RNN Model

Our model will have an embedding layer, followed by a simple RNN layer, and finally a dense prediction layer

<div align='center'>
    <img src='images/model_architecture.png' title='Model Architecture'/>
</div>

### Introduction to the [`TextVectorization` layer](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization)

A preprocessing layer which maps text features to integer sequences.

#### Why not use an external `Tokenizer` !

- One thing to note in the work we have done so far is that, unlike in previous chapters, we haven’t yet defined a Tokenizer object. 

- Although the `Tokenizer` has been an important part of our NLP pipeline to convert each token (or word) into an ID, there's a big downside to using an external tokenizer. 

- After training the model, if you forget to save the tokenizer along with the model, your machine learning model becomes useless: to combat this, during inference, you would need to map each word to the exact ID it was mapped to during training. 

- This is a significant risk the tokenizer poses. 

The `TextVectorization` layer can be thought of as a modernized tokenizer that can be plugged into the model

In [37]:
import tensorflow.keras.backend as K
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

K.clear_session()

toy_corpus = ["I went to the market on Sunday", "The Market was empty."]
toy_vectorization_layer = TextVectorization()

# Fit it on a corpus of data
toy_vectorization_layer.adapt(toy_corpus)

toy_vectorized_output = toy_vectorization_layer(toy_corpus)
toy_vocabulary = toy_vectorization_layer.get_vocabulary()

print("With default arguments\n")
print(f"Data: \n{toy_vectorized_output}")
print(f"Vocabulary: {toy_vocabulary}")
print('-'*50)

# limit the size of the vocabulary
toy_vectorization_layer = TextVectorization(max_tokens=5)
toy_vectorization_layer.adapt(toy_corpus)

print("\nWith limited vocabulary\n")
print(f"Data: \n{toy_vectorization_layer(toy_corpus)}")
print(f"Vocabulary: {toy_vectorization_layer.get_vocabulary()}")
print('-'*50)

#  skip the text pre-processing that happens within the layer, standardize=None
toy_vectorization_layer = TextVectorization(standardize=None)
toy_vectorization_layer.adapt(toy_corpus)

print("\nWith preprocessing disabled\n")
print(f"Data: \n{toy_vectorization_layer(toy_corpus)}")
print(f"Vocabulary: {toy_vectorization_layer.get_vocabulary()}")
print('-'*50)


# we can also control the padding/truncation of sequences with the output_sequence_length
toy_vectorization_layer = TextVectorization(output_sequence_length=4) # pad/truncate sequences at length 4
toy_vectorization_layer.adapt(toy_corpus)

print("\nWith a maximum sequence length\n")
print(f"Data: \n{toy_vectorization_layer(toy_corpus)}")
print(f"Vocabulary: {toy_vectorization_layer.get_vocabulary()}")
print('-'*50)

With default arguments

Data: 
[[ 9  4  6  2  3  8  7]
 [ 2  3  5 10  0  0  0]]
Vocabulary: ['', '[UNK]', 'the', 'market', 'went', 'was', 'to', 'sunday', 'on', 'i', 'empty']
--------------------------------------------------

With limited vocabulary

Data: 
[[1 4 1 2 3 1 1]
 [2 3 1 1 0 0 0]]
Vocabulary: ['', '[UNK]', 'the', 'market', 'went']
--------------------------------------------------

With preprocessing disabled

Data: 
[[12  2  4  5  7  6 10]
 [ 9 11  3  8  0  0  0]]
Vocabulary: ['', '[UNK]', 'went', 'was', 'to', 'the', 'on', 'market', 'empty.', 'The', 'Sunday', 'Market', 'I']
--------------------------------------------------

With a maximum sequence length

Data: 
[[ 9  4  6  2]
 [ 2  3  5 10]]
Vocabulary: ['', '[UNK]', 'the', 'market', 'went', 'was', 'to', 'sunday', 'on', 'i', 'empty']
--------------------------------------------------


### Defining the Model

In [12]:
from tensorflow.keras import layers
from tensorflow.keras.layers import TextVectorization
import tensorflow.keras.backend as K

K.clear_session()

def get_fitted_token_vectorization_layer(corpus, max_seq_length, vocabulary_size=None):
    """ Fit a TextVectorization layer on given data """
    
    # Define the layer
    vectorization_layer = TextVectorization(max_tokens=vocabulary_size, standardize=None,
                                           output_sequence_length=max_seq_length)
    
    # Fit on the text corpus
    vectorization_layer.adapt(corpus)
    
    # Get the vocabulary_size
    n_vocab = len(vectorization_layer.get_vocabulary())
    
    return vectorization_layer, n_vocab

>✒️**NOTE:** Pay attention to the various arguments we have set for the vectorization layer. We are passing the vocabulary size as `max_tokens`; we are setting the `standardize` to None. **This is an important setting.** 

**When performing NER, keeping the case of characters is very important. Typically, an entity starts with an uppercase letter (e.g. the name of a person or organization). Therefore, we should preserve the case in the text.**

Finally, we also set the `output_sequence_length` to the sequence length we found during the analysis.

In [41]:
# Input layer - that has a single column (i.e. each sentence represented as a single unit), thus shape=(1,)
word_input = layers.Input(shape=(1,), dtype=tf.string)

# Text Vectorization layer
vectorize_layer, n_vocab = get_fitted_token_vectorization_layer(corpus=train_sentences, 
                                                                max_seq_length=max_seq_length)

# Vectorized output (each word mapped to an int ID)
vectorized_out = vectorize_layer(word_input)

# Embedding layer
embedding_layer = layers.Embedding(input_dim=n_vocab, 
                                   output_dim=embedding_size,
                                   mask_zero=True)(vectorized_out)



- `mask_true` argument in `layers.Embedding`: **Masking is used to mask uninformative words added to sequences (e.g. the padding token added to make sentences a fixed length), as they do not contribute to the final outcome.** 

- In the embedding layer `mask_true=True` is set, to ignore padded values (which will be zeros).

* **
From documentation:
- `mask_zero` Boolean: 
    - Whether or not the input value 0 is a special "padding" value that should be masked out. This is useful when using recurrent layers which may take variable length input. If this is `True`, then all subsequent layers in the model need to support masking or an exception will be raised. 
    
    - If `mask_zero` is set to `True`, as a consequence, index `0` cannot be used in the vocabulary (`input_dim` should equal size of `vocabulary + 1`).
    

        - In TensorFlow, most layers support masking.
        
* **

- When you enable masking in a layer, it will propagate the mask to the downstream layers, flowing down until the loss computations.

    - **In other words, you only need to enable masking at the start of the model (as we have done at the embedding layer) and the rest is taken care of by TensorFlow.**
    
* **

#### Time-Dimension in Sequences

Until now, we dealt with feed-forward networks. Outputs of feed-forward networks did not have a time dimension. But if you look at the output from the `TextVectorization` layer, it will be a `[batch_size, sequence length]` - sized output. When this output goes through an `embedding_layer`, the output would be a `[batch size, sequence length, embedding size]`- shaped tensor. In other words, there is an additional time dimension included in the output of the embedding layer.

* **

#### Masking in Sequence Learning

Masking is a commonly used technique in sequence learning.

- Naturally, text has arbitrary lengths. For example, sentences in a corpus would have a wide variety of token lengths. But neural networks process tensors with fixed dimensions. 

- To bring arbitrary-length sentences to constant length, we pad these sequences with some special value (e.g. 0). 

- However, these padded values are synthetic, and only serve as a way to ensure the correct input shape. They should not contribute to the final loss or evaluation metrics. To ignore them during loss calculation and evaluation, "masking" is used. 

- The idea is to multiply the loss resulting from padded timesteps with a zero, essentially cutting them off from the final loss.

* **

In [42]:
#  Define a simple RNN layer
rnn_layer = layers.SimpleRNN(units=rnn_hidden_size, 
                             return_sequences=True)

rnn_out = rnn_layer(embedding_layer)

Here we pass two important arguments:
- `units` (int) – This defines the hidden output size of the RNN model. The larger this is, the more representational power the model will have.

- `return_sequences` (bool) – Whether to return outputs from all the timesteps, or to return only the last output. **For NER tasks, we need to label every single token. Therefore we need to return outputs for all the time steps.**

The rnn_layer takes a `[batch size, sequence length, embedding size]` - sized tensor and returns a `[batch size, sequence length, rnn hidden size]` - sized tensor.

Finally, the time-distributed output from the RNN will go to a Dense layer with `n_classes` output nodes and a `softmax` activation

In [43]:
dense_layer = layers.Dense(n_classes, activation='softmax')
dense_out = dense_layer(rnn_out)

# Next define the model
model = tf.keras.Model(inputs=word_input, outputs=dense_out)

## Evaluation metrics & Loss function

**Dealing with high class-imbalance**

- As we saw in the following cell [Checking the balance of labels](#Checking-the-balance-of-labels), that **NER tasks carry high class imbalance.**

- It is quite normal for text to have more non-entity-related tokens than entity-related tokens. This leads to large amounts of other (`O`) type labels and fewer of the remaining types. 

- We need to take this into consideration when training the model and evaluating the model. We can address the class imbalance in two ways:
     1. We can create a new evaluation metric that is resilient to class imbalance.
     
     2. We can use sample weights to penalize more frequent classes and boost the importance of rare classes.


- Let's look at the first one, a new evaluation metric

    - We will define a modified version of the accuracy. This is called a **macro-averaged accuracy**. 

    - **In macro averaging, we compute accuracies for each class separately, and then average it. Therefore, the class imbalance is ignored when computing the accuracy.** 
    

Below we define the function to compute `macro_accuracy` using a batch of true targets (y_true) and predictions (y_pred). `y_true.shape = [batch_size, sequence length]` and `y_pred.shape = [batch size, sequence length, n_classes]`.

In [13]:
def macro_accuracy(y_true, y_pred):
    
    #  [batch size, time] => [batch size * time]
    y_true = tf.cast(tf.reshape(y_true, [-1]), 'int32')
    
    # [batch size, sequence length, n_classes] => [batch size * time]
    y_pred = tf.cast(tf.reshape(tf.argmax(y_pred, axis=-1), [-1]), 'int32')
    
    sorted_y_true = tf.sort(y_true)
    sorted_inds = tf.argsort(y_true)
    
    sorted_y_pred = tf.gather(y_pred, sorted_inds)
    
    sorted_correct = tf.cast(tf.math.equal(sorted_y_true, sorted_y_pred), 'int32')
    
    # We are adding one to make sure there are no division by zero
    correct_for_each_label = tf.cast(tf.math.segment_sum(sorted_correct, sorted_y_true), 'float32') + 1
    all_for_each_label = tf.cast(tf.math.segment_sum(tf.ones_like(sorted_y_true), sorted_y_true), 'float32') + 1
    
    mean_accuracy = tf.reduce_mean(correct_for_each_label/all_for_each_label)
    
    return mean_accuracy


# mean_accuracy_metric = tf.keras.metrics.MeanMetricWrapper(fn=macro_accuracy,
#                                                           name='macro_accuracy')

In [52]:
tf.cast(tf.math.equal([1,1,0,0,2], [3,1,0,0,3]), 'int32')

<tf.Tensor: shape=(5,), dtype=int32, numpy=array([0, 1, 1, 1, 0])>

In [57]:
tf.math.segment_sum(data=[0, 1, 2, 3, 4, 5, 6, 7],
            segment_ids= [0, 0, 0, 1, 1, 2, 3, 3]).numpy()
# then the segment sum would be [0+1+2, 3+4, 5, 6+7] = [3, 7, 5, 13]

array([ 3,  7,  5, 13])

<div align='center'><b>Explanation of macro_accuracy code from the book</b></div>

<div align='center'>
    <img src='images/macro_acc.png'/>
</div>

## Compile the Model & Model Summary

**COMPLETE CODE**

In [14]:
from tensorflow.keras import layers
from tensorflow.keras.layers import TextVectorization
import tensorflow.keras.backend as K

K.clear_session()

# Input layer - that has a single column (i.e. each sentence represented as a single unit), thus shape=(1,)
word_input = layers.Input(shape=(1,), dtype=tf.string)

# Text Vectorization layer
vectorize_layer, n_vocab = get_fitted_token_vectorization_layer(corpus=train_sentences, 
                                                                max_seq_length=max_seq_length)

# Vectorized output (each word mapped to an int ID)
vectorized_out = vectorize_layer(word_input)

# Embedding layer
embedding_layer = layers.Embedding(input_dim=n_vocab, 
                                   output_dim=embedding_size,
                                   mask_zero=True)(vectorized_out)

#  Define a simple RNN layer
rnn_layer = layers.SimpleRNN(units=rnn_hidden_size, 
                             return_sequences=True)

rnn_out = rnn_layer(embedding_layer)

dense_layer = layers.Dense(n_classes, activation='softmax')
dense_out = dense_layer(rnn_out)

# Define the model
model = tf.keras.Model(inputs=word_input, outputs=dense_out)

# Defining the custom metric
mean_accuracy_metric = tf.keras.metrics.MeanMetricWrapper(fn=macro_accuracy, 
                                                          name='macro_accuracy')

# Complie the model
model.compile(loss=tf.keras.losses.sparse_categorical_crossentropy,
              optimizer='adam', metrics=[mean_accuracy_metric])

model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization (TextVec  (None, 40)               0         
 torization)                                                     
                                                                 
 embedding (Embedding)       (None, 40, 64)            1512000   
                                                                 
 simple_rnn (SimpleRNN)      (None, 40, 64)            8256      
                                                                 
 dense (Dense)               (None, 40, 9)             585       
                                                                 
Total params: 1,520,841
Trainable params: 1,520,841
Non-trainable params: 0
___________________________________________________

## Training and evaluating RNN on NER task

### Tackling Class Imbalance
When training the model we will use `sample_weight` to counteract class-imbalance.

To compute sample weights, we will first define a function called `get_class_weights()` that computes class_weights for each class. Next we will pass the class weights to another function, `get_sample_weights_from_class_weights()`, which will generate sample weights:

In [36]:
# print("Count of train labels")
# temp = pd.Series(chain(*train_labels)).value_counts()
# print(temp)
# print('-'*30, '\n')

# print('Getting class weights by dividing the\nmin. label count with respective label count')
# temp = temp.min()/temp
# print(temp)
# print('-'*30, '\n')

# print(f"Lable Map: {labels_map}")
# print('-'*30, '\n')

# print("Mapping the label names using above label map")
# temp.index = temp.index.map(labels_map)
# print(temp)

# del temp

In [37]:
def get_class_weights(train_labels):
    """
    Class weight is calculated by min_label_count/label_count_of_that_class.
    This way the minority class gets more weightage, thus tackling class imabalance
    """
    label_count = pd.Series(chain(*train_labels)).value_counts()
    label_count = label_count.min()/label_count
    
    label_id_map = get_lable_id_map(train_labels)
    label_count.index = label_count.index.map(label_id_map)
    return label_count.to_dict()

def get_sample_weights_from_class_weights(labels, class_weights):
    """ 
    - From the class weights generate sample weights.
    
    - This is simply mapping the class_weights with each training data sample.
    
    - The sample_weights will be the same shape as the train_labels as there’s 
      one weight for each sample.
    """
    return np.vectorize(class_weights.get)(labels)


train_class_weights = get_class_weights(train_labels)
print(f"Class weights: {train_class_weights}")

label_map: {'B-ORG': 0, 'O': 1, 'B-MISC': 2, 'B-PER': 3, 'I-PER': 4, 'B-LOC': 5, 'I-ORG': 6, 'I-MISC': 7, 'I-LOC': 8}
Class weights: {1: 0.006811025015037328, 5: 0.16176470588235295, 3: 0.175, 0: 0.18272425249169436, 4: 0.25507950530035334, 6: 0.31182505399568033, 2: 0.33595113438045376, 8: 0.9982713915298185, 7: 1.0}


You can see the class `Other(O)` has the lowest weight (because it’s the most frequent), and the class `I-MISC` has the highest as it’s the least frequent.

In [42]:
# Make train_sequences an array
train_sentences = np.array(train_sentences)

# Get sample weights (we cannot use class_weight with TextVectorization layer)
train_sample_weights = get_sample_weights_from_class_weights(padded_train_labels, 
                                                             train_class_weights)

# Training the model
history = model.fit(train_sentences, padded_train_labels,
                    sample_weight=train_sample_weights,
                    batch_size=batch_size,
                    epochs=epochs,
                    validation_data=(np.array(valid_sentences), padded_valid_labels))

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [43]:
# Evaluate on test data
model.evaluate(np.array(test_sentences), padded_test_labels)



[0.14868271350860596, 0.7812392115592957]

We get an accuracy of around $96\%$ on training data, $81\%$ on validation data, $78\%$ on testing data. Since the validation accuracy and test accuracy are on par, we can say that the model has generalized well.

## Visually analysing outputs

In [53]:
n_samples = 5
visual_test_sentences = test_sentences[:n_samples]
visual_test_labels = padded_test_labels[:n_samples]

visual_test_predictions = model.predict(np.array(visual_test_sentences))
visual_test_pred_labels = np.argmax(visual_test_predictions, axis=-1)

rev_labels_map = dict(zip(labels_map.values(), labels_map.keys()))

for i, (sentence, sent_labels, sent_preds) in enumerate(zip(visual_test_sentences, visual_test_labels, visual_test_pred_labels)):    
    n_tokens = len(sentence.split())
    print("Sample:\t", " ".join(sentence.split()))
    print("True:\t", " ".join([rev_labels_map[i] for i in sent_labels[:n_tokens]]))
    print("Pred:\t", " ".join([rev_labels_map[i] for i in sent_preds[:n_tokens]]), '\n')

Sample:	 SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRISE DEFEAT .
True:	 O O B-LOC O O O O B-LOC O O O O
Pred:	 O O B-MISC I-MISC I-MISC O O B-ORG O B-MISC I-LOC O 

Sample:	 Nadim Ladki
True:	 B-PER I-PER
Pred:	 B-ORG I-ORG 

Sample:	 AL-AIN , United Arab Emirates 1996-12-06
True:	 B-LOC O B-LOC I-LOC I-LOC O
Pred:	 B-ORG O B-LOC I-LOC I-LOC I-LOC 

Sample:	 Japan began the defence of their Asian Cup title with a lucky 2-1 win against Syria in a Group C championship match on Friday .
True:	 B-LOC O O O O O B-MISC I-MISC O O O O O O O B-LOC O O O O O O O O O
Pred:	 B-LOC I-LOC O O O O B-MISC I-MISC I-MISC O O O O O O B-LOC O O B-MISC O O O O O O 

Sample:	 But China saw their luck desert them in the second match of the group , crashing to a surprise 2-0 defeat to newcomers Uzbekistan .
True:	 O B-LOC O O O O O O O O O O O O O O O O O O O O O B-LOC O
Pred:	 O B-LOC O O O B-MISC O O O O O O O O O O O O O O O O O B-ORG O 



**It can be seen that our model is doing a decent job. It is good at identifying locations but is struggling at identifying the names of people.**

* **