# Named Entity Recognition (NER)
In this assignment we will perform NER using RNNs.
For the task, we will use the provided dataset which is already split into train/val/test sets. The dataset is tagged using BIO tagging scheme with a total of 17 different tags.
You need to perform the following:
- Read the dataset
- Encode the data as needed
- Create a model and train it using the train set and plot the loss and accuracy on the validation set
- Select the best performing model on the validation set to evalute your model on the test set.
- For this assignment you can show the performance using the accuracy metric (after delaing with padding, is used) and micro and macro F1-scores.

## Read the dataset

In [1]:
import os 
import numpy as np
import pandas as pd
import random as rnd

In [2]:
def get_vocab(vocab_path, tags_path):
    vocab = {}
    with open(vocab_path) as f:
        for i, l in enumerate(f.read().splitlines()):
            vocab[l] = i  # to avoid the 0
        # loading tags (we require this to map tags to their indices)
    vocab['<PAD>'] = len(vocab) # 35180
    tag_map = {}
    with open(tags_path) as f:
        for i, t in enumerate(f.read().splitlines()):
            tag_map[t] = i 
    
    return vocab, tag_map

In [3]:
def get_params(vocab, tag_map, sentences_file, labels_file):
    sentences = []
    labels = []

    with open(sentences_file) as f:
        for sentence in f.read().splitlines():
            # replace each token by its index if it is in vocab
            # else use index of UNK_WORD
            s = [vocab[token] if token in vocab 
                 else vocab['UNK']
                 for token in sentence.split(' ')]
            sentences.append(s)

    with open(labels_file) as f:
        for sentence in f.read().splitlines():
            # replace each label by its index
            l = [tag_map[label] for label in sentence.split(' ')] # I added plus 1 here
            labels.append(l) 
    return sentences, labels, len(sentences)

In [4]:
vocab, tag_map = get_vocab('NER/words.txt', 'NER/tags.txt')
t_sentences, t_labels, t_size = get_params(vocab, tag_map, 'NER/train/sentences.txt', 'NER/train/labels.txt')
v_sentences, v_labels, v_size = get_params(vocab, tag_map, 'NER/validate/sentences.txt', 'NER/validate/labels.txt')

In [8]:
from pprint import pprint
pprint(tag_map)

{'B-art': 8,
 'B-eve': 14,
 'B-geo': 1,
 'B-gpe': 2,
 'B-nat': 13,
 'B-org': 5,
 'B-per': 3,
 'B-tim': 7,
 'I-art': 9,
 'I-eve': 15,
 'I-geo': 4,
 'I-gpe': 11,
 'I-nat': 16,
 'I-org': 6,
 'I-per': 10,
 'I-tim': 12,
 'O': 0}


### Vocab mapping

In [6]:
# vocab translates from a word to a unique number
print('vocab["the"]:', vocab["the"])
# Pad token
print('padded token:', vocab['<PAD>'])

vocab["the"]: 9
padded token: 35179


## Exploring information about the data

In [7]:
# Exploring information about the data
print('The number of outputs is tag_map', len(tag_map))
# The number of vocabulary tokens (including <PAD>)
g_vocab_size = len(vocab)
print(f"Num of vocabulary words: {g_vocab_size}")
print('The vocab size is', len(vocab))
print('The training size is', t_size)
print('The validation size is', v_size)
print('An example of the first sentence is', t_sentences[0])
print('An example of its corresponding label is', t_labels[0])

The number of outputs is tag_map 17
Num of vocabulary words: 35180
The vocab size is 35180
The training size is 33570
The validation size is 7194
An example of the first sentence is [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 9, 15, 1, 16, 17, 18, 19, 20, 21]
An example of its corresponding label is [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0]
