<img src="data/images/lecture-notebook-header.png" />

# NER - GMB Dataset (Groningen Meaning Bank)

The Groningen Meaning Bank (GMB) dataset is a corpus of annotated text for natural language processing (NLP) tasks, including named entity recognition (NER). The corpus contains over 1.5 million words of text from multiple sources, including news articles, Wikipedia articles, and legal documents, and is annotated with various types of linguistic information, including part-of-speech tags, dependency parses, and named entities.

The GMB dataset is notable for its high-quality annotations and wide coverage of multiple languages, including English, Dutch, and Spanish. In particular, the English portion of the dataset contains over 500,000 words of annotated text, with named entities annotated for person, organization, and location.

The GMB dataset is freely available for research and academic purposes, and has been used in various NLP research projects, including the development of NER models using machine learning and deep learning techniques. The dataset is also used as a benchmark for evaluating the performance of NER models in research papers and competitions. Overall, the GMB dataset is a valuable resource for anyone interested in NLP, and has contributed significantly to the development of NER models and other NLP tasks.

In this notebook, we prepare the dataset for training RNN-based models for NER later.

## Setting up the Notebook

### Import Required Packages

In [None]:
import pandas as pd

import torch
import torchtext
from torchtext.vocab import vocab
from collections import Counter, OrderedDict
from tqdm import tqdm

Lastly, `src/utils.py` provides a utility method to decompress files.

In [None]:
from src.utils import decompress_file

---

## Load Data from File

Dataset we use in this notebook is taken from [Kaggle](https://www.kaggle.com/datasets/naseralqaydeh/named-entity-recognition-ner-corpus) from Kaggle. We provide this dataset here in the repository as a `zip` file, so we first need to extract the file.


In [None]:
print('Decompress file...')
decompress_file('data/datasets/gmb-ner/gmb-ner.zip', 'data/datasets/gmb-ner/')
print('DONE.')

Now we can read the extracted `csv` file as usual using `pandas` and have a look at the data.

In [None]:
df = pd.read_csv("data/datasets/gmb-ner/gmb-ner.csv", sep=",", encoding='Latin-1')

df.head(30)

As you can see from the output above each line in the file represents a word that comes with its POS tag and NER labels. So let's first loop over all lines and create sentences as a list of (word, pos, label) tuples.

In [None]:
sent, sentences = [], []

for row in df.itertuples():
    nr, word, pos, label = str(row[1]), row[2], row[3], row[4]

    # Check if we have reached the next sentence
    if 'Sentence' in nr:
        # If the current sentence is not empty (just a fail safe) add it to the list of all sentences
        if len(sent) > 0:
            sentences.append(sent)
        sent = []

    # Add current word, POS tag, and NER label to the current sentence
    if isinstance(word, str) is True:
        sent.append((word, pos, label))

# Print the number of sentences
print("Number of sentences: {}".format(len(sentences)))

## Create Vocabularies

Now we perform the well-known steps of creating the vocabularies and vectorizing each sentence to be used for training neural networks. You can check out previous lecture notebooks and provide more details on the following steps.

In [None]:
token_counter = Counter()
pos_counter = Counter()
label_counter = Counter()

for sent in sentences:
    for token, pos, tag in sent:
        token_counter[token] += 1
        pos_counter[pos] += 1
        label_counter[tag] += 1

print(len(token_counter))
print(len(pos_counter))
print(len(label_counter))

Let's have a quick look at an example:

In [None]:
print(sentences[0])

Each element of a sentence/sequence is a 3-tuple containing the word/token, the POS tag and the NER label.

The code cell below performs all the steps to create the required vocabularies as we have seen in multiple other notebooks, so we skip a more detailed discussion of each individual step in the code cell. However, note that we have to create a vocabulary for all three components: the words/tokens, the POS tags and the NER label. Considering `SPECIALS` and `UNK_TOKEN` also for the POS tags and the NER labels is probably not needed -- particularly for sufficiently large datasets -- but it doesn't harm either and so we are on the safe side (in case we would indeed encounter an unknown POS tag or NER label).


In [None]:
PAD_TOKEN = "<PAD>"
UNK_TOKEN = "<UNK>"
SOS_TOKEN = "<SOS>"
EOS_TOKEN = "<EOS>"

SPECIALS = [PAD_TOKEN, UNK_TOKEN, SOS_TOKEN, EOS_TOKEN]

## Sort word frequencies and conver to an OrderedDict
token_counter_sorted = sorted(token_counter.items(), key=lambda x: x[1], reverse=True)
pos_counter_sorted = sorted(pos_counter.items(), key=lambda x: x[1], reverse=True)
label_counter_sorted = sorted(label_counter.items(), key=lambda x: x[1], reverse=True)

max_words = 9999999999999999 # all words, by default (so just use a very large number)
token_ordered_dict = OrderedDict(token_counter_sorted[:max_words])
pos_ordered_dict = OrderedDict(pos_counter_sorted)
label_ordered_dict = OrderedDict(label_counter_sorted)

for t in token_ordered_dict:
    if isinstance(t, str) is False:
        print(t, type(t))
    #break

## Create vocabularies
vocab_token = vocab(token_ordered_dict, specials=SPECIALS)
vocab_pos = vocab(pos_ordered_dict, specials=SPECIALS)
vocab_label = vocab(label_ordered_dict, specials=SPECIALS)

vocab_token.set_default_index(vocab_token[UNK_TOKEN])
vocab_pos.set_default_index(vocab_pos[UNK_TOKEN])
vocab_label.set_default_index(vocab_label[UNK_TOKEN])

print("Size of token vocabulary: {}".format(len(vocab_token)))
print("Size of POS vocabulary: {}".format(len(vocab_pos)))
print("Size of label vocabulary: {}".format(len(vocab_label)))

We need to save all vocabularies for later use when training our models.

In [None]:
torch.save(vocab_token, "data/datasets/gmb-ner/gmb-ner-token.vocab")
torch.save(vocab_pos, "data/datasets/gmb-ner/gmb-ner-pos.vocab")
torch.save(vocab_label, "data/datasets/gmb-ner/gmb-ner-label.vocab")

## Vectorize Data

In the last step, we vectorize our sentences. Note that the code cell below considers only sentences for length 5..50 which is by far the majority of sentences. This is just for convenience when we train our models. Note also that we simply concatenate the token indices and POS tag indices into a single sequence.


In [None]:
output_file = open("data/datasets/gmb-ner/gmb-ner-data-vectorized.txt", "w")

min_sent_len, max_sent_len = 5, 50

with tqdm(total=len(sentences)) as pbar:
    for sent in sentences:
        seq_token = [ tup[0] for tup in sent ]
        seq_pos = [ tup[1] for tup in sent ]
        seq_label = [ tup[2] for tup in sent ]
    
        vec_token = vocab_token.lookup_indices(seq_token)
        vec_pos = vocab_pos.lookup_indices(seq_pos)
        vec_label = vocab_label.lookup_indices(seq_label)
    
        str_token_pos = " ".join([str(idx) for idx in vec_token+vec_pos])
        str_label = " ".join([str(idx) for idx in vec_label])
        output_file.write("{},{}\n".format(str_token_pos, str_label))
    
        pbar.update(1)
        
        
output_file.flush()
output_file.close()

To show an example, the code cell below prints the last vectorized data sample. Keep in mind that the first half of the sentence represents the indices of the word of the sentence (`207 27 42 163 7 4 1446 756 1510 2057 5`), and the second half represents the indices of the corresponding POS tags (`22 21 19 7 6 7 9 11 6 4 10`).

In [None]:
print(str_token_pos)

We can also look at the NER labels for each word. Since the last sentence does not contain any named entities, all words are labeled with `O` (Other), which is represented by the index `4`.

In [None]:
print(str_label)

---

## Summary

The file `gmb-ner-data-vectorized.txt` now contains all sentences and corresponding POS tags in vectorized form -- that is, each word and each POS tag is represented by its unique numerical index (i.e., integer value). This representation of the dataset can now serve as input for the RNN-based architectures for training an NER tagger in the notebook "Named Entity Recognition (NER)".
