# Named Entity Recognition
Theoretical explanation is present in this blogpost [link](). The project aims to use Bidirectional-LSTM with Conditional Random Fields along with static word embeddings to perform Named Entity Recognition task based off of the CoNLL-2003 dataset. This will be done step by step from downloading and preparing the dataset, embeddings, to training and validating the performance.

## 1. Dataset downloading and preprocessing
The first step is to download both the dataset which contains the tagged information as well as the static embeddings GloVe, which has vectors with embedded meanings for words. Since BiLSTM cannot generated contextualised vectors like BERT and other transformer models do (a natural improvement), we fallback to using GloVe.

### 1.1 Downloading GloVe Embeddings
We will download the 100 dimensional variant of GloVe (there's a 300 dimensional variant which is heavy but more accurate). This is available for your exploration at [Stanford NLP Github](https://github.com/stanfordnlp/GloVe/releases). The following script uses utility function to download this automatically and unzips to `src/data` folder.

In [1]:
from util.download import download_glove_embeddings, download_conll2003_dataset

In [2]:
download_glove_embeddings()

GloVe embeddings already exist. Skipping download.


### 1.2 Downloading the CoNLL2003 Dataset
The following method downloads the CoNLL2003 NER dataset in the same location as before.

In [3]:
download_conll2003_dataset()

CoNLL-2003 dataset already exists. Skipping download.


### 1.3 Loading the GloVe embeddings, CoNLL Data and create the dataset loader

In [4]:
from util.data import load_glove_embeddings, load_conll_dataset
from util.dataset import create_data_loader
from util.adapters import bio_tag_dictionary

import numpy as np

In [5]:
# Load the GloVe embeddings
word2idx, embeddings, dimensions = load_glove_embeddings()

# Load the CoNLL as data
train_data, test_data, val_data = load_conll_dataset()
# Use the Train data as sample to get the tag dictionary
tag2label, label2tag = bio_tag_dictionary(train_data)

Loading GloVe embeddings from file.
Loaded 400002 words into vocabulary from GloVe.
Loading CoNLL data from data/conll2003/train.txt
Dataset loaded.
Loading CoNLL data from data/conll2003/test.txt
Dataset loaded.
Loading CoNLL data from data/conll2003/valid.txt
Dataset loaded.


In [6]:
assert dimensions == 100
assert word2idx["the"] == 2
np.testing.assert_allclose(
    embeddings[word2idx["the"]],
    np.array([-0.038194, -0.24487, 0.72812, -0.39961, 0.083172, 0.043953, -0.39141, 0.3344, -0.57545, 0.087459, 0.28787, -0.06731, 0.30906, -0.26384, -0.13231, -0.20757, 0.33395, -0.33848, -0.31743, -0.48336, 0.1464, -0.37304, 0.34577, 0.052041, 0.44946, -0.46971, 0.02628, -0.54155, -0.15518, -0.14107, -0.039722, 0.28277, 0.14393, 0.23464, -0.31021, 0.086173, 0.20397, 0.52624, 0.17164, -0.082378, -0.71787, -0.41531, 0.20335, -0.12763, 0.41367, 0.55187, 0.57908, -0.33477,
              -0.36559, -0.54857, -0.062892, 0.26584, 0.30205, 0.99775, -0.80481, -3.0243, 0.01254, -0.36942, 2.2167, 0.72201, -0.24978, 0.92136, 0.034514, 0.46745, 1.1079, -0.19358, -0.074575, 0.23353, -0.052062, -0.22044, 0.057162, -0.15806, -0.30798, -0.41625, 0.37972, 0.15006, -0.53212, -0.2055, -1.2526, 0.071624, 0.70565, 0.49744, -0.42063, 0.26148, -1.538, -0.30223, -0.073438, -0.28312, 0.37104, -0.25217, 0.016215, -0.017099, -0.38984, 0.87424, -0.72569, -0.51058, -0.52028, -0.1459, 0.8278, 0.27062]),
    rtol=1e-5,
    atol=1e-8
)

In [7]:
train_data[0]

[['eu', 'rejects', 'german', 'call', 'to', 'boycott', 'british', 'lamb', '.'],
 ['b-org', 'o', 'b-misc', 'o', 'o', 'o', 'b-misc', 'o', 'o']]

In [None]:
BATCH_SIZE=16
NUM_WORKERS=0

# Create the actual dataloader
train_data_loader = create_data_loader(
    train_data, word2idx, tag2label, embeddings, batch_size=BATCH_SIZE, num_workers=NUM_WORKERS)
val_data_loader = create_data_loader(
    val_data, word2idx, tag2label, embeddings, batch_size=BATCH_SIZE, num_workers=NUM_WORKERS, 
    is_train=False)
test_data_loader = create_data_loader(
    test_data, word2idx, tag2label, embeddings, batch_size=BATCH_SIZE, num_workers=NUM_WORKERS, 
    is_train=False)

## 2. Creating the Runner Instance and run training loop

In [9]:
from runner import Runner

In [10]:
# Defining constants here
LEARNING_RATE = 1e-5

In [11]:
runner = Runner(LEARNING_RATE, train_data_loader, val_data_loader,
                test_data_loader, dimensions, len(tag2label.keys()), 30, None, label2tag)

### 2.1. Training Loop

In [12]:
runner.train()

Training progress:   0%|          | 0/30 [00:00<?, ?it/s]

Epoch: 1/30, Training loss: 494.40, Validation loss: 408.99


libc++abi: terminating due to uncaught exception of type std::__1::system_error: Broken pipe
libc++abi: terminating due to uncaught exception of type std::__1::system_error: Broken pipe
libc++abi: terminating due to uncaught exception of type std::__1::system_error: Broken pipe
libc++abi: terminating due to uncaught exception of type std::__1::system_error: Broken pipe
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x11a67ea20>
Traceback (most recent call last):
  File "/Users/sriharikarthickn/Developer/Tools/miniconda3/envs/project/lib/python3.13/site-packages/torch/utils/data/dataloader.py", line 1663, in __del__
    self._shutdown_workers()
  File "/Users/sriharikarthickn/Developer/Tools/miniconda3/envs/project/lib/python3.13/site-packages/torch/utils/data/dataloader.py", line 1627, in _shutdown_workers
    w.join(timeout=_utils.MP_STATUS_CHECK_INTERVAL)
  File "/Users/sriharikarthickn/Developer/Tools/miniconda3/envs/project/lib/python3.13/multiprocessing/

KeyboardInterrupt: 

### 2.2. Testing with test data set

In [None]:
filename = "ENTER PATH TO FILE HERE"

In [None]:
runner = Runner(LEARNING_RATE, train_data_loader, val_data_loader,
                test_data_loader, dimensions, len(tag2label.keys()), 30, filename, label2tag)

In [None]:
runner.test()