# Named Entity Recognition using LSTM

In this lab, we will look at building a NER model. 

Named Entity Recognition (NER) is a subtask of Natural Language Processing (NLP) that focuses on identifying and classifying entities within a text into predefined categories. These entities include names of persons, organizations, locations, dates, numerical expressions, products, and other significant terms.

## How NER Works:

- Tokenization: Splitting the text into smaller units (tokens) such as words or phrases.
- Entity Detection: Identifying segments of text that potentially represent entities.
- Entity Classification: Assigning each identified segment to a category, e.g., “John Smith” → Person, “New York” → Location.

## Applications of NER:

- Information Extraction: Pulling key information from documents or web pages.
- Search Optimization: Improving search engines by tagging entities for more relevant results.
- Customer Support Automation: Recognizing names, locations, or products to automate queries.
- Healthcare: Extracting disease names, medications, and patient data from medical records.
- Finance: Identifying companies, dates, and financial figures in reports.

**Example**:
For the sentence:
**"Apple Inc. was founded in Cupertino by Steve Jobs in 1976."**
NER identifies:

Organization: Apple Inc.

Location: Cupertino


Person: Steve Jobs

Date: 1976

In [35]:
pip install seqeval

Note: you may need to restart the kernel to use updated packages.


## Imports and libraries

In [36]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Dense, Embedding, LSTM, Bidirectional, TimeDistributed, Dropout, Input
from tensorflow.keras.models import Model
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight
from seqeval.metrics import classification_report as seqeval_report

## Reading data

In [37]:
# Load Dataset
data = pd.read_csv("ner_dataset.csv", encoding="latin1").fillna(method="ffill")

SentenceGetter is designed to aggregate words and their tags into sentences from a structured pandas DataFrame. It's typically used in Natural Language Processing (NLP) tasks, such as Named Entity Recognition (NER), where words and their corresponding labels (tags) need to be grouped by sentence.

It produces output like so:
[
    [('John', 'PERSON'), ('lives', 'O'), ('in', 'O'), ('Paris', 'LOCATION')],
    [('He', 'O'), ('works', 'O'), ('at', 'O'), ('Google', 'ORGANIZATION')]
]




In [38]:
# Sentence Aggregator
class SentenceGetter:
    def __init__(self, data):
        self.sentences = self.aggregate_sentences(data)

    def aggregate_sentences(self, data):
        agg_func = lambda s: [(w, t) for w, t in zip(s["Word"].values, s["Tag"].values)]
        sentences = data.groupby("Sentence #").apply(agg_func).tolist()
        return [s for s in sentences if len(s) > 0]  # Exclude empty sentences

In [39]:
getter = SentenceGetter(data)
sentences = getter.sentences

In [40]:
## Printing first sentence and corresponding tags
print(sentences[0])

[('Thousands', 'O'), ('of', 'O'), ('demonstrators', 'O'), ('have', 'O'), ('marched', 'O'), ('through', 'O'), ('London', 'B-geo'), ('to', 'O'), ('protest', 'O'), ('the', 'O'), ('war', 'O'), ('in', 'O'), ('Iraq', 'B-geo'), ('and', 'O'), ('demand', 'O'), ('the', 'O'), ('withdrawal', 'O'), ('of', 'O'), ('British', 'B-gpe'), ('troops', 'O'), ('from', 'O'), ('that', 'O'), ('country', 'O'), ('.', 'O')]


Below code prepares vocabulary and tag mappings for a Named Entity Recognition (NER) task. It creates three main mappings to convert words and tags into numerical indices and vice versa, which is essential for training machine learning models.

Extract Unique Words and Tags:

Collects all unique words and tags from the data.
Adds a special padding token "PAD" to handle sentence length variations.
Generate Mappings:

word2idx: Maps each unique word to a unique index.

tag2idx: Maps each unique tag (e.g., PERSON, LOCATION) to a unique index.

idx2tag: A reverse mapping to retrieve the tags from their indices.

Example:

words = ['John', 'lives', 'in', 'Paris', 'PAD']
tags = ['O', 'PERSON', 'LOCATION']

End result is:
{0: 'O', 1: 'PERSON', 2: 'LOCATION'}



In [41]:
# Vocabulary and Tag Mappings
words = list(set(data["Word"].values))
words.append("PAD")
tags = list(set(data["Tag"].values))
word2idx = {w: i for i, w in enumerate(words)}
tag2idx = {t: i for i, t in enumerate(tags)}
idx2tag = {i: t for t, i in tag2idx.items()}

## Modeling and Training

In [42]:
# Hyperparameters
MAX_LEN = 50
EMBEDDING_DIM = 100
LSTM_UNITS = 64
BATCH_SIZE = 32
EPOCHS = 1

This code prepares the input sequences (X) and the output label sequences (y) for a Named Entity Recognition (NER) task. It transforms textual data into numerical form, pads the sequences to a uniform length, and converts the labels into a one-hot encoded format to make them suitable for training a machine learning model.

In [43]:
# Prepare Sequences
X = [[word2idx.get(w[0], word2idx["PAD"]) for w in s] for s in sentences]
y = [[tag2idx.get(w[1], tag2idx["O"]) for w in s] for s in sentences]
X = pad_sequences(X, maxlen=MAX_LEN, padding="post")
y = pad_sequences(y, maxlen=MAX_LEN, padding="post", value=tag2idx["O"])

# Convert Labels to One-Hot Encoding
y = tf.keras.utils.to_categorical(y, num_classes=len(tags))

## Train-test split

In [44]:
# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

# Handle Class Imbalance
flat_y_train = np.argmax(y_train, axis=-1).flatten()
class_weights = compute_class_weight('balanced', classes=np.unique(flat_y_train), y=flat_y_train)
class_weights_dict = {i: class_weights[i] for i in range(len(class_weights))}

In [45]:


# Define Model
input_layer = Input(shape=(MAX_LEN,))
embedding = Embedding(input_dim=len(words), output_dim=EMBEDDING_DIM, input_length=MAX_LEN)(input_layer)
dropout1 = Dropout(0.3)(embedding)
lstm1 = Bidirectional(LSTM(LSTM_UNITS, return_sequences=True, dropout=0.3, recurrent_dropout=0.3))(dropout1)
dropout2 = Dropout(0.3)(lstm1)
output = TimeDistributed(Dense(len(tags), activation="softmax"))(dropout2)

model = Model(input_layer, output)
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
model.summary()

# Train Model
history = model.fit(X_train, y_train, batch_size=BATCH_SIZE, epochs=EPOCHS, validation_split=0.1, verbose=1)


# Predict on Test Data
y_pred = model.predict(X_test)
y_pred_tags = np.argmax(y_pred, axis=-1)
y_true_tags = np.argmax(y_test, axis=-1)

# Flatten Predictions and Ground Truth for Evaluation
y_pred_flat = [[idx2tag[i] for i in row] for row in y_pred_tags]
y_true_flat = [[idx2tag[i] for i in row] for row in y_true_tags]


Model: "model_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_8 (InputLayer)        [(None, 50)]              0         
                                                                 
 embedding_7 (Embedding)     (None, 50, 100)           3517900   
                                                                 
 dropout_14 (Dropout)        (None, 50, 100)           0         
                                                                 
 bidirectional_7 (Bidirecti  (None, 50, 128)           84480     
 onal)                                                           
                                                                 
 dropout_15 (Dropout)        (None, 50, 128)           0         
                                                                 
 time_distributed_7 (TimeDi  (None, 50, 17)            2193      
 stributed)                                                

## Evaluation

In [46]:

# Show Sample Predictions
print("\nSample Predictions vs Actual Tags:\n")
for i in range(5):  # Show 5 sample sentences
    words_example = [words[idx] for idx in X_test[i] if idx != word2idx["PAD"]]
    true_tags_example = [tag for tag in y_true_flat[i] if tag != "O"]
    pred_tags_example = [tag for tag in y_pred_flat[i] if tag != "O"]
    
    print("Sentence: ", " ".join(words_example))
    print("Actual Tags: ", " ".join(true_tags_example))
    print("Predicted Tags: ", " ".join(pred_tags_example))
    print("-" * 60)



Sample Predictions vs Actual Tags:

Sentence:  The report calls on President Bush and Congress to urge Chinese officials not to use the global war against terrorism as a pretext to suppress minorities ' rights . Moya Moya Moya Moya Moya Moya Moya Moya Moya Moya Moya Moya Moya Moya Moya Moya Moya Moya Moya Moya Moya
Actual Tags:  B-per I-per B-org B-gpe
Predicted Tags:  B-per I-per B-org B-gpe
------------------------------------------------------------
Sentence:  The construction on the Baku-T'bilisi-Ceyhan oil pipeline , the Baku-T'bilisi-Erzerum gas pipeline , and the Kars-Akhalkalaki Railroad are part of a strategy to capitalize on Georgia 's strategic location between Europe and Asia and develop its role as a transit point for gas , oil and other goods . Moya
Actual Tags:  B-org I-org B-geo B-geo B-geo
Predicted Tags:  I-org B-geo B-geo B-geo
------------------------------------------------------------
Sentence:  The pact was initially approved after discussions between President 