<a href="https://colab.research.google.com/github/aparnaashok2125/Elevvo-Pathways-NLP-Internship/blob/main/Elevvo_Pathways_Task_4_Named_Entity_Recognition_(NER)_from_News_Articles_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Named Entity Recognition (NER) on News Articles

This notebook performs Named Entity Recognition (NER) on the CoNLL-2003 dataset to identify entities (people, locations, organizations) using both rule-based and model-based approaches with SpaCy. It compares two SpaCy models (`en_core_web_sm` and `en_core_web_lg`), visualizes results with displaCy, and includes a BiLSTM model for additional model-based NER.

**Task Requirements**:
- Dataset: CoNLL-2003 (Kaggle)
- Identify named entities (PERSON, ORG, GPE)
- Use rule-based and model-based NER
- Highlight and categorize entities
- Tools: Python, SpaCy, Pandas
- Bonus: Visualize with displaCy, compare two SpaCy models

**Steps**:
1. Load and preprocess CoNLL-2003 dataset
2. Implement rule-based NER with SpaCy
3. Apply model-based NER with two SpaCy models
4. Train a BiLSTM model for NER
5. Visualize and compare results

In [None]:
!pip install spacy pandas tensorflow
!python -m spacy download en_core_web_sm
!python -m spacy download en_core_web_lg

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m63.1 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Collecting en-core-web-lg==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0-py3-none-any.whl (400.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m400.7/400.7 MB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: 

In [None]:
import spacy
from spacy import displacy
import pandas as pd
import numpy as np
from spacy.language import Language
from spacy.tokens import Span
from IPython.display import display, HTML
import os
from itertools import chain
from sklearn.model_selection import train_test_split
from tensorflow.keras import Sequential, Model, Input
from tensorflow.keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional
from tensorflow.keras.utils import plot_model
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from numpy.random import seed
import tensorflow

seed(1)
tensorflow.random.set_seed(2)

# Custom rule-based component for NER
@Language.component("custom_rule_based_ner")
def custom_rule_based_ner(doc):
    custom_ents = []
    for token in doc:
        if token.text.lower() in ['company', 'corporation', 'inc.', 'ltd.', 'group']:
            if token.i > 0:
                span = Span(doc, token.i-1, token.i+1, label="ORG")
                custom_ents.append(span)
    doc.ents = list(doc.ents) + custom_ents
    return doc

In [None]:
from google.colab import files
import pandas as pd
import os

print("Upload the CoNLL-2003 dataset (ner_dataset.csv)")
uploaded = files.upload()

def load_conll_data(file_path=None):
    if file_path and os.path.exists(file_path):
        data = pd.read_csv(file_path, encoding='unicode_escape')
    else:
        print("No file uploaded, using sample data.")
        sample_text = "Apple Inc. announced a new product launch in New York on January 15, 2025. Tim Cook, the CEO, will present at the United Nations headquarters. The event will feature collaborations with Microsoft and Tesla Motors."
        data = pd.DataFrame({
            'Sentence #': ['Sentence: 1'],
            'Word': sample_text.split(),
            'POS': ['NNP'] * len(sample_text.split()),
            'Tag': ['O'] * len(sample_text.split())
        })

    # Clean and group the data
    data['Sentence #'] = data['Sentence #'].ffill()
    data_group = data.groupby('Sentence #').agg({
        'Word': lambda x: ' '.join(str(word) for word in x if pd.notnull(word)),
        'Tag': list,
        'POS': list
    }).reset_index()

    return data, data_group

# Load data
file_path = list(uploaded.keys())[0] if uploaded else None
data, data_group = load_conll_data(file_path)

print("\nRaw Data Preview:")
display(data.head())

print("\nGrouped Data Preview:")
display(data_group.head())


Upload the CoNLL-2003 dataset (ner_dataset.csv)


Saving ner_dataset.csv to ner_dataset (1).csv

Raw Data Preview:


Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,Sentence: 1,of,IN,O
2,Sentence: 1,demonstrators,NNS,O
3,Sentence: 1,have,VBP,O
4,Sentence: 1,marched,VBN,O



Grouped Data Preview:


Unnamed: 0,Sentence #,Word,Tag,POS
0,Sentence: 1,Thousands of demonstrators have marched throug...,"[O, O, O, O, O, O, B-geo, O, O, O, O, O, B-geo...","[NNS, IN, NNS, VBP, VBN, IN, NNP, TO, VB, DT, ..."
1,Sentence: 10,Iranian officials say they expect to get acces...,"[B-gpe, O, O, O, O, O, O, O, O, O, O, O, O, O,...","[JJ, NNS, VBP, PRP, VBP, TO, VB, NN, TO, JJ, J..."
2,Sentence: 100,Helicopter gunships Saturday pounded militant ...,"[O, O, B-tim, O, O, O, O, O, B-geo, O, O, O, O...","[NN, NNS, NNP, VBD, JJ, NNS, IN, DT, NNP, JJ, ..."
3,Sentence: 1000,They left after a tense hour-long standoff wit...,"[O, O, O, O, O, O, O, O, O, O, O]","[PRP, VBD, IN, DT, NN, JJ, NN, IN, NN, NNS, .]"
4,Sentence: 10000,U.N. relief coordinator Jan Egeland said Sunda...,"[B-geo, O, O, B-per, I-per, O, B-tim, O, B-geo...","[NNP, NN, NN, NNP, NNP, VBD, NNP, ,, NNP, ,, J..."


In [None]:
# Prepare data for BiLSTM model
def get_dict_map(data, token_or_tag):
    tok2idx = {}
    idx2tok = {}
    if token_or_tag == 'token':
        vocab = list(set(data['Word'].to_list()))
    else:
        vocab = list(set(data['Tag'].to_list()))
    idx2tok = {idx: tok for idx, tok in enumerate(vocab)}
    tok2idx = {tok: idx for idx, tok in enumerate(vocab)}
    return tok2idx, idx2tok

token2idx, idx2token = get_dict_map(data, 'token')
tag2idx, idx2tag = get_dict_map(data, 'tag')

data['Word_idx'] = data['Word'].map(token2idx)
data['Tag_idx'] = data['Tag'].map(tag2idx)
data_fillna = data.fillna(method='ffill')
data_group_bilstm = data_fillna.groupby(['Sentence #'], as_index=False)[['Word', 'POS', 'Tag', 'Word_idx', 'Tag_idx']].agg(lambda x: list(x))

def get_pad_train_test_val(data_group, data):
    n_token = len(list(set(data['Word'].to_list())))
    n_tag = len(list(set(data['Tag'].to_list())))
    tokens = data_group['Word_idx'].tolist()
    maxlen = max([len(s) for s in tokens])
    pad_tokens = pad_sequences(tokens, maxlen=maxlen, dtype='int32', padding='post', value=n_token - 1)
    tags = data_group['Tag_idx'].tolist()
    pad_tags = pad_sequences(tags, maxlen=maxlen, dtype='int32', padding='post', value=tag2idx["O"])
    pad_tags = [to_categorical(i, num_classes=n_tag) for i in pad_tags]
    tokens_, test_tokens, tags_, test_tags = train_test_split(pad_tokens, pad_tags, test_size=0.1, train_size=0.9, random_state=2020)
    train_tokens, val_tokens, train_tags, val_tags = train_test_split(tokens_, tags_, test_size=0.25, train_size=0.75, random_state=2020)
    print(
        'train_tokens length:', len(train_tokens),
        '\ntest_tokens length:', len(test_tokens),
        '\nval_tokens:', len(val_tokens)
    )
    return train_tokens, val_tokens, test_tokens, train_tags, val_tags, test_tags

train_tokens, val_tokens, test_tokens, train_tags, val_tags, test_tags = get_pad_train_test_val(data_group_bilstm, data)

  data_fillna = data.fillna(method='ffill')


train_tokens length: 32372 
test_tokens length: 4796 
val_tokens: 10791


In [None]:
# Define and train BiLSTM model
input_dim = len(list(set(data['Word'].to_list())))+1
output_dim = 64
input_length = max([len(s) for s in data_group_bilstm['Word_idx'].tolist()])
n_tags = len(tag2idx)

def get_bilstm_lstm_model():
    model = Sequential()
    model.add(Embedding(input_dim=input_dim, output_dim=output_dim, input_length=input_length))
    model.add(Bidirectional(LSTM(units=output_dim, return_sequences=True, dropout=0.2, recurrent_dropout=0.2), merge_mode='concat'))
    model.add(LSTM(units=output_dim, return_sequences=True, dropout=0.5, recurrent_dropout=0.5))
    model.add(TimeDistributed(Dense(n_tags, activation="relu")))
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    model.summary()
    return model

def train_model(X, y, model):
    loss = []
    for i in range(5):  # Reduced epochs for faster execution
        hist = model.fit(X, np.array(y), batch_size=1000, verbose=1, epochs=1, validation_split=0.2)
        loss.append(hist.history['loss'][0])
    return loss

model_bilstm_lstm = get_bilstm_lstm_model()
_ = model_bilstm_lstm(train_tokens[:1])
plot_model(model_bilstm_lstm, show_shapes=True)
results = pd.DataFrame()
results['bilstm_loss'] = train_model(train_tokens, train_tags, model_bilstm_lstm)



[1m26/26[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m155s[0m 5s/step - accuracy: 0.8129 - loss: 2.3360 - val_accuracy: 0.9681 - val_loss: 0.3514
[1m26/26[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m131s[0m 5s/step - accuracy: 0.9676 - loss: 0.3318 - val_accuracy: 0.9681 - val_loss: 0.2357
[1m26/26[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m131s[0m 5s/step - accuracy: 0.9676 - loss: 0.2704 - val_accuracy: 0.9681 - val_loss: 0.2205
[1m26/26[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m126s[0m 5s/step - accuracy: 0.9676 - loss: 0.2539 - val_accuracy: 0.9682 - val_loss: 0.1993
[1m26/26[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m133s[0m 5s/step - accuracy: 0.9677 - loss: 0.2320 - val_accuracy: 0.9682 - val_loss: 0.1753


In [None]:
# SpaCy NER with two models
def perform_spacy_ner(text, model_name):
    nlp = spacy.load(model_name)
    if model_name == "en_core_web_sm":
        nlp.add_pipe("custom_rule_based_ner", after="ner")
    doc = nlp(text)
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    html = displacy.render(doc, style="ent", jupyter=False)
    return entities, html

# Process a sample sentence from CoNLL-2003
sample_text = data_group['Word'].iloc[0]
models = ["en_core_web_sm", "en_core_web_lg"]
spacy_results = {}

for model in models:
    print(f"\nProcessing with {model}...")
    entities, html_viz = perform_spacy_ner(sample_text, model)
    spacy_results[model] = {'entities': entities, 'visualization': html_viz}

    print(f"\nEntities found by {model}:")
    entities_df = pd.DataFrame(entities, columns=['Entity', 'Type'])
    display(entities_df)

    print(f"\nVisualization for {model}:")
    display(HTML(html_viz))

    output_path = f"/content/ner_viz_{model}.html"
    with open(output_path, "w", encoding="utf-8") as f:
        f.write(html_viz)
    print(f"Visualization saved as {output_path}")
    files.download(output_path)


Processing with en_core_web_sm...

Entities found by en_core_web_sm:


Unnamed: 0,Entity,Type
0,Thousands,CARDINAL
1,London,GPE
2,Iraq,GPE
3,British,NORP



Visualization for en_core_web_sm:


Visualization saved as /content/ner_viz_en_core_web_sm.html


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>


Processing with en_core_web_lg...

Entities found by en_core_web_lg:


Unnamed: 0,Entity,Type
0,Thousands,CARDINAL
1,London,GPE
2,Iraq,GPE
3,British,NORP



Visualization for en_core_web_lg:


Visualization saved as /content/ner_viz_en_core_web_lg.html


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
# Compare SpaCy models
print("\nComparison of SpaCy models:")
for entity_type in ['PERSON', 'ORG', 'GPE']:
    print(f"\n{entity_type} entities:")
    for model in models:
        entities = [e[0] for e in spacy_results[model]['entities'] if e[1] == entity_type]
        print(f"{model}: {entities}")


Comparison of SpaCy models:

PERSON entities:
en_core_web_sm: []
en_core_web_lg: []

ORG entities:
en_core_web_sm: []
en_core_web_lg: []

GPE entities:
en_core_web_sm: ['London', 'Iraq']
en_core_web_lg: ['London', 'Iraq']



* No named people were detected by either model.
* No organizations were detected by either model.
* Both models successfully recognized London and Iraq as geopolitical entities.


