Title: Named Entity Recognition

Description:
In this NER-focused project, you will design and develop a custom Named Entity Recognition (NER) system for text analysis. Named Entity Recognition involves identifying and classifying specific entities, such as names, dates, locations, and more, within unstructured text data. Your project will offer a versatile NER solution that will work well on the provided dataset.

Dataset: Named Entity Recognition (NER) Corpus (kaggle.com)

https://www.kaggle.com/datasets/naseralqaydeh/named-entity-recognition-ner-corpus

In [1]:
!date

Mon Apr  7 15:11:48 UTC 2025


In [2]:
import warnings,gc,sys
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import tensorflow as tf
from IPython.display import display, HTML
import matplotlib.pyplot as plt
# from sklearn.model_selection import train_test_split, GridSearchCV
import ast
from tqdm import tqdm

2025-04-07 15:11:50.427157: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-04-07 15:11:50.436347: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-04-07 15:11:50.479703: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-04-07 15:11:50.520147: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1744038710.571760   35786 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1744038710.58

In [3]:
# !pip install kagglehub

In [4]:
# Install dependencies as needed:
# pip install kagglehub[pandas-datasets]
import kagglehub
from kagglehub import KaggleDatasetAdapter

# Set the path to the file you'd like to load
file_path = "ner.csv"

# Load the latest version
NER_df = kagglehub.load_dataset(
  KaggleDatasetAdapter.PANDAS,
  "naseralqaydeh/named-entity-recognition-ner-corpus",
  file_path,

)

# print("First 5 records:", df.head())

In [5]:
display(NER_df.head(2).T)

Unnamed: 0,0,1
Sentence #,Sentence: 1,Sentence: 2
Sentence,Thousands of demonstrators have marched throug...,Families of soldiers killed in the conflict jo...
POS,"['NNS', 'IN', 'NNS', 'VBP', 'VBN', 'IN', 'NNP'...","['NNS', 'IN', 'NNS', 'VBN', 'IN', 'DT', 'NN', ..."
Tag,"['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', '...","['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."


Essential info about entities:

- geo = Geographical Entity
- org = Organization
- per = Person
- gpe = Geopolitical Entity
- tim = Time indicator
- art = Artifact
- eve = Event
- nat = Natural Phenomenon

### 1. Preprocess

In [6]:
print(NER_df.shape)

(47959, 4)


In [7]:
print(NER_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47959 entries, 0 to 47958
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Sentence #  47959 non-null  object
 1   Sentence    47959 non-null  object
 2   POS         47959 non-null  object
 3   Tag         47959 non-null  object
dtypes: object(4)
memory usage: 1.5+ MB
None


In [8]:
NER_df.dropna(inplace=True)
NER_df.drop(columns=["Sentence #","POS"],inplace=True)
NER_df["Tag"] = NER_df["Tag"].apply(lambda x: ast.literal_eval(x))

In [9]:
print(NER_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47959 entries, 0 to 47958
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Sentence  47959 non-null  object
 1   Tag       47959 non-null  object
dtypes: object(2)
memory usage: 749.5+ KB
None


In [10]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(NER_df["Sentence"], NER_df["Tag"], shuffle=True,test_size=0.20, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, shuffle=True,test_size=0.50, random_state=42)

In [17]:
from tensorflow.keras.preprocessing.text import Tokenizer
X_tokenizer = Tokenizer(lower=False,oov_token="UNK")
X_tokenizer.fit_on_texts(X_train)

In [18]:
X_train.reset_index(drop=True)

0        The 58-year-old former analyst says he provide...
1        But he said he will not accept any Pakistani p...
2        The Swiss star was upset Wednesday by German T...
3        After taking office in 2004 , the SPENCER gove...
4        Israeli soldiers have killed two Palestinians ...
                               ...                        
38362    However , poverty , illiteracy , and unemploym...
38363    The report was published two days after human ...
38364    In a separate incident , kidnappers released a...
38365    Those groups were shut down by U.S. officials ...
38366    A statement on the Web site of the Health Depa...
Name: Sentence, Length: 38367, dtype: object

In [19]:
train_inputs = X_tokenizer.texts_to_sequences(X_train)
test_inputs = X_tokenizer.texts_to_sequences(X_test)
val_inputs = X_tokenizer.texts_to_sequences(X_val)

In [29]:
vocab_len = len(X_tokenizer.word_index)
print(f"Number of unique tokens:\t{vocab_len}")

Number of unique tokens:	28761


In [16]:
train_tags = set([val for sublist in y_train for val in sublist])
test_tags = set([val for sublist in y_test for val in sublist])
val_tags = set([val for sublist in y_val for val in sublist])

print("Unique NER tags in train set: ",train_tags)
print("Unique NER tags in test set: ",test_tags)
print("Unique NER tags in test set: ",val_tags)

Unique NER tags in train set:  {'I-tim', 'I-org', 'I-art', 'B-geo', 'B-gpe', 'I-geo', 'B-per', 'B-eve', 'B-art', 'B-tim', 'I-eve', 'I-per', 'B-nat', 'B-org', 'O', 'I-gpe', 'I-nat'}
Unique NER tags in test set:  {'I-tim', 'I-org', 'I-art', 'B-geo', 'B-gpe', 'I-geo', 'B-per', 'B-eve', 'B-art', 'B-tim', 'I-eve', 'I-per', 'B-nat', 'B-org', 'O', 'I-gpe', 'I-nat'}
Unique NER tags in test set:  {'I-tim', 'I-org', 'I-art', 'B-org', 'B-gpe', 'I-geo', 'B-per', 'B-eve', 'B-art', 'B-tim', 'I-eve', 'I-per', 'B-nat', 'B-geo', 'O', 'I-gpe', 'I-nat'}


In [21]:
y_tokenizer = Tokenizer(lower=False,oov_token="UNK")
y_tokenizer.fit_on_texts(y_train)

In [22]:
train_target = y_tokenizer.texts_to_sequences(y_train)
test_target = y_tokenizer.texts_to_sequences(y_test)
val_target = y_tokenizer.texts_to_sequences(y_val)

In [23]:
max_len = 0
for dataset in [train_inputs,test_inputs,val_inputs]:
    for i in range(len(dataset)):
        max_len = max(max_len,len(dataset[i]))

In [37]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

padded_train_inputs = pad_sequences(train_inputs, maxlen=max_len, padding='post', value=0)
padded_test_inputs = pad_sequences(test_inputs, maxlen=max_len, padding='post', value=0)
padded_val_inputs = pad_sequences(val_inputs, maxlen=max_len, padding='post', value=0)

padded_train_target = pad_sequences(train_target, maxlen=max_len, padding='post', value=0)
padded_test_target = pad_sequences(test_target, maxlen=max_len, padding='post', value=0)
padded_val_target = pad_sequences(val_target, maxlen=max_len, padding='post', value=0)

In [38]:
for dataset in [padded_train_inputs,padded_test_inputs,padded_val_inputs,padded_train_target,padded_test_target,padded_val_target]:
    print(dataset.shape)

(38367, 89)
(4796, 89)
(4796, 89)
(38367, 89)
(4796, 89)
(4796, 89)


In [39]:
Number_of_classes_K = len(y_tokenizer.word_index) + 1

In [40]:
from keras.layers import Dense,Input,Bidirectional,LSTM,Embedding
from keras.models import Model
from keras.losses import SparseCategoricalCrossentropy
from keras.callbacks import EarlyStopping

In [41]:
vector_size = 16

i = Input(shape=(max_len,))
x = Embedding(input_dim= vocab_len+1,output_dim=vector_size,mask_zero=True)(i)
x = Bidirectional(LSTM(32,return_sequences=True))(x)
x = Dense(Number_of_classes_K)(x)

model = Model(i,x)
model.summary()

In [None]:
model.compile(optimizer="adam",loss=SparseCategoricalCrossentropy(from_logits=True),metrics=["accuracy"])
model.fit(padded_train_inputs,
          padded_train_target,
          epochs=6,
          validation_data=(padded_val_inputs,padded_val_target))

Epoch 1/6


[1m1199/1199[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m64s[0m 51ms/step - accuracy: 0.2113 - loss: 0.9437 - val_accuracy: 0.2123 - val_loss: 0.5026
Epoch 2/6
[1m1199/1199[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m63s[0m 52ms/step - accuracy: 0.2155 - loss: 0.4495 - val_accuracy: 0.2182 - val_loss: 0.3777
Epoch 3/6
[1m1199/1199[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m66s[0m 55ms/step - accuracy: 0.2203 - loss: 0.3374 - val_accuracy: 0.2201 - val_loss: 0.3379
Epoch 4/6
[1m1199/1199[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m68s[0m 57ms/step - accuracy: 0.2234 - loss: 0.2859 - val_accuracy: 0.2215 - val_loss: 0.3179
Epoch 5/6
[1m1199/1199[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m64s[0m 53ms/step - accuracy: 0.2106 - loss: 0.2502 - val_accuracy: 0.2070 - val_loss: 0.3114
Epoch 6/6
[1m 428/1199[0m [32m━━━━━━━[0m[37m━━━━━━━━━━━━━[0m [1m43s[0m 57ms/step - accuracy: 0.2134 - loss: 0.2230

In [43]:
sentence = """Is this the real life? Is this just fantasy? Caught in a landslide, no escape from reality"""
predictions = model.predict(pad_sequences(X_tokenizer.texts_to_sequences([sentence]),
                                          maxlen=max_len,
                                         padding="post"))
print(predictions)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 306ms/step
[[[ -5.304609    -4.743709     1.8781298  ...  -1.0697396   -1.0944893
    -2.5503826 ]
  [ -6.229029    -7.766292     8.0266     ...  -4.0541368   -2.6854007
    -4.7166605 ]
  [ -6.1748357  -10.034082    11.106297   ...  -5.695456    -2.8237836
    -6.305474  ]
  ...
  [ -0.05605015  -0.13771345   0.19956562 ...  -0.17713553  -0.14530168
    -0.12879112]
  [ -0.05605015  -0.13771345   0.19956562 ...  -0.17713553  -0.14530168
    -0.12879112]
  [ -0.05605015  -0.13771345   0.19956562 ...  -0.17713553  -0.14530168
    -0.12879112]]]


In [44]:
prediction_ner = np.argmax(predictions,axis=-1)
print(prediction_ner)

[[2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
  7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
  7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7]]


In [46]:
NER_tags = [y_tokenizer.index_word[num] for num in list(prediction_ner.flatten())]
print(NER_tags)

['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per']


In [9]:
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    print("Name:", gpu.name, "  Type:", gpu.device_type)

2025-04-07 13:12:49.924812: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)


In [10]:
from tensorflow.python.client import device_lib

device_lib.list_local_devices()

[name: "/device:CPU:0"
 device_type: "CPU"
 memory_limit: 268435456
 locality {
 }
 incarnation: 18137355800593050408
 xla_global_id: -1]

In [11]:
tf.test.is_gpu_available()

Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.


False

In [12]:
mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
[1m11490434/11490434[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step


In [13]:
x_train, x_test = x_train / 255.0, x_test / 255.0

In [14]:
model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10)
])

In [15]:
predictions = model(x_train[:1]).numpy()
predictions

array([[ 0.7193052 , -0.47969562,  0.32077536,  0.70289576, -0.04301793,
        -0.67874795, -0.23602428,  0.31352717, -0.5399128 ,  0.35995692]],
      dtype=float32)

In [16]:
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

In [17]:
model.compile(optimizer='adam',
              loss=loss_fn,
              metrics=['accuracy'])

In [18]:
tf.device('/device:GPU:0')

<tensorflow.python.eager.context._EagerDeviceContext at 0x7fdf7c3d5c80>

In [19]:
with tf.device('/device:GPU:0'):
    model.fit(x_train, y_train, epochs=500, batch_size=32)

Epoch 1/500
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 1ms/step - accuracy: 0.8551 - loss: 0.4949
Epoch 2/500
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - accuracy: 0.9535 - loss: 0.1564
Epoch 3/500
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - accuracy: 0.9668 - loss: 0.1071
Epoch 4/500
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - accuracy: 0.9737 - loss: 0.0829
Epoch 5/500
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - accuracy: 0.9762 - loss: 0.0744
Epoch 6/500
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - accuracy: 0.9804 - loss: 0.0638
Epoch 7/500
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - accuracy: 0.9813 - loss: 0.0562
Epoch 8/500
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - accuracy: 0.9837 - loss: 0.0493
Epoch 9/500
[1m

KeyboardInterrupt: 

In [None]:
model.evaluate(x_test,  y_test, verbose=2)

In [None]:
predictions = model(x_test[:1]).numpy()

313/313 - 1s - 4ms/step - accuracy: 0.9763 - loss: 0.0764


[0.07639684528112411, 0.9763000011444092]

In [None]:
predictions