Title: Named Entity Recognition

Description:
In this NER-focused project, you will design and develop a custom Named Entity Recognition (NER) system for text analysis. Named Entity Recognition involves identifying and classifying specific entities, such as names, dates, locations, and more, within unstructured text data. Your project will offer a versatile NER solution that will work well on the provided dataset.

Dataset: Named Entity Recognition (NER) Corpus (kaggle.com)

https://www.kaggle.com/datasets/naseralqaydeh/named-entity-recognition-ner-corpus

In [1]:
!date

Sun Apr  6 02:52:11 UTC 2025


In [2]:
import pandas as pd
import numpy as np
import tensorflow as tf
from IPython.display import display, HTML
import matplotlib.pyplot as plt
# from sklearn.model_selection import train_test_split, GridSearchCV
import warnings,gc,sys
warnings.filterwarnings('ignore')
# from sklearn.preprocessing import StandardScaler
# from sklearn.svm import LinearSVC
# from sklearn.metrics import confusion_matrix,ConfusionMatrixDisplay, classification_report
# from sklearn.pipeline import Pipeline
from tqdm import tqdm

ModuleNotFoundError: No module named 'tensorflow'

In [None]:
!pip install kagglehub



In [4]:
# Install dependencies as needed:
# pip install kagglehub[pandas-datasets]
import kagglehub
from kagglehub import KaggleDatasetAdapter

# Set the path to the file you'd like to load
file_path = "ner.csv"

# Load the latest version
NER_df = kagglehub.load_dataset(
  KaggleDatasetAdapter.PANDAS,
  "naseralqaydeh/named-entity-recognition-ner-corpus",
  file_path,

)

# print("First 5 records:", df.head())

In [5]:
# NER_df = pd.read_csv("ner.csv")
display(NER_df.head(2).T)

Unnamed: 0,0,1
Sentence #,Sentence: 1,Sentence: 2
Sentence,Thousands of demonstrators have marched throug...,Families of soldiers killed in the conflict jo...
POS,"['NNS', 'IN', 'NNS', 'VBP', 'VBN', 'IN', 'NNP'...","['NNS', 'IN', 'NNS', 'VBN', 'IN', 'DT', 'NN', ..."
Tag,"['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', '...","['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."


Essential info about entities:

- geo = Geographical Entity
- org = Organization
- per = Person
- gpe = Geopolitical Entity
- tim = Time indicator
- art = Artifact
- eve = Event
- nat = Natural Phenomenon

### 1. Preprocess

In [6]:
print(NER_df.shape)

(47959, 4)


In [7]:
print(NER_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47959 entries, 0 to 47958
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Sentence #  47959 non-null  object
 1   Sentence    47959 non-null  object
 2   POS         47959 non-null  object
 3   Tag         47959 non-null  object
dtypes: object(4)
memory usage: 1.5+ MB
None


In [8]:
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    print("Name:", gpu.name, "  Type:", gpu.device_type)

Name: /physical_device:GPU:0   Type: GPU


In [10]:
from tensorflow.python.client import device_lib

device_lib.list_local_devices()

[name: "/device:CPU:0"
 device_type: "CPU"
 memory_limit: 268435456
 locality {
 }
 incarnation: 4942869101077391933
 xla_global_id: -1,
 name: "/device:GPU:0"
 device_type: "GPU"
 memory_limit: 16266690560
 locality {
   bus_id: 1
   links {
   }
 }
 incarnation: 14831051814124344203
 physical_device_desc: "device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0"
 xla_global_id: 416903419]

In [11]:
tf.test.is_gpu_available()

True

In [12]:
mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
[1m11490434/11490434[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step


In [13]:
x_train, x_test = x_train / 255.0, x_test / 255.0

In [14]:
model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10)
])

In [15]:
predictions = model(x_train[:1]).numpy()
predictions

array([[ 0.7193052 , -0.47969562,  0.32077536,  0.70289576, -0.04301793,
        -0.67874795, -0.23602428,  0.31352717, -0.5399128 ,  0.35995692]],
      dtype=float32)

In [16]:
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

In [17]:
model.compile(optimizer='adam',
              loss=loss_fn,
              metrics=['accuracy'])

In [18]:
tf.device('/device:GPU:0')

<tensorflow.python.eager.context._EagerDeviceContext at 0x7fdf7c3d5c80>

In [19]:
with tf.device('/device:GPU:0'):
    model.fit(x_train, y_train, epochs=500, batch_size=32)

Epoch 1/500
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 1ms/step - accuracy: 0.8551 - loss: 0.4949
Epoch 2/500
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - accuracy: 0.9535 - loss: 0.1564
Epoch 3/500
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - accuracy: 0.9668 - loss: 0.1071
Epoch 4/500
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - accuracy: 0.9737 - loss: 0.0829
Epoch 5/500
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - accuracy: 0.9762 - loss: 0.0744
Epoch 6/500
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - accuracy: 0.9804 - loss: 0.0638
Epoch 7/500
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - accuracy: 0.9813 - loss: 0.0562
Epoch 8/500
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - accuracy: 0.9837 - loss: 0.0493
Epoch 9/500
[1m

KeyboardInterrupt: 

In [None]:
model.evaluate(x_test,  y_test, verbose=2)

In [None]:
predictions = model(x_test[:1]).numpy()

313/313 - 1s - 4ms/step - accuracy: 0.9763 - loss: 0.0764


[0.07639684528112411, 0.9763000011444092]

In [None]:
predictions