# NLP-Empowered-Named-Entity-Recognition
    EntitySense utilizes advanced NLP techniques to automatically identify and categorize entities in text data. With deep learning and semantic analysis, it offers accurate entity recognition, enabling applications like information extraction and sentiment analysis across different domains.

## Problem Statement
Twitter is a microblogging and social networking service on which users post and interact with messages known as "tweets". Every second, on average, around 6,000 tweets are tweeted on Twitter, corresponding to over 350,000 tweets sent per minute, 500 million tweets per day.

Twitter wants to automatically tag and analyze tweets for better understanding of the trends and topics without being dependent on the hashtags that the users use. Many users do not use hashtags or sometimes use wrong or mis-spelled tags, so they want to completely remove this problem and create a system of recognizing important content of the tweets.

Named Entity Recognition (NER) is an important subtask of information extraction that seeks to locate and recognise named entities.

You need to train models that will be able to identify the various named entities.

## Data Description

Dataset is annotated with 10 fine-grained NER categories: person, geo-location, company, facility, product,music artist, movie, sports team, tv show and other. Dataset was extracted from tweets and is structured in CoNLL format., in English language. Containing in Text file format.

The CoNLL format is a text file with one word per line with sentences separated by an empty line. The first word in a line should be the word and the last word should be the label.

In [14]:
import pandas as pd

In [15]:
def read_file(filename):
    with open(filename, "r", encoding="utf-8") as file:
        # Read the lines of the file
        lines = file.readlines()
    return lines

In [16]:
filename = "Datasets/wnut 16.txt.conll"
lines = read_file(filename)
for i in range(10):
    print(lines[i], end=" ")

print("\nTotal number of lines:", len(lines))

@SammieLynnsMom	O
 @tg10781	O
 they	O
 will	O
 be	O
 all	O
 done	O
 by	O
 Sunday	O
 trust	O
 
Total number of lines: 48862


In [17]:
filename = "Datasets/wnut 16test.txt.conll"
lines = read_file(filename)
for i in range(10):
    print(lines[i], end=" ")

print("\nTotal number of lines:", len(lines))

New	B-other
 Orleans	I-other
 Mother	I-other
 's	I-other
 Day	I-other
 Parade	I-other
 shooting	O
 .	O
 One	O
 of	O
 
Total number of lines: 65757


In [18]:
def read_file(filename):
    # Define column names based on the CoNLL format
    column_names = ["Word", "NER"]

    # Read the data from the file
    with open(filename, "r", encoding="utf-8") as file:
        lines = file.readlines()

    # Initialize an empty list to store formatted data
    formatted_data = []

    # Parse each line and append to formatted_data
    for line in lines:
        # Remove leading/trailing whitespaces and split by tabs
        parts = line.strip().split("\t")
        # Ignore empty lines
        
        formatted_data.append(parts)

    # Convert the list of lists into a DataFrame
    df = pd.DataFrame(formatted_data, columns=column_names)
    return df

In [19]:
df_train = read_file("Datasets/wnut 16.txt.conll")
df_test = read_file("Datasets/wnut 16test.txt.conll")

In [20]:
df_train.head()

Unnamed: 0,Word,NER
0,@SammieLynnsMom,O
1,@tg10781,O
2,they,O
3,will,O
4,be,O


In [21]:
df_test.head()

Unnamed: 0,Word,NER
0,New,B-other
1,Orleans,I-other
2,Mother,I-other
3,'s,I-other
4,Day,I-other


In [22]:
print("\nData shape:", df_train.shape)
print("\nData shape:", df_test.shape)


Data shape: (48862, 2)

Data shape: (65757, 2)


In [23]:
print("\nMissing values:")
print("-"*20)
print(df_train.isnull().sum())
print("-"*20)
print(df_test.isnull().sum())


Missing values:
--------------------
Word       0
NER     2393
dtype: int64
--------------------
Word       0
NER     3849
dtype: int64


In [24]:
df_train.dropna(inplace=True)

In [25]:
df_test.dropna(inplace=True)

In [26]:
print("\nMissing values:")
print("-"*20)
print(df_train.isnull().sum())
print("-"*20)
print(df_test.isnull().sum())


Missing values:
--------------------
Word    0
NER     0
dtype: int64
--------------------
Word    0
NER     0
dtype: int64


In [27]:
from sklearn.preprocessing import LabelEncoder
from transformers import BertTokenizer

# Your dataset
data = {
    "Word": ["New", "Orleans", "Mother", "'s", "Day"],
    "NER": ["B-other", "I-other", "I-other", "I-other", "I-other"]
}

# Encoding labels
label_encoder = LabelEncoder()
data["NER_encoded"] = label_encoder.fit_transform(data["NER"])

# Prepare inputs for the model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
inputs = tokenizer(data["Word"], padding=True, truncation=True, return_tensors="pt")

print("Encoded labels:", data["NER_encoded"])
print("Tokenized inputs:", inputs)


Encoded labels: [0 1 1 1 1]
Tokenized inputs: {'input_ids': tensor([[ 101, 2047,  102,    0],
        [ 101, 5979,  102,    0],
        [ 101, 2388,  102,    0],
        [ 101, 1005, 1055,  102],
        [ 101, 2154,  102,    0]]), 'token_type_ids': tensor([[0, 0, 0, 0],
        [0, 0, 0, 0],
        [0, 0, 0, 0],
        [0, 0, 0, 0],
        [0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 0],
        [1, 1, 1, 0],
        [1, 1, 1, 0],
        [1, 1, 1, 1],
        [1, 1, 1, 0]])}


In [31]:
df_test.sample(20)

Unnamed: 0,Word,NER
13924,before,O
13542,:,O
28890,in,O
45037,following,O
15133,not,O
20980,",",O
20799,?,O
64137,to,O
64742,French,B-company
7323,https://t.co/MwgDe5UB08,O
