# NLP-Empowered-Named-Entity-Recognition
    EntitySense utilizes advanced NLP techniques to automatically identify and categorize entities in text data. With deep learning and semantic analysis, it offers accurate entity recognition, enabling applications like information extraction and sentiment analysis across different domains.


## Problem Statement
Twitter is a microblogging and social networking service on which users post and interact with messages known as "tweets". Every second, on average, around 6,000 tweets are tweeted on Twitter, corresponding to over 350,000 tweets sent per minute, 500 million tweets per day.

Twitter wants to automatically tag and analyze tweets for better understanding of the trends and topics without being dependent on the hashtags that the users use. Many users do not use hashtags or sometimes use wrong or mis-spelled tags, so they want to completely remove this problem and create a system of recognizing important content of the tweets.

Named Entity Recognition (NER) is an important subtask of information extraction that seeks to locate and recognise named entities.

You need to train models that will be able to identify the various named entities.

## Data Description

Dataset is annotated with 10 fine-grained NER categories: person, geo-location, company, facility, product,music artist, movie, sports team, tv show and other. Dataset was extracted from tweets and is structured in CoNLL format., in English language. Containing in Text file format.

The CoNLL format is a text file with one word per line with sentences separated by an empty line. The first word in a line should be the word and the last word should be the label.

#### Step 1: Import Library

In [14]:
import pandas as pd

#### Step 2: Load the Twitter Named Entity Recognition corpus
The corpus to be used here contains tweets with NE tags. Every line of a file contains a pair of a token (word/punctuation symbol) and a tag, separated by a whitespace. Different tweets are separated by an empty line.

In [1]:
import re
def read_data(file_path):
    tokens = []
    tags = []
 
    tweet_tokens = []
    tweet_tags = []
    for line in open(file_path, encoding='utf-8'):
        line = line.strip()
        if not line:
            if tweet_tokens:
                tokens.append(tweet_tokens)
                tags.append(tweet_tags)
            tweet_tokens = []
            tweet_tags = []
        else:
            token, tag = line.split() # Replace all urls with  token
            # Replace all users with  token
            token = re.sub('http[s]*://.*', '', token.lower())
            token = re.sub('@.*', '', token)
            tweet_tokens.append(token)
            tweet_tags.append(tag)            
 
    return tokens, tags

In [2]:
train_tokens, train_tags = read_data('Datasets\wnut 16.txt.conll') 
test_tokens, test_tags = read_data('Datasets\wnut 16test.txt.conll') 

In [5]:
for i in range(3):
    for token, tag in zip(train_tokens[i], train_tags[i]):
        print('%s\t%s' % (token, tag))
    print() 

	O
	O
they	O
will	O
be	O
all	O
done	O
by	O
sunday	O
trust	O
me	O
*wink*	O

made	O
it	O
back	O
home	O
to	O
ga	B-geo-loc
.	O
it	O
sucks	O
not	O
to	O
be	O
at	O
disney	B-facility
world	I-facility
,	O
but	O
its	O
good	O
to	O
be	O
home	O
.	O
time	O
to	O
start	O
planning	O
the	O
next	O
disney	B-facility
world	I-facility
trip	O
.	O

'	O
breaking	B-movie
dawn	I-movie
'	O
returns	O
to	O
vancouver	B-geo-loc
on	O
january	O
11th	O
	O



In [6]:
from collections import defaultdict
def build_dict(tokens_or_tags, special_tokens):
    """
        tokens_or_tags: a list of lists of tokens or tags
        special_tokens: some special tokens
    """
    # Create a dictionary with default value 0
    tok2idx = defaultdict(lambda: 0)
    idx2tok = []
 
    # Create mappings from tokens (or tags) to indices and vice versa.
    # At first, add special tokens (or tags) to the dictionaries.
    # The first special token must have index 0.
 
    # Mapping tok2idx should contain each token or tag only once.
    # To do so, you should:
    # 1. extract unique tokens/tags from the tokens_or_tags variable, which is not
    #    occur in special_tokens (because they could have non-empty intersection)
    # 2. index them (for example, you can add them into the list idx2tok
    # 3. for each token/tag save the index into tok2idx).
 
    idx = 0
    for tok in special_tokens:
        tok2idx[tok] = idx
        idx2tok.append(tok)
        idx += 1
    tokens_or_tags = list(set([item for sublist in tokens_or_tags for item in sublist]) - set(special_tokens))
    #for idx, tok in enumerate(tokens_or_tags):
    for tok in tokens_or_tags:
        tok2idx[tok] = idx
        idx2tok.append(tok)
        idx += 1
    return tok2idx, idx2tok

In [7]:
special_tokens = ['', '']
special_tags = ['O']
 
# Create dictionaries
token2idx, idx2token = build_dict(train_tokens, special_tokens)
tag2idx, idx2tag = build_dict(train_tags, special_tags)

In [9]:
def words2idxs(tokens_list):
    return [token2idx[word] for word in tokens_list]

def idxs2words(idxs):
    return [idx2token[idx] for idx in idxs]

def tags2idxs(tags_list):
    return [tag2idx[tag] for tag in tags_list]
 
def idxs2tags(idxs):
    return [idx2tag[idx] for idx in idxs]

defaultdict(<function __main__.build_dict.<locals>.<lambda>()>,
            {'': 1,
             'bat': 2,
             'honey': 3,
             'window': 4,
             'fest': 5,
             '#moviedialoguewithpopesubstitute': 6,
             'merci': 7,
             'billy': 8,
             'guitar': 9,
             'due': 10,
             'freekey': 11,
             '85': 12,
             'sfw': 13,
             'debuted': 14,
             'jermaine': 15,
             'mouth': 16,
             'followd': 17,
             'dit': 18,
             'crisis': 19,
             'metallica': 20,
             'human': 21,
             'shane': 22,
             'loos': 23,
             'harii': 24,
             'wale': 25,
             'following': 26,
             'partiesss': 27,
             'theory': 28,
             'doomsday': 29,
             'pd': 30,
             'willingham': 31,
             'yea': 32,
             'costume': 33,
             'loader': 34,
             'urgh': 3

In [18]:
def read_file(filename):
    # Define column names based on the CoNLL format
    column_names = ["Word", "NER"]

    # Read the data from the file
    with open(filename, "r", encoding="utf-8") as file:
        lines = file.readlines()

    # Initialize an empty list to store formatted data
    formatted_data = []

    # Parse each line and append to formatted_data
    for line in lines:
        # Remove leading/trailing whitespaces and split by tabs
        parts = line.strip().split("\t")
        # Ignore empty lines
        
        formatted_data.append(parts)

    # Convert the list of lists into a DataFrame
    df = pd.DataFrame(formatted_data, columns=column_names)
    return df

In [19]:
df_train = read_file("Datasets/wnut 16.txt.conll")
df_test = read_file("Datasets/wnut 16test.txt.conll")