# Named Entity Recognition

Brief Introduction : 
- https://en.wikipedia.org/wiki/Named-entity_recognition
- https://towardsdatascience.com/contextual-embeddings-for-nlp-sequence-labeling-9a92ba5a6cf0
- https://cs230.stanford.edu/blog/namedentity/


### 1) Importing the libraries

In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.functional as F
import nltk
import spacy
import seaborn
import warnings
import math

  import pandas.util.testing as tm


### 2) Reading input file

In [0]:
df = pd.read_csv('drive/My Drive/Pytorch_DataSet/Named Entity Recognition/ner_dataset.csv',encoding='utf-8')

UnicodeDecodeError: ignored

Solving the above error : https://stackoverflow.com/questions/21504319/python-3-csv-file-giving-unicodedecodeerror-utf-8-codec-cant-decode-byte-err

- Here, it can be noted that our file has an encoding of `windows-1252` . So, we will use this only.
- Source where I found about this file's encoding : https://github.com/cs230-stanford/cs230-code-examples/blob/master/pytorch/nlp/build_kaggle_dataset.py

In [0]:
df = pd.read_csv('drive/My Drive/Pytorch_DataSet/Named Entity Recognition/ner_dataset.csv',encoding='windows-1252')

In [0]:
df.head(10)

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O
5,,through,IN,O
6,,London,NNP,B-geo
7,,to,TO,O
8,,protest,VB,O
9,,the,DT,O


In [0]:
len(df)

1048575

In [0]:
df.describe()

Unnamed: 0,Sentence #,Word,POS,Tag
count,47959,1048575,1048575,1048575
unique,47959,35178,42,17
top,Sentence: 13830,the,NN,O
freq,1,52573,145807,887908


In [0]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 4 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   Sentence #  47959 non-null    object
 1   Word        1048575 non-null  object
 2   POS         1048575 non-null  object
 3   Tag         1048575 non-null  object
dtypes: object(4)
memory usage: 32.0+ MB


In [0]:
a = df['Word'].tolist()
sentences = ' '.join(a)

In [0]:
#sentences



On seeing this dataset, we have to convert it into 2 text files, one contains sentences and other contains labels.

Example :

      sentences.txt

      John lives in New York
      Where is John ?

      labels.txt
      
      B-PER O O B-LOC I-LOC
      O O B-PER O

On having a closer look at dataset, we know that first columns corresponds to sentence numbers and we can use them as if we work on string `sentences` that we have to consider every punctuation for sentence ending if we write our own code. But it can be done using nltk or spacy sentences option,but then we will have another problem as tagging these sentences.


In [0]:
df.columns

Index(['Sentence #', 'Word', 'POS', 'Tag'], dtype='object')

In [0]:
#sentences

In [0]:
tag = ' '.join(df['Tag'].tolist())
#tag

'O O O O O O B-geo O O O O O B-geo O O O O O B-gpe O O O O O O O O O O O O O O O O O O O O O O O B-per O O O O O O O O O O O O O O O O O O O O O O B-geo I-geo O O O O O O O O O O O O O O O O O O O O O O O O O O O B-geo O O B-org I-org O O O B-gpe O O O B-geo O O O O O O B-gpe O O O O B-geo O O O O O O O B-gpe O O O O O O B-geo O O O O O O O O O O O O B-geo O B-geo O O B-geo O O B-org I-org I-org I-org O O O O O O O O B-geo B-tim O O O O O B-gpe O O O O O O O B-gpe O O O O O O O O O O B-geo O O O B-gpe O O O O O O O O O O O O O O B-tim O O O B-org O O O O O O O O O O O O O O O O O O B-org I-org O O B-gpe O O O O O O B-gpe O O B-org I-org I-org O O O O O O O O B-gpe O O O B-art I-art O O B-gpe O O B-per I-per I-per O B-tim O B-gpe O O O O B-gpe O O O O O O O O O O O B-gpe O O O B-gpe O O B-gpe O O O O O O O O O O O O O O B-geo O O O B-geo O O O O O O B-gpe O B-org I-org O B-per I-per O O O O O O O O B-tim O O O O B-geo I-geo O B-geo I-geo O O O O O O O O B-org I-org O O B-gpe O O O O O O

In [0]:
nlp = spacy.load('en')

In [0]:
doc = nlp(sentences)

ValueError: ignored

- Here we can see that using sentences string wont help us and it can, but then we have to split the sentences that can have errors.
- Let's try and think on working on dataframe only.

Solution of above error: https://stackoverflow.com/questions/57231616/valueerror-e088-text-of-length-1027203-exceeds-maximum-of-1000000-spacy

In [0]:
#nlp.max_length = 6053799

In [0]:
#doc = nlp(sentences)

- It will take a lot of time and your ram will crash automatically.

- Lets work on dataframe only.


Before working on dataset lists, 
- Lets learn about lists append() and += function and 
- Difference between list() and [] which is used to initialize a list.

In [0]:
list1 = [].append([2]) # This will return an empty list only, as append() function return None as function return Values
print(list1)

list2 = []     # correct way
list2.append([1,2])
list2.append([3,4])
print(list2)  

list3 = [] + [1,2,3] + [2,3,4]  # this add function will always create only one list.
print(list3)

"""
list() is a function call, and [] a literal:

import dis
def f1(): return list()
 
def f2(): return []
 
dis.dis(f1)
  1           0 LOAD_GLOBAL              0 (list)
              3 CALL_FUNCTION            0
              6 RETURN_VALUE        
dis.dis(f2)
  1           0 BUILD_LIST               0
              3 RETURN_VALUE        
Use the second form. It's more Pythonic, and it's probably faster (since it doesn't involve loading and calling a separate funciton).
"""

None
[[1, 2], [3, 4]]
[1, 2, 3, 2, 3, 4]


"\nlist() is a function call, and [] a literal:\n\nimport dis\ndef f1(): return list()\n \ndef f2(): return []\n \ndis.dis(f1)\n  1           0 LOAD_GLOBAL              0 (list)\n              3 CALL_FUNCTION            0\n              6 RETURN_VALUE        \ndis.dis(f2)\n  1           0 BUILD_LIST               0\n              3 RETURN_VALUE        \nUse the second form. It's more Pythonic, and it's probably faster (since it doesn't involve loading and calling a separate funciton).\n"

In [0]:
sents = []
tags = []
s = []
t = []
tracker = 0
first = True  
# This is used because for initial sentence empty list is added which is creating problems.
#Therefore to remove that first addition of empty list we are checking for first sentence and not adding it.

for index,row in df.iterrows():
  sent = row['Sentence #']
  word = row['Word']
  tag = row['Tag']
   
  if type(sent) == type('abc'):
    if first != True:
      sents.append(s)
      tags.append(t)
      print(f'{type(sent)} {type(s)} {type(t)} sent : {s}    and  tag : {t}')
      s.clear()
      t.clear()
    else:
      first = False  

  s.append(word)
  t.append(tag)
  tracker += 1
  if tracker >100:
    break

sents.append(s)
tags.append(t)
s.clear()
t.clear()



<class 'str'> <class 'list'> <class 'list'> sent : []    and  tag : []
<class 'str'> <class 'list'> <class 'list'> sent : ['Thousands', 'of', 'demonstrators', 'have', 'marched', 'through', 'London', 'to', 'protest', 'the', 'war', 'in', 'Iraq', 'and', 'demand', 'the', 'withdrawal', 'of', 'British', 'troops', 'from', 'that', 'country', '.']    and  tag : ['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O']
<class 'str'> <class 'list'> <class 'list'> sent : ['Families', 'of', 'soldiers', 'killed', 'in', 'the', 'conflict', 'joined', 'the', 'protesters', 'who', 'carried', 'banners', 'with', 'such', 'slogans', 'as', '"', 'Bush', 'Number', 'One', 'Terrorist', '"', 'and', '"', 'Stop', 'the', 'Bombings', '.', '"']    and  tag : ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-per', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
<class 'str'> <class 'list'> <class 

In [0]:
sents

[[], [], [], [], [], []]