# Named Entity Recognition - Data Preprocessing

Brief Introduction : 
- https://en.wikipedia.org/wiki/Named-entity_recognition
- https://towardsdatascience.com/contextual-embeddings-for-nlp-sequence-labeling-9a92ba5a6cf0
- https://cs230.stanford.edu/blog/namedentity/


### 1) Importing the libraries

In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.functional as F
import nltk
import spacy
import math

### 2) Reading input file

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
# ner_dataset does not have utf-8 encoding,running this line will give errors
# uncomment it to see errors
# df = pd.read_csv('drive/My Drive/Datasets/ner_dataset.csv',encoding='utf-8') 


Solving the above error : https://stackoverflow.com/questions/21504319/python-3-csv-file-giving-unicodedecodeerror-utf-8-codec-cant-decode-byte-err

- Here, it can be noted that our file has an encoding of `windows-1252` . So, we will use this only.
- Source where I found about this file's encoding : https://github.com/cs230-stanford/cs230-code-examples/blob/master/pytorch/nlp/build_kaggle_dataset.py

In [0]:
df = pd.read_csv('drive/My Drive/Pytorch_DataSet/Named Entity Recognition/ner_dataset.csv',encoding='windows-1252')

In [5]:
df.head()

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O


In [6]:
#printing rows from 60 to 120
df.loc[90:100]

Unnamed: 0,Sentence #,Word,POS,Tag
90,,the,DT,O
91,,annual,JJ,O
92,,conference,NN,O
93,,of,IN,O
94,,Britain,NNP,B-geo
95,,'s,POS,O
96,,ruling,VBG,O
97,,Labor,NNP,B-org
98,,Party,NNP,I-org
99,,in,IN,O


In [7]:
len(df)

1048575

In [8]:
df.describe()

Unnamed: 0,Sentence #,Word,POS,Tag
count,47959,1048575,1048575,1048575
unique,47959,35178,42,17
top,Sentence: 27660,the,NN,O
freq,1,52573,145807,887908


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 4 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   Sentence #  47959 non-null    object
 1   Word        1048575 non-null  object
 2   POS         1048575 non-null  object
 3   Tag         1048575 non-null  object
dtypes: object(4)
memory usage: 32.0+ MB


## 3) Forming sentences and labels

In [0]:
a = df['Word'].tolist()
sentences = ' '.join(a)

In [0]:
#sentences

On seeing this dataset, we have to convert it into 2 text files, one contains sentences and other contains labels.

Example :

      sentences.txt

      John lives in New York
      Where is John ?

      labels.txt
      
      B-PER O O B-LOC I-LOC
      O O B-PER O

On having a closer look at dataset, we know that first columns corresponds to sentence numbers and we can use them as if we work on string `sentences` that we have to consider every punctuation for sentence ending if we write our own code. But it can be done using nltk or spacy sentences option,but then we will have another problem as tagging these sentences.


In [12]:
df.columns

Index(['Sentence #', 'Word', 'POS', 'Tag'], dtype='object')

In [0]:
#sentences

In [0]:
tag = ' '.join(df['Tag'].tolist())
#tag

In [0]:
nlp = spacy.load('en')

In [0]:
import warnings
warnings.filterwarnings('ignore')


Below line will show text limit error : 

      ValueError: [E088] Text of length 6053799 exceeds maximum of 1000000. 
      The v2.x parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input.
      This means long texts may cause memory allocation errors. 
      If you're not using the parser or NER, it's probably safe to increase the `nlp.max_length` limit.
      The limit is in number of characters, so you can check whether your inputs are too long by checking `len(text)`.

Uncomment it to see the error.


In [0]:
#doc = nlp(sentences)

- Here we can see that using sentences string wont help us and it can, but then we have to split the sentences that can have errors.
- Let's try and think on working on dataframe only.

Solution of above error: https://stackoverflow.com/questions/57231616/valueerror-e088-text-of-length-1027203-exceeds-maximum-of-1000000-spacy

In [0]:
#nlp.max_length = 6053799

In [0]:
#doc = nlp(sentences)

- It will take a lot of time and your ram will crash automatically.

- Lets work on dataframe only.


Before working on dataset lists, 
- Lets learn about lists append() and += function and 
- Difference between list() and [] which is used to initialize a list.

In [25]:
list1 = [].append([2]) # This will return an empty list only, as append() function return None as function return Values
print(list1)

list2 = []     # correct way
list2.append([1,2])
list2.append([3,4])
print(list2)  

list3 = [] + [1,2,3] + [2,3,4]  # this add function will always create only one list.
print(list3)


# list() is a function call, and [] a literal: 
# Use the second form. It's more Pythonic, and it's probably faster 
# (since it doesn't involve loading and calling a separate funciton).

None
[[1, 2], [3, 4]]
[1, 2, 3, 2, 3, 4]


In [0]:
sents = []
tags = []
s = []
t = []
first = True  
# This is used because for initial sentence empty list is added which is creating problems.
#Therefore to remove that first addition of empty list we are checking for first sentence and not adding it.

for index,row in df.iterrows():
  sent = row['Sentence #']
  word = row['Word']
  tag = row['Tag']
   
  if type(sent) == type('abc'):
    if first != True:
      sents.append(s.copy())  # https://stackoverflow.com/questions/2612802/how-to-clone-or-copy-a-list
      tags.append(t.copy())
      #print(f'{type(sent)} {type(s)} {type(t)} sent : {s}    and  tag : {t}')
      s.clear()
      t.clear()
    else:
      first = False  

  s.append(word)
  t.append(tag)

sents.append(s.copy())
tags.append(t.copy())
s.clear()
t.clear()



In [27]:
print(len(sents), len(tags))

47959 47959


In [28]:
sents[1]

['Families',
 'of',
 'soldiers',
 'killed',
 'in',
 'the',
 'conflict',
 'joined',
 'the',
 'protesters',
 'who',
 'carried',
 'banners',
 'with',
 'such',
 'slogans',
 'as',
 '"',
 'Bush',
 'Number',
 'One',
 'Terrorist',
 '"',
 'and',
 '"',
 'Stop',
 'the',
 'Bombings',
 '.',
 '"']

In [29]:
tags[0]

['O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'B-geo',
 'O',
 'O',
 'O',
 'O',
 'O',
 'B-geo',
 'O',
 'O',
 'O',
 'O',
 'O',
 'B-gpe',
 'O',
 'O',
 'O',
 'O',
 'O']

## 4) Making sentences.txt and labels.txt

In [30]:
# Sample for printing

for sent in sents:
  a = " ".join(sent)
  print(a)
  break

Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country .


In [0]:
# Writing sents to sentence.txt

with open('sentences.txt','w') as f:
  for sent in sents:
    line = " ".join(sent)
    f.write(line + '\n')

In [31]:
# Sample for printing

for tag in tags:
  a = " ".join(tag)
  print(a)
  break

O O O O O O B-geo O O O O O B-geo O O O O O B-gpe O O O O O


In [0]:
# Writing tags to labels.txt

with open('labels.txt','w') as f:
  for tag in tags:
    line = " ".join(tag)
    f.write(line + '\n')

Now, we have created 3 small files(train,val and test) from sentences and labels and will be working on them only, as working on small files is fast and we can correct errors(if any) easily without much time wastage.

## 5) Creating dictionary of words and labels

Let's first create dictionary of words from sentences and labels in train,test and val file.

In [0]:
# Lets create a function: 

def update_vocab(file_path,vocab):

  with open(file_path,'r') as f:

    for i,line in enumerate(f):
      vocab.update(line.strip().split(' '))

    return i+1  


In [0]:
import os
from collections import Counter # https://www.journaldev.com/20806/python-counter-python-collections-counter

### 5.1) Creating `Words` Vocab

In [34]:
print("Words Vocab Building starts :-D ")

words = Counter()  # It is basically a dictionary

path = 'drive/My Drive/Pytorch_DataSet/Named Entity Recognition/small/'

train_size = update_vocab(os.path.join(path, 'train/sentences.txt'), words)
val_size = update_vocab(os.path.join(path, 'val/sentences.txt'), words)
test_size = update_vocab(os.path.join(path, 'test/sentences.txt'), words)

print(train_size, val_size, test_size)
#print(words)

print("Words Vocab Building successful!!!")

Words Vocab Building starts :-D 
10 10 10
Words Vocab Building successful!!!


### 5.2) Creating `Tags` Vocab

In [35]:
print("Tags Vocab Building starts :-D ")

tags = Counter()  # It is basically a dictionary

path = 'drive/My Drive/Pytorch_DataSet/Named Entity Recognition/small/'

train_size = update_vocab(os.path.join(path, 'train/labels.txt'), tags)
val_size = update_vocab(os.path.join(path, 'val/labels.txt'), tags)
test_size = update_vocab(os.path.join(path, 'test/labels.txt'), tags)

print(train_size, val_size, test_size)
#print(tags)

print("Tags Vocab Building successful!!!")

Tags Vocab Building starts :-D 
10 10 10
Tags Vocab Building successful!!!


### 5.3) Keeping Most Frequent Tokens

In [0]:
# Keeping only most frequent words
words = [token for token,count in words.items() if count>=1]
tags =  [token for token,count in tags.items() if count>=1]

In [0]:
#words

In [33]:
tags

['O',
 'B-geo',
 'B-gpe',
 'B-per',
 'I-geo',
 'B-org',
 'I-org',
 'B-tim',
 'B-art',
 'I-art',
 'I-per']

In [0]:
# Hyper parameters for the vocab

PAD_WORD = '<pad>'
PAD_TAG = 'O'
UNK_WORD = 'UNK'


In [0]:
# Add pad tokens
if PAD_WORD not in words: words.append(PAD_WORD)
if PAD_TAG not in tags: tags.append(PAD_TAG)

# add word for unknown words 
words.append(UNK_WORD)

In [0]:
#words

In [0]:
#tags

### 5.4) Saving words and tags to text files

In [0]:
def save_to_file(file_path,vocab):

  with open(file_path,'w') as f:

    for tok in vocab:
      f.write(tok + '\n')

In [0]:
path = 'drive/My Drive/Pytorch_DataSet/Named Entity Recognition/small/'

In [44]:
# For words.txt 

print("Making words.txt file \n")
save_to_file(os.path.join(path,'words.txt'),words)
print('done')

Making words.txt file 

done


In [43]:
# For tags.txt 

print("Making tags.txt file \n")
save_to_file(os.path.join(path,'tags.txt'),tags)
print('done')

Making tags.txt file 

done


### 5.5) Creating json file

In [0]:
import json

In [0]:
def save_to_json(file_path,d):

  with open(file_path, 'w') as f:
    d = {k: v for k, v in d.items()}
    json.dump(d, f, indent=4)

In [0]:
# Save datasets properties in json file

path = 'drive/My Drive/Pytorch_DataSet/Named Entity Recognition/small/'

sizes = {
    'train_size': train_size,
    'dev_size': val_size,
    'test_size': test_size,
    'vocab_size': len(words),
    'number_of_tags': len(tags),
    'pad_word': PAD_WORD,
    'pad_tag': PAD_TAG,
    'unk_word': UNK_WORD
}

save_to_json(os.path.join(path, 'dataset_params.json'),sizes)

In [50]:
# Logging sizes

to_print = "\n".join("- {}: {}".format(k, v) for k, v in sizes.items())
print("Characteristics of the dataset:\n{}".format(to_print))

Characteristics of the dataset:
- train_size: 10
- dev_size: 10
- test_size: 10
- vocab_size: 369
- number_of_tags: 11
- pad_word: <pad>
- pad_tag: O
- unk_word: UNK


Next in Series : [Model training](https://github.com/akash1309/Named-Entity-Recognition/blob/master/Model_Training.ipynb)