# About
Hi! Welcome again! Welcome to the complete tutorial for [NER or Named Entity Recognition](https://www.youtube.com/watch?v=MmgjhvOSd-E) using different ways. So for all who do not know what it is, let me explain it to you clearly. 
**It is whatever you want it to be** .Confused? I know. So Named Entity Recognition is a type of Language specific task which extracts the `USEFUL` entities from the sentence. And how do you decide what is useful. Well! You decide.

Let us take an example:

**Eminem has the the highest number of Top charts ever in Billboards top-100 but Kanye has more than 25 Grammies**. 

What do you propose is useful? Well! to me, (Eminem,person), (Billboard,organisation) but for some, it can be (Kanye,person) and (25,numerical). There can be lots of useful entities in the sentence which are of very common use. But we can also extract our own. So in this tutorial, we will be training our our NER system using [SpaCy](https://spacy.io/), [Keras](https://keras.io/) and [BERT](https://www.youtube.com/watch?v=TQQlZhbC5ps)

For those who are wondering what `SpaCy` and `BERT` are, in simplest terms, `SpaCy` is an open source package which is dedicated to language specific tasks and is directed to serve a broad audience.

On the other hand, **BERT** is type of [Transformer](https://www.youtube.com/watch?v=EXNBy8G43MM) architecture proposed by researchers at Google. It is a base architecture for lots of new, very smart language NLP models out there. 

# Problem
We have a unique problem at our disposal. Let us suppose that we have thousands of legal documents, say rent agreement and we have to get some specific details from each and every document like: Buyer, seller, Date of Agreement, Amount etc. So doing this manually, nah! you wouldn't be here.

What we can do is we can make a regular expression that covers each of the entities so that we can extract. But hey!! those of us who have used `re`, we know the pain of building all those expressions after hours of work and failing for many use cases. And most of all, what if we have 10 names, 20 dates and 30 payments in a single document. How would you know that which one is actual Seller, Buyer and so on?

# NER to the Rescue:
So basically, we have models which can learn in a specific way we want it to learn. The basic models can give us very default findings like [Staanford NER Model](https://nlp.stanford.edu/software/CRF-NER.shtml) and SpaCy's NER model which can tell date, person,org etc BUT! the problem of telling who is buyer and who is seller still remains. One benifit of these pretrained models are that they already what Language, its' Grammar and other things about a language looks like (Neural Networks man! they just KNOW). So we just need to teach them how to find the specific things from a sentence. It is just like paying 
[Attention](https://ai.stackexchange.com/questions/21389/what-is-the-intuition-behind-the-attention-mechanism) and [Self Attention](https://www.youtube.com/watch?v=yGTUuEx3GkA) for learning new things.

# Solution
We'll be looking at a few solutions to do this kind of work. Just because we have very scarce data, we won't be able to produce any good results if we try to train the model from scratch but we'll look at the different methodologies. This tutorial is just to give you a brief idea about NER and how can you fine tune existing and build your own model.

1. Retraining SpaCy on top of existing
2. Retraining BERT 
3. Building a model from scratch using Keras

# Imports and install
Some of the imports are essential and some are for your own good

In [1]:
# !pip install --upgrade tensorflow_hub # ELMO Embeddings
# !pip install tensorflow-addon # Add ons apart from tensorflow core modules
# !pip install git+https://www.github.com/keras-team/keras-contrib.git # CRF Layer Implementation
# !pip install tf2crf # for CRF layer (you'll know)

# !pip install python-docx # to load doc file
# !pip install textract # to extract text from doc
# !pip install num2words # change number to words: 10: ten
# !pip install inflect # 

In [2]:
import pandas as pd
import numpy as np
import os
import sys
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score,accuracy_score,precision_score,recall_score
import re
import docx
import textract

from __future__ import unicode_literals, print_function
import plac
import random
import warnings
from pathlib import Path
import spacy
from spacy.scorer import Scorer
from spacy.util import minibatch, compounding
from spacy import displacy

import calendar
import datetime
from num2words import num2words
import inflect

from tensorflow.keras.preprocessing.text import Tokenizer
import tensorflow_addons as tfa
import tensorflow_hub as hub
import tensorflow.keras.backend as K
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.keras.layers import Layer, Input, Embedding, Dense, LSTM, Bidirectional, TimeDistributed
from tensorflow.keras.models import Model
from tf2crf import ModelWithCRFLoss, CRF, ModelWithCRFLossDSCLoss

import warnings

# Defaults
Set defaults paths and seeds so that you can get results same everytime you train

In [25]:
INPUT_DIR = './data/Training_data/'
train_labels=pd.read_csv('./data/TrainingTestSet.csv')
val_labels=pd.read_csv('./data/ValidationSet.csv')
train_text_dir='./data/Training_data/'
val_text_dir='./data/Validation_data/'

SEED = 13
ver_37 = float(sys.version[:3])>=3.7 # get the python version

engine = inflect.engine()

batch_size = 32
epochs = 10

warnings.filterwarnings("ignore",category=UserWarning) # Supress some warnings which are not good

# ProProcessing - Tagging Helpers
What is more important than building the model itself? Preprocessing the data. A model with bad data is worse than no model at all just like no money is still better than debts. Every AI and even simple **DATA** related task need some way of preprocessing for sure and preprocessing tasks change from task to task.

So in our task, we have to extract the related entities from the document. Always pay attention to your enemies and your data. If you look at the data, you'll see that our text have different formats than the one given in  training CSV like tyhe dates are given in CSV are `day.month.year: 08.09.2020` but in the text it can be `27th Oct., 2020` or `seventh of jan two thousand ten`. I can not explain the whole process but looking at the comments and the docstring of methods, you'll get the idea what and why have I tried to do what I have done.

In [4]:
num_tex =  {1:'one',2:'two','3':'three',4:'four',5:'five',6:'six',7:'seven',8:'eight',9:'nine',10:'ten',11:'eleven',12:'twelve',
             13:'thirteen',14:'fourteen',15:'fifteen',16:'sixteen',17:'seventeen',18:'eighteen',19:'nineteen',20:'twenty'}

al_day_alnum_day = {'first':'1st','second':'2nd','third':'3rd','fourth':'4th','fifth':'5th','sixth':'6th','seventh':'7th','eighth':'8th',
                    'nineth':'9th','tenth':'10th','eleventh':'11th','twelfth':'12th','thirteenth':'13th','fourteenth':'14th',
                    'fifteenth':'15th','sixteenth':'16th','seventeeth':'17th','eighteenth':'18th','nineteenth':'19th','twentieth':'20th',
                    'twenty first': '21st','twenty second': '22nd','twenty third': '23rd','twenty fourth': '24th', 'twenty fifth': '25th',
                    'twenty sixth': '26th','twenty seventh': '27th','twenty eighth': '28th','twenty nineth': '29th','thirtieth':'30th',
                    'thirty first':'31st'}

alnum_day_al_day =  dict((v,k) for k,v in al_day_alnum_day.items())

alnum_day_num_day = {'first':'01','second':'02','third':'03','fourth':'04','fifth':'05','sixth':'06','seventh':'07','eighth':'08',
                     'nineth':'09','tenth':'10','eleventh':'11','twelfth':'12','thirteenth':'13','fourteenth':'14','fifteenth':'15',
                     'sixteenth':'16','seventeeth':'17','eighteenth':'18','nineteenth':'19','twentieth':'20','twenty first': '21',
                     'twenty second': '22','twenty third': '23','twenty fourth': '24', 'twenty fifth': '25', 'twenty sixth': '26',
                     'twenty seventh': '21','twenty eighth': '28','twenty nineth': '29','thirtieth':'30th','thirty first':'31'}
        
num_day_alnum_day = dict((v,k) for k,v in alnum_day_num_day.items())

month_abb = {}
month_to_num = {}
for i in range(1,13):
    month_abb[calendar.month_name[i].lower()] = calendar.month_abbr[i].lower()
    month_to_num[calendar.month_name[i].lower()] = i
    

def num_to_text(num:int,remove_sw:bool=True)->str:
    '''
    Change a number to words
    '''
    result = engine.number_to_words(num)
    if remove_sw:
        result = result.replace(' and ',' ')
    result = re.sub(r"[^a-z]",' ',result)
    return re.sub(r"\s+",' ',result)
    

def text2num(textnum:str, numwords:dict={})->int:
    '''
    Method to convert the TEXT format number to proper integer
    '''
    if not numwords:
      units = [
        "zero", "one", "two", "three", "four", "five", "six", "seven", "eight",
        "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen",
        "sixteen", "seventeen", "eighteen", "nineteen",
      ]

      tens = ["", "", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"]

      scales = ["hundred", "thousand", "million", "billion", "trillion"]

      numwords["and"] = (1, 0)
      for idx, word in enumerate(units):    numwords[word] = (1, idx)
      for idx, word in enumerate(tens):     numwords[word] = (1, idx * 10)
      for idx, word in enumerate(scales):   numwords[word] = (10 ** (idx * 3 or 2), 0)

    current = result = 0
    for word in textnum.split():
        if word not in numwords:
          raise Exception("Illegal word: " + word)

        scale, increment = numwords[word]
        current = current * scale + increment
        if scale > 100:
            result += current
            current = 0

    return result + current



def is_ascii(s:str)->bool:
    '''
    Method to return is a given token is ASCII or not
    '''
    return all(ord(c) < 128 for c in s)


def is_date(s:str)->bool:
    '''
    Method to return if a given string is in date 12.08.2020 format or not
    '''
    try:
        sum(int(x) for x in s.strip().split('.'))
        return True
    except ValueError as e:
        return False

def clean(text):
    text=text.replace('/','.')
    text=text.replace('\\','.')
    text=text.replace('. ','.')
    text=text.replace(', ','')
    text=text.replace(',','')
    text=text.replace('-','.')
    text=text.replace('/','.')
    text=text.replace('. ','.')
    #text=text.lower()
    return(text)

    
PROBLEM = 'restore original index value  and length after preprocessing'
def preprocess(s:[bytes,str],lowercase:bool=False,remove_all_spec:bool=True)->str:
    if isinstance(s, (bytes)):
        s = s.decode('utf-8')
        
    if lowercase:
        s = s.lower()
    
    s = s.replace('thousands','thousand')
    s = s.replace('hundreds','hundred')
    
    if remove_all_spec:
        s = re.sub(r"[^a-zA-Z0-9]",' ',s)
    else:
        s = re.sub(r"[^/.,a-zA-Z0-9]",' ',s) # as there are  only 3 special signs used in the training csv
        
    s = re.sub(r"\s+"," ",s)
    
    dummy_list = []
    for token in s.split(' '):
        if (token.strip().isascii() if ver_37 else is_ascii(token.strip())):
            dummy_list.append(token.strip())
    return ' '.join(dummy_list)


def find_year_end(text:str,y:str)->int:
    '''
    Give the matching year from a string
    '''
    if text.find(y)!=-1: # find year if it's directly 2020
        return text.find(y)+4
    else: #if its two thousand twenty in the text: check directly directly
        y_text = num_to_text(y)
        end = text.find(y_text)
        if end !=-1:
            return end+len(y_text)
        else:
            y_text_last = y_text.split(' ')[-1] # get last token
            end = text.find(y_text_last)
            if end!=-1:
                return end+len(y_text_last)
            else:
                return -1

# Input: Create Training Data Format
Train is used for training the model. Model will look at the data points of in train data set and depending on the task, data type and cost function, it'll try to adjust the weight metrices accordingly to minimize the loss function. Gradients will flow back into network according to the loss generated in train data. Validation data set is used to test the performance of model per epoch during the training but validation data does not change the weights of matrices so there is no gradient flow during the validation phase. Testing data is a real world depiction of actual data which we will get during deployment. We do not use testing data until we have the model fully trained. it is there to check the efficiency ofmertic defined for the model

So data format is different for different models. For spaCy, it is in the form of `SPACY DATA FORMAT` and BERT type of models expect a different type of data.

In [5]:
def change_data_format(doc_files_path:str='.data//Training_data/',csv_path:str='./data/TrainingTestSet.csv')->list:
    walk= list(os.walk(doc_files_path))
    dir_files = walk[0][2]

    df = pd.read_csv(csv_path)

    train_data = []
    ent_names = df.columns.tolist()[1:]


    count = 0
    for i,index in enumerate(df.index.tolist()):
        row_data = df.iloc[i,:].values.tolist()
        entities = {"entities":[]}

        for file in dir_files: # get each and every file name
            if file.split('.')[0] == row_data[0]:
                text = preprocess(textract.process(doc_files_path+file))

                for j,entry in enumerate(row_data[1:]):
                    if not pd.isna(entry): # if it is not Null value

                        if isinstance(entry,str) and not is_date(entry): # if it is NAME only party1 party2
                            entry = entry.strip()
                            start = text.find(entry)
                            if start!=-1: # if it a straight match
                                entities['entities'].append((start,start+len(entry),ent_names[j]))

                            else: # remove all the spaces and special characters from the csv data and check string again
                                entry = re.sub(r"[^a-zA-Z0-9]",' ',entry)
                                entry = re.sub(r"\s+",' ',entry) # because csv can have Mr. mrs. etc
                                start = text.find(entry)
                                if start!=-1:
                                    entities['entities'].append((start,start+len(entry),ent_names[j]))

                                else: # split the name. check first and last token and match
                                    e = entry.split(' ')
                                    if len(e)>1: # if more than 2 tokens
                                        start = text.find(e[0]) # starts where first token of name starts
                                        end = text.find(e[-1])+len(e[-1]) # ends where last token of name starts + length of last token
                                        entities['entities'].append((start,end,ent_names[j]))



                        elif isinstance(entry,np.float64): # If it is Aggrement Value or Renewal Notice
                            entry = str(int(entry))
                            start = text.find(entry)
                            if start!= -1: # if found directly 12345 etc
                                entities['entities'].append((start,start+len(entry),ent_names[j]))
                            else: # if it is in words like two thousand three
                                n2w = num2words(int(entry)).replace(',','')
                                start = text.find(n2w)
                                if start!=-1:
                                    entities['entities'].append((start,start+len(n2w),ent_names[j]))


                        elif is_date(entry): #  it can either be 09, 9th or nineth
                            entry = re.sub(r"[^a-zA-Z0-9]",' ',entry)
                            entry = re.sub(r"\s+",' ',entry) # because in the training data, we have removed . so there is no chance of exact match
                            start = text.find(entry)  # if exactly 02 02 2020
                            if start!=-1:
                                entities['entities'].append((start,start+len(entry),ent_names[j]))
                            else: # if it is 09  or nineth or 9th
                                e = entry.split(' ')
                                d,m,y = e[0],e[1],e[2] # split day month year

                                if text.find(d)!=-1:# if it is 09
                                    start = text.find(d)
                                    end = find_year_end(text,y)
                                    if start!=-1 and end!=-1 and end>start+8 and end-start<50:
                                        entities['entities'].append((start,end,ent_names[j]))

                                elif d in num_day_alnum_day: # if it is nineth
                                    d = num_day_alnum_day[d]
                                    start = text.find(d)
                                    end = find_year_end(text,y)
                                    if start!=-1 and end!=-1 and end>start+8 and end-start<50:
                                        entities['entities'].append((start,end,ent_names[j]))

                                elif d in alnum_day_al_day: # if it is 9th
                                    d = num_day_alnum_day[d]
                                    start = text.find(d)
                                    end = find_year_end(text,y)
                                    if start!=-1 and end!=-1 and end>start+8 and end-start<50:
                                        entities['entities'].append((start,end,ent_names[j]))

                train_data.append((text,entities))
                break
    return train_data



def spaCy_train_data(doc_files_path:str='./data/Training_data/',csv_path:str='./data/TrainingTestSet.csv')->list:
    data=[]
    train_labels = pd.read_csv(csv_path)
    data_index=list(train_labels.columns)
    count=0
    new_data=[]
    for i in os.listdir(doc_files_path):
        doc = docx.Document(doc_files_path+i)  # Creating word reader object.
        paragraph=doc.paragraphs
        st=""
        for j in paragraph:
            st+=clean(str(j.text))+" "
        data.append(st)
        temp=train_labels.loc[train_labels['File Name'] == i[:-9]]

        entities=[]
        for col in data_index[1:]:
            query=str(tuple(temp[col])[0])

            if(col=='Aggrement Start Date' or col=='Aggrement End Date'):
                if(st.find(query)!=-1):
                    entities.append([st.find(query),st.find(query)+len(query),col])
                    continue



            if(type(tuple(temp[col])[0])==type(8.0) and tuple(temp[col])[0]>=0): 
                    query=str(int(tuple(temp[col])[0]))      
            if(st.find(query)!=-1 or st.lower().find(query.lower())!=-1):
                    entities.append([st.find(query),st.find(query)+len(query),col])

            else:
                count+=1
        new_data.append((st,{'entities':entities}))
    return new_data
   

# SaCy Model Training
Training for just 1o epochs. Increase to 100 to see the loss going ALMOST 0

In [26]:
TRAIN_DATA = spaCy_train_data()

nlp = spacy.blank('en') # get the nlp model in memory
if 'ner' not in nlp.pipe_names:
    ner = nlp.create_pipe('ner') # instantiate a Pipeline
    nlp.add_pipe(ner, last=True)

    
for _, labels in TRAIN_DATA:
     for ent in labels.get('entities'): # get the entities 
        ner.add_label(ent[2]) # add label

other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner'] # Get all the pipe names Except NER
count=0
loss_val=[]
with nlp.disable_pipes(*other_pipes):  # keep only the NER model
    optimizer = nlp.begin_training()
    for i in range(epochs):
        random.shuffle(TRAIN_DATA) # shiuff
        losses = {}
        
        for text, annotations in TRAIN_DATA:
            try:                
                nlp.update(
                    [text], 
                    [annotations], 
                    drop=0.35, 
                    sgd=optimizer,
                    losses=losses)
            except:
                pass
                count=count+1
        loss_val.append(list(losses.values())[0])
        print("End of Epoch :",i,"Training Loss : ",list(losses.values())[0])

End of Epoch : 0 Training Loss :  6572.982368014429
End of Epoch : 1 Training Loss :  45.579438706202104
End of Epoch : 2 Training Loss :  160.9364908095085
End of Epoch : 3 Training Loss :  328.88992184182916
End of Epoch : 4 Training Loss :  226.82325800525746
End of Epoch : 5 Training Loss :  36.75989285803055
End of Epoch : 6 Training Loss :  28.35338048104185
End of Epoch : 7 Training Loss :  30.24545299475947
End of Epoch : 8 Training Loss :  103.38894767309104
End of Epoch : 9 Training Loss :  40.662100484302925


# Different Methods
1. [Bi-LSTM WITHOUT CRF](https://www.depends-on-the-definition.com/interpretable-named-entity-recognition/)
2. [LSTM + CNN without CRF](https://github.com/kamalkraj/Named-Entity-Recognition-with-Bidirectional-LSTM-CNNs/blob/master/nn.py)
3. [Bi-LSTM + CRF](https://confusedcoders.com/data-science/deep-learning/how-to-build-deep-neural-network-for-custom-ner-with-keras)
4. [Bi-LSTM WITHOUT CRF WITHOUT Embeddings](https://blog.codecentric.de/en/2020/11/take-control-of-named-entity-recognition-with-you-own-keras-model/)
5. [CRF based: sklearn and Keras](https://blog.codecentric.de/en/2020/11/take-control-of-named-entity-recognition-with-you-own-keras-model/)
6. [ELMO Embeddings without CRF](https://towardsdatascience.com/named-entity-recognition-ner-meeting-industrys-requirement-by-applying-state-of-the-art-deep-698d2b3b4ede)
8. [Using Fast Text without Gensim](https://www.kaggle.com/vsmolyakov/keras-cnn-with-fasttext-embeddings)
9. [GloVe, ParaGram, FastText](https://www.kaggle.com/theoviel/improve-your-score-with-some-text-preprocessing)
10. [GoogleNews, WikiNews Embeddings](https://www.kaggle.com/sudalairajkumar/a-look-at-different-embeddings)
11. [Gensim Word2Vec](https://www.kaggle.com/lystdo/lstm-with-word2vec-embeddings)
12. [Different Embeddings using Different libraries](https://towardsdatascience.com/word-embeddings-in-2020-review-with-code-examples-11eb39a1ee6d)
13. [Create and load own FastText, Word2Vec using Gensim](https://sturzamihai.com/how-to-use-pre-trained-word-vectors-with-keras/)

# NER from Scratch using Keras
We'll be using different ways to do this work in Keras. Please read all the comments and docstrings so that you can know what each function is doing. I have compiled many different methods inside a function so that it is looks like Swiss knife of Keras based NER. We'll be looking at 
1. Basic NER like a classification task
2. NER using CRF
3. Using above methods with Different Embeddings 
4. Preprocess data for NER
5. How to create and load different types of Embeddings in Keras

## Import Data

In [7]:
df = pd.read_csv('./data/ner_dataset.csv',encoding="ISO-8859-1") # encoding param is must as it'll give errors
print(df['Tag'].value_counts().index.tolist(),'\n')
df.head(3)

['O', 'B-geo', 'B-tim', 'B-org', 'I-per', 'B-per', 'I-org', 'B-gpe', 'I-geo', 'I-tim', 'B-art', 'B-eve', 'I-art', 'I-eve', 'B-nat', 'I-gpe', 'I-nat'] 



Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O


So `Sentence` is the sentence in which all the Words belong to. Each sentence has different words. `Word` is an individual word which belongs to a particulat sentence like `Thousands, of` belong to `Sentence 1`. `POS` is the telling us whether the word is Verb, Object Noun or some other. `Tag` is giving us information about what kind of word it is. IS it `Date`, `Place` `Name` etc etc.  So `O` means it is `Other` and does not mean anything, `B-org`: Beginning of Organisation (Ex United), `I-org` means Intermediate of Organisation (Health and Group). So `United Health Group` can be seen as `B-org,I-org,I-org`. 

## Data Cleaning and Preprocessing (Changing Structure)

In most of the NLP tasks, we have to do some certain steps:
1. Get data in `['text 1 here', 'here comes text 2', 'these can be short or long', 'these can be whole documents too']` format.
2. Clean and Tokenize each sentence. Tokenization means breaking sentencess in tokens (mostly words) based on some criteria (mostly blank space).
3. Create Vocablury from sentences. You can make the vocab from all words or you can chose maximum number of words to avoid huge dimensions.
4. Give each word in vocab a unique number because machines know numbers only.
5. Set a maximum length to use so that you can represent variable length sentences by using truncating or padding. If sentences are shorter than that number, add a padding token and if they are larger, truncate them.
6. After doing this, you'll have your data ready to be processed by NN and this data will be something like:

`
['hello my name is', 
'what?',
'what is slim shady anyway?']
`

will become :

`
[[1,2,3,4], # max_len = 4 
 [5,0,0,0], # padding = 'post'
 [5,4,6,7]] # 'what' has been given number 5. 'anyway' has been truncated because length > 4
`
Again, these numbers will be converted as `one_hot(number)` and then it'll go to the the `Embedding` Layer. To know what are embeddings and how can they be used, [Check this Notebook out](my_NLP_Kernel)

## Cleaning & Changing Structure

In [8]:
df.fillna(method="ffill",inplace=True) # Fill all the NaN values with the previous seen. DO NOT USE THIS WITHOUT LOOKING AT THE DATA> I did it for SENTENCE
df.head(3)

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,Sentence: 1,of,IN,O
2,Sentence: 1,demonstrators,NNS,O


In [9]:
words = list(set(df["Word"].values)) # Get each UNIQUE word (Our vocab length basically)
words.append("ENDPAD") # Add a EndPad as extra so that you can append it ot the existing sentence. Will be used with max_len
n_words = len(words) 
print(f"We have {n_words} unique words in our data\n") # vocab size

tags = list(set(df["Tag"].values))
n_tags = len(tags)
print(f"We have {n_tags} unique Tags in our data\n") # Number of Unique Tags

We have 35179 unique words in our data

We have 17 unique Tags in our data



# Helper Classes

In [10]:
class SentenceGetter(object):
    '''
    A very famous and most used class to change the shape of data for NER tasks. Check any NEW post from
    https://www.depends-on-the-definition.com/tags/nlp
    '''
    
    def __init__(self, data):
        '''
        args:
            data: DataFrame which has exactly 3 Columns by name as Word, POS and Tag
        '''
        self.n_sent = 1 # Number of sentence
        self.data = data # your dataframe
        self.empty = False
        agg_func = lambda s: [(w, p, t) for w, p, t in zip(s["Word"].values.tolist(),
                                                           s["POS"].values.tolist(),
                                                           s["Tag"].values.tolist())] # DataFrame Specific grouping
        
        self.grouped = self.data.groupby("Sentence #").apply(agg_func)
        self.sentences = [s for s in self.grouped]
        
    
    def get_next(self): # Works like a generator You can't access anything else
        '''
        Method to get entities per Sentence. Results will be in [(word1,pos1,tag1),(word2,pos2,tag2)....]    
        '''
        try:
            s = self.grouped["Sentence: {}".format(self.n_sent)]
            self.n_sent += 1
            return s
        except:
            return None
        
    def __len__(self):
        '''
        Method to return how many unique Sentences we have in total.Use it with len(object) 
        '''
        return self.sentences
    
    def __repr__(self):
        '''
        When you use print(object), this line will be printed. Just another "DUNDER" method
        '''
        return f"Object of type {type(self)}. Can't print the whole data. Make your own Function"
    
    
    
class PreprocessEmbedding():
    '''
    Default Preprocessing method for most of the NLP task for Text.
    Support any .vec, .txt or .bin files from popular Embeddings like GloVe, Word2Vec, Fasttext, Paragram 
    '''
    # check: https://stackabuse.com/pythons-classmethod-and-staticmethod-explained/ for @staticmethod use
    
    @staticmethod # You don't have to create an object of this class in order access this method. Preprocess.preprocess_data()
    def preprocess_data(data:list,max_length:int):
        '''
        Method to parse, tokenize, build vocab and padding the text data
        args:
            data: List of all the texts as: ['this is text 1','this is text 2 of different length']
            max_length: maximum length to consider for an individual text entry in data
        out:
            vocab size, fitted tokenizer object, encoded input text and padded input text
        '''
        tokenizer = Tokenizer() # set num_words, oov_token arguments depending on your usecase
        tokenizer.fit_on_texts(data)
        vocab_size = len(tokenizer.word_index) + 1 # extra 1 for unknown words which will be all 0s when loading pre trained embeddings
        encoded_docs = tokenizer.texts_to_sequences(data)
        padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post') 
        # padding = 'post' means that append 0s at the end if sentence length is less than max_length. check: https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences
        return vocab_size,tokenizer,encoded_docs,padded_docs
    
    
    @staticmethod
    def load_pretrained_embeddings(fitted_tokenizer,vocab_size:int,emb_file:str,emb_dim:int=300,):
        '''
        All 300D Embeddings: https://www.kaggle.com/reppy4620/embeddings 
        '''
        if '.bin' in emb_file: # if it is binary file, it is not embeddings but the MODEL itself. It could be fasttext or word2vec model
            model = KeyedVectors.load_word2vec_format(emb_file, binary=True)
            # emb_file = emb_file.replace('.bin','.txt') # general purpose path
            emb_file = './new_emb_file.txt' # for Kaggle because you have to save data in out dir only
        model.save_word2vec_format(emb_file, binary=False)
    
        # open and read the contents of the .txt / .vec file (.vec is same as .txt file)
        embeddings_index = dict() 
        with open(emb_file,encoding="utf8",errors='ignore') as f:
            for i,line in enumerate(f): # each line is as: hello 0.9 0.3 0.5 0.01 0.001 ...
                if i>0: # why this? You'll see in most of the Kaggle Kernals as if len(line)>100. It is because there is a difference between GloVe style and Word2Vec style embeddings
                    # check this link: https://radimrehurek.com/gensim/scripts/glove2word2vec.html

                    values = line.split(' ') 
                    word = values[0] # first value is "hello" 
                    coefs = np.asarray(values[1:], dtype='float32') # everything else is vector of "hello"
                    embeddings_index[word] = coefs

        # create the embedding matrix or Embedding weights based on your vocab
        embedding_matrix = np.zeros((vocab_size, emb_dim)) # build embeddings based on our vocab size
        for word, i in fitted_tokenizer.word_index.items(): # get each vocab token one by one
            embedding_vector = embeddings_index.get(word) # get from loaded embeddings
            if embedding_vector is not None:
                embedding_matrix[i] = embedding_vector # if it is present, just replace the corresponding vectors

        return embedding_matrix

In [11]:
getter = SentenceGetter(df) # It'll be a generator so you have to use get_next() method
print(getter.get_next(),'\n\n') # See a sample

[('Thousands', 'NNS', 'O'), ('of', 'IN', 'O'), ('demonstrators', 'NNS', 'O'), ('have', 'VBP', 'O'), ('marched', 'VBN', 'O'), ('through', 'IN', 'O'), ('London', 'NNP', 'B-geo'), ('to', 'TO', 'O'), ('protest', 'VB', 'O'), ('the', 'DT', 'O'), ('war', 'NN', 'O'), ('in', 'IN', 'O'), ('Iraq', 'NNP', 'B-geo'), ('and', 'CC', 'O'), ('demand', 'VB', 'O'), ('the', 'DT', 'O'), ('withdrawal', 'NN', 'O'), ('of', 'IN', 'O'), ('British', 'JJ', 'B-gpe'), ('troops', 'NNS', 'O'), ('from', 'IN', 'O'), ('that', 'DT', 'O'), ('country', 'NN', 'O'), ('.', '.', 'O')] 




In [17]:
max_len = 75 # Maximum Length of a Sentence

word2idx = {w: i + 1 for i, w in enumerate(words)} # Make a dict {word1:1, word2:2, ....}
tag2idx = {t: i for i, t in enumerate(tags)} # Dict of {tag1:1, tag2:2, .....}
print(f'''London has id: {word2idx["London"]} and B-org has id: {tag2idx['B-org']}\n''')

sentences = getter.sentences
X = [[word2idx[w[0]] for w in s] for s in sentences] # Words converted as Numbers
y = [[tag2idx[w[2]] for w in s] for s in sentences] # Tags converted as numbers


X = pad_sequences(maxlen=max_len, sequences=X, padding="post", value=0) # Padding as '0' in the end
y_sparse = pad_sequences(maxlen=max_len, sequences=y, padding="post", value=tag2idx["O"]) # Padding a 'O'= Other

y_categorical = [to_categorical(i, num_classes=n_tags) for i in y_sparse] # Change it to One Hot Encoded Values (Categorical Cross Entropy)

print(f"Sample X: {X[0]}\n\nSample y SPARSE: {y[0]}")

London has id: 15055 and B-org has id: 9

Sample X: [ 5697 29980 33667 34099 11026 13928 15055 26265  3264  7645 18779  5350
  4648 14165 10656  7645 29617 29980  3317  2598 32058 27817 22849 35153
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0]

Sample y SPARSE: [16, 16, 16, 16, 16, 16, 5, 16, 16, 16, 16, 16, 5, 16, 16, 16, 16, 16, 7, 16, 16, 16, 16, 16]


## Some Custom Layers
The model we're trying to build below is (Almost) Everything for the price of one. So in order to use sone things, we need to build those things as they are not given by default. We'll be building [ELMO](https://stackoverflow.com/questions/53798582/is-elmo-a-word-embedding-or-a-sentence-embedding) and [Attention](https://towardsdatascience.com/light-on-math-ml-attention-with-keras-dc8dbc1fad39) Layers. Two types of Attentions are also given in [Tensorflow's official Repo](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Attention) but we'll use a very simple model.

<font color="red">**NOTE: Elmo Embeddings are not working with this code. I have asked for help from community to integrate  it with `tf 2`. Will update when it'll be usable**</font>

In [13]:
# elmo_model = hub.Module("https://tfhub.dev/google/elmo/3", trainable=True)
batch_size = 32

# Another Implementation@: https://sujayskumar.com/2018/10/02/elmo-embeddings-in-keras/
# About ELMO @: https://stackoverflow.com/questions/53798582/is-elmo-a-word-embedding-or-a-sentence-embedding

def ElmoEmbedding(x):
    return elmo_model(inputs={"tokens": tf.squeeze(tf.cast(x,tf.string)),
                              "sequence_len": tf.constant(batch_size*[max_len])},
                      signature="tokens",
                      as_dict=True)["elmo"]



class Attention(Layer):
    '''
    Custom Attention Layer. It is dot product or Luong Style Attention. 
    Code from: @ https://stackoverflow.com/questions/62948332/how-to-add-attention-layer-to-a-bi-lstm/62949137#62949137
    '''
    
    def __init__(self, return_sequences=True):
        self.return_sequences = return_sequences
        super(Attention,self).__init__()
        
    def build(self, input_shape):
        
        self.W=self.add_weight(name="att_weight", shape=(input_shape[-1],1),
                               initializer="normal")
        self.b=self.add_weight(name="att_bias", shape=(input_shape[1],1),
                               initializer="zeros")
        
        super(Attention,self).build(input_shape)
        
    def call(self, x):
        
        e = K.tanh(K.dot(x,self.W)+self.b)
        a = K.softmax(e, axis=1)
        output = x*a
        
        if self.return_sequences:
            return output
        
        return K.sum(output, axis=1)

# Build Model

In [14]:
def build_model(vocab_size:int,n_tags:int,max_len:int,emb_dim:int=300,emb_weights=False,use_attention:bool=True,
                use_elmo:bool=False,use_crf:bool=False,train_embedding:bool=False):
    '''
    Build and return a Keras model based on the given inputs
    args:
        n_tags: No of unique 'y' tags present in the data
        max_len: Maximum length of sentence to use
        emb_dim: Size of embedding dimension
        emb_weights: pretrained Embedding Weights for Embedding Layer. if False, use default
        use_attention: Whether to use the Attentiom Layer ot not
        use_elmo: Whether to use Elmo Embeddings
        use_crf: Whether to use the CRF layer
        train_embedding: Whether to train the embeddings weights
    out:
        Keras model. See comments for each type of loss function and metric to use
    '''
    assert not(isinstance(emb_weights,np.ndarray) and  use_elmo), "Either provide embedding weights or use ELMO. Not both"
    
    inputs = Input(shape=(max_len,))
    
    if isinstance(emb_weights,np.ndarray):
        x = Embedding(trainable=train_embedding,input_dim=vocab_size, output_dim=emb_dim, input_length=max_len, mask_zero=True, embeddings_initializer=keras.initializers.Constant(emb_weights))(inputs)
    elif use_elmo:
        x = Lambda(ElmoEmbedding, output_shape=(max_len, 1024))(inputs) # Lambda will create a layer based on the function defined  
    else: # use default Embeddings
        x = Embedding(input_dim=vocab_size, output_dim=emb_dim, input_length=max_len, mask_zero=True,)(inputs) # n_words = vocab_size
    
    x = Bidirectional(LSTM(units=50, return_sequences=True,recurrent_dropout=0.1))(x)
    
    if use_attention:
        x = Attention(return_sequences=True)(x) # receives 3-D and Outputs 3-D because (in this case only) Dense needs to work with 3-D in & out
    
    if use_crf: 
        output = Dense(n_tags, activation=None)(x)
        crf = CRF(dtype='float32') # it does not take any n_tags. See the documentation.
        output = crf(output)
        base_model = Model(inputs, output)
        model = ModelWithCRFLoss(base_model) # It has Loss and Metric already. Change the model if you want to use DiceLoss.
        return model # Do not use any metric or loss with this model.compile(). Just use Optimizer and run training

    else:
        out = Dense(n_tags, activation="softmax")(x) # Wrap it around TimeDistributed(Dense()) if you have old versions
        model = Model(inputs, out)
        return model # use "sparse_categorical_crossentropy", "accuracy"

## Train Model
### Dense + Categorical Cross Entropy + Attention + `NO CRF`

In [22]:
epochs = 1 # just for testing

# Use this for simple use case. It is on CATEGORICAL Y
X_tr, X_te, y_tr, y_te = train_test_split(X, y_categorical, test_size=0.2,random_state=SEED) 

model = build_model(vocab_size=n_words+1,n_tags=n_tags,max_len=max_len,use_attention=True,use_crf=False)
model.compile('adam','categorical_crossentropy',metrics=['accuracy'])

history = model.fit(X_tr, np.array(y_tr), batch_size=batch_size, epochs=epochs, validation_split=0.2, verbose=1)



### CRF + Attention + CRF Loss

In [24]:
# CRF type Y labels. It is on SPARSE Y
X_tr, X_te, y_tr, y_te = train_test_split(X, y_sparse, test_size=0.2,random_state=SEED) # Use this for use_crf = True

model = build_model(vocab_size=n_words+1,n_tags=n_tags,max_len=max_len,use_attention=True,use_crf=True)
model.compile('adam') # See build_model() and if crf: block. Model is wrapped within model itself

history = model.fit(X_tr, np.array(y_tr), batch_size=batch_size, epochs=epochs, validation_split=0.2, verbose=1)

