# Named Entity Recognition (NER) using Optical Character Recognition (OCR) Data (Draft)



OCR data provides (somewhat) structured information about words and their positions in a digitized document.

This is a collection of ideas to extract features and classify the entities. So, let's say we want to extract info to help suggest Named Entities from an invoice OCR data:

- (none)
- VAT number
- Company name
- Company address
- Total value

Sample input:

- words
- word position (X, Y) in range [0,1] as coordinates of the image

Desired output:

For each word, we would like to predict category.

## Classic approach

A classic approach, for small number of data points, is to hand engineer some features for each word.

First perform stemming and lemmatization (and maybe set to lowercase): https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

Then, add some hand-engineered features (creativity and time consumption comes here):

- One-hot encoding of the word;
- Binary: Does it belong to the list ['eur', '€', 'euro', 'euros'] ?
- Binary: Does it belong to the list ['total', 'tot', 'totale', 'tutto'] ?
- Binary: Is this word one of the 10% closer to the right of the page?
- Binary: Is this word one of the 10% closer to the bottom of the page?
- Numeric: How many digits does this word have? (or what % of the characters are digits)
- etc.

On top of that, add as extra features the features extracted for the N closest neighbors (left-right, maybe up and down).

Obtain x_train, y_train vectors where x_train are features for each word (which includes info from neighbors) and corresponding NER labels.


## Iterating the solution

Once an initial estimate for NER is obtained, these can themselves be added to the feature set to try to improve the estimate. So, model 1 (say, linear regression) is trained using x_train, y_train and then model 2 (say SVM) uses x_train+(linear reg output), y_train.

Scikit-learn RandomSearchCV https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html is the best friend to find best parameters and combinations that work best (search on Linear models, SVMs, boosted trees).


## A note on generating the dataset

If all you have are OCR outputs and desired labels, locating a string that does not match exactly can be approximated using https://en.wikipedia.org/wiki/Levenshtein_distance

# A Neural Approach

If there is enough data (this is always a big if, of course), it is reasonable to try to learn features from data. Following the same rationale as before, each word has its own features (and of course there is no problem including the hand-engineered features in a neural model as well). The X and Y coordinates of the word should be included along with the hand-engineered features.

One possibility is to use attention to try to fetch data that helps classifying the current word. One decision to be made is whether to use word or character level embedding.

For simplicity, this neural model will assume word embeddings after stemming/lemmatization, keeping the top N more frequent words in the dictionary (replacing the others with the UNK token).

This is a very first approach/suggestion that needs to be improved when real problems are to be solved.

## Inputs

For each document, the neural net has the following inputs:

- Word indexes in dictionary (ints)
- Hand-engineered features for each word

Note that shape is `(batch_size, sequence_length)` for words (one index for each in the sequence) and `(batch_size, sequence_length, nFeats)` for features (nFeats features per word in the sequence).

## Outputs

Note that, for each word, the output is a softmax over the number of classes. Shape is `(batch_size, sequence_length, nClasses)`

## Model

Note that the model is purely attentional. Its arguments are:

- nClasses: number of named entities to predict
- nFeats: number of hand-engineered features plus the X,Y coordinates
- projection_dim: projection dimension to use when computing attention (to learn proper metrics)
- interactions: in the classic case the output of one model could be input to another. In a Neural Net, since there is gradient propagation, it is possible to simply nest the interactions. This is the number of neural models to use.
- dict_size: number of words kept in dictionary (including the `<UNK>` token)
- word_emb_dim: dimension of the embeddings
- n_attention_heads: fetch relevant information from how many other words?

Note: no regularization has been added. This will depend on the real problem.

In [1]:
from keras import Model
from keras import layers as L

def build_OCR_NER_model(nClasses = 5, nFeats = 6, projection_dim = 64, interactions=3, 
                        dict_size=256, word_emb_dim = 16, n_attention_heads=4):
    inpWords = L.Input( (None, ) ) #these are the indexes of the words in the dictionary, including <UNK>
    inpFeats = L.Input( (None, nFeats) ) #X and Y coordinates are among these
    
    #first we learn word embeddings. Usual rule of thumb: srqt(len(dictionary))
    emb_words = L.Embedding(dict_size, word_emb_dim)(inpWords)
    
    #combine the features
    x = L.Concatenate()([emb_words,inpFeats])
    
    for interac in range(interactions):
        feat_list = [x] #we include the original features here hoping to learn their interactions as well
        
        #what is important from other words? attention could be used to fetch this information
        x = L.TimeDistributed(L.Dense( projection_dim ))(x)
        for k in range(n_attention_heads):
            proj = L.TimeDistributed(L.Dense( projection_dim ))(x) #query projection
            attw = L.Dot([2,2])([x,proj])
            attw = L.Softmax()(attw)
            weighted_avg = L.Dot([2,1])([attw, x])
            feat_list.append(weighted_avg)
        x = L.Concatenate()(feat_list)
    
    #now do classification with a simple FF net
    x = L.TimeDistributed(L.Dense( projection_dim, activation='relu'))(x)
    x = L.TimeDistributed(L.Dense( projection_dim, activation='relu'))(x)
    x = L.TimeDistributed(L.Dense( nClasses, activation='softmax'))(x)
    
    m = Model(inputs = [inpWords, inpFeats], outputs=x)
    
    return m

model = build_OCR_NER_model()
model.summary(line_length=110)

Using TensorFlow backend.


______________________________________________________________________________________________________________
Layer (type)                        Output Shape            Param #      Connected to                         
input_1 (InputLayer)                (None, None)            0                                                 
______________________________________________________________________________________________________________
embedding_1 (Embedding)             (None, None, 16)        4096         input_1[0][0]                        
______________________________________________________________________________________________________________
input_2 (InputLayer)                (None, None, 6)         0                                                 
______________________________________________________________________________________________________________
concatenate_1 (Concatenate)         (None, None, 22)        0            embedding_1[0][0]                    
 