# The aim  of the notebook is to clean, process the data to train a model on the name entity recognization task.

In this notebook, we are going to prepare a dataset which can be used to train a name entity recognization model on the ICDAR 2019 Robust Reading Challenge. The challenge is divided into three sections. 

Task 1 : Scanned Receipt Text Localisation

Task 2 : SCanned Receipt OCR 

`Task 3 : NER`  


More about task 3 from the website. 

Task 3 - Key Information Extraction from Scanned Receipts
Task Description

This task aims to extract texts of several key fields from given receipts and save the texts for each receipt image in a JSON file with the format shown in Figure 3. Participants will be asked to submit a zip file containing results for all test invoice images. 



## Sections 

### 1. Downloading the data and Getting the training and  testing files.

The dataset is downloaded from this link : https://rrc.cvc.uab.es/?ch=13&com=downloads.
Note the authors of the dataset have not provided any separate dataset for task 3 like task 1 and task 2. The dataset has to be created using the dataset provided for task 1 and task 2.  

Once the data has been downloaded there will be four folders two containing training data and the rest testing. 

1. 0325updated.task1train(626p) : input trainnig data.
2. 0325updated.task2train(626p)  : ground truths  data.
3. task3-testï¼ˆ347p) : ids of test data
4. text.task1_2-testï¼ˆ361p) : test data input files. 


The test data for task 3 is present in the task3-text347) which contains 347 images, but the input of a ner model is not an image but text data so the **real test data  (text) is present in the test.task1&2test(361).** Hence the extra samples will have to be filtered out. Each file has a unique name which can be used as a unique id.


Preparing  train and test data 

1. `Training data:` 
   The training dataset is present in the 0325updated.task1train(626p) and 0325updated.task2train(626p) folder the first thing to do is ***extract only the common files from the two folders there are many duplicates .txt files.***
2. `Testing Data :`
    The testing dataset is present in the task3-testï¼ˆ347p) and text.task1_2-testï¼ˆ361p) folder from the _ folder we only get the names of files (347 files) present in the test set the text data comes from the text.task1_2-testï¼ˆ361p) folder. 



## 2. Building Our Vocab 
The ground truth data only has `upper case letters` hence our output will always be uppercase. On checking the training data very only one file has lower case letter in the train set and 2 files in the test. When processing the data we can upper case these files. A few files also have special chars which  [Â,·,£,¬] will treat them as out of vocab words. 


**Hence our vocab will consist of upper case letters,punctuation,digits plus space.**



## 3. Preparing Model input and output. 

This step is very important and must be performed correctly as the model's performance depends on the data it trains on. 


**Input data.** The input data is present in the 0325updated.task1train(626p) folder in the form of .txt files. The files are mixed with coordinates as these files are used as ground truths for task 1 localization problem. We will need to filter out these coordinates. This is done in the `getText()` function 

**Target Data**
In the ground truth .txt files we can see JSON or python dict like structures contain 4 fields [company, address, phone, date].There will be the 4 classes we are trying to predict. + will need to include other class. 



The output of our model will be numeric hence will need to encode these classes. 

`labels dict : {'company':1,'address':2,'date':3,'total':4}`

`inv_labels_dict :  {1:'company',2:'address',3:'date',4:'total'}`




Finally, the most important step is creating the labels. What the ground truth provided to us cant be used directly as for named entity recognization. The model will have a predict an entity for each token (word) or in our case character. 


***I have decided to go for character classification the reason for this is the vocab. With character level classification our model can handle new text better than word-level it becomes more robust towards handling out of vocab words. As the vocab is limited to 26 upper, punctuation and digits.***


The labels are prepared by the get_labels() method. 


## 4 Correction  
This is the most important part of this notebook and it took me almost 1 day. 
For the `get_labels()` function to work correctly the output labels from the ground, truth files must be present exactly in the same order word to word otherwise our model would be learning the wrong things. There are some annotations mistakes in the dataset and this has also been mentioned on the ICDAR website. To make the correction process a bit easy the corrections were made separately per field. Before the corrections lets look at the mistakes 

1. Total: For this field, there was only one error.  In one ground truth file 7838.80 instead of 7,838.80 [click here](#total) 
2. Date: 4 errors, the gt had wrong data format more details can be seen in the date correction section. [click here](#date)
3. Company: 18 errors, all these errors were removed manually there were mostly spelling errors and extra spaces. [click here](#company)
4. Address: 134 errors, To solve this a simple correction algorithm has developed,[more details can be found here.](#address)


# 5 Creating pickle files

In the end, all the files were saved as pickle files in the data `dict folder`.
**Structure** :  `{'file_name' : ('textdata',labels), 'file_name2' : ('textdata2,labels2), ...... 'file_name_n' : ('textdata_N,labels_N)}`

The file names are used as dict keys as they are unique. 


File Names: 
1. training data : train_corrected_data_dict. 
2. testing data : test_data_dict







# 1. Getting the Training and Testing files

In [8]:
import torch 
import os 
import pandas as pd 
import numpy as np
import json
from tqdm import tqdm
import pickle as pk 
from torch.utils.data import Dataset
import random
from string import punctuation,digits,ascii_uppercase
import re



VOCAB = punctuation + digits + ascii_uppercase + " " 

labels = {'company':1,'address':2,'date':3,'total':4}
inv_labels = {1:'company',2:'address',3:'date',4:'total'}

In [2]:
VOCAB

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ '

In [9]:
!ls ICDAR_2019_task3_data

0325updated.task1train(626p)
0325updated.task2train(626p)
task3-testï¼ˆ347p)
text.task1_2-testï¼ˆ361p)


# Training Data

The input data comes  from the `'ICDAR_2019_task3_data/0325updated.task1train(626p)'`  .txt files and the ground truths  from the `'ICDAR_2019_task3_data/0325updated.task2train(626p)'`. 

But first will need to remove the redundant entries 

### Image files

In [10]:
## all jpg images. 
train_jpg = [file.split('.')[0] for file in os.listdir('ICDAR_2019_task3_data/0325updated.task1train(626p)') if '.jpg' in file]

In [11]:
len(train_jpg)

712

we have 712 images. But the problem description mentions 600 trainval images hence some of there must be duplicates 

### Number of input text files 

In [12]:
## all input text images
train_raw_text = [file.split('.')[0] for file in os.listdir('ICDAR_2019_task3_data/0325updated.task1train(626p)') if '.txt' in file]

In [13]:
len(train_raw_text)

835

### No. of  ground truth .txt files.

In [14]:
train_gt_json = [file.split('.')[0] for file in os.listdir('ICDAR_2019_task3_data/0325updated.task2train(626p)') if '.txt' in file]
len(train_gt_json)

874

### Only Consider common files 

In [15]:
#training size 
len(set(train_raw_text).intersection(set(train_jpg)).intersection(set(train_gt_json)))

666

In [22]:
train_files = list(set(train_raw_text).intersection(set(train_jpg)).intersection(set(train_gt_json)))
len(train_files)

666

### Dealinng With Duplicate Files


In [23]:
train_files = [f for f in train_files if '(' not in f]
len(train_files)

625

## 2 . Building the Vocab

In [20]:
#helper function to get text.
def getText(file_path):
    
    '''
    A method to clean the input text (getting rid of the coordinates used for localization in task 1)
    
    Args:
    file_path (str): The file location of the file eg 
  

    Returns:
    str: cleaned string.
    
    '''
    
    raw_text = open(file_path).readlines()
    cleaned_text = ''
    for idx,i in enumerate(raw_text):

        clean_text = i.split(',',maxsplit = 8)[-1]

        cleaned_text += re.sub(r"[\t\n]"," ",clean_text)

    return cleaned_text

## Defining the Vocab

In [7]:
from string import punctuation,digits,ascii_uppercase
VOCAB = punctuation + digits + ascii_uppercase + " " 
VOCAB

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ '

### Out Of Vocab Words


The below cell is going to read every file in the 0325updated.task1train(626p) folder and print any character which is not present in our vocab

In [231]:
# check for invalid files 
# containg chars not in our vocab 


counter = 0
root_folder =  'ICDAR_2019_task3_data/0325updated.task1train(626p)/'

for file in train_files:
   
    path = root_folder + file +'.txt'

    data = getText(path)    
    for char in data:
        
        if char not in VOCAB:
            print(path,char)
            counter += 1
            
            

print('files to remove' , counter)    

ICDAR_2019_task3_data/0325updated.task1train(626p)/X51006466778.txt Â
ICDAR_2019_task3_data/0325updated.task1train(626p)/X51006466778.txt ·
ICDAR_2019_task3_data/0325updated.task1train(626p)/X51006466055.txt r
ICDAR_2019_task3_data/0325updated.task1train(626p)/X51008142068.txt l
files to remove 4


#### Will need to upper case input and have a extra token for unknowns

### Any lowercase  Character is in the ground truths ?
if the ground truth has lowercase chars then will need to include them in our VOCAB. The below cell scan through all the files in the ICDAR_2019_task3_data/0325updated.task2train(626p) folder and searches if any file contains OOV chars. Finally prints the files will might need to modify or remove

In [285]:

root_folder =  'ICDAR_2019_task3_data/0325updated.task2train(626p)/'

counter = 0
for file in train_files:
   
    path = root_folder + file +'.txt'

    
    data = ''
    
    with open(path,'r') as f:
            
        data_dict = json.load(f)

        data  = ' '.join(list(data_dict.values()))

    for char in data:
        
        if char not in VOCAB:
            print(path,char)
            counter += 1
            
            

print('files to remove' , counter)    



files to remove 0


#### K only upper case present in the ground truths. 

# 3. Preparing Model input and output.

## Preparing Input text data 

### 1. Loading a sample from train set

In [250]:
input_root_dir = 'ICDAR_2019_task3_data/0325updated.task1train(626p)/'
target_root_dir = 'ICDAR_2019_task3_data/0325updated.task2train(626p)/'

In [249]:
train_files[0]

'X51005447860'

In [251]:
path = input_root_dir + train_files[0] + ".txt"
path

'ICDAR_2019_task3_data/0325updated.task1train(626p)/X51005447860.txt'

In [253]:
open(path).readlines()[0:5]

['256,168,730,168,730,213,256,213,AEON CO. (M) BHD (126926-H)\n',
 '236,217,753,217,753,260,236,260,3RD FLR, AEON TAMAN MALURI SC\n',
 '272,271,699,271,699,311,272,311,JLN JEJAKA, TAMAN MALURI\n',
 '254,321,719,321,719,363,254,363,CHERAS, 55100 KUALA LUMPUR\n',
 '294,372,662,372,662,409,294,409,GST ID : 002017394688\n']

The input text files have 8 cordinates mixed with text we are only interested in the text. Will filer out the cords using split(',',maxsplit = 8). Then will remove \t and \n. This is the same exact code as getText() function.

In [254]:
raw_text = open(path).readlines()

cleaned_text = ''

for idx,i in enumerate(raw_text):
    
    clean_text = i.split(',',maxsplit = 8)[-1]
    
    cleaned_text += re.sub(r"[\t\n]"," ",clean_text)
    
print(cleaned_text)

AEON CO. (M) BHD (126926-H) 3RD FLR, AEON TAMAN MALURI SC JLN JEJAKA, TAMAN MALURI CHERAS, 55100 KUALA LUMPUR GST ID : 002017394688 SHOPPING HOURS MON-SUN:1000 HRS - 2200 HRS 1X 000004089728 5.90SR SAKUMASHIKIDROP 1X 000007572029 5.90SR BINDER CLIP -BL 1X 000006731878 5.90SR 150YEN CAR NECK SUB-TOTAL 17.70 TOTAL SALES INCL GST 17.70 TOTAL AFTER ADJ INCL GST 17.70 CASH 50.70 ITEM COUNT 3 CHANGE AMT 33.00 INVOICE NO: 2018021951320026157 GST SUMMARY AMOUNT TAX SR @ 6% 16.71 0.99 TOTAL 16.71 0.99 19/02/2018 16:45 5132 002 0026157 0293605 SHIVANESWARY A/P MANY DAISO SUNWAY VELOCITY TEL 1-300-80-AEON (2366) THANK YOU FOR YOUR PATRONAGE PLEASE COME AGAIN 


## Preparing Target Data

### Generating Ground Truths  

We will need to create ground truth by 
1. extracting data from the input text files (ICDAR_2019_task3_data/0325updated.task1train(626p)') the fields to extract come from 'ICDAR_2019_task3_data/0325updated.task2train(626p)') 
2. Then we encode the extracted text.


## Now we need to extract [company,address,date,total] fields from the cleaned text

In [255]:
#loading ground truth.

path = target_root_dir + train_files[0] + '.txt' 

with open(path,'r') as f:
    json_dict = json.load(f)

json_dict

{'company': 'AEON CO. (M) BHD',
 'date': '19/02/2018',
 'address': '3RD FLR, AEON TAMAN MALURI SC JLN JEJAKA, TAMAN MALURI CHERAS, 55100 KUALA LUMPUR',
 'total': '17.70'}

In [257]:
for i,k in enumerate(iter(json_dict)):
    
    # get the val from the ground truth 
    val = json_dict[k]
    
    # if there are no errors in the input_text then val will be in cleaned_text. 
    if val in cleaned_text:
        start_pos = cleaned_text.find(val)
        end_pos = cleaned_text.find(val) + len(val)        
        print(f'{k} : {cleaned_text[start_pos : end_pos]}')


company : AEON CO. (M) BHD
date : 19/02/2018
address : 3RD FLR, AEON TAMAN MALURI SC JLN JEJAKA, TAMAN MALURI CHERAS, 55100 KUALA LUMPUR
total : 17.70


## Encoding the targets 

The ground truth will be a list of ints 

In [335]:
labels

{'company': 1, 'address': 2, 'date': 3, 'total': 4}

In [19]:
def get_labels(json_dict,cleaned_text):
    '''
    
    This function is responsiable for creating the training encoded targets.
    
    Args:
        json_dict (dict) : A dict containing ground truth data. 
        cleaned_text (str) : cleaned text which has been stripped off of coordinates.
    Returns:
        text_class (list) : A list containg char level encodings.
        k (str) : key of ground truth [company,address,date,total] this is returned only when there is no match found.
    
    '''
    
    labels = {'company':1,'address':2,'date':3,'total':4}
    
    text_class = [0] * len(cleaned_text)
    
    for i,k in enumerate(iter(json_dict)):
        val = json_dict[k]
        
        if val in cleaned_text:
            start_pos = cleaned_text.find(val)
            end_pos = cleaned_text.find(val) + len(val)
            text_class[start_pos:end_pos] = [labels[k]] * (end_pos - start_pos) 
        else:
            # if val not present then send the key.
            return k
      
    return text_class


In [266]:
encoded_labels  = get_labels(json_dict,cleaned_text)

len(encoded_labels) == len(cleaned_text)

True

In [268]:
print(encoded_labels)

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

# 4. Corrections: 

The ground truths must be present in the input text in the exact order must match character to character. But few samples done have the exact ground truths. This is mosnly becz. of the following reasons.

1. extra spaces. 
2. improper use of . or ,. 
3. incorrect spellings. 

## Lets check the missing cases 


In [25]:
def check(train_files):
    
    '''
    This function check if all ground truth keys (company,address,date,total) were extracted properly or not. 

    Args:
        train_files (list) : List contain files names of all train files.
    
    Returns : 
            
            data_dict (dict) : A dict containing the training data. 
            missed (list) : all the keys which were missed 
            missed_dict (dict) : file names per key  eg . {address : [file_name1,file_name2] , 'company' : [file_name1]..}
            
    '''
    
    data_dict = {}
    missed_dict = {'address':[],'company':[],'date':[],'total':[]}
    missed = []
    
    for idx  in tqdm(train_files):
        
        txt_file = 'ICDAR_2019_task3_data/0325updated.task1train(626p)/' + idx + '.txt'
        json_file = 'ICDAR_2019_task3_data/0325updated.task2train(626p)/' + idx + '.txt'
        
        with open(json_file,'r') as jfile:
            json_dict = json.load(jfile)
        
        
        x = getText(txt_file).upper()
        y = get_labels(json_dict,x)
        
        if  isinstance(y,list):
            data_dict[idx] = (x,y)
        else:
            missed_dict[y].append(idx)
            missed.append(y)
     
    return data_dict,missed,missed_dict


In [27]:
train_dict,missed,missed_dict = check(train_files)

100%|██████████| 625/625 [00:25<00:00, 24.60it/s]


In [38]:
len(missed)

157

## We missed 157 samples. 
## Lets see which class was missed the most

In [39]:
from collections import Counter 
Counter(missed)

Counter({'address': 134, 'date': 4, 'company': 18, 'total': 1})

In [40]:
## lets get the ids of the missed samples
missed_ids = []

for i in missed_dict:
    
    missed_ids.extend(missed_dict[i])

In [69]:
## helper function to print input_text, ground truths 
def print_data(keys):
    
    '''
    
    Helper function to print the text data and the its corrosponding ground truth label  dict. 
    
    Args:
        keys (list) : list containg the files names. (since file names are unique they are called keys.)
        
        
    '''
    
    for idx  in  keys:
        print('For id = ' , idx)
        
        txt_file = 'ICDAR_2019_task3_data/0325updated.task1train(626p)/' + idx + '.txt'
        json_file = 'ICDAR_2019_task3_data/0325updated.task2train(626p)/' + idx + '.txt'
        
        with open(json_file,'r') as jfile:
            json_dict = json.load(jfile)        
        
        
        print(getText(txt_file))
        print('********')
        print(json_dict)
        
        print()
        print()


# Corrections : total field  <a id = 'total'></a>

We have only one missing file for `total` file id = `X51005806678` error ground truth doesn't contain  " , "  in  price. 
Fix replace '7838.80' with '7,838.80'.



In [53]:
#total 
missed_dict['total']

['X51005806678']

In [175]:
with open('ICDAR_2019_task3_data/0325updated.task2train(626p)/X51005806678.txt') as f:
    data_dict = json.load(f)
    print('before correction gt',data_dict)
    #correction : replace '7838.80' with '7,838.80'
    data_dict['total'] = '7,838.80'
    print('after correction gt',data_dict)
    
# print(data_dict)
with open('ICDAR_2019_task3_data/0325updated.task2train(626p)/X51005806678.txt','w') as f:
    f.write(json.dumps(data_dict))
        
    

before correction gt {'company': 'KAISON FURNISHING SDN BHD', 'date': '29-01-18', 'address': 'L4-17 (B), LEVEL 4, UP2-01, MELAWATI MALL, 355, JALAN BANDAR MELAWATI, PUSAT BANDAR MELAWATI, 53100 KUALA LUMPUR.', 'total': '7838.80'}
after correction gt {'company': 'KAISON FURNISHING SDN BHD', 'date': '29-01-18', 'address': 'L4-17 (B), LEVEL 4, UP2-01, MELAWATI MALL, 355, JALAN BANDAR MELAWATI, PUSAT BANDAR MELAWATI, 53100 KUALA LUMPUR.', 'total': '7,838.80'}


# Correcting Dates <a id = 'date'></a>

There are 4 erros in date fields of the following files. Printing  the fileds using the print_data() method. 
The fields are printed in the following order. 

id = 'XX24141..' 

text data 

ground truths 


## Errors found in ground truths 

corrected_dates = {'X51006414519':'2018-04-06','X51005447850':'04/03/2018','X51008142038':'28-11-18','X51005749912':'28-03-18'}

1. In file  `X51006414519`  date = '06/04/2018    instead of '2018-04-06'.
2. In file  `X51005447850`  date = 20180304   instead of  '04/03/2018'.
3. In file  `X51008142038`  date = '28-01-18' instead of '28-03-18'.
4. In file  `X51005749912` date = '28/03/18' instead of '28-03-18'.

In [78]:
print_data(missed_dict['date'])

For id =  X51006414519
HASHA PETROKIOSK COMPANY NO: LOT PTD 198718 TMN SIERRA P 81750 MASAI JOHOR SITE: 2591 TELEPHONE: GST NO: RECEIPT INVOICE NUMBER 22.73 LITRE PUMP # 02 FUELSAVE 95 RM 50.00 C 2.200 RM / LITRE TOTAL RM 50.00 CASH RM 50.00 RELIEF GST C RM 0.00 TOTAL GROSS C RM 50.00 2018-04-06 17:36:35 CUSTOMER COPY TERMINAL ID: 84259112 ID/STAN: 274961/000000 ENTRY MODE: MSR/1 CARD: ATTENDANT TAG CARD XXXXXXXXX0053 RESPONSE:- 000 APPROVED ATTENDANT: MRIFAN DATE TIME NUM OPT 06/04/18 DIESEL & PETROL RON95 GIVEN RELIEF UNDER SECTION 56 (3) (B) GST ACT 2014 THANK YOU PLEASE COME AGAIN JM0032980-V 073000190 000908345344 60000437782 17:36 41766 02 
********
{'company': 'HASHA PETROKIOSK', 'date': '06/04/2018', 'address': 'LOT PTD 198718 TMN SIERRA P 81750 MASAI JOHOR', 'total': '50.00'}


For id =  X51005447850
PASARAYA BORONG PINTAR SDN BHD BR NO.: (124525-H) NO 19-G& 19-1& 19-2 JALAN TASIK UTAMA 4, MEDAN NIAGA TASIK DAMAI 016-5498845. GST NO.: 04/03/2018 15:41:52 TAX INVOICE TRN: CR000

## The following fields are correcting the erros 

In [79]:
corrected_dates = {'X51006414519':'2018-04-06','X51005447850':'04/03/2018','X51008142038':'28-11-18','X51005749912':'28-03-18'}


for i in corrected_dates:
    
    with open('ICDAR_2019_task3_data/0325updated.task2train(626p)/'+i+'.txt') as f:
        data_dict = json.load(f)
        print('before correction date',data_dict['date'])
        #correction : replace '7838.80' with '7,838.80'
        data_dict['date'] = corrected_dates[i]
        print('after correction gt',data_dict['date'])
    
    # print(data_dict)
    with open('ICDAR_2019_task3_data/0325updated.task2train(626p)/'+i+'.txt','w') as f:
        f.write(json.dumps(data_dict))


before correction date 06/04/2018
after correction gt 2018-04-06
before correction date 20180304
after correction gt 04/03/2018
before correction date 28-01-18
after correction gt 28-11-18
before correction date 28/03/18
after correction gt 28-03-18


In [80]:
train_dict,missed,missed_dict = check(train_files)
len(missed)

100%|██████████| 626/626 [00:01<00:00, 437.77it/s]


154

In [81]:
missed_dict['date']

[]

Dates and total fields are Corrected.

# Correcting Company Name <a id = 'company'></a>

Erros are 

1. spelling  eg. ENTERORISE
2. spacing 




In [83]:
# view error files
# print_data(missed_dict['company'])

For id =  X51005712038
SWC ENTERORISE SDN BHD (1125830-V) 5-7, JALAN MAHAGONI 7/1 SEKSYEN 4, BANDAR UTAMA, 44300 BATANG KALI, SELANGOR 03-60571377 GST 002017808384 TAX INVOICE NO : 00518087100028 005002(BATANGKALI-2) 8 002 28/03/2018 10:36:13 OPEN CODE-SR ITEM 0025679 U 3X8.00 24.00 S 0025679 U 1X7.00 7.00 S PD 3.20 -3.80 AUTHORIZE : BATANGKALI-S STAR 12X13 (1X180) 0020329 PKT 1X1.00 1.00 S 0020324 PKT 1X1.80 1.80 S ITEM 4 SUBTOTAL INCL GST 30.00 QTY 6 SPEC.DISC 0.00 SAVING 3.80 ROUNDING 0.00 TOTAL 30.00 CASH 50.00 CHANGE 20.00 GST SUMMARY AMOUNT(RM) TAX(RM) S(6%) 28.30 1.70 28/03/2018 10:36:13 GOODS SOLD ONLY EXCHANGEABLE WITHIN 3 DAYS GOODS SOLD ARE NOT REFUNDABLE THANK YOU FOR KIND SUPPORT PLEASE COME AGAIN 
********
{'company': 'SWC ENTERPRISE SDN BHD', 'date': '28/03/2018', 'address': '5-7, JALAN MAHAGONI 7/1 SEKSYEN 4, BANDAR UTAMA, 44300 BATANG KALI, SELANGOR', 'total': '30.00'}


For id =  X51005749905
ENW HARDWARE CENTRE (M) SDN. BHD. CO. REG. NO. : 795225-A GST REG. NO. : 000

# Company Names  WIll be corrected manually

In [85]:
missed_dict['company']

['X51005712038',
 'X51005749905',
 'X51005715451',
 'X51005711451',
 'X51005441398',
 'X51007846326',
 'X51005749895',
 'X51005442361',
 'X51005717526',
 'X51005361907',
 'X51006554833',
 'X51006328913',
 'X00016469612',
 'X51005605333',
 'X51005712021',
 'X00016469620',
 'X51006414700',
 'X51006555570']

## After Corrections 

In [86]:
train_dict,missed,missed_dict = check(tmp_keys)
len(missed)

100%|██████████| 626/626 [00:00<00:00, 1086.56it/s]


142

In [87]:
missed_dict['company']

[]

# Correcting Address <a id = 'address'></a>

In [88]:
print_data(missed_dict['address'])

For id =  X00016469672
TAN CHAY YEE SOON HUAT MACHINERY ENTERPRISE (JM0352019-K) NO.53 JALAN PUTRA 1, TAMAN SRIPUTRA, 81200 JOHOR BAHRU JOHOR TEL : 07-5547360 / 016-7993391 FAX : 07-5624059 SOONHUAT2000@HOTMAIL.COM GST ID : 002116837376 CASH SALES DOC NO. : CS00004040 DATE: 11/01/2019 CASHIER : USER TIME: 09:44:00 SALESPERSON : REF.: GOODS SOLD ARE NOT RETURNABLE, THANK YOU. ITEM QTY S/PRICE S/PRICE AMOUNT TAX 1072 1 80.00 80.00 80.00 REPAIR ENGINE POWER SPRAYER (1UNIT) WORKMANSHIP & SERVICE 70549 1 160.00 160.00 160.00 GIANT 606 OVERFLOW ASSY 1071 1 17.00 17.00 17.00 ENGINE OIL 70791 1 10.00 10.00 10.00 GREASE FOR TOOLS 40ML (AKODA) 70637 1 6.00 6.00 6.00 EY20 PLUG CHAMPION 1643 1 8.00 8.00 8.00 STARTER TALI 70197 1 10.00 10.00 10.00 EY20 STARTER HANDLE 70561 2 18.00 18.00 36.00 HD40 1L COTIN TOTAL QTY: 9 327.00 TOTAL SALES : 327.00 DISCOUNT : 0.00 TOTAL : 0.00 ROUNDING : 0.00 TOTAL SALES : 327.00 CASH : 327.00 CHANGE : 0.00 
********
{'company': 'SOON HUAT MACHINERY ENTERPRISE', 'dat

PETRODELI ENTERPRISE COMPANY NO: SA0127959-D SITE: 2395 LOT 485,TMN LEMBAH KERAMAT JLN ULU KELANG . 54200 KUALA LUMPUR TELEPHONE: 03-41056485 GST NO: 000145047552 INVOICE NUMBER: 60000152273 39.42 LITRE PUNP # 09 FUELSAVE 95 RM 85.54 C 2.170 RM / LITRE TOTAL RM 85.54 VISA RM 85.54 RELIEF GST C RM 0.00 TOTAL GROSS C RM 85.54 SHELL LOYALTY CARD 6018840059147765 POINTS AWARDED: 39 DATE TIME NUM OPT 26/02/18 08:28 10886 09 DIESEL & PETROL RON95 GIVEN RELIEF UNDER SECTION 56 (3) (B) GST ACT 2014 THANK YOU PLEASE COME AGAIN 
********
{'company': 'PETRODELI ENTERPRISE', 'date': '26/02/18', 'address': 'LOT 485, TMN LEMBAH KERAMAT JLN ULU KELANG,  54200 KUALA LUMPUR', 'total': '85.54'}


For id =  X51006401723
S&Y STATIONERY (002050590-H) NO.36G JALAN BULAN BM U5/BM, BANDAR PINGGIRAN SUBANG, SEKSYEN U5, 40150 SHAH ALAM, SELANGOR. TEL / FAX : 0163307491 / 0378317491 EMAIL: SNYSTATIONERY@HOTMAIL.COM TEL: 0163307491 / 0378317491 FAX: 0378317491 E-MAIL: SNYSTATIONERY@HOTMAIL,.COM (GST REG NO : 0009

## Will need to create some sort of algorithm for this lets note down all the error cases 
1. missing punctuations or words. 
2. wrong spelling. - extra words or wrong words, or missing words. 
3. spaces.

**Will try to match the ground truth to text.**

## Simple Algorithm.


1. Extracting first and last words from the ground truth.
2. Get the pos. of first and last word from the input text. `store them in start and stop.` 
2. Then extract text from the input text spanning from `start : stop`




In [177]:
train_dict,missed,missed_dict = check(train_files)
print(len(missed_dict['address']))

100%|██████████| 667/667 [00:00<00:00, 1174.24it/s]

137





137 missed files 

### Applying Simple Algorithm

In [270]:

keys  = missed_dict['address']

counter = 0

algo_missed = []

for idx  in  keys:
    
    print('For id = ' , idx)
    
    #paths
    txt_file = 'ICDAR_2019_task3_data/0325updated.task1train(626p)/' + idx + '.txt'
    json_file = 'ICDAR_2019_task3_data/0325updated.task2train(626p)/' + idx + '.txt'
    
    
    #load the ground truth file
    with open(json_file,'r') as jfile:
        json_dict = json.load(jfile)        

    
    #get the text.
    input_text = getText(txt_file)
    
    #get the address 
    address = json_dict['address']
    len_address = len(address.split(' '))
    
    
    #get the first and last words  of address.
    first,last = address.split(' ')[slice(0,len_address,len_address-1)]
    
    
    ## if end word is word,word2 keep word2 
    ## if end word is word, don't do anything.
    if ',' in last:
        if not (last[-1] == ','):
            last = last.split(',')[-1]

        
        
    if first in input_text and last in input_text:
        
        start = input_text.find(first)
        end = [m.end() for m in re.finditer(last,input_text)][-1]
        
        print(idx)
        print('gt : ',address)
        print('corrected gt : ',input_text[start:end]) 
    else:
        
        # if start has No. 7  keep only No 
        if '.' in first:
            first = first.split('.')[0]

        
        last  = last.replace('.','')

        if first in input_text and last in input_text:
            start = input_text.find(first)

            end = [m.end() for m in re.finditer(last,input_text)][-1]
            print(idx)
            print('gt : ',address)
            print('corrected gt : ',input_text[start:end]) 


        else:

            #taking care of  spelling errors  
            #while preporcessing.

            algo_missed.append(idx)

            counter += 1

    
counter    

12

#### We missed 12 out of 137 due to spelling errors in the input text will need to fix that manually


Common spelling errors in input text

1. LOP instead of LOT 
2. MALAYSLA instead MALAYSIA 
3. SETANGIR instead of SELANGOR


In [160]:
#printing missed adress values 
print_data(algo_missedp[0:5])


For id =  X51005705722
ASO ELECTRICAL TRADING SDN BHD 1000131-K NO 31G, JALAN SEPADU C 25/C, SECTION 25, TAMAN INDUSTRIES, AXIS 40400 SHAH ALAM, SELANGOR. TEL:03-51221701, 51313091 GST NO : 000683900928 TAX INVOICE BILL TO: RECEIPT #: DATE: 27/09/2017 SALESPERSON: TIME: 10:51:00 CASHIER: USER (GST) (GST) ITEM QTY RSP RSP AMOUNT 107636 3 78.00 82.68 248.04 SR: HAGER TIMER, 24HRS POWER RESERVE TOT QTY: 3 248.04 (EXCLUDED GST) SUB TOTAL : 234.00 DISCOUNT : 0.00 TOTAL GST : 14.04 ROUNDING : 0.01 TOTAL : CASH : CHANGE : 0.00 GST SUMMARY TAX CODE % AMOUNT GST SR 6 234.00 14.04 TOTAL : 234.00 14.04 GOODS SOLD ARE NOT RETURNABLE, THANK YOU. 248.05 248.05 FAX: 03-51215716 CS00087400 
********
{'company': 'ASO ELECTRICAL TRADING SDN BHD', 'date': '27/09/2017', 'address': 'NO 31G, JALAN SEPADU C 25/C, SECTION 25, TAMAN INDUSTRIES, AXIS 40400 SHAH ALAM, SELANGIR.', 'total': '248.05'}


For id =  X51005719856
BASKIN BR ROBBINS DESA PARK CITY GOLDEN SCOOP SDN BHD (169609-A) A--16-1, TOWER A, NORTHPO

In [208]:
#helper function. 
def save(file,json_dict):
    
    '''
    helper function to save the corrected files. 
    
    Args:
        file (str) : name of corrected file.
        json_dict (dict) : dict containg the corrected data.
    
    '''
    
     with open(file,'w') as f:
            print(f'correcting file {file}')
            f.write(json.dumps(json_dict))


def correct_addresses(keys):
    
    '''
    This method applies a simple algorithm to correct the missed address files. This function all prints the found meaning 
    the files which were sucessfully corrected. and missed files missed by our algorithm.
    
    Args:
        keys (list) : file names.
    
    
    '''
    
    missed_cnt = 0
    found_cnt = 0
    
    algo_missed = []

    for idx  in  keys:

    #     print('For id = ' , idx)
        txt_file = 'ICDAR_2019_task3_data/0325updated.task1train(626p)/' + idx + '.txt'
        json_file = 'ICDAR_2019_task3_data/0325updated.task2train(626p)/' + idx + '.txt'

        with open(json_file,'r') as jfile:
            json_dict = json.load(jfile)        


        input_text = getText(txt_file)

        address = json_dict['address']
        len_address = len(address.split(' '))
        first,last = address.split(' ')[slice(0,len_address,len_address-1)]

        if ',' in last:
            if not (last[-1] == ','):
                last = last.split(',')[-1]


        if first in input_text and last in input_text:
            start = input_text.find(first)

            end = [m.end() for m in re.finditer(last,input_text)][-1]
            json_dict['address'] = input_text[start:end] 
            save(json_file,json_dict)
            found_cnt += 1    
        else:
            
            if '.' in first:
                first = first.split('.')[0]
            
            last  = last.replace('.','')

            if first in input_text and last in input_text:
                
                start = input_text.find(first)
                end = [m.end() for m in re.finditer(last,input_text)][-1]
                json_dict['address'] = input_text[start:end] 
                save(json_file,json_dict)
                found_cnt +=1 
                
            else:

                #taking care of  spelling errors  
                #while preporcessing.

                algo_missed.append(idx)

                missed_cnt += 1

    print(f'found {found_cnt} out of {len(keys)}')
    print(f'missed {missed_cnt} out of {len(keys)}')
        
    return algo_missed

In [209]:

keys  = missed_dict['address']

correct_addresses(keys)

correcting file ICDAR_2019_task3_data/0325updated.task2train(626p)/X00016469672.txt
correcting file ICDAR_2019_task3_data/0325updated.task2train(626p)/X51006387812.txt
correcting file ICDAR_2019_task3_data/0325updated.task2train(626p)/X51005442333.txt
correcting file ICDAR_2019_task3_data/0325updated.task2train(626p)/X51007846398.txt
correcting file ICDAR_2019_task3_data/0325updated.task2train(626p)/X51005361898.txt
correcting file ICDAR_2019_task3_data/0325updated.task2train(626p)/X51006387813.txt
correcting file ICDAR_2019_task3_data/0325updated.task2train(626p)/X00016469619.txt
correcting file ICDAR_2019_task3_data/0325updated.task2train(626p)/X51005749905.txt
correcting file ICDAR_2019_task3_data/0325updated.task2train(626p)/X51005745188.txt
correcting file ICDAR_2019_task3_data/0325updated.task2train(626p)/X51005719864.txt
correcting file ICDAR_2019_task3_data/0325updated.task2train(626p)/X51005715451.txt
correcting file ICDAR_2019_task3_data/0325updated.task2train(626p)/X51006620

[]

# Rare Cases Manual correction.: 


These are some of the files our algorithm couldnt correct.


1. X51005745183
2. X51006392122
3. X51005757243
4. X51007846451
5. X51005757199
6. X51005605284(2)
7. X51006913054

Errors due to spacing , extra , and .


In [200]:
print_data(['X51005745183','X51006392122','X51005757243','X51007846451'])

For id =  X51005745183
BILLION SIX ENTERPRISE NO 3, JALAN TAMAN JAS'A 2; SECTION U6, 40150 SHAH ALAM. TEL : 603-58856749 GST REG NO: 000944312320 TAX INVOICE INVOICE NO : 0-170891 DATE : 22/02/2018 7:41:46 AM CASHIER : 123 DESCRIPTION QTY PRICE AMOUNT 1 SR' 9556405112205 11 5.80 63.80 TG7 7" HIPS PLATE (50PCS) JP TOTAL : 63.80 DISCOUNT: 0.00 TOTAL SALES INCLUSIVE GST @6.00%: 63.80 MASTER CARD 63.80 5148826201779121 GST SUMMARY % AMOUNT(RM) TAX(RM) SR 6.00 60.19 3.61 BARANG YANG SUDAH DI BELI, WANG TIDAK DAPAT DI KEMBALIKAN. PERTUKARAN BARANG HANYA BOLEH DIBUAT DALAM 3 HARI SAHAJA DENGAN RESIT. 
********
{'company': 'BILLION SIX ENTERPRISE', 'date': '22/02/2018', 'address': "NO 3, JALAN TAMAN JAS'A 2; . SECTION U6, 40150 SHAH ALAM.", 'total': '63.80'}


For id =  X51006392122
Z = 0% 4.00 0.00 THANK YOU AND DO VISIT US AGAIN CASHIER [P.NISANTHI] MACHINE [003] [19/06/17] [11:37] FIVE STAR CASH & CARRY (1365663-P) G.23 & G.22, PLAZA SERI SETIA, NO.1 JALAN SS 9/1, 47300 PJ, SELANGOR, TEL/FA

In [204]:
print_data(['X51006913054','X51005605284(2)','X51006913054'])

For id =  X51006913054
DE LUXE CIRCLE FRESH MART SDN BHD (MUTIARA RINI 16) CO REG NO:797887-W NO, 89&91, JALAN UTAMA, TAMAN MUTIA RINI, 81300 SKUDAI, JOHOR. TEL:016-7780546 MT161201805120055 CASHIER: 12/05/18 12:12:57 PM 12/05/18 12:13:17 PM HEAVEN & EARTH AYATAKA GREEN TEA 1.5L 8888002119454 KONNYAKU 10G 8888338001119 5.00*1 3.50*2 ITEM: QTY: TOTAL SAVING: TOTAL WITH GST @ 6% ROUNDING TOTAL 12.00 0.00 12.00 TENDER CASH 50.00 CHANGE 38.00 GST ANALYSIS S = 6% Z = 0% MEMBER 0000036581 MEMBER: WONG SHOO YUEN *THANK YOU. SEE YOU AGAIN !! *CUSTOMER CARE LINE : 012-7092889 *CUSTOMERSERVICE@DELUXEGROUPS.COM GST NO:001507647488 CHU PECK 5.00 S 7.00 S 2 3 0.00 GOODS TAX AMOUNT 11.32 0.68 0.00 0.00 POINTS EARNED: 11 
********
{'company': 'DE LUXE CIRCLE FRESH MART SDN BHD', 'date': '12/05/18', 'address': 'NO. 89&91, JALAN UTAMA, TAMAN MUTIA RINI, 81300 SKUDAI, JOHOR.', 'total': '12.00'}


For id =  X51005605284(2)
ADVANCO COMPANY COMPANY REG. NO.: 725186-V NO 1&3, JALAN ANGSA DELIMA 12, WANGSA L

In [337]:
train_dict,missed,missed_dict = check(train_files)
missed_dict

100%|██████████| 666/666 [00:00<00:00, 1068.30it/s]


{'address': [], 'company': [], 'date': [], 'total': []}

### All the Samples have been corrected no. missing samples 

# Creating Training Data Dict


Now we are ready to create the training data.


In [29]:

def create_tran_data(train_files):
    
    '''
    Method to create the  training data. The output put of this method is a dict.
    
    Args:
        train_files (list) : list containing names of train files.
    
    '''
    
    data_dict = {}
   
    for idx  in tqdm(train_files):
    
        txt_file = 'ICDAR_2019_task3_data/0325updated.task1train(626p)/' + idx + '.txt'
        json_file = 'ICDAR_2019_task3_data/0325updated.task2train(626p)/' + idx + '.txt'
        
        with open(json_file,'r') as jfile:
            json_dict = json.load(jfile)
        
        
        x = getText(txt_file).upper()
        y = get_labels(json_dict,x)
        
      
        data_dict[idx] = (x,y)
       

    return data_dict


In [30]:
train_dict = create_tran_data(train_files)

100%|██████████| 625/625 [00:00<00:00, 1303.40it/s]


In [31]:
len(train_dict.keys())

625

## Lets check if our Rare cases discussed above are corrected or not

In [298]:
#correct the rare samples.

for i in ['X51005745183','X51006392122','X51005757243','X51007846451','X51006913054','X51005605284(2)','X51006913054']:
    
    key = i 
    text = train_dict[i][0]
    pred = train_dict[i][1]
    print(key)
    get_output(key,text,pred,inf = True)



X51005745183
*******input text*******
BILLION SIX ENTERPRISE NO 3, JALAN TAMAN JAS'A 2; SECTION U6, 40150 SHAH ALAM. TEL : 603-58856749 GST REG NO: 000944312320 TAX INVOICE INVOICE NO : 0-170891 DATE : 22/02/2018 7:41:46 AM CASHIER : 123 DESCRIPTION QTY PRICE AMOUNT 1 SR' 9556405112205 11 5.80 63.80 TG7 7" HIPS PLATE (50PCS) JP TOTAL : 63.80 DISCOUNT: 0.00 TOTAL SALES INCLUSIVE GST @6.00%: 63.80 MASTER CARD 63.80 5148826201779121 GST SUMMARY % AMOUNT(RM) TAX(RM) SR 6.00 60.19 3.61 BARANG YANG SUDAH DI BELI, WANG TIDAK DAPAT DI KEMBALIKAN. PERTUKARAN BARANG HANYA BOLEH DIBUAT DALAM 3 HARI SAHAJA DENGAN RESIT. 

****** Ground Truth *******
{'company': 'BILLION SIX ENTERPRISE', 'date': '22/02/2018', 'address': "NO 3, JALAN TAMAN JAS'A 2; SECTION U6, 40150 SHAH ALAM.", 'total': '63.80'}
*******Prediction*********
company  :  BILLION SIX ENTERPRISE

address  :  NO 3, JALAN TAMAN JAS'A 2; SECTION U6, 40150 SHAH ALAM.

date  :  22/02/2018

total  :  63.80

X51006392122
*******input text******

### A final check that consifrms each sample has all 5 labels

In the code below will iterate through all the samples and check if the labels have 5 unique numbers.

In [301]:
 
count = 0

for key in train_dict.keys():
    
    sample = train_dict[key][1]
    
    if len(np.unique(sample)) != 5:
        print(key)
        print(np.unique(sample))
        count += 1
    
count

0

## Saving Train dict

In [344]:
# pk.dump(train_dict,open('train_corrected_data_dict','wb'))

In [303]:
# pk.dump(train_files,open('train_corrected_ids','wb'))

## Test Data

All the above steps all performed to the test data except the correction part.


1. Test data input  is present in  the `ICDAR_2019_task3_data/text.task1_2-test（361p)` containing 361 samples. 
2. Data to submit is present in the `ICDAR_2019_task3_data/task3-test（347p)` containing 347

will need to get rid of the extra  26 samples


In [87]:
!ls ICDAR_2019_task3_data/

0325updated.task1train(626p)
0325updated.task2train(626p)
task3-testï¼ˆ347p)
text.task1_2-testï¼ˆ361p)


In [10]:
test_jpg = [i.split('.')[0] for i in  os.listdir('ICDAR_2019_task3_data/task3-test（347p)')]
len(test_jpg)

347

In [11]:
# getting test data 
test_txt = [i.split('.')[0] for i in os.listdir('ICDAR_2019_task3_data/text.task1_2-test（361p)')]
len(test_txt)


361

In [12]:
test_files = set(test_jpg).intersection(set(test_txt))
len(test_files)

347

## Checking out of Vocab Words

In [13]:
# check for invalid files 
# containg chars not in our vocab 


counter = 0
root_folder =  'ICDAR_2019_task3_data/text.task1_2-test（361p)/'

for file in test_files:
   
    path = root_folder + file +'.txt'

    data = getText(path)    
    for char in data:
        
        if char not in VOCAB:
            print(path,char)
            counter += 1
            
            

print('files to remove' , counter)    

ICDAR_2019_task3_data/text.task1_2-test（361p)/X51005447844.txt c
ICDAR_2019_task3_data/text.task1_2-test（361p)/X51005447844.txt a
ICDAR_2019_task3_data/text.task1_2-test（361p)/X51005447844.txt s
ICDAR_2019_task3_data/text.task1_2-test（361p)/X51005447844.txt h
ICDAR_2019_task3_data/text.task1_2-test（361p)/X51005447844.txt i
ICDAR_2019_task3_data/text.task1_2-test（361p)/X51005447844.txt e
ICDAR_2019_task3_data/text.task1_2-test（361p)/X51005447844.txt r
ICDAR_2019_task3_data/text.task1_2-test（361p)/X51005447844.txt a
ICDAR_2019_task3_data/text.task1_2-test（361p)/X51005447844.txt d
ICDAR_2019_task3_data/text.task1_2-test（361p)/X51005447844.txt m
ICDAR_2019_task3_data/text.task1_2-test（361p)/X51005447844.txt i
ICDAR_2019_task3_data/text.task1_2-test（361p)/X51005447844.txt n
ICDAR_2019_task3_data/text.task1_2-test（361p)/X51006619503.txt £
ICDAR_2019_task3_data/text.task1_2-test（361p)/X51006619503.txt ¬
files to remove 14


### the lower case chars will be upper case and the final unknown chars will be encoded with id = len(vocab)

In [98]:
def create_test_data(root,test_files):
    
    data_dict = {}
   
    for idx  in tqdm(test_files):
        
        txt_file = f'{root}' + idx + '.txt'
      
        
        x_test = getText(txt_file).upper()
   
        data_dict[idx] = x_test
    
    return data_dict
    

In [99]:
print('total no. of test files : ',len(test_files))
test_dict = create_test_data(root = 'ICDAR_2019_task3_data/text.task1_2-test（361p)/',test_files = test_files)

 88%|████████▊ | 304/347 [00:00<00:00, 1564.73it/s]

total no. of test files :  347


100%|██████████| 347/347 [00:00<00:00, 1468.43it/s]


In [100]:
len(test_dict.keys())

347

In [101]:

# pk.dump(test_dict,open('test_data_dict','wb'))
# pk.dump(test_files,open('test_ids','wb'))

# Dataset

For pytorch will need to create a dataset class. Our data class is called CustomDataset one thing to note here is the type of batching performed. Instead of computing a MAX_LEN for the entire dataset and then padding all the samples to that length. Here we compute MAX_LEN per batch.  

In [32]:
import torch 
from torch.utils.data import Dataset

import os 
import pickle as pk 
import random
from string import punctuation,digits,ascii_uppercase

VOCAB = punctuation + digits + ascii_uppercase + " " 
labels = {'company':1,'address':2,'date':3,'total':4}
# inv_labels = {1:'company',2:'address',3:'date',4:'total'}


class CustomDataset(Dataset):
    
    '''
    A class to act as a Data pipeline. 
    
    Arrtibutes:
        device (str) : device being used either (cpu or cuda) defaut = cpu
        data_type (str) : String used to identify the type of dataset train,val or test. default = train 
        train_dict (dict) : Dict containing all the training samples. 
        test_ds (dict) : dict containing only the test data 
        n (int) : size of train dataset (train  + val set combined)
        val_size (int) : size of val dataset 
        train_ds (dict) : dict containg the only the training samples 
        val_ds (dict) : dict containing only the validation data.
    
    
    Methods: 
           __len__ : Returns the length of the dataset.  
           
           getbtach(batch) : method responsiable for returning the encoded + padded train, val and test samples. 
                            
                             raises : ValueError 
                                     No specific key error if the key is not found.
                
    
    '''
    
    
    
    
    def __init__(self,train_dict_path,test_dict_path,data_type = 'train',val_size = 0.2,device = 'cpu'):
        
        '''
                    
            Args:
                train_dict_path (str) : path to train dict
                test_dict_path (str) : path to test dict. 
                data_tye (str) : type of data one of (train,val,test) default = train 
                val_size (int) : size of the validation data default = 0.2
                device (str) : either cpu or cuda. 
        
        
        '''

        super(CustomDataset,self).__init__()
        
        self.data_type = data_type
        self.train_dict = pk.load(open(train_dict_path,'rb'))
        self.test_ds = pk.load(open(test_dict_path,'rb'))
        
        self.n = len(self.train_dict.keys())
        self.val_size = int(self.n * val_size)
        
        data = list(self.train_dict.items())
        random.shuffle(data)

        self.train_ds =  dict(data[:(self.n - self.val_size)])
        self.val_ds = dict(data[(self.n-self.val_size):])
        self.device = device   

        print('using device : ',self.device)
        print('train size : ',len(self.train_ds))
        print('val size : ',len(self.val_ds))
        print('test size ',len(self.test_ds))
        
        
    def __len__(self):
        
        '''
            Returns the length (size) of the dataset.
        '''
        
        return self.n 
    
    
    def get_batch(self,batch):
        
        '''
        
        Args:
            batch (str or list) : parameter containing the names of files (without extensions) these serve as keys to our 
                                  train and test dict. For test dict the batch is str (a single value) for training and 
                                  validation batch is list of strings. 
        
        Returns:
                (while training and validation)
                samples (list) : list of keys. 
                text_tensor(tensor) : tensor containing encoded input text. 
                truth_tensot(tensor) : tensor containing the encoded target data.
                
                (while testing)
                text_tensor (tensor) : tensor containing encode test sample.
        
        
        
        '''
        
 
        # Testing phase 
        if self.data_type == 'test':
                # print('testing.')
                text = self.test_ds[batch]
                text_tensor = torch.zeros(len(text), 1, dtype=torch.long)
                text_tensor[:, 0] = torch.LongTensor([VOCAB.find(c)if VOCAB.find(c) != -1 else len(VOCAB) for c in text])
                return text_tensor.to(self.device)
            
        elif isinstance(batch,list) and (self.data_type == 'train' or self.data_type == 'val'):
            
            samples = batch

            if self.data_type == 'train':        
                texts = [self.train_ds[k][0] for k in samples]
                labels = [self.train_ds[k][1] for k in samples]
            else:
                texts = [self.val_ds[k][0] for k in samples]
                labels = [self.val_ds[k][1] for k in samples]

        else: 
             raise ValueError('No key specified.')
        # padding and encoding.

        maxlen = max(len(t) for t in texts)
    
        text_tensor = torch.zeros(maxlen, len(batch), dtype=torch.long)
        for i, text in enumerate(texts):            
            text_tensor[:, i] = torch.cat([torch.LongTensor([VOCAB.find(c)if VOCAB.find(c) != -1 else len(VOCAB) for c in text]),torch.zeros(maxlen-len(text),dtype = torch.long)])
            
        truth_tensor = torch.zeros(maxlen, len(batch), dtype=torch.long)
        for i, label in enumerate(labels):
            truth_tensor[:, i] = torch.cat([torch.LongTensor(label), torch.zeros(maxlen-len(label),dtype = torch.long)])

        return samples,text_tensor.to(self.device), truth_tensor.to(self.device)




if __name__ == "__main__":

    dataset = CustomDataset(train_dict_path = 'Data_Dicts/train_corrected_data_dict',test_dict_path ='Data_Dicts/test_data_dict')   
    
    print('total  samples : ',len(dataset))

    ### dataset during eval 

    # Training 
    train_keys = list(dataset.train_ds.keys())
    print('total training samples : ',len(train_keys))
    dataset.data_type = 'train'
    for i in train_keys:
        key,text,labels = dataset.get_batch(batch=[i])
        print(i,key) 
        break 
    

    #Validation 
    val_keys = list(dataset.val_ds.keys())
    print('total val samples : ',len(val_keys))
    dataset.data_type = 'val'
    for i in val_keys:
        key,text,labels = dataset.get_batch(batch=[i])
        print(i,text.shape) 
        break 

    # Testing     
    test_keys = list(dataset.test_ds.keys())
    print('total test samples : ',len(test_keys))
    # test sample from the test set.
    dataset.data_type = 'test'
    for i in test_keys:
        text = dataset.get_batch(batch = i)
        print(i,key) 
        break 


using device :  cpu
train size :  500
val size :  125
test size  347
total  samples :  625
total training samples :  500
X51005433552 ['X51005433552']
total val samples :  125
X51007846391 torch.Size([591, 1])
total test samples :  347
X51005684949 ['X51007846391']


In [33]:
#helper function to convert the pred to dict.
def get_output(key,text,pred,inf = True):
    
    '''
    
    A helper method to convert the models output (pred) to a dict. A competition accepts us the submission as dict stored in 
    .txt files.
    
    Args:
        key (str) : file name without .txt extension
        text (str) : cleaned text data. 
        pred (list) : list of label. 
        inf (boolean) : default  = True print the ground truth,  input text and predictions
        
        Returns: 
               output_dict (dict) : A dict containing the predict output in dict format. 
    
    '''
    
    if inf:
    
        print('*******input text*******')    
        print(text + "\n")
        print('****** Ground Truth *******')
        data_dict = json.load(open('ICDAR_2019_task3_data/0325updated.task2train(626p)/'+key+".txt",'r'))
        print(dict(data_dict))    
        print('*******Prediction*********')
    
    output_dict = {}
    
    print(pred)
    
    for label in range(1,5):
        word = ''
        print(inv_labels[label], " : ",end = " ")
        
        for char,idx in zip(text,pred):
            
            if idx == label:
                word += char
                
        print(word,end = "")
        print("\n")
        
        output_dict[inv_labels[label]] = word
        
        
    return output_dict
    

In [34]:
keys_to_check = list(dataset.train_ds.keys())[0:5]
item = dataset.train_ds[keys_to_check[0]]

In [35]:
get_output('X51006679216',*item)

*******input text*******
WARAKUYA PERMAS CITY SDN BHD REG NO: 1203194-W JALAN PERMAS UTARA 1. PERMAS JAYA 81750 MASAI JOHOR TEL : 0111-558 0000 GST ID: 0016 6993 5104 TAX INVOICE NO 58244 42 DATE: 10/03/2018 5:41:06 PAX NO:4 CASHIER: CASHIER3 WAITER: HANYIN QTY CODE/DESC THANK YOU ! PLEASE COME AGAIN ! TOTAL RM 3 SABA SHIO YAKI SEY 53.70 1 SALMON SHIO SET 21.90 4 ICED GREEN TEA 4.00 SUBTOTAL 79.60 DISCOUNT -30.00 TOTAL AMOUNT : 49.60 SERV CHARGE 10% 4.96 GST @ 6% 3.27 ROUNDING ADJ : -0.03 TOTAL AMOUNT: 57.80 TOTAL: RM 57.80 TYPE 3 GST SUMMARY AMOUNT RM TAX RM SR 6% 54.56 3.27 

****** Ground Truth *******
{'company': 'FTOF NOODLE HOUSE', 'date': '21/05/2018', 'address': 'NO.25, JALAN METRO PERDANA BARAT 2, TAMAN USAHAWAN KEPONG, KEPONG UTARA, 52100 KUALA LUMPUR.', 'total': 'RM31.60'}
*******Prediction*********
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 

{'company': 'WARAKUYA PERMAS CITY SDN BHD',
 'address': 'JALAN PERMAS UTARA 1. PERMAS JAYA 81750 MASAI JOHOR',
 'date': '10/03/2018',
 'total': '57.80'}

In [36]:
for key in keys_to_check:
    print(key)
    item = dataset.train_ds[key]
    get_output(key,*item)

X51005433552
*******input text*******
WARAKUYA PERMAS CITY SDN BHD REG NO: 1203194-W JALAN PERMAS UTARA 1. PERMAS JAYA 81750 MASAI JOHOR TEL : 0111-558 0000 GST ID: 0016 6993 5104 TAX INVOICE NO 58244 42 DATE: 10/03/2018 5:41:06 PAX NO:4 CASHIER: CASHIER3 WAITER: HANYIN QTY CODE/DESC THANK YOU ! PLEASE COME AGAIN ! TOTAL RM 3 SABA SHIO YAKI SEY 53.70 1 SALMON SHIO SET 21.90 4 ICED GREEN TEA 4.00 SUBTOTAL 79.60 DISCOUNT -30.00 TOTAL AMOUNT : 49.60 SERV CHARGE 10% 4.96 GST @ 6% 3.27 ROUNDING ADJ : -0.03 TOTAL AMOUNT: 57.80 TOTAL: RM 57.80 TYPE 3 GST SUMMARY AMOUNT RM TAX RM SR 6% 54.56 3.27 

****** Ground Truth *******
{'company': 'WARAKUYA PERMAS CITY SDN BHD', 'date': '10/03/2018', 'address': 'JALAN PERMAS UTARA 1. PERMAS JAYA 81750 MASAI JOHOR', 'total': '57.80'}
*******Prediction*********
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2

## Observations 

1. Getting the total (price) correct is going to be difficult as there are many numberic which can be confused for price. 
   also the format is not consistent as some of the have currencies some dont. 
2. Getting the date will not be as difficult as the other arrtibutes. 
