Receipt-Text-Extraction

Code to train your own Keras model on labelled receipt images collected from the wild.

Also releasing a pre-trained model bestext.hd5 which can be finetuned on your data.

How did you go about Pre-processing the data ? A: The common operations were to binarize and remove the colour channel in both training and test images. In the final test images given by tonetag, I detected the text right out of the image using OpenCV functions descriptions for which are given in the respective cell heading.
Model and explanation. A: Explained the model in the model cell heading and summary heading cells.
Performance measuring of the model. A: Since the images given were not labelled, I have no way measuring the performance for these ~900 images apart from manually checking each prediction file. But the validation accuracy from labelled images (which I collected) was arrived at using tthe CTC loss function the description for which has been updated in the relevant cell above.

Is there something I could have done better ? A: There are a lots of things that can be further done to improve this whole pipeline. This notebook and the predictions submitted is the first approach I thought to try. While doing this project, there were multiple approaches I could think of, but given the time constraint, and already being half way through, trying other approaches wasn't feasible :

Here are the alternative approaches in decreasing order of preference :

 a) Two different DL models for TEXT DETECTION , TEXT RECOGNITION. 
 PROS : Easier to tune.
 CONS : might be prone to overfitting to training data.

 b) An single model for end to end TEXT DETECTION + TEXT RECOGNITION
 PROS : Deep Learning would obviously be more robust to finding text boxes and predicting the text inside. 
 CONS : This approach would need 2-3X more data and 2-3X more training time and designing time (hyperparameter tuning) as the learning would be harder here.

 OTHERS :
     1. More labelled training data. 
     a) by generating textual data of words from a dictionary. 
     b) Collecting data apart from receipt images (I only used around 700 labelled receipt images data here)

     2. Data Augmentataion :
         a) Using standard keras datagen pre-processing steps like rotation, sheer etc.
     2. More num of epochs 
     Wouldve done it, but got GPU access pretty late. And Google Colab is very painful for Data I/O

     3. Hyperparameter tuning of the DL model
          a)more layers, more neurons per layer, trying GRUs instead of LSTMs. 
          b)Using a Global average pooling layer instead of flattening the CNN output.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
OCR		OCR
.DS_Store		.DS_Store
.gitignore		.gitignore
OCR.ipynb		OCR.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCR

OCR

.DS_Store

.DS_Store

.gitignore

.gitignore

OCR.ipynb

OCR.ipynb

README.md

README.md

Repository files navigation

Receipt-Text-Extraction

About

Releases

Packages

Languages

dhruvagupta2014/Receipt-Text-Extraction_LSTM-CNN-CTC

Folders and files

Latest commit

History

Repository files navigation

Receipt-Text-Extraction

About

Resources

Stars

Watchers

Forks

Languages