Skip to content

YAI 9 x Lomin : End to End OCR model (Detection & Recognition) based on Vision Transformer

Notifications You must be signed in to change notification settings

YAI-Project/E2E_OCR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

E2E_OCR_Transformer

2022 Winter Project (JAN-FEB)

YAI(Yonsei university Artificial Intelligence) with Lomin

TEAM : Subin Kang(강수빈), Juyeon Kim(김주연), JiHyuk Byun(변지혁), MinJae Lee(이민재), Sangmin Lee(이상민)


Introduction

Many OCR models are either a detection model or a recognition model. In other words, the two tasks are accomplished through separate models. So, there are times in which detection and recognition aren't incompatible, while training. Our goal is to make detecting and recognizing texts within images through a single step with transformer. Also, we tried to recognize Korean texts as well as English ones. Our main idea is to share the features extracted by backbone(Resnet50) with the detection branch and the recognition branch, inspired by FOTS, and to use four losses (label loss, bbox loss, text loss, recognition loss).


Model

Trulli


Model Pipeline

(1) Shared Convolution from FOTS

  • Backbone: Resnet-50
  • Extracting the features
  • Output: Feature map

(2) Transformer Detection from DETR

  • Detecting the texts in images
  • Output: binary prediction of presence(1) or absence(0) of texts in image queries and Bounding Box coordinates

(3) RoIRotate from FOTS

  • Cropping the images with Bounding box coordinates
  • Rotating the region of interests to recognize the texts easily

(4) Transformer Recognition from ViTSTR

  • Recognizing the texts in the images
  • Output: Text

Structure

datasets
   __init__.py
   coco_eval.py
   text.py
   transforms.py

util           # roi_rotate.py is hevily based on FOTS    
               # Here is Repo: https://github.com/jiangxiluning/FOTS.PyTorch/tree/master/FOTS/model/modules (FOTS)
   __init__.py      
   box_ops.py
   misc.py
   plot_utils.py
   upsampling.py
   visualize_results.py
   convert.py
   matcher.py
   metric.py
   position_encoding.py
   roi_rotate.py

models         # Heavily based on DETR and ViTSTR
               # Here is Repo : https://github.com/facebookresearch/detr (DETR)
                                https://github.com/roatienza/deep-text-recognition-benchmark (ViTSTR)   
   model.py      
   backbone.py         
   loss.py                      
   transformer.py
   vitstr.py

_main_.py
_train_.py
config.py



Data


OCR Transformer Train

Train Model : python _main_.py


Discussion

  • Problem

    • DETR Model – The DETR model isn’t suitable for detecting small-sized objects (texts)
    • Resolution – Limitation in lowering image resolution due to memory
    • Number of predicted bounding boxes - # of predicted text bbox > # of gt text bbox
  • Trial and errors

    (1) Omit pretraining DETR

    • Reason: Existing DETR model shows low performance in small-sized objects
    • Result: Improved performance and Accordance of # of predicted text bbox and # of gt text bbox

    (2) Finetuning (Increase weight coefficient of “recog” loss)

    • Weight coefficient – class(True/False) : bounding box : giou : recog
    • Reason: Deviation of recognized text from gt text
    • Result: No effect

    (3) Detection performance test (+ Increase resolution and giou loss weight coefficient)

    • Reason: To identity whether the detection or the recognition task was the problem
    • Result: Decrease in number of predicted text bbox and giou loss
  • Feedback by Lomin

    The architecture complexity was the most signified problem pointed out.

    • Prior to validating the entire model, validations of each module should have been processed. Our model consists of countless components. When such many, large-sized modules are utilized, it would be productive to validate each module step by step.

    • Our model is too deep, the model undergoes underfitting as a whole. Lomin advised our team to first hard-train the encoder and decoder separately when finetuning, before training. For instance, we could overfit the model by training each encoder and decoder with just a single data, consecutively increase the dataset size, and finally train our entire dataset.

For details of our project, go to YAI & LOMIN CORPORATE PROJECT

About

YAI 9 x Lomin : End to End OCR model (Detection & Recognition) based on Vision Transformer

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages