Skip to content

πŸ“œ (OCR) Recognizing LaTeX format text in the equation image

License

Notifications You must be signed in to change notification settings

bsm8734/formula-recognition-OCR

Β 
Β 

Repository files navigation

Formula Image Latex Recognition

logo
Star Forks Issues License

πŸ“ Table of Contents


βž— Latex Recognition Task

Competition Overview

μˆ˜μ‹ 인식(Latex Recognition)은 μˆ˜μ‹ μ΄λ―Έμ§€μ—μ„œ LaTeX 포맷의 ν…μŠ€νŠΈλ₯Ό μΈμ‹ν•˜λŠ” νƒœμŠ€ν¬λ‘œ, 문자 인식(Character Recognition)κ³Ό 달리 μˆ˜μ‹ μΈμ‹μ˜ 경우 쒌 β†’ 우 뿐만 μ•„λ‹ˆλΌ Multi-line에 λŒ€ν•΄μ„œ μœ„ β†’ μ•„λž˜μ— λŒ€ν•œ μˆœμ„œ νŒ¨ν„΄ ν•™μŠ΅λ„ ν•„μš”ν•˜λ‹€λŠ” νŠΉμ§•μ„ κ°€μ§‘λ‹ˆλ‹€.


πŸ“ File Structure

Code Folder

ocr_teamcode/
β”‚
β”œβ”€β”€ config/                   # train argument config file
β”‚   β”œβ”€β”€ Attention.yaml
β”‚   └── SATRN.yaml
β”‚
β”œβ”€β”€ data_tools/               # utils for dataset
β”‚   β”œβ”€β”€ download.sh           # dataset download script
β”‚   β”œβ”€β”€ extract_tokens.py     # extract tokens from token.txt
β”‚   β”œβ”€β”€ make_dataset.py       # sample dataset
β”‚   β”œβ”€β”€ parse_upstage.py      # convert JSON ground truth file to ICDAR15 format
β”‚   └── train_test_split.py   # split dataset into train and test dataset
β”‚
β”œβ”€β”€ networks/                 # network, loss
β”‚   β”œβ”€β”€ Attention.py
β”‚   β”œβ”€β”€ SATRN.py
β”‚   └── loss.py
β”‚   └── spatial_transformation.py
β”‚
β”œβ”€β”€ checkpoint.py             # save, load checkpoints
β”œβ”€β”€ pre_processing.py         # preprocess images with OpenCV
β”œβ”€β”€ custom_augment.py         # image augmentations
β”œβ”€β”€ transform.py
β”œβ”€β”€ dataset.py
β”œβ”€β”€ flags.py                  # parse yaml to FLAG format
β”œβ”€β”€ inference.py              # inference
β”œβ”€β”€ metrics.py                # calculate evaluation metrics
β”œβ”€β”€ scheduler.py              # learning rate scheduler
β”œβ”€β”€ train.py                  # train
└── utils.py                  # utils for training

Dataset Folder

input/data/train_dataset
β”‚
β”œβ”€β”€ images/                 # input image folder
β”‚   β”œβ”€β”€ train_00000.jpg
β”‚   β”œβ”€β”€ train_00001.jpg
β”‚   β”œβ”€β”€ train_00002.jpg
β”‚   └── ...
|
β”œβ”€β”€ gt.txt                  # input data
β”œβ”€β”€ level.txt               # formula difficulty feature
β”œβ”€β”€ source.txt              # printed output / hand written feature
└── tokens.txt              # vocabulary for training

✨ Getting Started

Installation

pip install -r requirements.txt
  • scikit_image==0.14.1
  • opencv_python==3.4.4.19
  • tqdm==4.28.1
  • torch==1.7.1+cu101
  • torchvision==0.8.2+cu101
  • scipy==1.2.0
  • numpy==1.15.4
  • pillow==8.2.0
  • tensorboardX==1.5
  • editdistance==0.5.3
  • python-dotenv==0.17.1
  • wandb==0.10.30
  • adamp==0.3.0
  • python-dotenv==0.17.1

Download Dataset

sh filename.sh

Dataset Setting

πŸ“Œ ν•™μŠ΅λ°μ΄ν„°λŠ” Dataset Folder와 같이 λ„£μ–΄μ£Όμ„Έμš”!

πŸ“Œ 단일 컬럼으둜 κ΅¬μ„±λœ txtλŠ” \n을 κΈ°μ€€μœΌλ‘œ 데이터λ₯Ό κ΅¬λΆ„ν•˜λ©°, 2개 μ΄μƒμ˜ 컬럼으둜 κ΅¬μ„±λœ txtλŠ” \t둜 μ»¬λŸΌμ„, \n으둜 데이터λ₯Ό κ΅¬λΆ„ν•©λ‹ˆλ‹€.

ν•™μŠ΅λ°μ΄ν„°λŠ” tokens.txt, gt.txt, level.txt, source.txt 총 4개의 파일과 이미지 ν΄λ”λ‘œ κ΅¬μ„±λ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€.

이 쀑 tokens.txt와 gt.txtλŠ” λͺ¨λΈ ν•™μŠ΅μ— κΌ­ ν•„μš”ν•œ μž…λ ₯ 파일이며, level.txt, source.txtλŠ” 이미지에 λŒ€ν•œ 메타 λ°μ΄ν„°λ‘œ 데이터셋 λΆ„λ¦¬μ—μ„œ μ‚¬μš©ν•©λ‹ˆλ‹€.

  • tokens.txtλŠ” ν•™μŠ΅μ— μ‚¬μš©λ˜λŠ” vocabulary νŒŒμΌλ‘œμ„œ λͺ¨λΈ ν•™μŠ΅μ— ν•„μš”ν•œ token듀이 μ •μ˜λ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€.

    O
    \prod
    \downarrow
    ...
    
  • gt.txtλŠ” μ‹€μ œ ν•™μŠ΅μ— μ‚¬μš©ν•˜λŠ” 파일둜 이미지 경둜, LaTex둜 된 Ground Truth둜 각 컬럼이 κ΅¬μ„±λ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€.

    train_00000.jpg	4 \times 7 = 2 8
    train_00001.jpg	a ^ { x } > q
    train_00002.jpg	8 \times 9
    ...
    
  • level.txtλŠ” μˆ˜μ‹μ˜ λ‚œμ΄λ„ 정보 파일둜 각 μ»¬λŸΌμ€ κ²½λ‘œμ™€ λ‚œμ΄λ„λ‘œ κ΅¬μ„±λ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€. 각 μˆ«μžλŠ” 1(μ΄ˆλ“±), 2(쀑등), 3(κ³ λ“±), 4(λŒ€ν•™), 5(λŒ€ν•™ 이상)을 μ˜λ―Έν•©λ‹ˆλ‹€.

    train_00000.jpg	1
    train_00001.jpg	2
    train_00002.jpg	2
    ...
    
  • source.txtλŠ” μ΄λ―Έμ§€μ˜ 좜λ ₯ ν˜•νƒœ 정보 파일둜, μ»¬λŸΌμ€ κ²½λ‘œμ™€ μ†ŒμŠ€λ‘œ κ΅¬μ„±λ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€. 각 μˆ«μžλŠ” 0(ν”„λ¦°νŠΈ 좜λ ₯λ¬Ό), 1(손글씨)λ₯Ό λœ»ν•©λ‹ˆλ‹€.

    train_00000.jpg	1
    train_00001.jpg	0
    train_00002.jpg	0
    

Create .env for wandb

wandb logging을 μ‚¬μš© μ‹œ wandb에 λ„˜κ²¨μ£Όμ–΄μ•Ό ν•˜λŠ” 인자λ₯Ό .env νŒŒμΌμ— μ •μ˜ν•©λ‹ˆλ‹€.

PROJECT="[wandb project name]"
ENTITY="[wandb nickname]"

Config Setting

ν•™μŠ΅ μ‹œ μ‚¬μš©ν•˜λŠ” config νŒŒμΌμ€ yaml파일둜 ν•™μŠ΅ λͺ©ν‘œμ— 따라 λ‹€μŒκ³Ό 같이 μ„€μ •ν•΄μ£Όμ„Έμš”.

network: SATRN
input_size: # resize image
  height: 48
  width: 192
SATRN:
  encoder:
    hidden_dim: 300
    filter_dim: 1200
    layer_num: 6
    head_num: 8

    shallower_cnn: True # shallow CNN
    adaptive_gate: True # A2DPE
    conv_ff: True # locality-aware feedforward
    separable_ff: True # only if conv_ff is True
  decoder:
    src_dim: 300
    hidden_dim: 300
    filter_dim: 1200
    layer_num: 3
    head_num: 8

checkpoint: "" # load checkpoint
prefix: "./log/satrn" # log folder name

data:
  train: # train dataset file path
    - "/opt/ml/input/data/train_dataset/gt.txt"
  test: # validation dataset file path
    -
  token_paths: # token file path
    - "/opt/ml/input/data/train_dataset/tokens.txt" # 241 tokens
  dataset_proportions: # proportion of data to take from train (not test)
    - 1.0
  random_split: True # if True, random split from train files
  test_proportions: 0.2 # only if random_split is True, create validation set
  crop: True # center crop image
  rgb: 1 # 3 for color, 1 for greyscale

batch_size: 16
num_workers: 8
num_epochs: 200
print_epochs: 1 # print interval
dropout_rate: 0.1
teacher_forcing_ratio: 0.5 # teacher forcing ratio
teacher_forcing_damp: 5e-3 # teacher forcing decay (0 to turn off)
max_grad_norm: 2.0 # gradient clipping
seed: 1234
optimizer:
  optimizer: AdamP
  lr: 5e-4
  weight_decay: 1e-4
  selective_weight_decay: True # no decay in norm and bias
  is_cycle: True # cyclic learning rate scheduler
label_smoothing: 0.2 # label smoothing factor (0 to off)

patience: 30 # stop train after waiting (-1 for off)
save_best_only: True # save best model only

fp16: True # mixed precision

wandb:
  wandb: True # wandb logging
  run_name: "sample_run" # wandb project run name

⏩ Usage

Train

python train.py [--config_file]
  • --config_file: config 파일 경둜

Inference

python inference.py [--checkpoint] [--max_sequence] [--batch_size] [--file_path] [--output_dir]
  • --checkpoint: checkpoint 파일 경둜
  • --max_sequence: inference μ‹œ μ΅œλŒ€ μ‹œν€€μŠ€ 길이
  • --batch_size: 배치 크기
  • --file_path: test dataset 경둜
  • --output_dir: inference κ²°κ³Ό μ €μž₯ 디렉토리

πŸš€ Demo

demo

πŸ“– References

  • On Recognizing Texts of Arbitrary Shapes with 2D Self-Attention, Lee et al., 2019
  • Bag of Tricks for Image ClassiτŽ™Ÿcation with Convolutional Neural Networks, He et al., 2018
  • Averaging Weights Leads to Wider Optima and Better Generalization, Izmailov et al., 2018
  • CSTR: Revisiting Classification Perspective on Scene Text Recognition, Cai et al., 2021
  • Improvement of End-to-End Offline Handwritten Mathematical Expression Recognition by Weakly Supervised Learning, Truong et al., 2020
  • ELECTRA: Pre-training Text Encoders As Discriminators Rather Than Generators, Clark et al., 2020
  • SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition, Qiao et al., 2020
  • Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition, Fang et al., 2021
  • Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, Wu et al., 2016

πŸ‘©β€πŸ’» Contributors

κΉ€μ’…μ˜ 민지원 λ°•μ†Œν˜„ 배수민 μ˜€μ„Έλ―Ό 졜재혁
Avatar Avatar Avatar Avatar Avatar Avatar

βœ… License

Distributed under the MIT License. See LICENSE for more information.

About

πŸ“œ (OCR) Recognizing LaTeX format text in the equation image

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.7%
  • Shell 0.3%