Skip to content

Part II Tesseract OCR Tools for reading Thai National Document using TH Sarabun National Font for training and finetune. Read README.md to see my step to developing.

Notifications You must be signed in to change notification settings

copninixh/TH-National-Document-OCR-Part-II

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TH National Document OCR (THND OCR)

DOI

Tesseract OCR Tools for read Thai National Document using TH Sarabun National Font for training and finetune. Read README.md to see my step to developing.

Part I : https://github.com/copninich/TH-National-Document-OCR-Part-I

Part II: https://github.com/copninich/TH-National-Document-OCR-Part-II

Medium : https://copninich.medium.com/thnd-ocr-the-series-version-%E0%B8%A1%E0%B9%89%E0%B8%A7%E0%B8%99%E0%B9%80%E0%B8%94%E0%B8%B5%E0%B8%A2%E0%B8%A7%E0%B8%88%E0%B8%9A%E0%B8%9A%E0%B8%9A%E0%B8%9A-%E0%B8%AD%E0%B8%B0%E0%B9%80%E0%B8%AB%E0%B9%89%E0%B8%A2%E0%B8%A2-5a673cc8a686

0. Information

0.1 Tools

0.2 Dataset

0.3 Project information

  • Ai Builders
  • Kampanart Chaimooltan

01.Performance testing

Using Character Errorate and leght string (OCR & Correct Text) and output result testing (.csv file)

02.Generate dataset

Using PIL library and using TH Sarabun formart font 72 px to create dataset.

03.Testing train and fine tune tesserct (default langdata_lstm)

Requirements langdata_lstm , tesseract 4 , tessdata_best

Load file to your folder and extract : https://drive.google.com/drive/folders/1ABo7ooO62Tb03RR_VvkdshRVG9vz23sl?usp=sharing

04.Train and fine tune

Run scrript script_basic.ipynb or script_config_error.ipynb Requirements langdata_lstm , tesseract 4 , tessdata_best

Custom tha.training_text with my own dataset more than 60k sentences

Load file to your folder and extract : https://drive.google.com/drive/folders/1ABo7ooO62Tb03RR_VvkdshRVG9vz23sl?usp=sharing

05.Performance testing

report_performace_final.csv

About

Part II Tesseract OCR Tools for reading Thai National Document using TH Sarabun National Font for training and finetune. Read README.md to see my step to developing.

Resources

Stars

Watchers

Forks

Packages