Skip to content

copninixh/TH-National-Document-OCR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 

Repository files navigation

Thai National Document Optical Character Recognition (THND OCR)

DOI

Tesseract OCR tools for read Thai National Document used TH Sarabun National Font trained and fine-tuned. Read README.md to see about my process.

0. Information

0.1 Tool

0.2 Datasets

0.3 Project information

  • Ai Builders 2021
  • Kampanart Chaimooltan

01.Performance tested

I used Character Errorate and leght string (OCR & Correct Text) and output result testing (.csv file)

02.Generated datasets

I used PIL library. in addtion, I used TH Sarabun formart font 72 px to create datasets.

Link : https://www.kaggle.com/copninich/thaienglish-character-in-th-sarabun-font

03.Tested trained and fine-tuned Tesseract (default langdata_lstm)

Requirements

  1. langdata_lstm
  2. tesseract v.4
  3. tessdata_best

Load file to your folder and extract : https://drive.google.com/drive/folders/1ABo7ooO62Tb03RR_VvkdshRVG9vz23sl?usp=sharing

04.Trained and fine-tuned

Open In Colab

Run script script_basic.ipynb or script_config_error.ipynb

Requirements

  1. langdata_lstm
  2. tesseract v.4
  3. tessdata_best

Custom tha.training_text with my own datasets more than 1.9 M sentences

05.Performance tested

report_performace_final.csv

About

Tesseract OCR tools for read Thai National Document used TH Sarabun National Font trained and fine-tuned. Read README.md to see about my process.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published