Thai National Document Optical Character Recognition (THND OCR)

Tesseract OCR tools for read Thai National Document used TH Sarabun National Font trained and fine-tuned. Read README.md to see about my process.

Part I : https://github.com/copninich/TH-National-Document-OCR-Part-I
Part II : https://github.com/copninich/TH-National-Document-OCR-Part-II
Medium : https://medium.com/@copninich/5a673cc8a686

0. Information

0.1 Tool

Tesseract : https://github.com/tesseract-ocr/tesseract

0.2 Datasets

PyThaiNLP (Prachathai) : https://github.com/PyThaiNLP/prachathai-67k
PyThaiNLP (ThaiGov V2 Corpus) : https://github.com/PyThaiNLP/thaigov-v2-corpus
PyThaiNLP (ThaiGov Archive corpus) : https://github.com/PyThaiNLP/thaigov-archive-corpus
Thaisum : https://github.com/nakhunchumpolsathien/ThaiSum
TR-TPBS : https://github.com/nakhunchumpolsathien/TR-TPBS

0.3 Project information

Ai Builders 2021
Kampanart Chaimooltan

01.Performance tested

I used Character Errorate and leght string (OCR & Correct Text) and output result testing (.csv file)

02.Generated datasets

I used PIL library. in addtion, I used TH Sarabun formart font 72 px to create datasets.

Link : https://www.kaggle.com/copninich/thaienglish-character-in-th-sarabun-font

03.Tested trained and fine-tuned Tesseract (default langdata_lstm)

Requirements

langdata_lstm
tesseract v.4
tessdata_best

Load file to your folder and extract : https://drive.google.com/drive/folders/1ABo7ooO62Tb03RR_VvkdshRVG9vz23sl?usp=sharing

04.Trained and fine-tuned

Run script script_basic.ipynb or script_config_error.ipynb

Requirements

langdata_lstm
tesseract v.4
tessdata_best

Custom tha.training_text with my own datasets more than 1.9 M sentences

05.Performance tested

report_performace_final.csv

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Repository files navigation

Thai National Document Optical Character Recognition (THND OCR)

Tesseract OCR tools for read Thai National Document used TH Sarabun National Font trained and fine-tuned. Read README.md to see about my process.

0. Information

0.1 Tool

0.2 Datasets

0.3 Project information

01.Performance tested

02.Generated datasets

03.Tested trained and fine-tuned Tesseract (default langdata_lstm)

04.Trained and fine-tuned

05.Performance tested

About

Releases 1

Packages

copninixh/TH-National-Document-OCR

Folders and files

Latest commit

History

README.md

README.md

Repository files navigation

Thai National Document Optical Character Recognition (THND OCR)

Tesseract OCR tools for read Thai National Document used TH Sarabun National Font trained and fine-tuned. Read README.md to see about my process.

0. Information

0.1 Tool

0.2 Datasets

0.3 Project information

01.Performance tested

02.Generated datasets

03.Tested trained and fine-tuned Tesseract (default langdata_lstm)

04.Trained and fine-tuned

05.Performance tested

About

Topics

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Packages