GitHub - ayushbits/pe-ocr-sanskrit: Source and Data of our EMNLP Paper 'A Benchmark and Dataset for Post-OCR text correction in Sanskrit'

A Benchmark and Dataset for Post-OCR text correction in Sanskrit

A Benchmark and Dataset for Post-OCR text correction in Sanskrit
Ayush Maheshwari, Nikhil Singh, Amrith Krishna and Ganesh Ramakrishnan
Findings of EMNLP 2022

Post-edited data

*_devanagari.csv refers to train, test and validation split of manually post-edited OCR data
ood-test.csv refers to out-of-domain test set consisting of 500 sentences as described in Section 4.1 of the paper.

dev-transliterate-scripts/

contains scripts to transliterate words from from SLP1 to Dev and vice-versa

OCR images and their annotation

OCR-Images-Annotation/ folder contains books containing test set of 500 images and their corresponding groundtruth.
BHS refers to Brahmastura Bhashyam
GG refers to Grahalaghava of Ganesh Daivajna
GOS refers to Goladhyaya

Training Scripts

Training scripts are present in the train-scripts directory

Calculate CER, WER

preds/ folder contains predictions and GT for the 500 sentences in out-of-domain test set
To calculate, run pip install fastwer
python word_count_cer.py <cer/wer>

Citation:

@inproceedings{maheshwari2022benchmark,
  title={A Benchmark and Dataset for Post-OCR text correction in Sanskrit},
  author={Maheshwari, Ayush and Singh, Nikhil and Krishna, Amrith and Ramakrishnan, Ganesh},
  booktitle={Findings of the Association for Computational Linguistics: EMNLP 2022},
  pages={6258--6265},
  year={2022}
}

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
OCR-Images-Annotation		OCR-Images-Annotation
dev-transliterate-scripts		dev-transliterate-scripts
preds		preds
train-scripts		train-scripts
.gitignore		.gitignore
README.md		README.md
metrics.py		metrics.py
ood-test.csv		ood-test.csv
test_devnagari.csv		test_devnagari.csv
train_devnagari.csv		train_devnagari.csv
val_devnagari.csv		val_devnagari.csv
word_count_cer.py		word_count_cer.py
word_wise_cer.png		word_wise_cer.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCR-Images-Annotation

OCR-Images-Annotation

dev-transliterate-scripts

dev-transliterate-scripts

preds

preds

train-scripts

train-scripts

.gitignore

.gitignore

README.md

README.md

metrics.py

metrics.py

ood-test.csv

ood-test.csv

test_devnagari.csv

test_devnagari.csv

train_devnagari.csv

train_devnagari.csv

val_devnagari.csv

val_devnagari.csv

word_count_cer.py

word_count_cer.py

word_wise_cer.png

word_wise_cer.png

Repository files navigation

A Benchmark and Dataset for Post-OCR text correction in Sanskrit

Post-edited data

dev-transliterate-scripts/

OCR images and their annotation

Training Scripts

Calculate CER, WER

Citation:

About

Languages

ayushbits/pe-ocr-sanskrit

Folders and files

Latest commit

History

Repository files navigation

A Benchmark and Dataset for Post-OCR text correction in Sanskrit

Post-edited data

dev-transliterate-scripts/

OCR images and their annotation

Training Scripts

Calculate CER, WER

Citation:

About

Resources

Stars

Watchers

Forks

Languages