rmgarbage: automatic removal of garbage strings in OCR text

The goal of rmgarbage is to remove strings obtained from OCR engines which are garbage. It contains functions that implement the methods described by:

Taghva et al. (2001) “Automatic Removal of Garbage Strings in OCR Text: An implementation” http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.81.8901
Yang Cai (2008) “OCR Output Enhancement” https://ladyissy.github.io/OCR/

The code was inspired by Python code at https://github.com/foodoh/rmgarbage and JavaScript code at https://github.com/Amoki/rmgarbage.

Installation

You can install rmgarbage from GitHub with:

remotes::install_github("benmarwick/rmgarbage")

Example

This is a basic example which shows you how to solve the problem of identifing bad OCR.

library(rmgarbage)

Here is an example of output on a good ocr:

good_ocr <- "This document was scanned perfectly"
good_ocr_split <- strsplit(good_ocr, " ")[[1]]
sapply(good_ocr_split, rmgarbage)
#>      This  document       was   scanned perfectly 
#>     FALSE     FALSE     FALSE     FALSE     FALSE

And here is an example of output on a bad ocr:

bad_ocr <- "This 3ccm@nt w&s scnnnnd not pe&;c1!y"
bad_ocr_ocr_split <- strsplit(bad_ocr, " ")[[1]]
sapply(bad_ocr_ocr_split, rmgarbage)
#>     This  3ccm@nt      w&s  scnnnnd      not pe&;c1!y 
#>    FALSE     TRUE     TRUE     TRUE    FALSE     TRUE

Contributing

If you would like to contribute to this project, please start by reading our Guide to Contributing. Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github		.github
R		R
man		man
tests		tests
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
CONDUCT.md		CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
README.Rmd		README.Rmd
README.md		README.md
codecov.yml		codecov.yml
rmgarbage.Rproj		rmgarbage.Rproj

License

Licenses found

benmarwick/rmgarbage

Folders and files

Latest commit

History

Repository files navigation

rmgarbage: automatic removal of garbage strings in OCR text

Installation

Example

Contributing

About

Resources

License

Licenses found

Stars

Watchers

Forks

Languages