Skip to content

Automatic garbage extraction from OCR'd text

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md
Notifications You must be signed in to change notification settings

benmarwick/rmgarbage

Repository files navigation

rmgarbage: automatic removal of garbage strings in OCR text

R build status Lifecycle: experimental

The goal of rmgarbage is to remove strings obtained from OCR engines which are garbage. It contains functions that implement the methods described by:

The code was inspired by Python code at https://github.com/foodoh/rmgarbage and JavaScript code at https://github.com/Amoki/rmgarbage.

Installation

You can install rmgarbage from GitHub with:

remotes::install_github("benmarwick/rmgarbage")

Example

This is a basic example which shows you how to solve the problem of identifing bad OCR.

library(rmgarbage)

Here is an example of output on a good ocr:

good_ocr <- "This document was scanned perfectly"
good_ocr_split <- strsplit(good_ocr, " ")[[1]]
sapply(good_ocr_split, rmgarbage)
#>      This  document       was   scanned perfectly 
#>     FALSE     FALSE     FALSE     FALSE     FALSE

And here is an example of output on a bad ocr:

bad_ocr <- "This 3ccm@nt w&s scnnnnd not pe&;c1!y"
bad_ocr_ocr_split <- strsplit(bad_ocr, " ")[[1]]
sapply(bad_ocr_ocr_split, rmgarbage)
#>     This  3ccm@nt      w&s  scnnnnd      not pe&;c1!y 
#>    FALSE     TRUE     TRUE     TRUE    FALSE     TRUE

Contributing

If you would like to contribute to this project, please start by reading our Guide to Contributing. Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

About

Automatic garbage extraction from OCR'd text

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages