Parsr

This project is a JVM client for Axa group's Parsr project.

Download

Grab via Maven:

<dependency>
    <groupId>com.github.devcsrj</groupId>
    <artifactId>docparsr</artifactId>
    <version>(version)</version>
</dependency>

or Gradle:

implementation("com.github.devcsrj:docparsr:$version")

Usage

Assuming you have the API server running, you can communicate with it using:

val uri = URI.create("http://localhost:3001")
val parser = DocParsr.create(uri)

Then, submit your file with:

val file = File("hello.pdf")    // your pdf or image file
val config = Configuration()    // or, 'parser.getDefaultConfig()`
val job = parser.newParsingJob(file, config)

At this point, the job object presents you with either synchronous:

val result = job.execute()

...or asynchronous method:

val callback = object: Callback {
    fun onFailure(job: ParsingJob, e: Exception) {}
    fun onProgress(job: ParsingJob, progress: Progress) {}
    fun onSuccess(job: ParsingJob, result: ParsingResult) {}
}
job.enqueue(callback)

Regardless of the approach you choose, you end up with a ParsingResult. You can then access the various generated output from the server with:

result.source(Text).use {
   // copy the InputStream
}

If you are instead interested on the JSON schema, this library provides a Visitor -based API:

val visitor = object: DocumentVisitor {
   // override methods
}
val document = Document.from(result)
document.accept(visitor)

Building

Like any other gradle -based project, you can build the artifacts with:

$ ./gradlew build

This project also includes functional test, which runs against an actual Parsr server. Assuming you have Docker installed, run the tests with:

$ ./gradlew functionalTest

Future work

Motivation

When I was working on the Klerk project, I realized how difficult and time-consuming it is to scrape data from PDF documents. My approach then also involved the use of heavy witchcraft using Tesseract, because typical PDF-to-text libraries just don't cut it (especially on skewed, or garbled sections).

The Parsr project seems to also tackle the challenges I faced, and more. To keep the data extraction out of my Beam pipeline, I wrote this library.

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
gradle/wrapper		gradle/wrapper
src		src
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
build.gradle		build.gradle
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
licenseheaders.txt		licenseheaders.txt
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parsr

Download

Usage

Building

Future work

Motivation

About

Releases

Packages

Languages

License

devcsrj/docparsr-jvm

Folders and files

Latest commit

History

Repository files navigation

Parsr

Download

Usage

Building

Future work

Motivation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages