Skip to content

devcsrj/docparsr-jvm

Repository files navigation

Parsr

This project is a JVM client for Axa group's Parsr project.

Download

Grab via Maven:

<dependency>
    <groupId>com.github.devcsrj</groupId>
    <artifactId>docparsr</artifactId>
    <version>(version)</version>
</dependency>

or Gradle:

implementation("com.github.devcsrj:docparsr:$version")

Usage

Assuming you have the API server running, you can communicate with it using:

val uri = URI.create("http://localhost:3001")
val parser = DocParsr.create(uri)

Then, submit your file with:

val file = File("hello.pdf")    // your pdf or image file
val config = Configuration()    // or, 'parser.getDefaultConfig()`
val job = parser.newParsingJob(file, config)

At this point, the job object presents you with either synchronous:

val result = job.execute()

...or asynchronous method:

val callback = object: Callback {
    fun onFailure(job: ParsingJob, e: Exception) {}
    fun onProgress(job: ParsingJob, progress: Progress) {}
    fun onSuccess(job: ParsingJob, result: ParsingResult) {}
}
job.enqueue(callback)

Regardless of the approach you choose, you end up with a ParsingResult. You can then access the various generated output from the server with:

result.source(Text).use {
   // copy the InputStream
}

If you are instead interested on the JSON schema, this library provides a Visitor -based API:

val visitor = object: DocumentVisitor {
   // override methods
}
val document = Document.from(result)
document.accept(visitor) 

Building

Like any other gradle -based project, you can build the artifacts with:

$ ./gradlew build

This project also includes functional test, which runs against an actual Parsr server. Assuming you have Docker installed, run the tests with:

$ ./gradlew functionalTest

Future work

Motivation

When I was working on the Klerk project, I realized how difficult and time-consuming it is to scrape data from PDF documents. My approach then also involved the use of heavy witchcraft using Tesseract, because typical PDF-to-text libraries just don't cut it (especially on skewed, or garbled sections).

The Parsr project seems to also tackle the challenges I faced, and more. To keep the data extraction out of my Beam pipeline, I wrote this library.