This project is a JVM client for Axa group's Parsr project.
Grab via Maven:
<dependency>
<groupId>com.github.devcsrj</groupId>
<artifactId>docparsr</artifactId>
<version>(version)</version>
</dependency>
or Gradle:
implementation("com.github.devcsrj:docparsr:$version")
Assuming you have the API server running, you can communicate with it using:
val uri = URI.create("http://localhost:3001")
val parser = DocParsr.create(uri)
Then, submit your file with:
val file = File("hello.pdf") // your pdf or image file
val config = Configuration() // or, 'parser.getDefaultConfig()`
val job = parser.newParsingJob(file, config)
At this point, the job
object presents you with either synchronous:
val result = job.execute()
...or asynchronous method:
val callback = object: Callback {
fun onFailure(job: ParsingJob, e: Exception) {}
fun onProgress(job: ParsingJob, progress: Progress) {}
fun onSuccess(job: ParsingJob, result: ParsingResult) {}
}
job.enqueue(callback)
Regardless of the approach you choose, you end up with a ParsingResult
. You can then
access the various generated output
from the server with:
result.source(Text).use {
// copy the InputStream
}
If you are instead interested on the JSON schema, this library provides a Visitor -based API:
val visitor = object: DocumentVisitor {
// override methods
}
val document = Document.from(result)
document.accept(visitor)
Like any other gradle -based project, you can build the artifacts with:
$ ./gradlew build
This project also includes functional test, which runs against an actual Parsr server. Assuming you have Docker installed, run the tests with:
$ ./gradlew functionalTest
When I was working on the Klerk project, I realized how difficult and time-consuming it is to scrape data from PDF documents. My approach then also involved the use of heavy witchcraft using Tesseract, because typical PDF-to-text libraries just don't cut it (especially on skewed, or garbled sections).
The Parsr project seems to also tackle the challenges I faced, and more. To keep the data extraction out of my Beam pipeline, I wrote this library.