This project provides some tools that help the user to extract structured information from PDF documents. Currently, the program is able to export them to HTML.
PDFJuice depends on Apache PDFBox to read PDF documents.
There are two functionalities available so far:
- Extract tables.
- Extract slides.
This project is a spin-off of Courseminer.
Compile with dependencies:
mvn compile package assembly:single
Output files are already available in the repository. They will be overwritten.
java -cp target/PDFJuice-1.3-SNAPSHOT-jar-with-dependencies.jar org.sj.tools.pdfjuice.ExampleGenerator
See this tutorial.
java -cp target/PDFJuice-1.3-SNAPSHOT-jar-with-dependencies.jar org.sj.tools.pdfjuice.PDFJuice -m [mode] -i [input-filename] -o [output-filename]
The mode option may be slide or table, depending on which kind of information you want to extract (text and poster modes are under development).
More information (command line help):
Missing required options: i, o, m
usage: utility-name
-c,--clip <arg> format: x,y,width,height
-g,--gui Launches graphic user interface.
-h,--help Shows this help message.
-i,--input <arg> input file
-l,--lines <arg> line filtering: <color_name> | 0x<rrggbb> | all
-m,--mode <arg> extraction mode: slide|table|text
-o,--output <arg> output file
-p,--proximity <arg> minimum distance between tables
-t,--thickness <arg> máximum line thickness
Set java.util.logging.config.file property to ./logging.properties.
java -cp target/PDFJuice-1.3-SNAPSHOT-jar-with-dependencies.jar -Djava.util.logging.config.file=./logging.properties org.sj.tools.pdfjuice.PDFJuice -m [mode] -i [input-filename] -o [output-filename]
You can see some examples of what can be done with PDFJuice so far.