Skip to content

andrescg2sj/PDFJuice

master
Switch branches/tags
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
res
 
 
src
 
 
 
 
 
 
 
 

Presentation

This project provides some tools that help the user to extract structured information from PDF documents. Currently, the program is able to export them to HTML.

PDFJuice depends on Apache PDFBox to read PDF documents.

There are two functionalities available so far:

  • Extract tables.
  • Extract slides.

This project is a spin-off of Courseminer.

Compile

Compile with dependencies:

mvn compile package assembly:single

Usage

Generate examples:

Output files are already available in the repository. They will be overwritten.

java -cp target/PDFJuice-1.3-SNAPSHOT-jar-with-dependencies.jar org.sj.tools.pdfjuice.ExampleGenerator

Use graphic user interface

See this tutorial.

Run from command line

java -cp target/PDFJuice-1.3-SNAPSHOT-jar-with-dependencies.jar org.sj.tools.pdfjuice.PDFJuice -m [mode] -i [input-filename] -o [output-filename]

The mode option may be slide or table, depending on which kind of information you want to extract (text and poster modes are under development).

More information (command line help):

Missing required options: i, o, m
usage: utility-name
 -c,--clip <arg>        format: x,y,width,height
 -g,--gui               Launches graphic user interface.
 -h,--help              Shows this help message.
 -i,--input <arg>       input file
 -l,--lines <arg>       line filtering: <color_name> | 0x<rrggbb> | all
 -m,--mode <arg>        extraction mode: slide|table|text
 -o,--output <arg>      output file
 -p,--proximity <arg>   minimum distance between tables
 -t,--thickness <arg>   máximum line thickness

Logging

Set java.util.logging.config.file property to ./logging.properties.

java -cp target/PDFJuice-1.3-SNAPSHOT-jar-with-dependencies.jar -Djava.util.logging.config.file=./logging.properties org.sj.tools.pdfjuice.PDFJuice -m [mode] -i [input-filename] -o [output-filename]

Examples

You can see some examples of what can be done with PDFJuice so far.

About

Extract formatted information from PDF documents.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages