tabula-py is a simple Python wrapper of tabula-java, which can read table of PDF.
You can read tables from PDF and convert into pandas's DataFrame. tabula-py also enables you to convert a PDF file into CSV/TSV/JSON file.
- Java 8+
- Python 3.5+
I confirmed working on macOS and Ubuntu. But some people confirm it works on Windows 10. See also the document for the detailed installation for Windows 10.
Ensure you have Java runtime and set PATH for it.
pip install tabula-py
tabula-py enables you to extract table from PDF into DataFrame and JSON. It also can extract tables from PDF and save file as CSV, TSV or JSON.
import tabula # Read pdf into list of DataFrame df = tabula.read_pdf("test.pdf", pages='all') # Read remote pdf into list of DataFrame df2 = tabula.read_pdf("https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/arabic.pdf") # convert PDF into CSV file tabula.convert_into("test.pdf", "output.csv", output_format="csv", pages='all') # convert all PDFs in a directory tabula.convert_into_by_batch("input_directory", output_format='csv', pages='all')
Interested in helping out? I'd love to have your help!
You can help by:
- Reporting a bug.
- Adding or editing documentation.
- Contributing code via a Pull Request. See also for the contribution
- Write a blog post or spreading the word about
tabula-pyto people who might be able to benefit from using it.
You can also support our continued work on
tabula-py with a donation on Patreon.