-
-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
updated version? #43
Comments
The latest version of tabula-java is v0.9.2. You're pointing the tabula's version. |
ok, should I report a new issue for the warnings I am getting in the output I have shown? It is stripping off some text when extracting tables from pdfs. I confirmed the version of the jar I have in my tabula-py : tabula-0.9.2-jar-with-dependencies.jar |
That's had already reported at #31 |
ok, thanks. So just to confirm, a re-install of tabula-py (with pdfbox2.0 of tabula-java) is all I need? I can't find the release steps of tabula-py in order to use the upgraded version of tabula-java which is based on pdfbox2.0. Apologies if I am missing something obvious here. |
No. If you want to replace tabula-java v0.9.2 into master code, you should replace jar and modify some codes. My PR would be helpful. |
I upgraded tabula.py to use the latest jar (tabula-1.0.1-jar-with-dependencies.jar) and while it reduced these warnings, I still get some. Aug 31, 2017 11:42:02 AM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont The main issue is that common cell headers donot get read in and not sure if the warnings are related. Please find the PDF file here: ufile.io/5xuti |
Summary of your issue
It looks like the current version of tabula-py is not compatible with the latest tabula-java. What needs to be done to upgrade tabula-py which has the updates from https://github.com/tabulapdf/tabula/releases/tag/v1.1.1 ?
Environment
Write and check your environment.
python --version
: 3.6java -version
: 1.8.0_45-b14What did you do when you faced the problem?
fails to parse tables fully throwing these warnings:
Jul 19, 2017 11:05:36 AM org.apache.fontbox.util.FontManager findTTFontname
WARNING: Font not found: Times New Roman,Bold
It appears that tabula-java v1.1.1 has the fix for this issue.
Example code:
read_pdf_table(in_file, pages=i)
Output:
Jul 19, 2017 11:05:35 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: BDC
Jul 19, 2017 11:05:36 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: EMC
Jul 19, 2017 11:05:36 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: BDC
Jul 19, 2017 11:05:36 AM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont getawtFont
INFO: Can't find the specified font Times New Roman,Bold
Jul 19, 2017 11:05:36 AM org.apache.fontbox.util.FontManager findTTFontname
WARNING: Font not found: Times New Roman,Bold
Jul 19, 2017 11:05:36 AM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont getawtFont
INFO: Can't find the specified font Times New Roman,Italic
Jul 19, 2017 11:05:36 AM org.apache.fontbox.util.FontManager findTTFontname
WARNING: Font not found: Times New Roman,Italic
Jul 19, 2017 11:05:36 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: EMC
Jul 19, 2017 11:05:36 AM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont getawtFont
INFO: Can't find the specified font Times New Roman
Jul 19, 2017 11:05:36 AM org.apache.fontbox.util.FontManager findTTFontname
WARNING: Font not found: Times New Roman
Jul 19, 2017 11:05:37 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: BDC
Jul 19, 2017 11:05:37 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: EMC
What did you intend to be?
The text was updated successfully, but these errors were encountered: