Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

updated version? #43

Closed
4 tasks
dr333 opened this issue Jul 19, 2017 · 6 comments
Closed
4 tasks

updated version? #43

dr333 opened this issue Jul 19, 2017 · 6 comments

Comments

@dr333
Copy link

dr333 commented Jul 19, 2017

Summary of your issue

It looks like the current version of tabula-py is not compatible with the latest tabula-java. What needs to be done to upgrade tabula-py which has the updates from https://github.com/tabulapdf/tabula/releases/tag/v1.1.1 ?

Environment

Write and check your environment.

  • python --version: 3.6
  • java -version: 1.8.0_45-b14
  • OS and it's version: CentOS Linux release 7.2.1511
  • Your PDF URL: any

What did you do when you faced the problem?

fails to parse tables fully throwing these warnings:
Jul 19, 2017 11:05:36 AM org.apache.fontbox.util.FontManager findTTFontname
WARNING: Font not found: Times New Roman,Bold

It appears that tabula-java v1.1.1 has the fix for this issue.

Example code:

read_pdf_table(in_file, pages=i)

Output:

Jul 19, 2017 11:05:35 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: BDC
Jul 19, 2017 11:05:36 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: EMC
Jul 19, 2017 11:05:36 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: BDC
Jul 19, 2017 11:05:36 AM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont getawtFont
INFO: Can't find the specified font Times New Roman,Bold
Jul 19, 2017 11:05:36 AM org.apache.fontbox.util.FontManager findTTFontname
WARNING: Font not found: Times New Roman,Bold
Jul 19, 2017 11:05:36 AM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont getawtFont
INFO: Can't find the specified font Times New Roman,Italic
Jul 19, 2017 11:05:36 AM org.apache.fontbox.util.FontManager findTTFontname
WARNING: Font not found: Times New Roman,Italic
Jul 19, 2017 11:05:36 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: EMC
Jul 19, 2017 11:05:36 AM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont getawtFont
INFO: Can't find the specified font Times New Roman
Jul 19, 2017 11:05:36 AM org.apache.fontbox.util.FontManager findTTFontname
WARNING: Font not found: Times New Roman
Jul 19, 2017 11:05:37 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: BDC
Jul 19, 2017 11:05:37 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: EMC

What did you intend to be?

@chezou
Copy link
Owner

chezou commented Jul 19, 2017

The latest version of tabula-java is v0.9.2. You're pointing the tabula's version.
https://github.com/tabulapdf/tabula-java/releases/tag/0.9.2

@chezou chezou closed this as completed Jul 19, 2017
@dr333
Copy link
Author

dr333 commented Jul 19, 2017

ok, should I report a new issue for the warnings I am getting in the output I have shown? It is stripping off some text when extracting tables from pdfs. I confirmed the version of the jar I have in my tabula-py : tabula-0.9.2-jar-with-dependencies.jar

@chezou
Copy link
Owner

chezou commented Jul 19, 2017

That's had already reported at #31

@dr333
Copy link
Author

dr333 commented Jul 19, 2017

ok, thanks. So just to confirm, a re-install of tabula-py (with pdfbox2.0 of tabula-java) is all I need? I can't find the release steps of tabula-py in order to use the upgraded version of tabula-java which is based on pdfbox2.0. Apologies if I am missing something obvious here.

@chezou
Copy link
Owner

chezou commented Jul 19, 2017

No. If you want to replace tabula-java v0.9.2 into master code, you should replace jar and modify some codes.

My PR would be helpful.
#2

@dr333
Copy link
Author

dr333 commented Aug 31, 2017

I upgraded tabula.py to use the latest jar (tabula-1.0.1-jar-with-dependencies.jar) and while it reduced these warnings, I still get some.

Aug 31, 2017 11:42:02 AM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont
WARNING: Using fallback font 'LiberationSans' for 'TimesNewRomanPS-ItalicMT'
Aug 31, 2017 11:42:03 AM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont
WARNING: Using fallback font 'LiberationSans' for 'TimesNewRomanPS-ItalicMT'
Aug 31, 2017 11:42:03 AM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont
WARNING: Using fallback font 'LiberationSans' for 'TimesNewRomanPS-ItalicMT'

The main issue is that common cell headers donot get read in and not sure if the warnings are related. Please find the PDF file here: ufile.io/5xuti
You will see that common cell header in table of page 1 for instance ('Three Months Ended March 31') gets dropped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants