Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingest-attachment plugin Font not found: TimesNewRomanPS-BoldMT #27198

Closed
TomonoriSoejima opened this issue Nov 1, 2017 · 7 comments
Closed
Assignees
Labels
>bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP feedback_needed

Comments

@TomonoriSoejima
Copy link
Contributor

Describe the feature:

Elasticsearch version (bin/elasticsearch --version):
ES 5.x
Plugins installed: []
ingest-attachment
JVM version (java -version):
1.8.x
OS version (uname -a if on a Unix-like system):

Description of the problem including expected versus actual behavior:

Steps to reproduce:

Please include a minimal but complete recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc. The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.

  1. ingest pdf document which contains TimesNewRomanPS-BoldMT font
  2. ingest-pipeline should throw error below

I have created an issue here.
https://issues.apache.org/jira/browse/PDFBOX-3985

Provide logs (if relevant):

2017/10/31 00:01:13.348 [WARN ] [elasticsearch[test][bulk][T#3]] [FontManager] Font not found: TimesNewRomanPS-BoldMT
2017/10/31 00:01:13.413 [ERROR] [elasticsearch[test][bulk][T#3]] [TrueTypeFont] An error occured when reading table cmap
java.io.IOException: CMap subtype 14 not yet implemented
        at org.apache.fontbox.ttf.CMAPEncodingEntry.processSubtype14(CMAPEncodingEntry.java:304)
        at org.apache.fontbox.ttf.CMAPEncodingEntry.initSubtable(CMAPEncodingEntry.java:114)
        at org.apache.fontbox.ttf.CMAPTable.initData(CMAPTable.java:100)
        at org.apache.fontbox.ttf.TrueTypeFont.initializeTable(TrueTypeFont.java:280)
        at org.apache.fontbox.ttf.AbstractTTFParser.parseTables(AbstractTTFParser.java:128)
        at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:80)
        at org.apache.fontbox.ttf.AbstractTTFParser.parseTTF(AbstractTTFParser.java:109)
        at org.apache.fontbox.ttf.TTFParser.parseTTF(TTFParser.java:25)
        at org.apache.fontbox.ttf.AbstractTTFParser.parseTTF(AbstractTTFParser.java:84)
        at org.apache.fontbox.ttf.TTFParser.parseTTF(TTFParser.java:25)
        at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getTTFFont(PDTrueTypeFont.java:632)
        at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getFontWidth(PDTrueTypeFont.java:673)
        at org.apache.pdfbox.pdmodel.font.PDSimpleFont.getFontWidth(PDSimpleFont.java:231)
        at org.apache.pdfbox.pdmodel.font.PDSimpleFont.getSpaceWidth(PDSimpleFont.java:533)
        at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355)
        at org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:62)
        at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:557)
        at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
        at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
        at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
        at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:458)
        at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:383)
        at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:342)
        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:148)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:148)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at org.apache.tika.Tika.parseToString(Tika.java:537)
@talevy
Copy link
Contributor

talevy commented Nov 1, 2017

thank you for creating the upstream issue against PDFBox!

@talevy
Copy link
Contributor

talevy commented Nov 1, 2017

This is not the first issue we've seen dealing with parsing specific fonts. I think we can do better with the latest version of PDFBox that, if I am not mistaken, logs (instead of throws) these exceptions. That way we can still extract what we can from the pdf.

@clintongormley clintongormley added :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP and removed :Plugin Ingest Attachment labels Feb 13, 2018
@dadoonet dadoonet removed the v5.6.3 label Feb 21, 2018
@dadoonet
Copy link
Member

I looked at this and it seems like that Apache Tika 1.17 depends on PDFBox 2.0.8:

[INFO] +- org.apache.tika:tika-parsers:jar:1.17:compile
[INFO] |  +- org.apache.tika:tika-core:jar:1.17:compile
[INFO] |  +- org.gagravarr:vorbis-java-tika:jar:0.8:compile
[INFO] |  +- com.healthmarketscience.jackcess:jackcess:jar:2.1.8:compile
[INFO] |  |  \- commons-lang:commons-lang:jar:2.6:compile
[INFO] |  +- com.healthmarketscience.jackcess:jackcess-encrypt:jar:2.1.2:compile
[INFO] |  +- org.tallison:jmatio:jar:1.2:compile
[INFO] |  +- org.apache.james:apache-mime4j-core:jar:0.8.1:compile
[INFO] |  +- org.apache.james:apache-mime4j-dom:jar:0.8.1:compile
[INFO] |  +- org.apache.commons:commons-compress:jar:1.14:compile
[INFO] |  +- org.tukaani:xz:jar:1.6:compile
[INFO] |  +- commons-codec:commons-codec:jar:1.10:compile
[INFO] |  +- org.apache.pdfbox:pdfbox:jar:2.0.8:compile
[INFO] |  |  \- org.apache.pdfbox:fontbox:jar:2.0.8:compile
[INFO] |  +- org.apache.pdfbox:pdfbox-tools:jar:2.0.8:compile
[INFO] |  +- org.apache.pdfbox:jempbox:jar:1.8.13:compile
[INFO] |  +- org.bouncycastle:bcmail-jdk15on:jar:1.54:compile
[INFO] |  |  \- org.bouncycastle:bcpkix-jdk15on:jar:1.54:compile
[INFO] |  +- org.bouncycastle:bcprov-jdk15on:jar:1.54:compile
[INFO] |  +- org.apache.poi:poi:jar:3.17:compile
[INFO] |  |  \- org.apache.commons:commons-collections4:jar:4.1:compile
[INFO] |  +- org.apache.poi:poi-scratchpad:jar:3.17:compile
[INFO] |  +- org.apache.poi:poi-ooxml:jar:3.17:compile
[INFO] |  |  +- org.apache.poi:poi-ooxml-schemas:jar:3.17:compile
[INFO] |  |  |  \- org.apache.xmlbeans:xmlbeans:jar:2.6.0:compile
[INFO] |  |  \- com.github.virtuald:curvesapi:jar:1.04:compile
[INFO] |  +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1:compile
[INFO] |  +- org.ow2.asm:asm:jar:5.0.4:compile
[INFO] |  +- com.googlecode.mp4parser:isoparser:jar:1.1.18:compile
[INFO] |  +- com.drewnoakes:metadata-extractor:jar:2.10.1:compile
[INFO] |  |  \- com.adobe.xmp:xmpcore:jar:5.1.3:compile
[INFO] |  +- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:compile
[INFO] |  +- com.rometools:rome:jar:1.5.1:compile
[INFO] |  |  \- com.rometools:rome-utils:jar:1.5.1:compile
[INFO] |  +- org.gagravarr:vorbis-java-core:jar:0.8:compile
[INFO] |  +- com.googlecode.juniversalchardet:juniversalchardet:jar:1.0.3:compile
[INFO] |  +- org.codelibs:jhighlight:jar:1.0.2:compile
[INFO] |  +- com.pff:java-libpst:jar:0.8.1:compile
[INFO] |  +- com.github.junrar:junrar:jar:0.7:compile
[INFO] |  +- org.apache.commons:commons-exec:jar:1.3:compile
[INFO] |  +- org.apache.opennlp:opennlp-tools:jar:1.8.3:compile
[INFO] |  +- com.googlecode.json-simple:json-simple:jar:1.1.1:compile
[INFO] |  +- com.tdunning:json:jar:1.8:compile
[INFO] |  +- com.google.code.gson:gson:jar:2.8.1:compile
[INFO] |  +- org.slf4j:slf4j-api:jar:1.7.24:compile
[INFO] |  +- org.slf4j:jul-to-slf4j:jar:1.7.24:compile
[INFO] |  +- org.slf4j:jcl-over-slf4j:jar:1.7.24:compile
[INFO] |  +- org.apache.httpcomponents:httpclient:jar:4.5.4:compile
[INFO] |  |  \- org.apache.httpcomponents:httpcore:jar:4.4.7:compile
[INFO] |  +- org.apache.httpcomponents:httpmime:jar:4.5.4:compile
[INFO] |  +- org.apache.commons:commons-csv:jar:1.0:compile
[INFO] |  +- org.apache.sis.core:sis-utility:jar:0.6:compile
[INFO] |  +- org.apache.sis.storage:sis-netcdf:jar:0.6:compile
[INFO] |  |  +- org.apache.sis.storage:sis-storage:jar:0.6:compile
[INFO] |  |  \- org.apache.sis.core:sis-referencing:jar:0.6:compile
[INFO] |  +- org.apache.sis.core:sis-metadata:jar:0.6:compile
[INFO] |  +- org.opengis:geoapi:jar:3.0.0:compile
[INFO] |  |  \- javax.measure:jsr-275:jar:0.9.3:compile
[INFO] |  \- edu.usc.ir:sentiment-analysis-parser:jar:0.1:compile

I can see that TIKA will be updated to a new pdfbox version with https://issues.apache.org/jira/browse/TIKA-2178 (for other reasons).
I opened https://issues.apache.org/jira/browse/TIKA-2579 to track this BTW.

I'm unsure though if that will really fix the problem though. As PDFBox team asked, @TomonoriSoejima could you share the failing PDF document so they can reproduce the problem and we can also add it to make sure that Tika next version will fix it?

Thanks!

@dadoonet
Copy link
Member

dadoonet commented Mar 9, 2018

Ping @TomonoriSoejima. Could you please share a document?

@TomonoriSoejima
Copy link
Contributor Author

Unfortunately, a user I was dealing with the support case declined to share the reproducible file with us due to privacy and I don't have the file.

@dadoonet
Copy link
Member

https://issues.apache.org/jira/browse/TIKA-2579 has been fixed. \o/
Let's wait for a release now.

@colings86
Copy link
Contributor

No further feedback so closing. If this can be reproduced we can reopen the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP feedback_needed
Projects
None yet
Development

No branches or pull requests

5 participants