ANY23-418 improve TikaEncodingDetector #131

HansBrende · 2018-11-06T21:48:14Z

Improves TikaEncodingDetector by:

Not second-guessing UTF-8 if there is any indication that a stream is UTF-8-encoded. We can't afford false positives from obscure, obsolete charsets such as IBM500 (See TIKA-2771).
Taking entire stream into account rather than a prefix (this shouldn't be a huge memory issue, as we are already holding the entire stream in memory to pass to each extractor, and extractors such as RDFa already parse the entire content into a DOM before generating the triples. If we want to make Any23 "streaming"-capable in the future to reduce memory requirements, we can look into that, but for now, since we're not, we may as well use that to our advantage to be more accurate in charset detection.)
Taking TIKA-2771, TIKA-2038, and TIKA-539 into account.

HansBrende · 2018-11-06T23:53:15Z

@lewismc any thoughts about this?

lewismc · 2018-11-08T01:56:50Z

There is a fair bit of code here but I am not really sure how to test it. Unfortunately I am going to say, please provide unit test. I've been aware of some encoding detection issues previously with Any23 but never got around to logging them here.

HansBrende · 2018-11-08T16:40:55Z

@lewismc I've added some additional unit tests which test against the main issues we've been having with encoding detection.

Unfortunately, the only real way to comprehensively test this is to compare against millions of webpages "in the wild", but I am confident that it represents a huge improvement over what we have now, based on our past problems with encoding detection, plus discussions over in Tika regarding the various issues they've been having with encoding detection.

Compare to the original version of this file here.

Since that time, I've made a couple changes to the algorithm to fix up problems we've encountered along the way, but those tweaks weren't as comprehensive as this one is.

Ideally, I'd like to compare this more comprehensive solution against our original solution across millions of webpages, but I'm not yet sure how to proceed in that regard.

lewismc · 2018-11-08T17:55:24Z

You've brought up an excellent topic for conversation. Tika currently has a batch, regression job which essentially enables them to run over loads of documents and analyze the output. The result being that they know how changes to the source are affecting Tika's ability to do what it claims it is doing over time. We do not have that in Any23 but I think we should make an effort to build bridges withe the Tika community in this regard with the aim of us sharing resources (both available computing to run large batch parse jobs, as well as dataset(s) we can use to run Any23 over.)

I have been thinking for the longest time now about implementing a tika.triplify API which would encapsulate Any23 run it on the Tika data streams but I just never got around to it. Maybe now is a better time to bring that idea back to life.

I was thinking we could possibly use common crawl but they do not publish the raw data AFAIK it is the Nutch segments or some alternative e.g. the WebArchive files.

HansBrende · 2018-11-12T02:35:55Z

@lewismc I've simplified the code a lot so it should be a whole lot easier to see what's going on now.

Also, I improved the UTF-8 detector by reverse engineering jchardet's methodology for UTF-8 detection, and created a UTF-8 state machine which does the same thing as jchardet (in a much more human-readable manner), plus fixed two bugs in jchardet's UTF-8 detector along the way (possibly due to the lack of human-readability in the original source code).

I started looking into jchardet because, according to TIKA-2038, using it to detect UTF-8 before anything else increased the accuracy of charset detection from ~72% to ~96%.

Our encoding detector should now be at least as accurate.

Any thoughts on the methodology, as compared to what we had before?

lewismc

@HansBrende do you have any pending issues with this PR? I've tested it locally and your tests pass. I've reviewed the code and combined with your comments, I've made best efforts at analyzing what is going on.

encoding/pom.xml

HansBrende · 2019-02-01T00:07:51Z

@lewismc The only item I have left is to update the f8 dependency. Ideally, I'd like to depend on the 1.1 version (unreleased) rather than the RC version, so I'll work on that tonight and hopefully have this PR finished up by this weekend. Thanks!

lewismc · 2019-02-01T18:54:28Z

cool

HansBrende · 2019-02-03T03:10:24Z

@lewismc I see a bunch of different LICENSE and NOTICE files, and I'm not sure what the difference is or which one to add additional licenses to. Is there a wiki on how to do this correctly?

HansBrende · 2019-02-04T00:43:13Z

@lewismc Also, I never added license notices for the biweekly dependency (used to parse iCal) and jsoup dependency (used to parse HTML). I guess I should add those too, before we push a release?

Some clarity on which NOTICE and/or LICENSE files I need to append to would be appreciated. Thanks!

lewismc · 2019-02-04T21:11:46Z

Is there a wiki on how to do this correctly?

No, but there is some documentation at https://apache.org/legal/release-policy.html#notice-file

If I were you I would just add the relevant license to NOTICE on the top level of the project.

ANY23-418 improve TikaEncodingDetector

d64dac9

Oops -- minor fixup

de8c887

ANY23-418 add additional unit tests

e9f11b4

HansBrende added 2 commits November 11, 2018 20:02

simplifications, implemented more robust UTF-8 state machine

f0df840

small improvement

e355fb9

HansBrende force-pushed the ANY23-418 branch from 907b736 to e355fb9 Compare November 12, 2018 02:20

HansBrende added 5 commits November 12, 2018 08:08

use ternary operator to reduce verbosity

44a1255

reset state to valid on invalid

f25f805

further improvements & simplifications

94e700d

minor refactor

58ff4e1

additional simplifications and refinements

74c0af9

HansBrende force-pushed the ANY23-418 branch from 5839a92 to 74c0af9 Compare November 13, 2018 23:56

HansBrende added 2 commits November 19, 2018 17:32

simplify utf8 byte statistics by including f8 artifact

a500375

minor refactor

c9b7706

lewismc reviewed Jan 31, 2019

View reviewed changes

encoding/pom.xml Show resolved Hide resolved

HansBrende added 2 commits February 2, 2019 18:05

Merge branch 'master' into ANY23-418

320be6a

ANY23-418 update f8 artifact, cleanup

dce3c09

Merge branch 'master' into ANY23-418

df18147

ANY23-418 update NOTICE.txt

e9c001f

asfgit merged commit e9c001f into apache:master Feb 7, 2019

HansBrende deleted the ANY23-418 branch February 7, 2019 05:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ANY23-418 improve TikaEncodingDetector #131

ANY23-418 improve TikaEncodingDetector #131

Uh oh!

HansBrende commented Nov 6, 2018

Uh oh!

HansBrende commented Nov 6, 2018

Uh oh!

lewismc commented Nov 8, 2018

Uh oh!

HansBrende commented Nov 8, 2018

Uh oh!

lewismc commented Nov 8, 2018

Uh oh!

HansBrende commented Nov 12, 2018

Uh oh!

lewismc left a comment

Uh oh!

Uh oh!

HansBrende commented Feb 1, 2019

Uh oh!

lewismc commented Feb 1, 2019

Uh oh!

HansBrende commented Feb 3, 2019

Uh oh!

HansBrende commented Feb 4, 2019

Uh oh!

lewismc commented Feb 4, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ANY23-418 improve TikaEncodingDetector #131

ANY23-418 improve TikaEncodingDetector #131

Uh oh!

Conversation

HansBrende commented Nov 6, 2018

Uh oh!

HansBrende commented Nov 6, 2018

Uh oh!

lewismc commented Nov 8, 2018

Uh oh!

HansBrende commented Nov 8, 2018

Uh oh!

lewismc commented Nov 8, 2018

Uh oh!

HansBrende commented Nov 12, 2018

Uh oh!

lewismc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HansBrende commented Feb 1, 2019

Uh oh!

lewismc commented Feb 1, 2019

Uh oh!

HansBrende commented Feb 3, 2019

Uh oh!

HansBrende commented Feb 4, 2019

Uh oh!

lewismc commented Feb 4, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants