Skip to content
This repository was archived by the owner on Jul 3, 2023. It is now read-only.

Conversation

@HansBrende
Copy link
Member

Improves TikaEncodingDetector by:

  1. Not second-guessing UTF-8 if there is any indication that a stream is UTF-8-encoded. We can't afford false positives from obscure, obsolete charsets such as IBM500 (See TIKA-2771).
  2. Taking entire stream into account rather than a prefix (this shouldn't be a huge memory issue, as we are already holding the entire stream in memory to pass to each extractor, and extractors such as RDFa already parse the entire content into a DOM before generating the triples. If we want to make Any23 "streaming"-capable in the future to reduce memory requirements, we can look into that, but for now, since we're not, we may as well use that to our advantage to be more accurate in charset detection.)
  3. Taking TIKA-2771, TIKA-2038, and TIKA-539 into account.

@HansBrende
Copy link
Member Author

@lewismc any thoughts about this?

@lewismc
Copy link
Member

lewismc commented Nov 8, 2018

There is a fair bit of code here but I am not really sure how to test it. Unfortunately I am going to say, please provide unit test. I've been aware of some encoding detection issues previously with Any23 but never got around to logging them here.

@HansBrende
Copy link
Member Author

@lewismc I've added some additional unit tests which test against the main issues we've been having with encoding detection.

Unfortunately, the only real way to comprehensively test this is to compare against millions of webpages "in the wild", but I am confident that it represents a huge improvement over what we have now, based on our past problems with encoding detection, plus discussions over in Tika regarding the various issues they've been having with encoding detection.

Compare to the original version of this file here.

Since that time, I've made a couple changes to the algorithm to fix up problems we've encountered along the way, but those tweaks weren't as comprehensive as this one is.

Ideally, I'd like to compare this more comprehensive solution against our original solution across millions of webpages, but I'm not yet sure how to proceed in that regard.

@lewismc
Copy link
Member

lewismc commented Nov 8, 2018

You've brought up an excellent topic for conversation. Tika currently has a batch, regression job which essentially enables them to run over loads of documents and analyze the output. The result being that they know how changes to the source are affecting Tika's ability to do what it claims it is doing over time. We do not have that in Any23 but I think we should make an effort to build bridges withe the Tika community in this regard with the aim of us sharing resources (both available computing to run large batch parse jobs, as well as dataset(s) we can use to run Any23 over.)

I have been thinking for the longest time now about implementing a tika.triplify API which would encapsulate Any23 run it on the Tika data streams but I just never got around to it. Maybe now is a better time to bring that idea back to life.

I was thinking we could possibly use common crawl but they do not publish the raw data AFAIK it is the Nutch segments or some alternative e.g. the WebArchive files.

@HansBrende
Copy link
Member Author

@lewismc I've simplified the code a lot so it should be a whole lot easier to see what's going on now.

Also, I improved the UTF-8 detector by reverse engineering jchardet's methodology for UTF-8 detection, and created a UTF-8 state machine which does the same thing as jchardet (in a much more human-readable manner), plus fixed two bugs in jchardet's UTF-8 detector along the way (possibly due to the lack of human-readability in the original source code).

I started looking into jchardet because, according to TIKA-2038, using it to detect UTF-8 before anything else increased the accuracy of charset detection from ~72% to ~96%.

Our encoding detector should now be at least as accurate.

Any thoughts on the methodology, as compared to what we had before?

Copy link
Member

@lewismc lewismc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HansBrende do you have any pending issues with this PR? I've tested it locally and your tests pass. I've reviewed the code and combined with your comments, I've made best efforts at analyzing what is going on.

@HansBrende
Copy link
Member Author

@lewismc The only item I have left is to update the f8 dependency. Ideally, I'd like to depend on the 1.1 version (unreleased) rather than the RC version, so I'll work on that tonight and hopefully have this PR finished up by this weekend. Thanks!

@lewismc
Copy link
Member

lewismc commented Feb 1, 2019

cool

@HansBrende
Copy link
Member Author

@lewismc Also, I never added license notices for the biweekly dependency (used to parse iCal) and jsoup dependency (used to parse HTML). I guess I should add those too, before we push a release?

Some clarity on which NOTICE and/or LICENSE files I need to append to would be appreciated. Thanks!

@lewismc
Copy link
Member

lewismc commented Feb 4, 2019

Is there a wiki on how to do this correctly?

No, but there is some documentation at https://apache.org/legal/release-policy.html#notice-file

If I were you I would just add the relevant license to NOTICE on the top level of the project.

@asfgit asfgit merged commit e9c001f into apache:master Feb 7, 2019
@HansBrende HansBrende deleted the ANY23-418 branch February 7, 2019 05:04
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants