-
Notifications
You must be signed in to change notification settings - Fork 52
ANY23-418 improve TikaEncodingDetector #131
Conversation
|
@lewismc any thoughts about this? |
|
There is a fair bit of code here but I am not really sure how to test it. Unfortunately I am going to say, please provide unit test. I've been aware of some encoding detection issues previously with Any23 but never got around to logging them here. |
|
@lewismc I've added some additional unit tests which test against the main issues we've been having with encoding detection. Unfortunately, the only real way to comprehensively test this is to compare against millions of webpages "in the wild", but I am confident that it represents a huge improvement over what we have now, based on our past problems with encoding detection, plus discussions over in Tika regarding the various issues they've been having with encoding detection. Compare to the original version of this file here. Since that time, I've made a couple changes to the algorithm to fix up problems we've encountered along the way, but those tweaks weren't as comprehensive as this one is. Ideally, I'd like to compare this more comprehensive solution against our original solution across millions of webpages, but I'm not yet sure how to proceed in that regard. |
|
You've brought up an excellent topic for conversation. Tika currently has a batch, regression job which essentially enables them to run over loads of documents and analyze the output. The result being that they know how changes to the source are affecting Tika's ability to do what it claims it is doing over time. We do not have that in Any23 but I think we should make an effort to build bridges withe the Tika community in this regard with the aim of us sharing resources (both available computing to run large batch parse jobs, as well as dataset(s) we can use to run Any23 over.) I have been thinking for the longest time now about implementing a I was thinking we could possibly use common crawl but they do not publish the raw data AFAIK it is the Nutch segments or some alternative e.g. the WebArchive files. |
907b736 to
e355fb9
Compare
|
@lewismc I've simplified the code a lot so it should be a whole lot easier to see what's going on now. Also, I improved the UTF-8 detector by reverse engineering jchardet's methodology for UTF-8 detection, and created a UTF-8 state machine which does the same thing as jchardet (in a much more human-readable manner), plus fixed two bugs in jchardet's UTF-8 detector along the way (possibly due to the lack of human-readability in the original source code). I started looking into jchardet because, according to TIKA-2038, using it to detect UTF-8 before anything else increased the accuracy of charset detection from ~72% to ~96%. Our encoding detector should now be at least as accurate. Any thoughts on the methodology, as compared to what we had before? |
5839a92 to
74c0af9
Compare
lewismc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HansBrende do you have any pending issues with this PR? I've tested it locally and your tests pass. I've reviewed the code and combined with your comments, I've made best efforts at analyzing what is going on.
|
@lewismc The only item I have left is to update the f8 dependency. Ideally, I'd like to depend on the 1.1 version (unreleased) rather than the RC version, so I'll work on that tonight and hopefully have this PR finished up by this weekend. Thanks! |
|
cool |
|
@lewismc Also, I never added license notices for the Some clarity on which NOTICE and/or LICENSE files I need to append to would be appreciated. Thanks! |
No, but there is some documentation at https://apache.org/legal/release-policy.html#notice-file If I were you I would just add the relevant license to NOTICE on the top level of the project. |
Improves TikaEncodingDetector by: