New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WARC writer language detection: ensure proper charset detection #6

Closed
sebastian-nagel opened this Issue Oct 16, 2018 · 2 comments

Comments

Projects
None yet
1 participant
@sebastian-nagel

sebastian-nagel commented Oct 16, 2018

The character set detection was not fully working for first 13 segments of the October 2018 crawl (CC-MAIN-2018-43) due to a missing library caused by a bug in the dependency management configuration. About 15% of the captures do not have charset assigned. In consequence, also no language detectors are run on these captures.

To fix this issue:

  • ensure that the tika-parsers library is reliably provided in core. Otherwise, the character set detection is less reliable (only based on metadata).
  • add unit test to verify it's working - needs to be checked at runtime since the missing charset detectors do not cause a failure during build.

sebastian-nagel added a commit that referenced this issue Oct 16, 2018

WARC writer charset detection #6
- fix exclusion of transitive dependencies of tika-parsers
- add unit test to ensure charset and language detection are working

sebastian-nagel added a commit that referenced this issue Oct 16, 2018

WARC writer charset detection #6
- fix exclusion of transitive dependencies of tika-parsers
- add unit test to ensure charset and language detection are working
@sebastian-nagel

This comment has been minimized.

sebastian-nagel commented Oct 29, 2018

Fixed dependency and added unit test for charset and language detection called by WARC writer.

@sebastian-nagel

This comment has been minimized.

sebastian-nagel commented Nov 6, 2018

Just to list the first 13 segments explicitly:

s3://commoncrawl/crawl-data/CC-MAIN-2018-43/segments/
 1539583508988.18/
 1539583509170.2/
 1539583509196.33/
 1539583509326.21/
 1539583509336.11/
 1539583509690.35/
 1539583509845.17/
 1539583509958.44/
 1539583509960.34/
 1539583509996.54/
 1539583510019.12/
 1539583510415.29/
 1539583510749.37/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment