Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mine metrics about truncated content in WARC files #9

Closed
wants to merge 4 commits into from

Conversation

sebastian-nagel
Copy link
Contributor

Add script to get metrics about records with truncated content in WARC files

  • look for records marked by WARC-Truncated: ... header
  • add records with payload length equal to content limit as potentially truncated records (although not marked as such)

See discussion in the Common Crawl user group why this script has been written.

@sebastian-nagel
Copy link
Contributor Author

To get a CSV file listing all (presumably) truncated records in a list of WARC files, run:

$SPARK_HOME/bin/spark-submit ./warc_truncation_stats.py \
     --output_format csv --output_option header=true \
     <file_listing_warc_files_to_analyze> <output_in_spark_warehouse>

…C files

- look for records marked by `WARC-Truncated: ...` header
- add records with payload length equal to content limit as
  potentially truncated records (although not marked as such)
- if no `X-Crawler-Content-Encoding` and `X-Crawler-Transfer-Encoding`
  HTTP header fields are found look for unmasked `Content-Encoding`
  resp. `Transfer-Encoding` (required for WARC files before CC-MAIN-2018-34)
- beware of invalid numbers in Content-Length header
- change output type from int32 to long/int64 to avoid overflows
  if lengths exceed 2 GiB
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant