Mine metrics about truncated content in WARC files #9

sebastian-nagel · 2019-07-19T08:42:03Z

Add script to get metrics about records with truncated content in WARC files

look for records marked by WARC-Truncated: ... header
add records with payload length equal to content limit as potentially truncated records (although not marked as such)

See discussion in the Common Crawl user group why this script has been written.

sebastian-nagel · 2019-07-19T08:49:22Z

To get a CSV file listing all (presumably) truncated records in a list of WARC files, run:

$SPARK_HOME/bin/spark-submit ./warc_truncation_stats.py \
     --output_format csv --output_option header=true \
     <file_listing_warc_files_to_analyze> <output_in_spark_warehouse>

…C files - look for records marked by `WARC-Truncated: ...` header - add records with payload length equal to content limit as potentially truncated records (although not marked as such)

- if no `X-Crawler-Content-Encoding` and `X-Crawler-Transfer-Encoding` HTTP header fields are found look for unmasked `Content-Encoding` resp. `Transfer-Encoding` (required for WARC files before CC-MAIN-2018-34)

- beware of invalid numbers in Content-Length header - change output type from int32 to long/int64 to avoid overflows if lengths exceed 2 GiB

sebastian-nagel force-pushed the warc-truncation-stats branch from 5dd60d7 to ceb611f Compare August 2, 2019 08:39

sebastian-nagel mentioned this pull request Aug 2, 2019

Allow to access WARC record filename and offset #6

Closed

sebastian-nagel added 4 commits January 28, 2020 10:27

Add script to get metrics about records with truncated content in WAR…

aba2db4

…C files - look for records marked by `WARC-Truncated: ...` header - add records with payload length equal to content limit as potentially truncated records (although not marked as such)

Script to get metrics about records with truncated content in WARC files

7291525

- if no `X-Crawler-Content-Encoding` and `X-Crawler-Transfer-Encoding` HTTP header fields are found look for unmasked `Content-Encoding` resp. `Transfer-Encoding` (required for WARC files before CC-MAIN-2018-34)

Script to get metrics about records with truncated content in WARC files

59101c8

- beware of invalid numbers in Content-Length header - change output type from int32 to long/int64 to avoid overflows if lengths exceed 2 GiB

Refactored and used --output_option from sparkcc.py

acc5b03

sebastian-nagel force-pushed the warc-truncation-stats branch from 9608e05 to acc5b03 Compare January 28, 2020 09:28

sebastian-nagel deleted the branch master March 17, 2022 16:32

sebastian-nagel closed this Mar 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mine metrics about truncated content in WARC files #9

Mine metrics about truncated content in WARC files #9

sebastian-nagel commented Jul 19, 2019

sebastian-nagel commented Jul 19, 2019

Mine metrics about truncated content in WARC files #9

Mine metrics about truncated content in WARC files #9

Conversation

sebastian-nagel commented Jul 19, 2019

sebastian-nagel commented Jul 19, 2019