# Truncated Records in WARC Files (November 2019)

Since November 2019 (CC-MAIN-2019-47) truncated records are marked in the URL indexes. This allows to analyze distribution of truncated records over the entire monthly crawl (November 2019 - CC-MAIN-2019-47).

Counts of truncated records are aggregated per MIME type using [AWS Athena](https://aws.amazon.com/athena/) and the SQL query [average-warc-record-length-by-mime-type.sql](/commoncrawl/cc-index-table/blob/525ab8b16fee54a88706c0aa5aa453b4d7253d7b/src/sql/examples/cc-index/average-warc-record-length-by-mime-type.sql). The aggregations also include WARC record sizes.

In [1]:
import json
import pandas

data = pandas.read_csv('data/warc-record-size-truncation-by-mime-type-CC-MAIN-2019-47.csv')

data[['content_mime_detected', 'n_pages', 'perc_truncated', 'reasons_truncation']].head(20)

Unnamed: 0,content_mime_detected,n_pages,perc_truncated,reasons_truncation
0,text/html,2002557402,0.757578,"{disconnect=1265692, length=13904449, time=791}"
1,application/xhtml+xml,555659037,0.719266,"{disconnect=817826, length=3178707, time=131}"
2,application/pdf,12206558,24.028182,"{disconnect=11271, length=2921479, time=264}"
3,image/jpeg,3932068,9.78096,"{disconnect=2065, length=382529}"
4,application/rss+xml,3424865,0.627499,"{disconnect=2245, length=19236, time=10}"
5,application/atom+xml,3261205,0.128388,"{disconnect=24, length=4163}"
6,text/plain,2068745,2.289891,"{disconnect=3805, length=43556, time=11}"
7,application/xml,1713357,5.201309,"{disconnect=1894, length=87221, time=2}"
8,text/calendar,959303,0.18503,"{disconnect=773, length=1002}"
9,application/json,633507,0.39747,"{disconnect=13, length=2505}"


The aggregations show which MIME types are mostly affected by truncations.

Now let's look into the reasons of the truncation and load the histograms with reason counts into columns:

In [2]:
# expand embedded Presto/Athena histogram as columns into data frame
# - transform to valid JSON
data['reasons_truncation'] = data['reasons_truncation'].str.replace('(\\w+)=', '"\\1":', regex=True)
# - load columns in data frame
truncation_reason = data['reasons_truncation'].apply(lambda x: json.loads(x) if type(x) == str else {}).apply(pandas.Series).add_prefix('trunc.')
# - join with original data
data = data.join(truncation_reason)

data['trunc.length.perc'] = 100.0 * data['trunc.length'] / data['n_pages']
data[['content_mime_detected', 'n_pages', 'perc_truncated', 'trunc.length', 'trunc.length.perc']].sort_values(by=['trunc.length'], ascending=False).head(20)

Unnamed: 0,content_mime_detected,n_pages,perc_truncated,trunc.length,trunc.length.perc
0,text/html,2002557402,0.757578,13904449.0,0.694335
1,application/xhtml+xml,555659037,0.719266,3178707.0,0.572061
2,application/pdf,12206558,24.028182,2921479.0,23.933684
3,image/jpeg,3932068,9.78096,382529.0,9.728443
14,audio/mpeg,147410,85.80829,126384.0,85.736382
7,application/xml,1713357,5.201309,87221.0,5.09065
13,application/zip,222256,33.397974,73971.0,33.281891
10,image/png,584080,9.197884,53457.0,9.152342
6,text/plain,2068745,2.289891,43556.0,2.105431
36,video/mp4,42645,92.723649,39509.0,92.646266
