# Truncated Records in WARC Files (August 2019)

This metrics are based on WARC files from the August 2019 crawl (CC-MAIN-2019-35) after fixes and improvements in the WARC writer, cf. [cc/nutch#10](//github.com/commoncrawl/nutch/issues/10) and [metrics based on October 2018 WARC files](./cc-main-2018-43-single-warc-file.ipynb).

Truncated records are extracted using cc-pyspark, see [cc-pyspark#9](//github.com/commoncrawl/cc-pyspark/pull/9) from 100 randomly selected WARC files of the August 2019 crawl listed in [CC-MAIN-2019-35-warc.paths](./data/CC-MAIN-2019-35-warc.paths).

In [1]:
import pandas

data = pandas.read_csv('data/truncated-records-CC-MAIN-2019-35-warc-100.csv')

Verify the solution of ([NUTCH-2729](https://issues.apache.org/jira/browse/NUTCH-2729)): all records having exactly the size of the content limit (1 MiB) are likely to be marked as truncated because the probability that a content payload is exactly 1 MiB is low.

There are no unmarked records in the analyzed set:

In [2]:
data['truncated_reason'].value_counts(dropna=False)

length        26643
disconnect     4223
time              2
Name: truncated_reason, dtype: int64

Let's only look at pages/documents truncated because of the length limit

In [3]:
tr = data[data['truncated_reason'] == 'length']

and check, first, how long the records are (all are exactly 1 MiB):

In [4]:
tr['payload_length'].value_counts(dropna=False)

1048576    26643
Name: payload_length, dtype: int64

and, second, which content types are truncated:

In [5]:
tr['identified_payload_type'].value_counts(dropna=False)

text/html                                                                    17989
application/xhtml+xml                                                         5234
application/pdf                                                               1496
image/jpeg                                                                     629
audio/mpeg                                                                     188
application/zip                                                                124
text/plain                                                                      83
image/png                                                                       82
application/xml                                                                 69
application/octet-stream                                                        67
video/mp4                                                                       56
audio/mp4                                                                       54
appl

Now let's check records records truncated by another reason than the payload length:

In [6]:
tr = data[data['truncated_reason'] != 'length']
tr['payload_length'].mean()/1024

69.48091808431953

The average size (after truncation) is significantly lower than 1 MiB, and the amount of HTML pages among the truncated document types is larger:

In [7]:
tr['identified_payload_type'].value_counts(dropna=False)

text/html                                  2571
application/xhtml+xml                      1615
application/pdf                              12
image/jpeg                                    7
application/xml                               4
application/vnd.android.package-archive       4
video/x-matroska                              3
text/plain                                    2
text/x-vcard                                  2
text/calendar                                 1
application/x-tika-msoffice                   1
application/gzip                              1
application/octet-stream                      1
text/x-matlab                                 1
Name: identified_payload_type, dtype: int64

Now let's look into WARC records sizes of all records truncated because of the length limit.

Because Common Crawl uses gzipped WARC files as primary archive format, it's actually more the size of the gzip-compressed WARC records which matters, not the size of the uncompressed content payload. Let's put the compressed and uncompressed lengths into relation for the various content types:

In [8]:
data['count'] = 1
tr = data[data['truncated_reason'] == 'length']
tr = tr[['warc_record_length', 'warc_content_length', 'payload_length', 'identified_payload_type', 'count']]
r = tr.groupby(['identified_payload_type']).sum()
r = r[r['count'] > 3] # drop rare document types
r['% ratio'] = 100.0*r['warc_record_length']/r['warc_content_length']
for col in ['warc_record_length', 'warc_content_length', 'payload_length']:
    r[col] = r[col]/2**20 # show aggregated length in MiB
r

Unnamed: 0_level_0,warc_record_length,warc_content_length,payload_length,count,% ratio
identified_payload_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
application/atom+xml,0.86954,7.003787,7.0,7,12.415286
application/epub+zip,32.613303,33.015307,33.0,33,98.78237
application/fits,5.050832,8.002682,8.0,8,63.11424
application/gzip,9.990907,10.003711,10.0,10,99.872007
application/json,0.743587,6.002691,6.0,6,12.387553
application/msword,4.433574,7.003987,7.0,7,63.30071
application/octet-stream,54.735172,67.029173,67.0,67,81.658731
application/pdf,1342.943089,1496.863754,1496.0,1496,89.717123
application/rdf+xml,0.356032,4.001985,4.0,4,8.896395
application/rss+xml,7.564011,29.014924,29.0,29,26.069379


Note that the `warc_record_length` also includes the WARC headers while `warc_content_length` does not. That's why the compression ratio can exceed 100%.

Finally, a short look into super-large captures with an original `Content-Length` HTTP header exceeding 1 GiB:

In [9]:
data[data['http_orig_content_length'] > 2**30].groupby(['identified_payload_type']).sum()

Unnamed: 0_level_0,warc_record_offset,warc_record_length,warc_content_length,http_content_length,payload_length,http_orig_content_length,count
identified_payload_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
application/octet-stream,7497503,742809,1049026,1048576,1048576,2073100288,1
application/x-tar,126888947,950904,1048870,1048576,1048576,6430743040,1
application/zip,474064537,3119285,3147708,3145728,3145728,4086519668,3
model/vnd.mts,146964408,1035285,1048917,1048576,1048576,2045650944,1
video/mp4,1095804704,1145435,2098097,2097152,2097152,3006448624,2
video/mpeg,195450414,1023648,1048880,1048576,1048576,1785134452,1
video/x-matroska,199548452,3132836,3177048,3175878,3175878,6035122172,4
