# Truncated Records in WARC Files (October 2018)

Content payload in Common Crawl archives is truncated if the content exceeds a limit of
- [500 kiB in 2008 – 2012 ARC files](https://groups.google.com/d/topic/common-crawl/hQTRnWahcHA/discussion)
- 1 MiB in WARC files (since 2013)

The truncation is required to keep the crawl archives at a limited size and ensure that a broad sample of web pages is covered. It also avoids that the archives are filled by accidentally captured video or audio streams. The crawler needs to buffer the content temporarily and a limit ensures that this is possible with a limited amount of RAM for many parallel connections.


While the mission of Common Crawl has always been to provide a broad sample of HTML pages, there have been always users interested in the various other document formats: PDF, office documents, etc.  Recently, a discussion started in the [Common Crawl forum](https://groups.google.com/forum/#!forum/common-crawl) to [increase the 1 MiB content limit](https://groups.google.com/d/topic/common-crawl/JJW6fv1rUQw/discussion). It also turned out that there are issues regarding the marking of truncated payloads using the [WARC-Truncated header](https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-truncated), see [cc/nutch#10](//github.com/commoncrawl/nutch/issues/10).


Metrics shown here are based on a single WARC file ([CC-MAIN-20181016221847-20181017003347-00114.warc.gz](https://data.commoncrawl.org/crawl-data/CC-MAIN-2018-43/segments/1539583510893.26/warc/CC-MAIN-20181016221847-20181017003347-00114.warc.gz)) of the October 2018 crawl. Truncated records are extracted using cc-pyspark, see [cc-pyspark#9](//github.com/commoncrawl/cc-pyspark/pull/9).

The length limit is applied to the binary content after it the HTTP protocol-level content and transfer encoding have been removed (decoded). Every WARC record consists of WARC headers, HTTP headers and content payload and is as a whole compressed using gzip. This leads to a couple of length metrics (bytes):
- `warc_record_length`: WARC compressed record length (WARC headers, HTTP headers and content payload)
- `warc_content_length`: uncompressed length of HTTP headers and content payload (obligatorily given by `Content-Lenght` WARC header)
- `http_content_length` resp. `payload_length`: the actual size of the payload
- `http_orig_content_length`: the original value of the HTTP `Content-Length` header (if present) which indicates the entire length with protocol-level compression applied

But let's now inspect the set of truncated records.

In [1]:
import pandas

data = pandas.read_csv('data/truncated-records-CC-MAIN-20181015080248-20181015101748-00033.csv')

Since there was a bug which caused that not all truncated records are marked, the table also includes records with a payload length equal to the content limit (1 MiB). These are presumably truncated as the probability that a page/document has exactly a length of 1 MiB shouldn't be high. It appears that even the majority of truncated records isn't marked appropriately:

In [2]:
data['truncated_reason'].value_counts(dropna=False)

NaN           262
length        105
disconnect     46
Name: truncated_reason, dtype: int64

Let's only look at pages/documents truncated because of the length limit

In [3]:
tr = data[data['payload_length'] == 2**20]

and see which content types are truncated:

In [4]:
tr['identified_payload_type'].value_counts(dropna=False)

text/html                                  210
application/pdf                             88
application/xhtml+xml                       37
image/jpeg                                   9
application/xml                              6
audio/mpeg                                   4
application/x-dosexec                        2
application/vnd.android.package-archive      1
image/png                                    1
application/octet-stream                     1
audio/mp4                                    1
application/x-rar-compressed                 1
video/webm                                   1
application/x-msdownload; format=pe32        1
image/bmp                                    1
application/gzip                             1
application/zip                              1
video/x-msvideo                              1
Name: identified_payload_type, dtype: int64

Because Common Crawl uses gzipped WARC files as primary archive format, it's actually more the size of the gzip-compressed WARC records which matters, not the size of the uncompressed content payload. Let's put the compressed and uncompressed lengths into relation for the various content types:

In [5]:
tr = tr[['warc_record_length', 'warc_content_length', 'payload_length', 'identified_payload_type']]
tr['count'] = 1
r = tr.groupby(['identified_payload_type']).sum()
r['% ratio'] = 100.0*r['warc_record_length']/r['warc_content_length']
for col in ['warc_record_length', 'warc_content_length', 'payload_length']:
    r[col] = r[col]/2**20 # show aggregated length in MiB
r

Unnamed: 0_level_0,warc_record_length,warc_content_length,payload_length,count,% ratio
identified_payload_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
application/gzip,1.000004,1.000328,1.0,1,99.967586
application/octet-stream,1.001059,1.000244,1.0,1,100.081424
application/pdf,80.557208,88.033373,88.0,88,91.507579
application/vnd.android.package-archive,0.97899,1.000364,1.0,1,97.863308
application/x-dosexec,1.955378,2.000984,2.0,2,97.720791
application/x-msdownload; format=pe32,0.940312,1.000352,1.0,1,93.99816
application/x-rar-compressed,0.994845,1.000288,1.0,1,99.455895
application/xhtml+xml,3.651897,37.018148,37.0,37,9.865154
application/xml,0.261678,6.001943,6.0,6,4.359884
application/zip,1.001128,1.000469,1.0,1,100.065868


Note that the `warc_record_length` also includes the WARC headers while `warc_content_length` does not. That's why the compression ratio could be a little bit too large.