# Accessing the common crawl indexes

## Notes

The common crawl is stored in s3://commoncrawl, so you can start out with aws command-line operations:

In [1]:
%%bash
aws s3 ls s3://commoncrawl

                           PRE cc-index/
                           PRE contrib/
                           PRE crawl-001/
                           PRE crawl-002/
                           PRE crawl-analysis/
                           PRE crawl-data/
                           PRE hive_analysis/
                           PRE index2012/
                           PRE mapred-temp/
                           PRE meanpath/
                           PRE parse-output-test/
                           PRE parse-output/
                           PRE projects/
                           PRE stats-output/
                           PRE wikipedia/
2016-05-21 14:17:43         25 robots.txt


We can then drill down to one of these until we are at:

In [2]:
%%bash
aws s3 ls s3://commoncrawl/cc-index/collections/CC-MAIN-2017-47/indexes/

2017-11-25 14:35:56  864874660 cdx-00000.gz
2017-11-25 14:32:56  825651663 cdx-00001.gz
2017-11-25 14:37:17  906565380 cdx-00002.gz
2017-11-25 14:35:32  880110856 cdx-00003.gz
2017-11-25 14:37:10  831190551 cdx-00004.gz
2017-11-25 14:36:41  875623695 cdx-00005.gz
2017-11-25 14:33:34  668722507 cdx-00006.gz
2017-11-25 14:37:16  908400359 cdx-00007.gz
2017-11-25 14:34:57  790358317 cdx-00008.gz
2017-11-25 14:35:48  769691459 cdx-00009.gz
2017-11-25 14:36:17  864471407 cdx-00010.gz
2017-11-25 14:33:08  897341346 cdx-00011.gz
2017-11-25 14:33:38  748446905 cdx-00012.gz
2017-11-25 14:38:49 1120286215 cdx-00013.gz
2017-11-25 14:35:01  761875442 cdx-00014.gz
2017-11-25 14:32:57  604735303 cdx-00015.gz
2017-11-25 14:33:20  636916760 cdx-00016.gz
2017-11-25 14:36:02  860977107 cdx-00017.gz
2017-11-25 14:33:56  778215512 cdx-00018.gz
2017-11-25 14:37:28 1201585040 cdx-00019.gz
2017-11-25 14:34:59  899674194 cdx-00020.gz
2017-11-25 14:38:45 1018483141 cdx-00021.gz
2017-11-25

Now, we can visit each of these files, in an EMR fashion, with how many executors we have, and process each line of the gzip.  A single line has the following format:

 `reversed-split-url` ")" `path` `datetime` `jsonobject`


This JSON object has the following fields:

* url - the URL
* mime - the Content-Type
* status - The HTTP status code
* digest - A Digest - we need more on how this is calculated
* length - Length in bytes
* offset - a Byte offset into the WARC file
* filename - The relative-path within the data where the WARC data will be found

So, this means that if we can determine what we are interested in using only the URL, we can then filter the amount of WARC we need to process.

### Processing Steps

1. Map the CDX files you want to process out to AWS Batch
2. In the Batch, process the URLs matching the pattern you are interested in and store the results into DynamoDB - you need not store all fields, but you should partition by the WARC filename and sort by the offset.
3. In another batch step, process each WARC file once, in offset order, and pull out the data to index from the WARC.

   * Build a new WARC file, storing only the data you actually are reading, and update DynamoDB with the new offset.
   * Do indexing of the WARC data - this can be a separate traversal.

### Scope of Data

This is still BigData:
 * There are 301 index files in s3://commoncrawl/cc-index/collections/CC-MAIN-2017-47/indexes/
 * One of these holds about 37 million urls.
 * There are therefore about 1 billion urls in totem.

In [4]:
toturls = 301 * 3750324      # Number_of_index_files * Number_of_lines_in_cdx-00035.gz
toturls

1128847524

### External filtering technique

We can also use a new [index of the common crawl](http://index.commoncrawl.org/) to query for all pages ending in {{nih.gov}}, for example:

http://index.commoncrawl.org/CC-MAIN-2017-47-index?url=*.nih.gov&output=json&showNumPages=true

In [5]:
import requests

r = requests.get('http://index.commoncrawl.org/CC-MAIN-2017-47-index',
             params={
                 'url': '*.nih.gov',
                 'output': 'json',
                 'pageSize': 5,
                 'showNumPages': True})
r.json()


{'blocks': 341, 'pageSize': 5, 'pages': 69}

Now, we can page through those results page by page:

In [6]:
import requests
r = requests.get('http://index.commoncrawl.org/CC-MAIN-2017-47-index',
                params={
                    'url': '*.nih.gov',
                    'output': 'json',
                    'pageSize': 5,
                    'page': 0
                })
r

<Response [200]>

Note that a "block" as returned by this is not a line.   If we count the lines in those first 3 pages:

In [7]:
len(r.content.decode('utf-8').split('\n'))

13685

So, that's way more than 5 urls.  Quick estimate:

In [8]:
tot_nih_urls = 69 * 13685
tot_nih_urls

944265

Now, we can now farm out using the dynamic query.

1. Do a query for the number of pages.
2. Produce that range as a map operation, e.g. `range(0,69)` using numbers above
3. mapper can retrieve data for each mapper and proceed to index just that by WARC into DynamoDB
4. next step can query DynamoDB by WARC file, and take apart the WARC file

In [9]:
import json
page = json.loads(r.content.decode('utf-8').split('\n')[13683])
page

{'digest': 'EY6RM7AWLMSJR77DD35456OSZPRNQIZK',
 'filename': 'crawl-data/CC-MAIN-2017-47/segments/1510934805049.34/robotstxt/CC-MAIN-20171118210145-20171118230145-00597.warc.gz',
 'length': '639',
 'mime': 'text/html',
 'mime-detected': 'text/html',
 'offset': '1184043',
 'status': '301',
 'timestamp': '20171118214035',
 'url': 'http://www.cit.nih.gov/robots.txt',
 'urlkey': 'gov,nih,cit)/robots.txt'}

We can use the Range header operator when retrieving this data to avoid retrieving all of it.  We'll be using the boto3 library here to get it.

In [10]:
import boto3
client = boto3.client('s3')

In [28]:
offset = int(page['offset'])
end_offset = offset + int(page['length']) - 1
offset, end_offset

(1184043, 1184681)

In [29]:
obj = client.get_object(Bucket='commoncrawl', Key=page['filename'], Range='bytes={}-{}'.format(offset, end_offset))
obj

{'AcceptRanges': 'bytes',
 'Body': <botocore.response.StreamingBody at 0x7351400>,
 'ContentLength': 639,
 'ContentRange': 'bytes 1184043-1184681/1972251',
 'ContentType': 'application/octet-stream',
 'ETag': '"42fdb0aa528931b6ce96c7a8c403579f"',
 'LastModified': datetime.datetime(2017, 11, 19, 0, 7, 30, tzinfo=tzutc()),
 'Metadata': {},
 'ResponseMetadata': {'HTTPHeaders': {'accept-ranges': 'bytes',
   'content-length': '639',
   'content-range': 'bytes 1184043-1184681/1972251',
   'content-type': 'application/octet-stream',
   'date': 'Mon, 04 Dec 2017 23:05:16 GMT',
   'etag': '"42fdb0aa528931b6ce96c7a8c403579f"',
   'last-modified': 'Sun, 19 Nov 2017 00:07:30 GMT',
   'server': 'AmazonS3',
   'x-amz-id-2': '5WKf2SC8P6iFVCdoK9Zke9TH6Ta/H8kVDEybxV+ldPzqWtCFzIH7Joe9vB9t1op7DBG6zkxUK8Y=',
   'x-amz-request-id': '8395F6CACD28D0E1'},
  'HTTPStatusCode': 206,
  'HostId': '5WKf2SC8P6iFVCdoK9Zke9TH6Ta/H8kVDEybxV+ldPzqWtCFzIH7Joe9vB9t1op7DBG6zkxUK8Y=',
  'RequestId': '8395F6CACD28D0E1',
  'R

In [30]:
compressed_data = obj['Body'].read()
compressed_data

b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x00\x85R\xcbr\x9b0\x14\xdd3\xe3\x7fP\xbd\xae\x00\xf1\x86\x10f\x88q\x13\xbb\xb6\xe3`2n\xbd\x13\x920Lm\xc4\x80\xec\xc4\x7f_\xf9\x95\xb6if\xb2\x93\xce}\x9d{\xeeY\xc6\xe9@C\xaa\xdeS\x96\xf2\x05\xb3C\xc3\x02\xd0\xb2\xae\xe1u\xc7.`\x82\x85\x04\r\x1d\xb9\x10!\x88\xbc\xcc@\x81\xa5\x07\xa6\xbd\xba$\xa4\x8c\xf0\x96\xc2Q\x12\x80p\xd7\xd6\xc1nW\xd1\xc0\xb2|\xea[\x04C\xdbC\x05\xb4,J\xa0\x8fM\x17\x12\x87z\xbeg\xfb\xb6\x8b\x9c\xa8\xa7\x0cx-X-\xe0\x84\xd5kQ\x06\xc0r\xf4?\xe0\x99\x0en\x9aME\xb0\xa8x\xad\x95B47`\xdb\xad\x85\x0c\xdd\xbe#\xba\xc4-\xa9\xea\x82\xbfcB\x88\x933\xdf\xf2\xa0Q\xd8:\xb4\x9c\xdc\x87y\xe1 h\x12\x8aL\xdbD\xbe\x8fHti!\'\x93]\xdb\x9e\x86\xf3\xbf\x9b\xd0\xdc\xb4p\xc1\xe4\xfe\x8e\xeeB\xcbu\x0b\xe8a\'\x87\x9e\x81\xfd\xdc(P\xe1\xe9\xee\xb5\xc9h\x0ecJ%\xb9.\x00\xb6\xa5z\xbej\xba\xaam_5\xc6\xed\x9a\t\xf8\x9c\x8e\x02p\\\'\xd0\xb4\x97\x97\x17\x95TB\xad\xabR]\xf3\xbd\xd6\xf2\x9c\x8bN\x15\xaf\xe2R3\xc7\x87\r\xc7\x14&\xd5\x9au"\x00]\x89Q0\xfc\xe9\xa4S7^N\xa

In [38]:
from io import BytesIO
import gzip

warc_excerpt = BytesIO(compressed_data)
warc_reader = gzip.GzipFile(fileobj=warc_excerpt, mode='r')
warc_data = warc_reader.read().decode('utf-8')
warc_data

'WARC/1.0\r\nWARC-Type: response\r\nWARC-Date: 2017-11-18T21:40:35Z\r\nWARC-Record-ID: <urn:uuid:449d94ca-581f-44dc-9a37-c6d898595716>\r\nContent-Length: 460\r\nContent-Type: application/http; msgtype=response\r\nWARC-Warcinfo-ID: <urn:uuid:cc6be948-2f50-46b9-bf61-3cd13531991c>\r\nWARC-Concurrent-To: <urn:uuid:db34afe1-1607-477f-8a6b-82a9b2f1f807>\r\nWARC-IP-Address: 54.89.37.55\r\nWARC-Target-URI: http://www.cit.nih.gov/robots.txt\r\nWARC-Payload-Digest: sha1:EY6RM7AWLMSJR77DD35456OSZPRNQIZK\r\nWARC-Block-Digest: sha1:R3MAZAOGGLR4NWNTMZJGTAA3XJUHQQBJ\r\nWARC-Identified-Payload-Type: text/html\r\n\r\nHTTP/1.1 301 Moved Permanently\r\nDate: Sat, 18 Nov 2017 21:40:35 GMT\r\nContent-Type: text/html; charset=iso-8859-1\r\nContent-Length: 242\r\nConnection: close\r\nServer: Apache\r\nLocation: https://www.cit.nih.gov/robots.txt\r\n\r\n<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n<html><head>\n<title>301 Moved Permanently</title>\n</head><body>\n<h1>Moved Permanently</h1>\n<p>The docu

In [40]:
warc, headers, response = warc_data.strip().split('\r\n\r\n', 2)
warc

'WARC/1.0\r\nWARC-Type: response\r\nWARC-Date: 2017-11-18T21:40:35Z\r\nWARC-Record-ID: <urn:uuid:449d94ca-581f-44dc-9a37-c6d898595716>\r\nContent-Length: 460\r\nContent-Type: application/http; msgtype=response\r\nWARC-Warcinfo-ID: <urn:uuid:cc6be948-2f50-46b9-bf61-3cd13531991c>\r\nWARC-Concurrent-To: <urn:uuid:db34afe1-1607-477f-8a6b-82a9b2f1f807>\r\nWARC-IP-Address: 54.89.37.55\r\nWARC-Target-URI: http://www.cit.nih.gov/robots.txt\r\nWARC-Payload-Digest: sha1:EY6RM7AWLMSJR77DD35456OSZPRNQIZK\r\nWARC-Block-Digest: sha1:R3MAZAOGGLR4NWNTMZJGTAA3XJUHQQBJ\r\nWARC-Identified-Payload-Type: text/html'

In [41]:
headers

'HTTP/1.1 301 Moved Permanently\r\nDate: Sat, 18 Nov 2017 21:40:35 GMT\r\nContent-Type: text/html; charset=iso-8859-1\r\nContent-Length: 242\r\nConnection: close\r\nServer: Apache\r\nLocation: https://www.cit.nih.gov/robots.txt'

In [42]:
response

'<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n<html><head>\n<title>301 Moved Permanently</title>\n</head><body>\n<h1>Moved Permanently</h1>\n<p>The document has moved <a href="https://www.cit.nih.gov/robots.txt">here</a>.</p>\n</body></html>'

### Not without work

Although we've clearly decoded  this record, it is clearly not without work.  This particular record is a dud, and we have to assume that the common crawl indexed the other.  Maybe we will store in DynamoDB graph triples to allow us to see what links to from what.