# Robots.txt WARC Captures of Top-K Websites - Data Preparation

Objective: extract the robots.txt WARC records of top-k ranked websites from [Common Crawl's robots.txt dataset](https://commoncrawl.org/2016/09/robotstxt-and-404-redirect-data-sets/).

See also: the general procedure how to extract records for a large list of domains is described in more detail in the [notebook "Bulk URL Lookups By Table Joins"](https://github.com/commoncrawl/cc-notebooks/blob/main/cc-index-table/bulk-url-lookups-by-table-joins.ipynb).


## Selection of Top-K Websites

See the README in [data/top-k-sites](../../data/top-k-sites/README.md) for how the list is compiled from Tranco lists.

We now convert the list into [Parquet](https://parquet.apache.org/) file format...

In [1]:
import pandas as pd

df = pd.read_csv('../../data/top-k-sites/tranco/tranco_combined.txt.gz', sep='\t', names=['rank', 'host'])
df.head()

Unnamed: 0,rank,host
0,1,google.com
1,2,facebook.com
2,3,microsoft.com
3,4,youtube.com
4,5,akamaiedge.net


In [2]:
# rows and columns of the list
df.shape

(2042066, 2)

In [3]:
# add the registered (aka. pay-level) domain to top-k list
import tldextract

df['domain'] = df['host'].apply(lambda host: tldextract.extract(host).registered_domain)

In [4]:
df[df['domain'] != df['host']].head(10)

Unnamed: 0,rank,host,domain
10,11,www.google.com,google.com
15,16,data.microsoft.com,microsoft.com
25,26,ctldl.windowsupdate.com,windowsupdate.com
26,27,events.data.microsoft.com,microsoft.com
31,32,ftl.netflix.com,netflix.com
35,36,cloud.netflix.com,netflix.com
37,38,play.google.com,google.com
38,39,en.wikipedia.org,wikipedia.org
40,41,safebrowsing.googleapis.com,googleapis.com
42,43,prod.cloud.netflix.com,netflix.com


In [5]:
# save as Parquet
df.to_parquet('../../data/top-k-sites/tranco/tranco_combined.zstd.parquet', compression='zstd', index=False)

## Bulk Lookup by Table Join

The lookup of all top-k websites in the [columnar index](https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/) requires to

- upload the Parquet domain list to S3 (`mybucket` is a placeholder for a bucket in `us-east-1`)
  ```bash
  aws s3 cp ../../data/top-k-sites/tranco/tranco_combined.zstd.parquet s3://mybucket/robotstxt-experiments/domain-top-k/
  ```
- register the domain list as table in [Amazon Athena](https://aws.amazon.com/athena/)
  - navigate to the [Athena query editor](https://console.aws.amazon.com/athena/home?region=us-east-1#/query-editor) and
  - create a database "robotsexperiments" by executing the following statement:
    ```sql
    CREATE DATABASE robotsexperiments;
    ```
  - register the table "topdomains":
    ```sql
    CREATE EXTERNAL TABLE IF NOT EXISTS robotsexperiments.topdomains (
      `rank`   int,
      `host`   string,
      `domain` string
    )
    ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
    WITH SERDEPROPERTIES (
      'serialization.format' = '1'
    ) LOCATION 's3://mybucket/robotstxt-experiments/domain-top-k/'
    TBLPROPERTIES ('has_encrypted_data'='false');
    ```
  - and verify whether the table is imported properly and contains the expected number of rows
    ```sql
    SELECT * FROM robotsexperiments.topdomains limit 10;
  
    SELECT COUNT(*) FROM robotsexperiments.topdomains;
    ```

Finally, the bulk lookup is done by a table join with the [Common Crawl's columnar index](https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/), we
- select only the most recent record per same robots.txt URL (the crawler might fetch the robots.txt repeatedly during a monthly crawling running over almost two weeks)
  - Note: this might still lead to multiple URLs per host because of the URL scheme/protocol (`http://` and `https://`)
- extract WARC record locations for later processing of robots.txt records
- MIME types (HTTP Content-Type header and identified by content)
- fetch time and status
- and redirect locations (since CC-MAIN-2019-47) in order to "follow" redirects

```sql
WITH allrobots AS (
  SELECT topdomains.host as host,
         topdomains.domain as domain,
         topdomains.rank as rank,
         cc.url as orig_url, -- track original URL when following redirects
         cc.url_host_tld,
         cc.url_host_registered_domain,
         cc.url_host_name,
         cc.url,
         cc.url_protocol,
         cc.url_path,
         cc.url_query,
         cc.fetch_time,
         cc.fetch_status,
         cc.warc_filename,
         cc.warc_record_offset,
         cc.warc_record_length,
         cc.fetch_redirect,
         cc.content_mime_type,
         cc.content_mime_detected,
         -- enumerate records of same URL, most recent first
         ROW_NUMBER() OVER(PARTITION BY cc.url ORDER BY cc.fetch_time DESC) AS n
  FROM "ccindex"."ccindex" AS cc
    RIGHT OUTER JOIN "robotsexperiments"."topdomains" AS topdomains
    ON topdomains.host = cc.url_host_name
  WHERE cc.crawl = 'CC-MAIN-2025-05'
    AND cc.subset = 'robotstxt'
    AND cc.url_path = '/robots.txt'
    AND cc.url_query IS NULL
SELECT *
 FROM allrobots
-- select only the first (most recent) record of the same URL
WHERE allrobots.n = 1;
```

The query extracts the robots.txt records for a single monthly crawl (CC-MAIN-2025-05). We repeat this
- for all crawls since August 2016, run in August or February (one data point every six months)
- for the years 2020 - 2025 we include all crawls to get the maximum coverage.

Based on this selection, we need to run the query on 48 crawls from August 2016 until February 2025. See [top-k-sample/crawls.txt](../../data/top-k-sample/crawls.txt) for the list of crawl identifiers.

Running the queries is done by a Python script [get_robotstxt_captures_athena.py](../script/get_robotstxt_captures_athena.py) based on [PyAthena](https://pypi.org/project/PyAthena/). The results are stored on S3 in Parquet format.

### Result Verification

We register the results for all crawls as an Athena table:
```sql
-- Note: the last three columns (from_*) are used when redirects are followed
CREATE EXTERNAL TABLE IF NOT EXISTS `robotsexperiments`.`domain_top_k_sample` (
  `host` string,
  `domain` string,
  `rank` int,
  `orig_url` string,
  `url_host_tld` string,
  `url_host_registered_domain` string,
  `url_host_name` string,
  `url` string,
  `url_protocol` string,
  `url_path` string,
  `url_query` string,
  `fetch_time` timestamp,
  `fetch_status` int,
  `warc_filename` string,
  `warc_record_offset` int,
  `warc_record_length` int,
  `fetch_redirect` string,
  `content_mime_type` string,
  `content_mime_detected` string,
  `from_url` string,
  `from_fetch_status` int,
  `from_to_is_same_host` boolean
)
PARTITIONED BY (`crawl` string, `redirects` int)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 's3://mybucket/robotstxt-experiments/domain-top-k-sample/'
TBLPROPERTIES ('classification' = 'parquet');
```

After loading the partitions we can count the number of robots.txt captures found per crawl:
```sql
select
  count(*) as n_captures,
  sum(case when fetch_status = 200 then 1 else 0 end) as n_fetch_success,
  sum(case when fetch_redirect is not null then 1 else 0 end) as n_redirects,
  crawl
from domain_top_k_sample
where redirects = 0
group by crawl order by crawl;
```

In [6]:
import pandas as pd

df = pd.read_csv('../../data/top-k-sample/robotstxt-captures-counts.csv')
df

Unnamed: 0,n_captures,n_fetch_success,n_redirects,crawl
0,604440,361009,0,CC-MAIN-2016-36
1,717829,417576,0,CC-MAIN-2017-09
2,615175,296869,0,CC-MAIN-2017-34
3,807732,356548,0,CC-MAIN-2018-09
4,980026,401300,0,CC-MAIN-2018-34
5,1033590,428322,0,CC-MAIN-2019-09
6,864256,415165,0,CC-MAIN-2019-35
7,779791,401925,233976,CC-MAIN-2020-05
8,838228,418257,267003,CC-MAIN-2020-10
9,815695,414499,250201,CC-MAIN-2020-16


The top-k sites where selected in 2020 - 2024. The number of robots.txt captures is close to 1 million for these years, which means a coverage of 50%. For 2016 and 2017 the coverage is lower. This is not unexpected.

There are several reasons why there is no robots.txt capture for a site (host name):
- the crawler didn't visit the site in this month. Consequently, there is also no robots.txt capture.
- the robots.txt fetch failed before a HTTP connection could be established. This usually means that also no content was crawled at all from the given site.

In addition, the robots.txt is not always successfully fetched. It can also be
- a HTTP 404 "Not Found": the site has no robots.txt
- any other non-success HTTP code
- including redirects - which we need to follow, see below.

The crawler follows redirects as specified per [RFC 9309](https://www.rfc-editor.org/rfc/rfc9309.html#name-redirects). However, the redirect target is only archived if the following conditions are given:
- it's a text file (not HTML or anything else)
- the URL path is "/robots.txt" and URL query is empty, if not then the redirect location is required to be
  - allowed by the robots.txt (maybe the robots.txt of a different site)
  - not excluded per URL filter

The strict robots.txt archiving policies are necessary to avoid that an attacker is able to put secret or sensitive content into the robots.txt archives by pointing a robots.txt redirect to a path on a different site.

Following the WARC standard non-successful fetches are recorded as WARC records with HTTP header and (optional) payload. Since November 2019 (CC-MAIN-2019-47) the redirect target location is also stored in the URL index. This allows us to easily follow the redirects.

### Following Robots.txt Redirects in the URL Index Table

Following redirects based on the URL index table isn't a trivial task. It require to "recursively" look up the redirect targets in the URL index table until all redirects are resolved or the maximum number of redirects to follow is reached. [RFC 9309](https://www.rfc-editor.org/rfc/rfc9309.html#name-redirects) defines that the "crawlers SHOULD follow at least five consecutive redirects, even across authorities (for example, hosts in the case of HTTP)." However, RFC 9309 was published in September 2022 and it took some time to implement it completely in Common Crawl's crawler. Before Nov 2023 (CC-MAIN-2023-50) the crawler was following only one level of redirects. 

The following points are important to consider:
- redirects might require to be transformed from a relative URL into an absolute one before lookup.
- in several cases the redirect target might already been retrieved
  - it points to the `/robots.txt` on a different host
  - to the same host but using a different protocol (`http://` vs. `https://`)
  - it was already observed as a target in a preceding iteration
- two or more source URLs can point to the same target

The Python script [get_robotstxt_captures_athena.py](../script/get_robotstxt_captures_athena.py) implements the logic to follow the redirects. It loops until the maximum redirect depth is reached. In each loop
1. redirect targets are extracted from the results of the previous lookup (the initial one or redirect "step" before), that is from a specific partition of the table `domain_top_k_sample`
2. the redirect targets are appended as partition to a Parquet table on S3 named `redirects_to_follow`
3. after loading the newly added partitions the redirect target URLs in `redirects_to_follow` are joined with the main URL index table
4. the results are appended as partition to the result table `domain_top_k_sample`

The table to hold the redirect targets to be looked up is defined by:

```sql
CREATE EXTERNAL TABLE IF NOT EXISTS `robotsexperiments`.`redirects_to_follow` (
  `host` string,
  `domain` string,
  `rank` int,
  `orig_url` string,
  `from_url` string,
  `from_fetch_status` int,
  `from_to_is_same_host` boolean,
  `to_url` string)
PARTITIONED BY (`crawl` string, `redirects` int)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 's3://mybucket/robotstxt-experiments/domain-top-k-sample.redirects_to_follow/'
TBLPROPERTIES ('classification' = 'parquet');
```

```sql
select
  count(*) as n_captures,
  sum(case when fetch_status = 200 then 1 else 0 end) as n_fetch_success,
  sum(case when fetch_redirect is not null then 1 else 0 end) as n_redirects,
  crawl,
  redirects
from domain_top_k_sample
group by crawl, redirects order by crawl, redirects;
```

After the table to hold the redirect locations temporarily is created we run the Python script to do the lookups in the columnar index:
```sh
python ./src/script/get_robotstxt_captures_athena.py \
  s3://mybucket/robotstxt-experiments/domain-top-k-sample/ \
  s3://mybucket/robotstxt-experiments/domain-top-k-sample.redirects_to_follow/ \
  s3://mybucket/robotstxt-experiments/tmp/ \
  CC-MAIN-2025-05
```

We repeat this for all crawls we want to look into. If done, we can get the counts, how many redirects where found.
Then we can query the result table for how many redirects were found:
```sql
select
  count(*) as n_captures,
  sum(case when fetch_status = 200 then 1 else 0 end) as n_fetch_success,
  sum(case when fetch_redirect is not null then 1 else 0 end) as n_redirects,
  crawl,
  redirects
from domain_top_k_sample
group by crawl, redirects order by crawl, redirects;
```

In [7]:
import pandas as pd

df = pd.read_csv('../../data/top-k-sample/robotstxt-captures-counts-with-redirects.csv')
df[df['crawl'] == 'CC-MAIN-2025-05']

Unnamed: 0,n_captures,n_fetch_success,n_redirects,crawl,redirects
241,925873,433646,287704,CC-MAIN-2025-05,0
242,795734,501956,158621,CC-MAIN-2025-05,1
243,289261,184066,63211,CC-MAIN-2025-05,2
244,94603,49462,31781,CC-MAIN-2025-05,3
245,47581,19975,21619,CC-MAIN-2025-05,4
246,32803,11423,17681,CC-MAIN-2025-05,5


At first level of redirection (depth=1), 795k additional successfully fetched robots.txt captures are found. But this number drops with every iteration. After five redirects, only 17k redirect targets are left and are not followed. Note that these numbers include duplicates because two or more sources can link to the same redirect target.

In [8]:
# comparison before/after following 1 resp. 5 levels of redirections
df[df['crawl'].isin(['CC-MAIN-2023-40', 'CC-MAIN-2023-50'])]

Unnamed: 0,n_captures,n_fetch_success,n_redirects,crawl,redirects
169,891105,423845,276311,CC-MAIN-2023-40,0
170,725382,475230,141054,CC-MAIN-2023-40,1
171,230908,156658,44946,CC-MAIN-2023-40,2
172,63415,36972,18947,CC-MAIN-2023-40,3
173,28077,14373,10584,CC-MAIN-2023-40,4
174,17264,7332,8041,CC-MAIN-2023-40,5
175,1018455,475464,334009,CC-MAIN-2023-50,0
176,797105,514012,162815,CC-MAIN-2023-50,1
177,258153,172028,52807,CC-MAIN-2023-50,2
178,72952,41324,22441,CC-MAIN-2023-50,3


### Extract Lean Table of Robots.txt Captures

For easier generating metrics (in a separate notebook) we also extract a lean table of robots.txt captures and store it locally:

```sh
cat data/top-k-sample/crawls.txt \
  | xargs python src/script/get_robotstxt_ranked_list.py \
     s3://mybucket/robotstxt-experiments/domain-top-k-sample/ \
     data/top-k-sample/captures/
```

## Fetching Robots.txt Captures

First, we extract the lists of "real" robots.txt captures from the index lookup results. The script [get_robotstxt_download_list.py](../script/get_robotstxt_download_list.py) reads the result table from S3 crawl by crawl and writes a CSV files contain URL and WARC record locations only for
- successfully fetched robots.txt captures
- excluding MIME types which cannot be not a robots.txt capture (e.g., `text/html`)

We call the script with input and output location, followed by crawl identifier(s):
```sh
python ./src/script/get_robotstxt_download_list.py \
  s3://mybucket/robotstxt-experiments/domain-top-k-sample/ \
  data/top-k-sample-warc-records/input/ \
  CC-MAIN-2025-05
```

Second, we fetch the WARC records given the record locations. This is done by a tool (Java class) from [cc-index-table](https://github.com/commoncrawl/cc-index-table#export-subsets-of-the-common-crawl-archives)). Here a sample call:
```sh
crawl="CC-MAIN-2025-05"
APPJAR=.../cc-index-table/target/cc-index-table-0.3-SNAPSHOT-jar-with-dependencies.jar
$SPARK_HOME/bin/spark-submit \
   --conf spark.log.level=INFO \
   --class org.commoncrawl.spark.examples.CCIndexWarcExport $APPJAR \
   --csv file:$PWD/data/top-k-sample-warc-records/input/crawl=$crawl/robotstxt-captures-$crawl.csv \
           --numOutputPartitions 1 \
           --numRecordsPerWarcFile 125000 \
           --warcPrefix robotstxt-captures-$crawl \
           --warcDescription "Robots.txt captures of top-k domains from $crawl" \
           "" \
           file:$PWD/data/top-k-sample-warc-records/warc/crawl=$crawl/
```

It's a [Spark](https://spark.apache.org/) job, run locally. If necessary it can run in a cluster to scale up.

## Parsing Robots.txt Captures

Parsing the robots.txt captures downloaded in the previous step is done by the script [robotstxt_statistics.py](../cc-pyspark/robotstxt_statistics.py) based on [cc-pyspark](https://github.com/commoncrawl/cc-pyspark). As a precondition, you need to copy `sparkcc.py` from `cc-pyspark` into this project folder.

```sh
crawl="CC-MAIN-2025-05"
# write the list of input WARC files
ls $PWD/data/top-k-sample-warc-records/warc/crawl=$crawl/*.warc.gz \
  | sed 's/^/file:/' >data/top-k-sample-cc-pyspark/input/$crawl.txt
$SPARK_HOME/bin/spark-submit \
  --num-executors 1 --executor-cores 1 \
  --conf spark.sql.warehouse.dir=data/top-k-sample-cc-pyspark/tmp \
  ./src/cc-pyspark/robotstxt_statistics.py \
  --num_input_partitions 1 \
  --num_output_partitions 1 \
  --output_format json \
  --output_compression gzip \
  --log_level INFO \
  --extract_rulesets \
  data/top-k-sample-cc-pyspark/input/$crawl.txt \
  robotstxt_statistics
# move the data in place (cc-pyspark cannot write Hive-partitioned data)
# and split the job output into two parts - the counts and the rulesets
mv -v data/top-k-sample-cc-pyspark/tmp/robotstxt_statistics \
  data/top-k-sample-cc-pyspark/robotstxt_statistics/crawl=$crawl/
zcat data/top-k-sample-cc-pyspark/robotstxt_statistics/crawl=$crawl/*.json.gz \
  | jq -r '[.key.directive, .key.value, .cnt] | join("\t")' \
  | grep -a '^(ruleset)' \
  | cut -f2 | zstd -19 >data/top-k-sample/rulesets/$crawl-rulesets.jsonl.zst
mkdir -p data/top-k-sample/counts/crawl=$crawl
zcat data/top-k-sample-cc-pyspark/robotstxt_statistics/crawl=$crawl/*.json.gz \
  | jq -r '[.key.directive, .key.value, .cnt] | join("\t")' \
  | grep -va '^(ruleset)' \
  | zstd -19 \
  > data/top-k-sample/counts/crawl=$crawl/$crawl.txt.zst
```