Skip to content
Mapping the commons towards an open ledger and cc search.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.
src Update CC Catalog Common Crawl test cases Aug 16, 2019
tests Update CC Catalog Common Crawl test cases Aug 16, 2019
CODEOWNERS Make repository contribution ready May 20, 2019 Update Jun 22, 2019

Creative Commons Catalog

Mapping the commons towards an open ledger and cc search.


This repository contains the methods used to identify over 1.4 billion Creative Commons licensed works. The challenge is that these works are dispersed throughout the web and identifying them requires a combination of techniques. Two approaches are currently explored:

  1. Web crawl data
  2. Application Programming Interfaces (APIs)

Web Crawl Data

The Common Crawl Foundation provides an open repository of petabyte-scale web crawl data. A new dataset is published at the end of each month comprising over 200 TiB of uncompressed data.

The data is available in three formats:

  • WARC: the entire raw data, including HTTP response metadata, WARC metadata, etc.
  • WET: extracted plaintext from each webpage.
  • WAT: extracted html metadata, e.g. HTTP headers and hyperlinks, etc.

CC Catalog uses AWS Data Pipeline service to automatically create an EMR cluster of 100 c4.8xlarge instances that will parse the WAT archives to identify all domains that link to Due to the volume of data, Apache Spark is used to streamline the processing. The output of this methodology is a series of parquet files that contain:

  • the domains and its respective content path and query string (i.e. the exact webpage that links to
  • the CC referenced hyperlink (which may indicate a license),
  • HTML meta data in JSON format which indicates the number of images on each webpage and other domains that they reference,
  • the location of the webpage in the WARC file so that the page contents can be found.

The steps above are performed in

Application Programming Interfaces (APIs)

Apache Airflow is used to manage the workflow for the various API ETL jobs. There are three workflows: 1) Daily_ETL_Workflow, 2) Monthly_Workflow and 3) DB_Loader.


This manages the daily ETL jobs for the following platforms:


Manages the monthly jobs that are scheduled to run on the 15th day of each month at 16:00 UTC. This workflow is reserved for long-running jobs or APIs that do not have date filtering capabilities so the data is reprocessed monthly to keep the catalog updated. The following tasks are performed:


Scheduled to load data into the upstream database every four hours. It includes data preprocessing steps.

Other API Jobs (not in the workflow)

Getting Started


JDK 9.0.1
Python 3.6
Pytest 4.3.1
Spark 2.2.1
Airflow 1.10.4

pip install -r requirements.txt

Running the tests

python -m pytest tests/


See the list of contributors who participated in this project.


This project is licensed under the MIT license - see the LICENSE file for details.

You can’t perform that action at this time.