Skip to content
Mapping the commons towards an open ledger and cc search.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.
jars First tests for the common crawl parsing strategy Jan 5, 2018
src Bug fix May 13, 2019
tests Add test cases May 21, 2018
LICENSE Initial commit Jul 12, 2017 Update Oct 8, 2018
requirements.txt Update requirements.txt Nov 28, 2018

Creative Commons Catalog

Mapping the commons towards an open ledger and cc search.


This repository contains the methods used to identify over 1.4 billion Creative Commons licensed works. The challenge is that these works are dispersed throughout the web and identifying them requires a combination of techniques. Two approaches are currently being explored:

  1. Web crawl data
  2. Application Programming Interfaces (APIs)

Web Crawl Data

The Common Crawl Foundation provides a repository of free web crawl data that is stored on Amazon Public Datasets. At the begining of each month, we process their most recent meta-data to identify all domains that link to Creative Commons. Due to the volume of data, Apache Spark is used to streamline the processing.

Application Programming Interfaces (APIs)

Getting Started


JDK 9.0.1
Python 2.7
Spark 2.2.0

pip install -r requirements.txt

Running the tests

python -m unittest discover -v


See the list of contributors who participated in this project.


This project is licensed under the MIT license - see the LICENSE file for details.

You can’t perform that action at this time.