Skip to content
Mapping the commons towards an open ledger and cc search.
Python
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
jars
src Update CC Catalog Common Crawl test cases Aug 16, 2019
tests Update CC Catalog Common Crawl test cases Aug 16, 2019
.cc-metadata.yml
CODEOWNERS
CODE_OF_CONDUCT.md Make repository contribution ready May 20, 2019
CONTRIBUTING.md Update CONTRIBUTING.md Jun 22, 2019
LICENSE
README.md
requirements.txt

README.md

Creative Commons Catalog

Mapping the commons towards an open ledger and cc search.

Description

This repository contains the methods used to identify over 1.4 billion Creative Commons licensed works. The challenge is that these works are dispersed throughout the web and identifying them requires a combination of techniques. Two approaches are currently explored:

  1. Web crawl data
  2. Application Programming Interfaces (APIs)

Web Crawl Data

The Common Crawl Foundation provides an open repository of petabyte-scale web crawl data. A new dataset is published at the end of each month comprising over 200 TiB of uncompressed data.

The data is available in three formats:

  • WARC: the entire raw data, including HTTP response metadata, WARC metadata, etc.
  • WET: extracted plaintext from each webpage.
  • WAT: extracted html metadata, e.g. HTTP headers and hyperlinks, etc.

CC Catalog uses AWS Data Pipeline service to automatically create an EMR cluster of 100 c4.8xlarge instances that will parse the WAT archives to identify all domains that link to creativecommons.org. Due to the volume of data, Apache Spark is used to streamline the processing. The output of this methodology is a series of parquet files that contain:

  • the domains and its respective content path and query string (i.e. the exact webpage that links to creativecommons.org)
  • the CC referenced hyperlink (which may indicate a license),
  • HTML meta data in JSON format which indicates the number of images on each webpage and other domains that they reference,
  • the location of the webpage in the WARC file so that the page contents can be found.

The steps above are performed in ExtractCCLinks.py

Application Programming Interfaces (APIs)

Apache Airflow is used to manage the workflow for the various API ETL jobs. There are three workflows: 1) Daily_ETL_Workflow, 2) Monthly_Workflow and 3) DB_Loader.

Daily_ETL_Workflow

This manages the daily ETL jobs for the following platforms:

Monthly_Workflow

Manages the monthly jobs that are scheduled to run on the 15th day of each month at 16:00 UTC. This workflow is reserved for long-running jobs or APIs that do not have date filtering capabilities so the data is reprocessed monthly to keep the catalog updated. The following tasks are performed:

DB_Loader

Scheduled to load data into the upstream database every four hours. It includes data preprocessing steps.

Other API Jobs (not in the workflow)

Getting Started

Prerequisites

JDK 9.0.1
Python 3.6
Pytest 4.3.1
Spark 2.2.1
Airflow 1.10.4

pip install -r requirements.txt

Running the tests

python -m pytest tests/test_ExtractCCLinks.py

Authors

See the list of contributors who participated in this project.

License

This project is licensed under the MIT license - see the LICENSE file for details.

You can’t perform that action at this time.