Skip to content
The Creative Commons Catalog API allows programmatic access to search for CC-licensed and public domain digital media.
Branch: master
Clone or download
Latest commit 0fabb66 May 3, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.github Added description and issue to the PR template May 2, 2019
.idea/dictionaries [#7] Materialize parallel_bulk generator and fix supervisord logfile … Jun 12, 2018
ccbot
cccatalog-api Disable flaky test Apr 25, 2019
ingestion_server Merge pull request #281 from creativecommons/phrase-relevance-continued Apr 26, 2019
.cc-metadata.yml Changed metadata key name to avoid conflict with GitHub API "name" fi… Mar 9, 2019
.gitignore initial script for generating the watermark image Mar 1, 2019
.travis.yml Update Travis CI to install ingestion server dependencies using Pipenv Mar 29, 2019
CODEOWNERS Add CODEOWNERS file Jan 25, 2019
CODE_OF_CONDUCT.md Add code of conduct and contribution guidelines to comply with github… Apr 19, 2019
CONTRIBUTING.md
CONTRIBUTORS.md Update CONTRIBUTORS.md Jan 25, 2019
LICENSE Fix copyright date in API license Nov 10, 2018
README.md
docker-compose.yml Set container name so load_sample_data.sh works across different vers… Mar 4, 2019
load_sample_data.sh Load QA data before indexing sample data to solve a possible race con… Apr 17, 2019
sample_data.csv Changed removed_from_source for one record; Added tests Mar 4, 2019
system_architecture.png Update architecture diagram to include data layer and ingestion changes. Jan 24, 2019

README.md

Creative Commons Catalog API

Build Status License

Purpose

The Creative Commons Catalog API ('cccatalog-api') is a system that allows programmatic access to public domain digital media. It is our ambition to index and catalog billions of Creative Commons works, including articles, songs, videos, photographs, paintings, and more. Using this API, developers will be able to access the digital commons in their own applications.

This repository is primarily concerned with back end infrastructure like datastores, servers, and APIs. The pipeline that feeds data into this system can be found in the cccatalog repository. A front end web application that interfaces with the API can be found at the cccatalog-frontend repository.

Project Status

The API is still in semantic version 0.*.*, meaning the API can be changed without notice. You should contact us if you are interested in using this API in production. No SLAs or warranties are provided to anonymous consumers of the API.

API Documentation

Browsable API documentation can be found here.

Running the server locally

Ensure that you have installed Docker and that the Docker daemon is running.

git clone https://github.com/creativecommons/cccatalog-api.git
cd cccatalog-api
docker-compose up

After executing docker-compose up, you will be running:

  • A Django API server
  • Two PostgreSQL instances (one simulates the upstream data source, the other serves as the application database)
  • Elasticsearch
  • Redis
  • Ingestion Server, a microservice for bulk ingesting and indexing search data.

Once everything has initialized, with docker-compose still running in the background, load the sample data. You will need to install PostgreSQL client tools to perform this step. On Debian, the package is called postgresql-client-common.

./load_sample_data.sh

You are now ready to start sending the API server requests. Hit the API with a request to make sure it is working: curl localhost:8000/image/search?q=honey

Diagnosing local Elasticsearch issues

If the API server container failed to start, there's a good chance that Elasticsearch failed to start on your machine. Ensure that you have allocated enough memory to Docker applications, otherwise the container will instantly exit with an error. Also, if the logs mention "insufficient max map count", increase the number of open files allowed on your system. For most Linux machines, you can fix this by adding the following line to /etc/sysctl.conf:

vm.max_map_count=262144

To make this setting take effect, run:

sudo sysctl -p

System Architecture

System Architecture

Basic flow of data

Search data is ingested from upstream sources provided by the data pipeline. As of the time of writing, this includes data from Common Crawl and multiple 3rd party APIs. Once the data has been scraped and cleaned, it is transferred to the upstream database, indicating that it is ready for production use.

Every week, the latest version of the data is automatically bulk copied ("ingested") from the upstream database to the production database by the Ingestion Server. Once the data has been downloaded and indexed inside of the database, the data is indexed in Elasticsearch, at which point the new data can be served up from the CC Catalog API servers.

Description of subprojects

  • cccatalog-api is a Django Rest Framework API server. For a full description of its capabilities, please see the browsable documentation.
  • ingestion_server is a RESTful microservice for downloading and indexing search data once it has been prepared by the CC Catalog.
  • ccbot is a slightly customized fork of Scrapy Cluster. The original intent was to find all of the dead links in our database, but it can easily be modified to perform other useful tasks, such as mass downloading images or scraping new content into the CC Catalog. This is not used in production at this time and is included in the repository for historic reasons.

Running the tests

Running API live integration tests

You can check the health of a live deployment of the API by running the live integration tests.

cd cccatalog-api
virtualenv venv
source venv/bin/activate
pip install -r requirements.txt
cd test
./run_test.sh

Running Ingestion Server test

This end-to-end test ingests and indexes some dummy data using the Ingestion Server API.

cd ingestion_server
virtualenv venv
source venv/bin/activate
pip install -r requirements.txt
python3 test/integration_tests.py

Deploying and monitoring the API

The API infrastructure is orchestrated using Terraform hosted in creativecommons/ccsearch-infrastructure. More details can be found on the this wiki page.

Contributing

Pull requests are welcome! Feel free to join us on Slack and discuss the project with the engineers on #cc-developers.

You can’t perform that action at this time.