Backend of Common Search. Analyses webpages and sends them to the index.
HTML Python Makefile JavaScript Perl Shell
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
cosrlib
explainer
plugins
scripts
spark
tests
urlserver
.coveragerc
.dockerhash
.dockerignore
.gitignore
.pylintrc
.travis.yml
CONTRIBUTING.md
Dockerfile
INSTALL.md
LICENSE
Makefile
README.md
docker-compose.yml
requirements.txt

README.md

cosr-back

Chat with us on Slack Build Status Coverage Status Apache License 2.0

This repository contains the main components of the Common Search backend.

Your help is welcome! We have a complete guide on how to contribute.

Understand the project

This repository has 4 components:

  • cosrlib: Python code for parsing, analyzing and indexing documents
  • spark: Spark jobs using cosrlib.
  • urlserver: A service for getting metadata about URLs from static databases
  • explainer: A web service for explaining and debugging results, hosted at explain.commonsearch.org

Here is how they fit in our general architecture:

General technical architecture of Common Search

Local install

A complete guide available in INSTALL.md.

Launching the tests

See tests/README.md.

Using plugins

Common Search supports the insertion of user-provided plugins in its processing pipeline. Some are included by default, for instance:

make docker_shell
spark-submit spark/jobs/pipeline.py --source url:https://about.commonsearch.org/ --plugin plugins.grep.Words:words="common search",output=/tmp/grep_result

See the plugins/ directory for more examples and Analyzing the web with Spark for a complete tutorial.

Launching the explainer

The explainer allows you to debug results easily. Just run:

make docker_explainer

Then open http://192.168.99.100:9703 in your browser (Assuming 192.168.99.100 is the IP of your Docker host)

Launching an index job

make docker_shell
spark-submit spark/jobs/pipeline.py --source commoncrawl:limit=1 --plugin plugins.filter.Homepages:index_body=1 --profile

After this, if you have a cosr-front instance connected to the same Elasticsearch service, you will see the results!

A tutorial is currently being written on this topic.