This repository contains the main components of the Common Search backend.
Your help is welcome! We have a complete guide on how to contribute.
Understand the project
This repository has 4 components:
- cosrlib: Python code for parsing, analyzing and indexing documents
- spark: Spark jobs using cosrlib.
- urlserver: A service for getting metadata about URLs from static databases
- explainer: A web service for explaining and debugging results, hosted at explain.commonsearch.org
Here is how they fit in our general architecture:
A complete guide available in INSTALL.md.
Launching the tests
Common Search supports the insertion of user-provided plugins in its processing pipeline. Some are included by default, for instance:
make docker_shell spark-submit spark/jobs/pipeline.py --source url:https://about.commonsearch.org/ --plugin plugins.grep.Words:words="common search",output=/tmp/grep_result
Launching the explainer
The explainer allows you to debug results easily. Just run:
Then open http://192.168.99.100:9703 in your browser (Assuming
192.168.99.100 is the IP of your Docker host)
Launching an index job
make docker_shell spark-submit spark/jobs/pipeline.py --source commoncrawl:limit=1 --plugin plugins.filter.Homepages:index_body=1 --profile
After this, if you have a
cosr-front instance connected to the same Elasticsearch service, you will see the results!
A tutorial is currently being written on this topic.