This is the repository hosting the public DocSearch configurations.
DocSearch is composed of 3 different projects:
- The scraper which browses & indexes web pages: https://github.com/algolia/docsearch-scraper
- The configurations for the scraper: https://github.com/algolia/docsearch-configs
If you want to run your own DocSearch instance on those configuration files, please get familiar with the scraper setup guidelines.
The DocSearch scraper will use a configuration file specifying:
- the Algolia index name that will store the records resulting from the crawling
- the URLs it needs to crawl
- the URLs it shouldn't crawl
- the (hierarchical) CSS selectors to use to extract the relevant content from your webpages
- the CSS selectors to skip
- An optional sitemap URL that will be crawled and then scraped
- additional options you might provide to fine-tune the scraping
How it works
Once you run the DocSearch scraper on a specific configuration, it will:
- crawl all the URLs you specified (from the start_urls or the sitemap)
- follow all the hyperlinks mentioned in the page, and continue the crawling there
- stop the crawling as soon as you've reached a URL that is not specified in your configuration or affiliated to a start url
- extract the content of every single crawled page following the logic you defined using the CSS selectors
- push the resulting records to the Algolia index you configured
Update You can check the DocSearch dedicated documentation website if you need more details regarding how to fine-tune your configuration.