A search engine built on top of couchdb-lucene
JavaScript Python
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
couchapp get which couchdb to query from the current url Feb 7, 2010
python
.gitignore
LICENSE
README.md

README.md

Couch Crawler

A search engine built on top of couchdb-lucene.

Dependencies

CouchDB

Python

Optionally for Yammer spidering:

Installation

Assuming couchdb-lucene was installed to the "_fti" endpoint, you can push Couch Crawler to your CouchDB instance with the command:

cd couchapp
couchapp push

This will create a new CouchDB database called "crawler" on the localhost:5984 CouchDB instance. To change the db, modify couchapp/.couchapprc and do another couchapp push.

To configure the crawler, copy python/couchcrawler-sample.cfg to python/couchcrawler.cfg and fill out the appropriate configuration values.

To start indexing pages, run the crawler script:

cd python
./scrapy-ctl.py crawl domain_to_crawl.com

While it's indexing, you can visit the search engine at the following url:

http://localhost:5984/crawler/_design/crawler/index.html

Spiders

The crawler current has spiders for:

  • MediaWiki
  • Twiki
  • Yammer

It's pretty easy to create your own. See python/couchcrawler/spiders/wiki.py for an example, or Scrapy documentation for more a more in-depth explanation.