Alida: a crawling, scraping and indexing tool written in Clojure
A video for the talk is up at: http://vimeo.com/45132055/
The Alida project was started as companion code to my talk at EuroClojure 2012 on the topic of "Building a search engine with Clojure". The goal of this application is to provide the back-end for a simple search engine, while the front-end (i.e. the part that performs the actual searches and displays the results to the visitor) is separate. Alida provides the following functionality:
- Data retrieval (web crawling).
- Storage (storing the crawled pages).
- Scraping (extracting interesting bits and pieces from documents).
- Indexing (storing the end result in a searchable index).
The application depends on several external libraries and applications, including Apache CouchDB to store the crawled pages, Apache Lucene for indexing, clj-http as a wrapper around the Apache HttpClient, Clutch as a CouchDB library and both Enlive and Jsoup for scraping data from the retrieved documents. As an experimental project Alida isn't designed for immediate production use. Some technology choices reflect that. Apache CouchDB, for example, isn't the most obvious choice for storing crawled pages. For large scale document storage something like HDFS would be more efficient, but CouchDB is easier to set up in a small, experimental setting. If you need something more mature, I'd suggest looking at Apache Nutch, which also includes a web crawler. That being said, one of the goals for Alida is to be able to power a real-world search engine project.
Check the docs/ directory for the slides. I've also created two sub-projects. clojure-blog-search is a very basic example of how you could use Alida to crawl the web. Please remember that it does an actual crawl on the personal blogs of fellow Clojurians, so please don't run it unless you have a very good reason to do so. I've also created clojure-blog-search-www, which is a little front-end web application that also includes a pre-built Lucene index that you can experiment with. If you are less inclined towards diving straight into the code and want to see a hosted example you can find one at http://clojure-blog-search.vixu.com/.
You can use this code as a library in your own project by adding the following to the :dependencies in your project.clj file:
Don't hesitate to contact me if you have any questions or feedback. You can email me at firstname.lastname@example.org.
My name is Filip de Waard. As the founder of Vixu.com I write Clojure code for a living. The main focus of Vixu.com is providing website-management software as a service. Under the hood we use the free, open source Vix application to power the service. My company is also working on a product search application written in Clojure.