The robots handling code in crawler-commons was based on the the Bixo code but has subsequently been improved.
|Failed to load latest commit information.|
|contrib/helpful||Maintain SNAPSHOT version number in master|
|doc||Moved the contents of the doc/Releasing.txt to the "Creating a Releas…|
|examples||Maintain SNAPSHOT version number in master|
|lib||Complete the switch to using Maven for jar dependencies.|
|src||Switched to crawler-commons for processing robots.txt.|
|tools||Maintain SNAPSHOT version number in master|
|.gitignore||exclude examples/logs from Git|
|CHANGES||Updated changes file for 0.9.2|
|README||Changed license info.|
|build.properties||Maintain SNAPSHOT version number in master|
|build.xml||Get rid of the ec2 section in the dist target.|
|pom.xml||Switched to crawler-commons for processing robots.txt.|
=============================== Introduction =============================== Bixo is an open source Java web mining toolkit that runs as a series of Cascading pipes. It is designed to be used as a tool for creating customized web mining apps. By building a customized Cascading pipe assembly, you can quickly create a workflow using Bixo that fetches web content, parses, analyzes, and publishes the results. Bixo borrows heavily from the Apache Nutch project, as well as many other open source projects at Apache and elsewhere. Bixo is released under the Apache License, Version 2.0. =============================== Building =============================== See http://openbixo.org/documentation/building-bixo/ for full details. You need Apache Ant 1.7 or higher. To get a list of valid targets: % cd <project directory> % ant -p To clean and build a jar (which also runs all tests): % ant clean jar Note that "ant clean test jar" will currently fail, due to a bug in the maven ant task plugin used for managing dependencies. To create Eclipse project files: % ant eclipse Then, from Eclipse follow the standard procedure to import an existing Java project into your Workspace.