Skip to content
Fetching latest commit…
Cannot retrieve the latest commit at this time.
..
Failed to load latest commit information.
bin
doc
lib
src
README
build.properties
build.xml
pom.xml

README

The "examples" folder contains a set of examples to showcase Bixo.

The DemoSimpleCrawlTool demonstrates a simple web crawler. You provide
it with a starting domain name (e.g. cnn.com) and a user agent name,
and it will crawl for the specified number of loops.

The DemoWebMiningTool is an example of a focused crawler with the emphasis on 
extracting (un)structured data from web pages. It demonstrates how 
to analyze the fetched data from a page using a DOM parser. The result is a text
file with one entry per line : 
	<soruce url>\t<extracted img url>\t<description text (if any) via the alt tag>.
(http://www.scaleunlimited.com/about/focused-crawler/ is a good starting point 
to understand focused crawling)



To build a job jar:

% cd <path to Bixo>/examples
% ant clean job

To run the job jar (for DemoCrawlTool and DemoWebMiningTool):

% hadoop jar build/bixo-examples-job-<x.y.z>.jar bixo.examples.crawl.DemoCrawlTool \
	-agentname <your agent name> -domain <target domain> -numloops <number of loops> -outputdir <results dir>

Note that you should pick a number of loops > 1, so that it crawls more than just the top-level
page of your target domain.

% hadoop jar build/bixo-examples-job-<x.y.z>.jar bixo.examples.webmining.DemoWebMiningTool \
	-agentname <your agent name> -workingdir <results dir>

To create an Eclipse project:

% cd <path to Bixo>/examples
% ant eclipse

after which you would import the project into your Eclipse workspace.

Something went wrong with that request. Please try again.