A minimal multithreaded web crawler
C#
Latest commit feb51bb Nov 21, 2012 bolthar Description in assemblyinfo

README.md

Tenteikura

A minimal C# multithreaded web crawler

Usage

From an user's point of view, what is needed to start the crawler is a call to the #crawl method on a Crawler instance. Crawler's constructor takes a Cache instance as a parameter, which in turn requires a starting URL and a target directory to be instantiated.

    String targetDirectory = @"C:\tenteikura_cache";
    Uri startingURL        = new Uri("http://www.andreadallera.com");
    Cache cache            = new Cache(startingURL, targetDirectory);
    Crawler crawler        = new Crawler(cache);
    crawler.Crawl(startingURL); //starts the crawler at http://www.andreadallera.com

Crawler's constructor takes an optional parameter (bool, default false) which, if true, instructs the crawler to fetch pages outside the starting URI's domain or not:

    new Crawler(cache, true);  //will follow urls outside the starting URI's domain
    new Crawler(cache, false); //will fetch only pages inside the starting URI's domain
    new Crawler(cache);        //same as above

This will only keep the downloaded pages in the Cache object, which is an IEnumerable:

    foreach(Page page in cache) 
    {
        Console.WriteLine(page.Title);  //page title
        Console.WriteLine(page.HTML);   //page full HTML
        Console.WriteLine(page.Uri);    //page URI object
        Console.WriteLine(page.Hash);   //an hash of the URI's AbsoluteUri
        foreach(Uri link in page.Links) 
        {
            //the page has a IEnumerable<Uri> which contains all the links found on the page itself
            Console.WriteLine(link.AbsoluteUri);
        }
    }

Crawler exposes two events - NewPageFetched and WorkComplete:

    //fired when a valid page not in cache is downloaded    
    crawler.NewPageFetched += (page) {
        //do something with the fetched page
    };
    //fired when the crawler has no more pages left to fetch
    crawler.WorkComplete += () {
        //shut down the application, or forward to the GUI, or whatever
    };

If you want to persist the fetched pages, a very rudimental file system backed storage option is available, via the Persister class:

    Persister persister = new Persister(targetDirectory, startingURL);
    crawler.NewPageFetched += (page) {
        persister.save(page);
    };

Persister will save the page, in a subdirectory of targetDirectory named after startingURL.Authority, as two files: one file, with filename page.Hash + ".link", contains the page's absolute URI and the other, with filename page.Hash, contains the page itself in full.

There is an example console application on Tenteikura.Example.

TO DO

There's an hard dependency between Cache and Persister at the moment: Cache expects pages from the targetDirectory + startingUri.Authority path to be in the same format as the ones saved from Persister, while the loading strategy should be injected (and ideally provided by Persister itself).

Persister should use a more effective storage strategy - maybe backed by a RDMS or a documental storage.

The pages are fetched in random order, so there is no traversal priority strategy of any kind.