Skip to content
This repository


Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

A simple web crawler for CSCI 3172 Assignment 1 (UNMAINTAINED)

branch: master

Fetching latest commit…


Cannot retrieve the latest commit at this time

Octocat-spinner-32 bin
Octocat-spinner-32 lib
Octocat-spinner-32 t
Octocat-spinner-32 .gitignore
Octocat-spinner-32 Build.PL
Octocat-spinner-32 Changes
Octocat-spinner-32 Makefile.PL
Octocat-spinner-32 README
Octocat-spinner-32 TODO
Octocat-spinner-32 dist.ini
Octocat-spinner-32 doherty.sql
    WWW::3172::Crawler - A simple web crawler for CSCI 3172 Assignment 1

    version v0.002

        use WWW::3172::Crawler;
        my $crawler = WWW::3172::Crawler->new(host => '', max => 50);
        my $stats = $crawler->crawl;
        # Present the stats however you want

    The constructor takes a mandatory 'host' parameter, which specifies the
    starting point for the crawler. The 'max' parameter specifies how many
    pages to visit, defaulting to 200.

    Additional settings are:

    *   debug - whether to print debugging information

    *   ua - a LWP::UserAgent object to use to crawl. This can be used to
        provide a mock useragent which doesn't connect to the internet for

    *   callback - a coderef which gets called for each page crawled. The
        coderef is called with two parameters: the URL and a hashref of
        data. This can be used to do incremental processing, instead of
        doing the crawl run all at once and returning a large hashref of
        data. This also reduces memory requirements.

                host    => '',
                callback=> sub {
                    my $url  = shift;
                    my $data = shift;
                    print "Got data about $url:\n";
                    print "Stems: @{ $data->{stems} }\n";

    Begins crawling at the provided link, collecting statistics as it goes.
    The robot respects robots.txt. At the end of the crawling run, reports
    some basic statistics for each page crawled:

    *   Description meta tag

    *   Keywords meta tag

    *   Page size

    *   Load time

    *   Page text

    *   Keywords extracted from page text using the following technique:

        1)  Split page text on whitespace

        2)  Skip stopwords

        3)  "Normalize" to remove non-ASCII characters

        4)  Run Porter's stemming algorithm

    The data is returned as a hash keyed on URL.

    Image, video, and audio are also fetched, evaluated for size and speed.

    Crawling ends when there are no more URLs in the crawl queue, or the
    maximum number of pages is reached.

    URLs are crawled in order of the number of appearances the crawler has
    seen. This is somewhat similar to Google's PageRank algorithm, where
    popularity of a page, as measured by inbound links, is a major factor in
    a page's ranking in search results.

    The project homepage is <>.

    The latest version of this module is available from the Comprehensive
    Perl Archive Network (CPAN). Visit <> to find a
    CPAN site near you, or see

    The development version lives at
    <> and may be cloned from
    <git://>. Instead of sending
    patches, please fork this project using the standard git and github

    The development version is on github at
    <> and may be cloned from

    No bugs have been reported.

    Please report any bugs or feature requests through the web interface at

    Mike Doherty <>

    This software is copyright (c) 2011 by Mike Doherty.

    This is free software; you can redistribute it and/or modify it under
    the same terms as the Perl 5 programming language system itself.

Something went wrong with that request. Please try again.