Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Any::Moose wrapper for queued downloads via Net::Curl & AnyEvent
Perl

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
eg
lib
t
.gitignore
Changes
MANIFEST.SKIP
README.pod
dist.ini
perlcritic.rc
weaver.ini

README.pod

NAME

AnyEvent::Net::Curl::Queued - Moose wrapper for queued downloads via Net::Curl & AnyEvent

VERSION

version 0.010

SYNOPSIS

    #!/usr/bin/env perl

    package CrawlApache;
    use common::sense;

    use HTML::LinkExtor;
    use Moose;

    extends 'AnyEvent::Net::Curl::Queued::Easy';

    after finish => sub {
        my ($self, $result) = @_;

        say $result . "\t" . $self->final_url;

        if (
            not $self->has_error
            and $self->getinfo('content_type') =~ m{^text/html}
        ) {
            my @links;

            HTML::LinkExtor->new(sub {
                my ($tag, %links) = @_;
                push @links,
                    grep { $_->scheme eq 'http' and $_->host eq 'localhost' }
                    values %links;
            }, $self->final_url)->parse(${$self->data});

            for my $link (@links) {
                $self->queue->prepend(sub {
                    CrawlApache->new({ initial_url => $link });
                });
            }
        }
    };

    no Moose;
    __PACKAGE__->meta->make_immutable;

    1;

    package main;
    use common::sense;

    use AnyEvent::Net::Curl::Queued;

    my $q = AnyEvent::Net::Curl::Queued->new;
    $q->append(sub {
        CrawlApache->new({ initial_url => 'http://localhost/manual/' })
    });
    $q->wait;

DESCRIPTION

Efficient and flexible batch downloader with a straight-forward interface:

  • create a queue;
  • append/prepend URLs;
  • wait for downloads to end (retry on errors).

Download init/finish/error handling is defined through Moose's method modifiers.

MOTIVATION

I am very unhappy with the performance of LWP. It's almost perfect for properly handling HTTP headers, cookies & stuff, but it comes at the cost of speed. While this doesn't matter when you make single downloads, batch downloading becomes a real pain.

When I download large batch of documents, I don't care about cookies or headers, only content and proper redirection matters. And, as it is clearly an I/O bottleneck operation, I want to make as many parallel requests as possible.

So, this is what CPAN offers to fulfill my needs:

AnyEvent::Net::Curl::Queued is a glue module to wrap it all together. It offers no callbacks and (almost) no default handlers. It's up to you to extend the base class AnyEvent::Net::Curl::Queued::Easy so it will actually download something and store it somewhere.

OVERHEAD

Obviously, the bottleneck of any kind of download agent is the connection itself. However, socket handling and header parsing add a lots of overhead. The script eg/benchmark.pl compares AnyEvent::Net::Curl::Queued against several other download agents. Only AnyEvent::Net::Curl::Queued itself, AnyEvent::Curl::Multi and lftp support parallel connections; thus, forks are used to reproduce the same behaviour for the remaining agents. Both AnyEvent::Curl::Multi and LWP::Curl are frontends for WWW::Curl. The download target is a local copy of the Apache documentation.

                                URL/s WWW::Mechanize LWP::UserAgent HTTP::Lite HTTP::Tiny AnyEvent::Curl::Multi  lftp AnyEvent::Net::Curl::Queued AnyEvent::HTTP  curl LWP::Curl  wget
    WWW::Mechanize                196             --           -60%       -80%       -85%                  -86%  -88%                        -89%           -92%  -97%      -97% -100%
    LWP::UserAgent                484           148%             --       -51%       -63%                  -66%  -70%                        -72%           -80%  -93%      -93%  -99%
    HTTP::Lite                    989           405%           104%         --       -25%                  -32%  -39%                        -42%           -59%  -85%      -86%  -99%
    HTTP::Tiny                   1312           569%           170%        33%         --                   -9%  -19%                        -23%           -46%  -80%      -82%  -99%
    AnyEvent::Curl::Multi        1446           638%           198%        46%        10%                    --  -10%                        -16%           -41%  -78%      -80%  -98%
    lftp                         1609           722%           232%        63%        23%                   11%    --                         -6%           -34%  -75%      -77%  -98%
    AnyEvent::Net::Curl::Queued  1713           773%           253%        73%        30%                   18%    6%                          --           -30%  -74%      -76%  -98%
    AnyEvent::HTTP               2437          1144%           403%       146%        86%                   69%   51%                         42%             --  -63%      -66%  -97%
    curl                         6512          3228%          1244%       559%       397%                  351%  305%                        281%           167%    --       -8%  -93%
    LWP::Curl                    7110          3524%          1364%       618%       442%                  391%  341%                        315%           191%    9%        --  -92%
    wget                        88875         45240%         18215%      8877%      6675%                 6045% 5418%                       5092%          3544% 1262%     1151%    --

AnyEvent::HTTP & LWP::Curl are actually faster, but both lack queueing/retry.

ATTRIBUTES

allow_dups

Allow duplicate requests (default: false). By default, requests to the same URL (more precisely, requests with the same signature are issued only once. To seed POST parameters, you must extend the AnyEvent::Net::Curl::Queued::Easy class. Setting allow_dups to true value disables request checks.

completed

Count completed requests.

cv

AnyEvent condition variable. Initialized automatically, unless you specify your own.

max

Maximum number of parallel connections (default: 4; minimum value: 1).

multi

Net::Curl::Multi instance.

queue

ArrayRef to the queue. Has the following helper methods:

  • queue_push: append item at the end of the queue;
  • queue_unshift: prepend item at the top of the queue;
  • dequeue: shift item from the top of the queue;
  • count: number of items in queue.

share

Net::Curl::Share instance.

stats

AnyEvent::Net::Curl::Queued::Stats instance.

timeout

Timeout (default: 60 seconds).

METHODS

start()

Populate empty request slots with workers from the queue.

empty()

Check if there are active requests or requests in queue.

add($worker)

Activate a worker.

append($worker)

Put the worker (instance of AnyEvent::Net::Curl::Queued::Easy) at the end of the queue. For lazy initialization, wrap the worker in a sub { ... }, the same way you do with the Moose default => sub { ... }:

    $queue->append(sub {
        AnyEvent::Net::Curl::Queued::Easy->new({ initial_url => 'http://.../' })
    });

prepend($worker)

Put the worker (instance of AnyEvent::Net::Curl::Queued::Easy) at the beginning of the queue. For lazy initialization, wrap the worker in a sub { ... }, the same way you do with the Moose default => sub { ... }:

    $queue->prepend(sub {
        AnyEvent::Net::Curl::Queued::Easy->new({ initial_url => 'http://.../' })
    });

wait()

Shortcut to $queue->cv->recv.

SEE ALSO

AUTHOR

Stanislaw Pusep <stas@sysd.org>

COPYRIGHT AND LICENSE

This software is copyright (c) 2011 by Stanislaw Pusep.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.

Something went wrong with that request. Please try again.