Perl Grepper is a recursive e-mail address crawler written in Perl. In the main branch, it neither takes a look on robots.txt nor downloads binary files or gets caught in traps like spampoison.com
Fetching latest commit…
Cannot retrieve the latest commit at this time
README for perl-grepper Perl-grepper is a little web crawler for e-mail addresses written in Perl using libwww-perl (module LWP). The program works recursively; it searches in every line for strings which look like an e-mail address and stores them in a file. If it sees a http or https link, it will get that page and search there for e-mail addresses or new weblinks. That won't work with relative links. You can specify patterns for the URL and the file ending which perl grepper should not visit. If perl grepper sees a link which is matched by such a pattern, it will go to the next link. Switches: -s 'regex pattern' Regex pattern (Attention: The '' are important in shells like bash!) which specifies a page at which perl grepper should stop. -n NUMBER How many pages we want to crawl. If you specifiy '500', it'll stop at the 500th download (and not execute this download) -r DEEPEST_RECURSION How often should links be processed? E.g.: If you specify '3', perl-grepper won't download sites anymore if it is already at the 3rd level: perl-grepper.pl -r3 http://www.golem.de/ will search and execute links on www.golem.de, go to www.golem.de/ticker and from there to http://golem.ivwbox.de/cgi-bin/ivw/CP/Ticker. This is the deepest level (3), so it won't search for further links, only for e-mail addresses. With the given example, perl-grepper downloaded 950 pages. Usage: ./perl-grepper.pl http://example.com # Begin at example.com and crawl until the apocalypse ./perl-grepper.pl -n 500 http://example.net/ # Begin at example.net and crawl until you downloaded 500 pages, then stop immediately ./perl-grepper.pl -s 'page\.at/which(/to)?/stop' http://example.org # Crawl until we visit the page given with -s. Perl regexes are allowed ./perl-grepper.pl -n 3000 -r 3 http://example.com # Start at example.com, but do not download further pages when on the 3rd level of recursion. # But download not more than 3000 pages. (or combined) ====== NECESSARY MODULES ===== Essential modules for perl-grepper are: - LWP (often packaged: Debian/Ubuntu: libwww-perl; openSuSE: perl-libwww-perl... simply search for 'libwww' and select that which conforms to your distro's scheme for perl modules; or at CPAN: LWP) - Getopt::Std (distributed with standard Perl) ====== PLATFORMS ====== The branch master is for all UNIX perl distributions. If you want to crawl the web on windows, please use the program from branch 'windows'. ====== LICENSE ====== Copyright (C) 2012, 2013 lbo This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.