Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
A web crawler in PHP
PHP
tree: f26af68a5d

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
.gitignore
CONFIG_db_example.php
LIB_db_functions.php
LIB_encoding.php
LIB_exclusion_list.php
LIB_http.php
LIB_parse.php
LIB_resolve_addresses.php
LIB_simple_spider.php
LICENCE.txt
README.md
listNodes_graphml_04_byDomain02_centralGov.php
spider.php

README.md

phpWebCralwer

A web crawler in PHP.

Note that an additional file, CONFIG_db.php, is required. This sets the database server, name and password. Full details are commented in LIB_db_functions.php.

TODO:

  • Separate domain into its own db table.
  • Set up per-domain timers, to allow local, not global, rate limit.
  • Move to libcurl for fetching -- restrict size/wrong MIME fetches.
Something went wrong with that request. Please try again.