A web crawler in PHP
PHP R
Switch branches/tags
Nothing to show
Permalink
Failed to load latest commit information.
networks concat sql and execute as one query Sep 22, 2012
unit_test bug fix previous -- opps Aug 16, 2012
.gitignore Script to get size of strong component of each domain network Sep 6, 2012
LIB_db_functions.php make minCacheSize global in cache function Nov 11, 2012
LIB_encoding.php Initial upload of libraries, main crawling script and GraphML export … May 30, 2012
LIB_exclusion_list.php Back out pre-fetch suffix check botch (exclusion_list covers same gro… Jul 23, 2012
LIB_http.php Spider is up and running! Jul 28, 2012
LIB_parse.php Spider is up and running! Jul 28, 2012
LIB_resolve_addresses.php Initial upload of libraries, main crawling script and GraphML export … May 30, 2012
LIB_simple_spider.php annoying Nov 4, 2012
LICENCE.txt Remove surplus libraries Jul 3, 2012
README.md readme Oct 18, 2012
dumpfiles.php need db_connect call Oct 21, 2012
example_CONFIG_db.php add whitelistURL option Nov 4, 2012
example_db.sql missing ; at end of db create Oct 18, 2012
example_db_alt.sql to add to db after collection Oct 16, 2012
getExternalLinkedHosts.php comment output, protect against no output in db_run_select Sep 5, 2012
getNetworksPerDomain.php list nodes per domain, create networks for all domains Sep 5, 2012
listNodes_graphml.php fix error Nov 8, 2012
listNodes_graphml_04_byDomain02_centralGov.php Initial upload of libraries, main crawling script and GraphML export … May 30, 2012
listNodes_graphml_forDomain.php list nodes per domain, create networks for all domains Sep 5, 2012
set_status.php fix some error messages Aug 19, 2012
spider.php misc Oct 18, 2012

README.md

phpWebCralwer

A web crawler in PHP.

Note that an additional file, CONFIG_db.php, is required. This sets the database server, name and password, as well as various other global options. An example file (example_CONFIG_db.php) is included.

TODO:

  • Interface the Public Suffix List, to get correct domains parsed for domains table

Prerequisites

  • PHP
  • MySQL
  • TidyHTML (php5-tidy)
  • CURL (php5-curl)
  • PDO (php5-mysql)