Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
A web crawler in PHP
PHP R
Branch: master
Failed to load latest commit information.
networks concat sql and execute as one query
unit_test bug fix previous -- opps
.gitignore Script to get size of strong component of each domain network
LIB_db_functions.php make minCacheSize global in cache function
LIB_encoding.php Initial upload of libraries, main crawling script and GraphML export …
LIB_exclusion_list.php Back out pre-fetch suffix check botch (exclusion_list covers same gro…
LIB_http.php Spider is up and running!
LIB_parse.php Spider is up and running!
LIB_resolve_addresses.php Initial upload of libraries, main crawling script and GraphML export …
LIB_simple_spider.php annoying
LICENCE.txt Remove surplus libraries
README.md readme
dumpfiles.php need db_connect call
example_CONFIG_db.php add whitelistURL option
example_db.sql missing ; at end of db create
example_db_alt.sql to add to db after collection
getExternalLinkedHosts.php comment output, protect against no output in db_run_select
getNetworksPerDomain.php list nodes per domain, create networks for all domains
listNodes_graphml.php fix error
listNodes_graphml_04_byDomain02_centralGov.php Initial upload of libraries, main crawling script and GraphML export …
listNodes_graphml_forDomain.php list nodes per domain, create networks for all domains
set_status.php fix some error messages
spider.php misc

README.md

phpWebCralwer

A web crawler in PHP.

Note that an additional file, CONFIG_db.php, is required. This sets the database server, name and password, as well as various other global options. An example file (example_CONFIG_db.php) is included.

TODO:

  • Interface the Public Suffix List, to get correct domains parsed for domains table

Prerequisites

  • PHP
  • MySQL
  • TidyHTML (php5-tidy)
  • CURL (php5-curl)
  • PDO (php5-mysql)
Something went wrong with that request. Please try again.