Tiny python web crawler
Python
Switch branches/tags
Nothing to show
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
examples readme typos, examples May 5, 2016
pholcidae2 respecting robots.txt, no more stay_in_domain setting May 5, 2016
.gitignore
LICENSE version 2.0 initial commit Jan 26, 2016
README.md
setup.py

README.md

PHOLCIDAE - Tiny python web crawler

Pholcidae

Pholcidae, commonly known as cellar spiders, are a spider family in the suborder Araneomorphae.

About

Pholcidae is a tiny Python module allows you to write your own crawl spider fast and easy.

View end of README to read about changes in v2

Dependencies

  • python 2.7 or higher

Install

pip install git+https://github.com/bbrodriges/pholcidae.git

Basic example

from pholcidae2 import Pholcidae

class MySpider(Pholcidae):

	def crawl(self, data):
    	print(data.url)

settings = {'domain': 'www.test.com', 'start_page': '/sitemap/'}

spider = MySpider()
spider.extend(settings)
spider.start()

Allowed settings

Settings must be passed as dictionary to extend method of the crawler.

Params you can use:

Required

  • domain string - defines domain which pages will be parsed. Defines without trailing slash.

Additional

  • start_page string - URL which will be used as entry point to parsed site. Default: /
  • protocol string - defines protocol to be used by crawler. Default: http://
  • valid_links list - list of regular expression strings (or full URLs), which will be used to filter site URLs to be passed to crawl() method. Default: ['(.*)']
  • append_to_links string - text to be appended to each link before fetching it. Default: ''
  • exclude_links list - list of regular expression strings (or full URLs), which will be used to filter site URLs which must not be checked at all. Default: []
  • cookies dict - a dictionary of string key-values which represents cookie name and cookie value to be passed with site URL request. Default: {}
  • headers dict - a dictionary of string key-values which represents header name and value value to be passed with site URL request. Default: {}
  • follow_redirects bool - allows crawler to bypass 30x headers and not follow redirects. Default: True
  • precrawl string - name of function which will be called before start of crawler. Default: None
  • postcrawl string - name of function which will be called after the end crawling. Default: None
  • callbacks dict - a dictionary of key-values which represents URL pattern from valid_links dict and string name of self defined method to get parsed data. Default: {}
  • proxy dict - a dictionary mapping protocol names to URLs of proxies, e.g., {'http': 'http://user:passwd@host:port'}. Default: {}

New in v2:

  • silent_links list - list of regular expression strings (or full URLs), which will be used to filter site URLs which must not pass page data to callback function, yet still collect URLs from this page. Default: []
  • valid_mimes list - list of strings representing valid MIME types. Only URLs that can be identified with this MIME types will be parsed. Default: []
  • threads int - number of concurrent threads of pages fetchers. Default: 1
  • with_lock bool - whether use or not lock while URLs sync. It slightly decreases crawling speed but eliminates race conditions. Default: True
  • hashed bool - whether or not store parsed URLs as shortened SHA1 hashes. Crawler may run a little bit slower but consumes a lot less memory. Default: False
  • respect_robots_txt bool - whether or not read robots.txt file before start and add Disallow directives to exclude_links list. Default: True

Response attributes

While inherit Pholcidae class you can override built-in crawl() method to retrieve data gathered from page. Any response object will contain some attributes depending on page parsing success.

Successful parsing

  • body string - raw HTML/XML/XHTML etc. representation of page.
  • url string - URL of parsed page.
  • headers AttrDict - dictionary of response headers.
  • cookies AttrDict - dictionary of response cookies.
  • status int - HTTP status of response (e.g. 200).
  • match list - matched part from valid_links regex.

Unsuccessful parsing

  • body string - raw representation of error.
  • status int - HTTP status of response (e.g. 400). Default: 500
  • url string - URL of parsed page.

Example

See test.py

Note

Pholcidae does not contain any built-in XML, XHTML, HTML or other parser. You can manually add any response body parsing methods using any available python libraries you want.

v2 vs v1

Major changes have been made in version 2.0:

  • All code has been completely rewritten from scratch
  • Less abstractions = more speed
  • Threads support
  • Matches in page data are now list and not optional
  • Option stay_in_domain has been removed. Crawler cannot break out of initial domain anymore.

There are some minor code changes which breaks backward code compatibility between version 1.x and 2.0:

  • You need to explicitly pass settings to extend method of your crawler
  • Option autostart has been removed. You must call spider.srart() explisitly
  • Module is now called pholcidae2