A simple (but effective!) web crawler written in Python.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
crawler
tests
.gitignore
.travis.yml
README.md
requirements.txt

README.md

🕷 Crawler

Build Status

🕷 Crawler is a simple (but effective!) web crawler written in Python. It outputs a flat dictionary which shows each page crawled, along with the static assets (e.g. images) found and the links between pages.

Key features:

  • Fast LRU Cache from Python's standard library
  • Unit tests (more to come soon!)
  • Outputs a flat Python dict — easily serializable to JSON
  • Configurable maximum recursion depth
  • Restricted to crawling same-domain pages.

A sample of the output format:

{
  "https://website.tld": {
    "assets": {
      "images": ["https://website.tld/image.png"],
      "scripts": ["https://othersite.tld/script.js"]
    },
    "links": "https://website.tld/page.html"
  },
  
  "https://website.tld/page.html": {
    "assets": {
      "images": [],
      "scripts": ["https://website.tld/scripts/counter.js"]
    },
    "links": []
  }
}

Tests can be run in the root of this repository with python -m nose.