Summary

This is simple web crawler that, given a list of starting URLs, visits all pages within the given domains. The output is a plain JSON file whose structure meets the following specification:

{
    "http://www.example.com/starting_page_1.html": {
        "links": [
            "http://www.example.com/static/js/main.js",
            "http://www.example.com/static/images/picture.png"
            "http://www.example.com/blog/post/"
        ],
        "external": [
            "http://twitter.com/aTwitterAccount",
            
        ]
    }
}

Installation

Just run:

pip install -r requirements.txt

Usage

$ python crawler.py -h
usage: crawler.py [-h] [-u START_URLS [START_URLS ...]]
                  [-d ALLOWED_DOMAINS [ALLOWED_DOMAINS ...]] [-D] [-O FILE]

Generate a JSON sitemap for website.

optional arguments:
  -h, --help            show this help message and exit
  -u START_URLS [START_URLS ...], --start-urls START_URLS [START_URLS ...]
                        URLs where the spider will start to crawl from
  -d ALLOWED_DOMAINS [ALLOWED_DOMAINS ...], --allowed-domains ALLOWED_DOMAINS [ALLOWED_DOMAINS ...]
                        List of domains that the spider is allowed to crawl
  -D, --debug           Enable debug output
  -O FILE, --output-document FILE
                        Write output to FILE instead of stdout

An example follows:

$ python crawler.py -d mydomain.com www.mydomain.com -u http://www.mydomain.com

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.github		.github
.gitignore		.gitignore
CHANGELOG		CHANGELOG
LICENSE		LICENSE
README.md		README.md
crawler.py		crawler.py
requirements.txt		requirements.txt
setup.py		setup.py
tests.py		tests.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Summary

Installation

Usage

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Summary

Installation

Usage

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages