Skip to content
This repository was archived by the owner on Mar 23, 2026. It is now read-only.

alessio/simple-spider

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Summary

This is simple web crawler that, given a list of starting URLs, visits all pages within the given domains. The output is a plain JSON file whose structure meets the following specification:

{
    "http://www.example.com/starting_page_1.html": {
        "links": [
            "http://www.example.com/static/js/main.js",
            "http://www.example.com/static/images/picture.png"
            "http://www.example.com/blog/post/"
        ],
        "external": [
            "http://twitter.com/aTwitterAccount",
            
        ]
    }
}

Installation

Just run:

pip install -r requirements.txt

Usage

$ python crawler.py -h
usage: crawler.py [-h] [-u START_URLS [START_URLS ...]]
                  [-d ALLOWED_DOMAINS [ALLOWED_DOMAINS ...]] [-D] [-O FILE]

Generate a JSON sitemap for website.

optional arguments:
  -h, --help            show this help message and exit
  -u START_URLS [START_URLS ...], --start-urls START_URLS [START_URLS ...]
                        URLs where the spider will start to crawl from
  -d ALLOWED_DOMAINS [ALLOWED_DOMAINS ...], --allowed-domains ALLOWED_DOMAINS [ALLOWED_DOMAINS ...]
                        List of domains that the spider is allowed to crawl
  -D, --debug           Enable debug output
  -O FILE, --output-document FILE
                        Write output to FILE instead of stdout

An example follows:

$ python crawler.py -d mydomain.com www.mydomain.com -u http://www.mydomain.com

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages