crawler

A Web crawler.

Start from the url and crawl the web pages with a specified depth.
Save the pages which contain a keyword(if provided) into database.
Support multi-threading.
Support logging.
Support self-testing.

usage

main.py [-h] -u URL -d DEPTH [--logfile FILE] [--loglevel {1,2,3,4,5}]
               [--thread NUM] [--dbfile FILE] [--key KEYWORD] [--testself]

optional arguments:

  -h, --help            show this help message and exit
  -u URL                Specify the begin url
  -d DEPTH              Specify the crawling depth
  --logfile FILE        The log file path, Default: spider.log
  --loglevel {1,2,3,4,5}
                        The level of logging details. Larger number record
                        more details. Default:3
  --thread NUM          The amount of threads. Default:10
  --dbfile FILE         The SQLite file path. Default:data.sql
  --key KEYWORD         The keyword for crawling. Default: None. For more then
                        one word, quote them. example: --key 'Hello world'
  --testself            Crawler self test

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.gitignore		.gitignore
Loggertest.py		Loggertest.py
README.md		README.md
crawler.py		crawler.py
database.py		database.py
log.txt		log.txt
main.py		main.py
options.py		options.py
proxy.py		proxy.py
sitecopy.py		sitecopy.py
spidertest.py		spidertest.py
threadPool.py		threadPool.py
webPage.py		webPage.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

Loggertest.py

Loggertest.py

README.md

README.md

crawler.py

crawler.py

database.py

database.py

log.txt

log.txt

main.py

main.py

options.py

options.py

proxy.py

proxy.py

sitecopy.py

sitecopy.py

spidertest.py

spidertest.py

threadPool.py

threadPool.py

webPage.py

webPage.py

Repository files navigation

crawler

A Web crawler.

usage

optional arguments:

About

Releases

Packages

Languages

caorong/crawler

Folders and files

Latest commit

History

Repository files navigation

crawler

A Web crawler.

usage

optional arguments:

About

Resources

Stars

Watchers

Forks

Languages