PyCrawler

A basic Python crawler that harvests URLs and maintains crawl index based on a given set of URL seeds.The implementation is based on the algorithm in the book 'Programming Collective Intelligence' by Toby Segaran.

##What it Does

Given a file with seed URLs, an index file where to save the resuts, and the crawl depth (an integer >=1), it performs the crawl and saves the harvested URLs in the index file. It automatically tosses out duplicate URLs which might happen during crawling. Depending on the bandwidth and crawl-depth, the crawl speed may expectedly vary.

##Requirements/Dependencies

Python 2.7 - http://www.python.org
Beautiful Soup Library - http://www.crummy.com/software/BeautifulSoup/
urllib Library - http://github.com/mikemaccana/python-docx

##Plaftorms This application is platform-agnostic as long as you have the Python interpretor. Quick test showed that it worked fine on:

Windows XP SP3
MacOs 10.6.4
Ubuntu 12.10 LTS

##Usage

Prepare the file containing seed files (e.g., test_seed.txt)
Create the index file if running for the first time (e.g., test_index.txt). You need not create the index file for the next run as the crawl output is appended to the index file. If you want to launch a new crawl task, create a new index file.
Run the crawler: Suppose the crawl-depth is 2. So, you run the crawler as: python runcrawler.py test_seed.txt test_index.txt 2

##Author Birhanu Mekuria Eshete - birhanu.mekuria(at)gmail.com

##License This code is released under the MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
pycrawler.py		pycrawler.py
runcrawler.py		runcrawler.py
test_index.txt		test_index.txt
test_seed.txt		test_seed.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PyCrawler

About

Releases

Packages

Languages

birhanu-eshete/PyCrawler

Folders and files

Latest commit

History

Repository files navigation

PyCrawler

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages