Skip to content

birhanu-eshete/PyCrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PyCrawler

A basic Python crawler that harvests URLs and maintains crawl index based on a given set of URL seeds.The implementation is based on the algorithm in the book 'Programming Collective Intelligence' by Toby Segaran.

##What it Does

  • Given a file with seed URLs, an index file where to save the resuts, and the crawl depth (an integer >=1), it performs the crawl and saves the harvested URLs in the index file. It automatically tosses out duplicate URLs which might happen during crawling. Depending on the bandwidth and crawl-depth, the crawl speed may expectedly vary.

##Requirements/Dependencies

##Plaftorms This application is platform-agnostic as long as you have the Python interpretor. Quick test showed that it worked fine on:

  • Windows XP SP3
  • MacOs 10.6.4
  • Ubuntu 12.10 LTS

##Usage

  • Prepare the file containing seed files (e.g., test_seed.txt)
  • Create the index file if running for the first time (e.g., test_index.txt). You need not create the index file for the next run as the crawl output is appended to the index file. If you want to launch a new crawl task, create a new index file.
  • Run the crawler: Suppose the crawl-depth is 2. So, you run the crawler as: python runcrawler.py test_seed.txt test_index.txt 2

##Author Birhanu Mekuria Eshete - birhanu.mekuria(at)gmail.com

##License This code is released under the MIT License

About

A basic Python Crawler to harvest URLs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages