This is a Scrapy project to scrape websites from public web directories.
This project is only meant for educational purposes.
The items scraped by this project are websites, and the item is defined in the class:
See the source code for more details.
This project contains two spiders:
dmoz. When in doubt,
you can check the available spiders with:
googledir spider crawls the entire Google Directory, though you may
want to try it by limiting the crawl to a certain number of items.
For example, to run the
googledir spider limited to scrape 20 items use:
scrapy crawl googledir --set CLOSESPIDER_ITEMCOUNT=20
dmoz spider scrapes the Open Directory Project (dmoz.org), and it's
based on the dmoz spider described in the Scrapy tutorial
googledir spider, this spider doesn't crawl the entire dmoz.org
site but only a few pages by default (defined in the
attribute). These pages are:
So, if you run the spider regularly (with
scrapy crawl dmoz) it will scrape
only those two pages. However, you can scrape any dmoz.org page by passing the
url instead of the spider name. Scrapy internally resolves the spider to use by
looking at the allowed domains of each spider.
For example, to scrape a different URL use:
scrapy crawl http://www.dmoz.org/Computers/Programming/Languages/Erlang/
You can scrape any URL from dmoz.org using this spider
googledir- for scraping Google Directory (directory.google.com)
This project uses a pipeline to filter out websites containing certain forbidden words in their description. This pipeline is defined in the class: