Multifarious Scrapy examples. Spiders for alexa / amazon / douban / douyu / github / linkedin etc.
Switch branches/tags
Nothing to show
Clone or download
geekan Merge pull request #15 from bryant1410/master
Fix broken headings in Markdown files
Latest commit edb1cb1 Aug 13, 2018
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
alexa 升级scrapy1.0版本 May 12, 2016
alexa_topsites add alexa_topsites Jun 14, 2016
amazonbook 升级scrapy1.0版本 May 12, 2016
dianping update Nov 6, 2016
dmoz 升级scrapy1.0版本 May 12, 2016
doubanbook 升级scrapy1.0版本 May 12, 2016
doubanmovie 升级scrapy1.0版本 May 12, 2016
douyu write result to file. Jun 13, 2016
general_spider add return x Jun 14, 2016
github_trending add github_trending repo. Jun 13, 2016
googlescholar delay.sh: update download delay. Jun 12, 2016
hacker_news add hacker_news Jun 14, 2016
hrtencent 升级scrapy1.0版本 May 12, 2016
linkedin 升级scrapy1.0版本 May 12, 2016
misc Update auto_join_text to False to make all actions more uniform. Jun 12, 2016
pandatv add pandatv Jun 13, 2016
proxylist 升级scrapy1.0版本 May 12, 2016
qqnews 升级scrapy1.0版本 May 12, 2016
reddit add reddit Jun 14, 2016
sinanews 升级scrapy1.0版本 May 12, 2016
sis json: update Jun 12, 2016
template update template. Jun 12, 2016
tutorial 升级scrapy1.0版本 May 12, 2016
underdev add twitch as doing Jun 12, 2016
v2ex add v2ex list extraction. Jun 12, 2016
youtube_trending add youtube_trending Jun 13, 2016
zhibo8 deleted: zhibo8/index.html Jun 14, 2016
zhihu 升级scrapy1.0版本 May 12, 2016
ziroom add an example for crawl ziroom Nov 25, 2016
.gitignore update .gitignore Dec 2, 2015
README.md Fix broken Markdown headings Apr 18, 2017
clean.sh clean.sh chmod +x Nov 16, 2014
delay.sh delay.sh: update download delay. Jun 12, 2016
startproject.sh not print useless msg Nov 26, 2015

README.md

scrapy-examples

Multifarious scrapy examples with integrated proxies and agents, which make you comfy to write a spider.

Don't use it to do anything illegal!


Real spider example: doubanbook

Tutorial

git clone https://github.com/geekan/scrapy-examples
cd scrapy-examples/doubanbook
scrapy crawl doubanbook

Depth

There are several depths in the spider, and the spider gets real data from depth2.

  • Depth0: The entrance is http://book.douban.com/tag/
  • Depth1: Urls like http://book.douban.com/tag/外国文学 from depth0
  • Depth2: Urls like http://book.douban.com/subject/1770782/ from depth1

Example image

douban book


Avaiable Spiders

  • tutorial
    • dmoz_item
    • douban_book
    • page_recorder
    • douban_tag_book
  • doubanbook
  • linkedin
  • hrtencent
  • sis
  • zhihu
  • alexa
    • alexa
    • alexa.cn

Advanced

  • Use parse_with_rules to write a spider quickly.
    See dmoz spider for more details.

  • Proxies

    • If you don't want to use proxy, just comment the proxy middleware in settings.
    • If you want to custom it, hack misc/proxy.py by yourself.
  • Notice

    • Don't use parse as your method name, it's an inner method of CrawlSpider.

Advanced Usage

  • Run ./startproject.sh <PROJECT> to start a new project.
    It will automatically generate most things, the only left things are:
    • PROJECT/PROJECT/items.py
    • PROJECT/PROJECT/spider/spider.py

Example to hack items.py and spider.py

Hacked items.py with additional fields url and description:

from scrapy.item import Item, Field

class exampleItem(Item):
    url = Field()
    name = Field()
    description = Field()

Hacked spider.py with start rules and css rules (here only display the class exampleSpider):

class exampleSpider(CommonSpider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.com/",
    ]
    # Crawler would start on start_urls, and follow the valid urls allowed by below rules.
    rules = [
        Rule(sle(allow=["/Arts/", "/Games/"]), callback='parse', follow=True),
    ]

    css_rules = {
        '.directory-url li': {
            '__use': 'dump', # dump data directly
            '__list': True, # it's a list
            'url': 'li > a::attr(href)',
            'name': 'a::text',
            'description': 'li::text',
        }
    }

    def parse(self, response):
        info('Parse '+response.url)
        # parse_with_rules is implemented here:
        #   https://github.com/geekan/scrapy-examples/blob/master/misc/spider.py
        self.parse_with_rules(response, self.css_rules, exampleItem)