Skip to content

DormyMo/scrappy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

scrappy

scrapy best practice

requirements

pip install requirements.txt

structrue

|____bin    #bash scripts
|____requirements.txt
|____scrappy
| |____dbs    #storge dao
| |____extensions    #scrapy extensions
| |____items
| |____middlewares
| |____resources    #static resources
| |____scripts    #py scripts
| |____services    #py services
| |____spiders    #spiders definition
| |____utils    #python utils
|____scrapy.cfg

usage

       code some spider in spiders

  1. extends CrawlSpider

  2. define name

  3. define start_urls or start_requests function

  4. define parse function to parse the response

  5. define models in items

  6. define pipeline in pipelines

notice

items

  • handleInsert.

    parse the item before insert

  • handleUpdate.

    parse the item before update

spiders

extends BaseSpider

  • CrawlSpider.

    normal spider

    the spider will distributly if set ENABLE_REDIS value to True in settings

  • scrappy.extensions.scrapy_redis.spiders.RedisSpider.

    spider will not shutdown , always pop request form redis

resource

  • ResourceHelper.

    reading, wirting and creating files

middlewares

  • RemoveCookieMiddleware.

    remove cookie before request

  • RandomProxyMiddleware.

    random switch proxy before request

  • UserAgentMiddleware.

    random switch UserAgent before request

setting

       it will automatic switch configuration file (Linux as product platform)

  • ENABLE_REDIS.

    Enable redis distribution , redis stat

have a nice day :)

About

scrapy best practice

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published