Skip to content

chw120/crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 

Repository files navigation

VERSION 1.0

Prerequisites:
    OS:Linux
    Python2.7, pip
       for windows:	
          +install: pywin32, visual c++ 9.0

    render Javascript:
      splash (a Javascript rendering service): Docker
      scrapyjs
      start splash service
        sudo docker run -p 8050:8050 scrapinghub/splash

install scrapy:
    pip install scrapy

Start a scrapy project
    use cmd moving the specified folder;
    scrapy startproject projectname.
create spidername.py under spiders/

add crawling logic in a scrapy spider
    items.py: define data schema.
    pipelines.py: process data after fetching.
    settings.py: define configurations, eg: use proxy.
    spider: logic.(normally, each website respectively has one spider.)

run spider
    cd projectname
    scrapy crawl name


Tips:
When setting SPLASH_URL in settings.py, run $docker-machine ip to get the ip address if using Docker.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages