This repo contains Scrapy spiders demo-ed in Radius Intelligence's workshop "Data Collection with Scrapy: Build & Manage Production Web Scraping Pipelines".
####Presentation materials available here
The spiders collect data about the wine products from www.wine.com and are broken out into levels that build new concepts on top of each other.
-
L0 (wine_example/spiders/L0_barespider.py)
- set up basic spider to fetch from wine.com url
-
L1 (wine_example/spiders/L1_wine.py)
- Create a spider that returns an item type named 'Wine' containing the fields: 1) the specific product page link, 2) product name, and 3) the current sell price. Only do this for the first page of 25 wine products at www.wine.com/v6/wineshop
-
L2 (wine_example/spiders/L2_wine_meta.py)
- Add to the Wine item the following fields: 1) wine type and 2) region.
-
L3 (wine_example/spiders/L3_wine_pagination.py)
- Teach your spider to crawl through all product pages to gather all 5000+ products
-
Wine_login.py (wine_example/spiders/wine_login.py)
- Create a login authentication aware spider
#####Take-Home Challenge:
- L4 (wine_example/spiders/L4_wine_reviews.py)
- Complete this part on your own. Teach your spider to crawl one more page level deep to scrape all ratings and reviews for each product. Good luck and have fun!
- For those who do not have pip installed:
curl -O https://bootstrap.pypa.io/get-pip.py
sudo python get-pip.py # writes to system Python- Install & activate virtualenv
sudo pip install virtualenv # writes to system Python
virtualenv scrapy_learn # isolated from system Python
source scrapy_learn/bin/activate- Install Scrapy & Dependencies
pip install wheel
pip install scrapy
- You will also need Chrome
- Scrapy Documentation
- CSS Selectors
- XPath
- Regex
- Beautiful Soup