# Exercise - Static web scraping 3

Author: Jun Sun (jun.sun@gesis.org)

## Task 1

1. Install selectorlib addon in your browser
2. Install selectorlib package in colab notebook

## Solution

In [1]:
# install selectorlib package in colab notebook
!pip install selectorlib

Collecting selectorlib
  Downloading selectorlib-0.16.0-py2.py3-none-any.whl (5.8 kB)
Collecting parsel>=1.5.1 (from selectorlib)
  Downloading parsel-1.8.1-py2.py3-none-any.whl (17 kB)
Collecting cssselect>=0.9 (from parsel>=1.5.1->selectorlib)
  Downloading cssselect-1.2.0-py2.py3-none-any.whl (18 kB)
Collecting jmespath (from parsel>=1.5.1->selectorlib)
  Downloading jmespath-1.0.1-py3-none-any.whl (20 kB)
Collecting w3lib>=1.19.0 (from parsel>=1.5.1->selectorlib)
  Downloading w3lib-2.1.2-py3-none-any.whl (21 kB)
Installing collected packages: w3lib, jmespath, cssselect, parsel, selectorlib
Successfully installed cssselect-1.2.0 jmespath-1.0.1 parsel-1.8.1 selectorlib-0.16.0 w3lib-2.1.2


In [2]:
# import stuffs
from selectorlib import Extractor
import requests
import re

## Task 2

Use selectorlib and regex to scrape information about all pokemons with tag "overgrow".

https://scrapeme.live/product-tag/overgrow/

1. Get the name and price of each pokemon with tag "overgrow"
2. Get the stock level of each pokemon with tag "overgrow"
3. Calculate the total value (i.e., sum of all prices)


## Solution

In [3]:
# there are only two pages of pokemons
url1 = 'https://scrapeme.live/product-tag/overgrow/'
url2 = 'https://scrapeme.live/product-tag/overgrow/page/2/'

In [4]:
# yamls are created with selectorlib addon
yaml = """
  product_name:
      css: 'li.product h2.woocommerce-loop-product__title'
      xpath: null
      multiple: true
      type: Text
  price:
      css: 'li.product span.woocommerce-Price-amount'
      xpath: null
      multiple: true
      type: Text
  link:
      css: 'li.product a.woocommerce-LoopProduct-link'
      xpath: null
      multiple: true
      type: Link
"""


In [5]:
yaml_stock = """
    yaml_price:
        css: p.stock
        xpath: null
        type: Text
"""

In [6]:
# create selector extractors for the pokemon and the stock level
e = Extractor.from_yaml_string(yaml)
e_stock = Extractor.from_yaml_string(yaml_stock)

In [7]:
# make requests
r1 = requests.get(url1)
r2 = requests.get(url2)

In [8]:
# extract information using the selectorlib extractor
extr1 = e.extract(r1.text)
extr2 = e.extract(r2.text)

In [9]:
# get the names, prices and the URLs of the pokemons
name_list = extr1['product_name'] + extr2['product_name']
price_list = extr1['price'] + extr2['price']
link_list = extr1['link'] + extr2['link']

print(name_list)
print(price_list)
print(link_list)

['Bulbasaur', 'Ivysaur', 'Venusaur', 'Chikorita', 'Bayleef', 'Meganium', 'Treecko', 'Grovyle', 'Sceptile', 'Turtwig', 'Grotle', 'Torterra', 'Snivy', 'Servine', 'Serperior', 'Chespin', 'Quilladin', 'Chesnaught', 'Dartrix', 'Decidueye']
['£ 63.00', '£ 87.00', '£ 105.00', '£ 127.00', '£ 44.00', '£ 163.00', '£ 96.00', '£ 190.00', '£ 37.00', '£ 101.00', '£ 154.00', '£ 87.00', '£ 102.00', '£ 154.00', '£ 88.00', '£ 185.00', '£ 161.00', '£ 96.00', '£ 169.00', '£ 106.00']
['https://scrapeme.live/shop/Bulbasaur/', 'https://scrapeme.live/shop/Ivysaur/', 'https://scrapeme.live/shop/Venusaur/', 'https://scrapeme.live/shop/Chikorita/', 'https://scrapeme.live/shop/Bayleef/', 'https://scrapeme.live/shop/Meganium/', 'https://scrapeme.live/shop/Treecko/', 'https://scrapeme.live/shop/Grovyle/', 'https://scrapeme.live/shop/Sceptile/', 'https://scrapeme.live/shop/Turtwig/', 'https://scrapeme.live/shop/Grotle/', 'https://scrapeme.live/shop/Torterra/', 'https://scrapeme.live/shop/Snivy/', 'https://scrapeme.l

In [10]:
# a function to get the numerical price
def get_price(str_price):
    # using regex to extract the numerical price
    regex_price = re.compile(r".\s(\d+\.\d+)")
    price = float(regex_price.findall(str_price)[0])

    return price

In [12]:
prices = [ get_price(str_p) for str_p in price_list ]
prices

[63.0,
 87.0,
 105.0,
 127.0,
 44.0,
 163.0,
 96.0,
 190.0,
 37.0,
 101.0,
 154.0,
 87.0,
 102.0,
 154.0,
 88.0,
 185.0,
 161.0,
 96.0,
 169.0,
 106.0]

In [13]:
# a function to get the stock level value from a URL of a pokemon
def get_stock_level_from_link(url):
    # make a request and extract the string of the stock level
    r = requests.get(url)
    extr = e_stock.extract(r.text)
    str_stock = extr['yaml_price']

    # using regex to extract the stock level value
    regex_stock = re.compile(r"(\d+) in stock")
    stock_level = regex_stock.findall(str_stock)
    if len(stock_level) == 0:
        return 0
    else:
        return int(stock_level[0])

In [14]:
# get all stock level values
stock_levels = [ get_stock_level_from_link(l) for l in link_list ]

In [16]:
stock_levels, prices

([45,
  142,
  30,
  98,
  299,
  71,
  93,
  195,
  159,
  254,
  187,
  146,
  47,
  82,
  164,
  267,
  61,
  145,
  169,
  268],
 [63.0,
  87.0,
  105.0,
  127.0,
  44.0,
  163.0,
  96.0,
  190.0,
  37.0,
  101.0,
  154.0,
  87.0,
  102.0,
  154.0,
  88.0,
  185.0,
  161.0,
  96.0,
  169.0,
  106.0])

In [17]:
# the total value is just the sum of the products of corresponding values in the two lists
sum([ s * p for s, p in zip(stock_levels, prices) ])

336488.0

In [18]:
# alternatively: using numpy
import numpy as np

np.dot(stock_levels, prices)

336488.0

## Bonus task

Play around with the interactive regex exercises in https://regexone.com/.

Proceed as much as possible.
