# Lecture -	Static web scraping 2

Big thanks to [Fabian Floeck](https://f-squared.org/) for the majority of the content in this notebook.

This notebook also borrows from the Lab for Data Mining with Pandas on Wikipedia data by [Brian Keegan](https://www.brianckeegan.com), [Department of Information Science, CU Boulder](https://www.colorado.edu/cmci/academics/information-science), as well as the [PyCon 2015 Pandas tutorial](https://github.com/brandon-rhodes/pycon-pandas-tutorial) by Brandon Rhodes and the [dataquest blog](https://www.dataquest.io/blog/web-scraping-tutorial-python/).

This notebook is copyrighted and made available under the [Apache License v2.0](https://creativecommons.org/licenses/by-sa/4.0/).

Maintained / presented by: Jun Sun (jun.sun@gesis.org)

## Learning goals
* Use `requests` to retrieve a webpage
* Use `beautifulsoup` to parse a webpage
* Navigate through a website

## Import modules and set up environment

In [1]:
# Package query and download from web resources! Alternatives: URLlib2, URLlib3
import requests

# Speaking of, we can manipulate URLs easily with urllib
import urllib

# If you want HTML to make sense, you need soup
from bs4 import BeautifulSoup

# Avoids scroll-in-the-scroll in the entire Notebook
from IPython.display import Javascript, HTML
if 'google.colab' in str(get_ipython()):
    def resize_colab_cell():
      display(Javascript('google.colab.output.setIframeHeight(0, true, {maxHeight: 600})'))
    get_ipython().events.register('pre_run_cell', resize_colab_cell)

## Starting example

example page at http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html

In [2]:
content = """<html>
<head>
<title>A simple example page</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>
</html>"""

In [3]:
# display as HTML
HTML(content)

We can also get the same content by fetching the page through requests

In [4]:
page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
content = page.content

In [5]:
# load it into beautifulsoup
soup = BeautifulSoup(content, 'html.parser')

In [6]:
# print the soup nicely
print(soup.prettify())

<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <div>
   <p class="inner-text first-item" id="first">
    First paragraph.
   </p>
   <p class="inner-text">
    Second paragraph.
   </p>
  </div>
  <p class="outer-text first-item" id="second">
   <b>
    First outer paragraph.
   </b>
  </p>
  <p class="outer-text">
   <b>
    Second outer paragraph.
   </b>
  </p>
 </body>
</html>



In [7]:
list(soup.html.body.div.children)

['\n',
 <p class="inner-text first-item" id="first">
                 First paragraph.
             </p>,
 '\n',
 <p class="inner-text">
                 Second paragraph.
             </p>,
 '\n']

In [8]:
# find all <p> </p> tags
soup.find_all('p')

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>,
 <p class="inner-text">
                 Second paragraph.
             </p>,
 <p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In [9]:
# number of <p> </p> tags
len(soup.find_all('p'))

4

In [10]:
# access the first one
soup.find_all('p')[0]

<p class="inner-text first-item" id="first">
                First paragraph.
            </p>

In [11]:
# equivalently
soup.find('p')

<p class="inner-text first-item" id="first">
                First paragraph.
            </p>

In [12]:
# equivalently
soup.p

<p class="inner-text first-item" id="first">
                First paragraph.
            </p>

In [13]:
# find all <p> </p> tags with class 'outer-text'
soup.find_all('p', class_='outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In [14]:
# find all elements with id 'first'
soup.find_all(id="first")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>]

* `find`: Return only the first child of this Tag matching the given
criteria. The first argument is a tag name.
* `select`: Perform a CSS selection operation on the current element. The first argument is a query/selector.

Notable CSS selectors:
* ancestor descendant
* parent > child
* element.class
* element#id




Examples:


In [15]:
soup.select("div p")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>,
 <p class="inner-text">
                 Second paragraph.
             </p>]

In [16]:
soup.select("p")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>,
 <p class="inner-text">
                 Second paragraph.
             </p>,
 <p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In [17]:
soup.select("p.first-item")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>,
 <p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>]

In [18]:
soup.select("div p.first-item#second")

[]

In [19]:
soup.select("p.first-item")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>,
 <p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>]

In [20]:
soup.select("#second")

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>]

In [21]:
soup.select("p#second")


[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>]

In [22]:
soup.select("p#second")[0].get_text().strip()

'First outer paragraph.'

## Navigating through a website

### Parse a page
Extract the title, price of a book from a mock e-commerce website:

https://books.toscrape.com/catalogue/adulthood-is-a-myth-a-sarahs-scribbles-collection_659/index.html


In [23]:
# we know the drill now: get the page for the book
book_url = 'https://books.toscrape.com/catalogue/adulthood-is-a-myth-a-sarahs-scribbles-collection_659/index.html?'
book = BeautifulSoup(requests.get(book_url).content, 'html.parser')

In [24]:
# get the title
title = book.h1.text

In [25]:
#there are multiple elements with the same characteristics
book.select('.price_color')

[<p class="price_color">£10.90</p>,
 <p class="price_color">£48.80</p>,
 <p class="price_color">£32.01</p>,
 <p class="price_color">£57.62</p>,
 <p class="price_color">£51.51</p>,
 <p class="price_color">£38.39</p>,
 <p class="price_color">£25.48</p>]

In [26]:
# select the parent first
price = book.select('.product_main > p.price_color')[0].text

In [27]:
# display the title and the price
print(title)
print(price)

Adulthood Is a Myth: A "Sarah's Scribbles" Collection
£10.90


wrap in a function

In [28]:
def parse_book(book):
  title = ''
  price = ''

  try:
    title = book.h1.text
  except: pass

  try:
    price = book.select('.product_main > p.price_color')[0].text
  except: pass

  return dict(title = title,
              price = price,
          )

### Parse multiple pages

extract all the book urls from the front page of this listing: https://books.toscrape.com/index.html

In [29]:
# we start from here
BASE_URL = 'https://books.toscrape.com/'
current_page_url = BASE_URL
current_soup = BeautifulSoup(requests.get(current_page_url).content, 'html.parser')

In [30]:
# get the relative URL
book_urls = [a['href'] for a in current_soup.select('.product_pod .image_container a[href]')]
book_urls[0]

'catalogue/a-light-in-the-attic_1000/index.html'

In [31]:
# concatenate the URL
urllib.parse.urljoin(current_page_url, book_urls[0])

'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'

In [32]:
# concatenate all URLs on the page
[ urllib.parse.urljoin(current_page_url, relative_url) for relative_url in book_urls ]

['https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
 'https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html',
 'https://books.toscrape.com/catalogue/soumission_998/index.html',
 'https://books.toscrape.com/catalogue/sharp-objects_997/index.html',
 'https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html',
 'https://books.toscrape.com/catalogue/the-requiem-red_995/index.html',
 'https://books.toscrape.com/catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html',
 'https://books.toscrape.com/catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html',
 'https://books.toscrape.com/catalogue/the-boys-in-the-boat-nine-americans-and-their-epic-quest-for-gold-at-the-1936-berlin-olympics_992/index.html',
 'https://books.toscrape.com/catalogue/the-black-maria_991/index.html',
 'https://books.toscrape.com/catalogue/starving-hearts-triangular-trade-tr

wrap in a function

In [33]:
def extract_book_links(current_soup, current_page_url):
  absolute_book_urls = []
  try:
    book_urls = [a['href'] for a in current_soup.select('.product_pod .image_container a[href]')]
    absolute_book_urls = [ urllib.parse.urljoin(current_page_url, relative_url) for relative_url in book_urls ]
  except: pass
  return absolute_book_urls

In [34]:
# try it out
extract_book_links(current_soup, current_page_url)

['https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
 'https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html',
 'https://books.toscrape.com/catalogue/soumission_998/index.html',
 'https://books.toscrape.com/catalogue/sharp-objects_997/index.html',
 'https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html',
 'https://books.toscrape.com/catalogue/the-requiem-red_995/index.html',
 'https://books.toscrape.com/catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html',
 'https://books.toscrape.com/catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html',
 'https://books.toscrape.com/catalogue/the-boys-in-the-boat-nine-americans-and-their-epic-quest-for-gold-at-the-1936-berlin-olympics_992/index.html',
 'https://books.toscrape.com/catalogue/the-black-maria_991/index.html',
 'https://books.toscrape.com/catalogue/starving-hearts-triangular-trade-tr

### Parse the whole website

In [35]:
# find the link for the next page
relative_next_page_url = current_soup.select(".next a[href]")[0]['href']
relative_next_page_url

'catalogue/page-2.html'

In [36]:
# concatenate the URL
next_page_url = urllib.parse.urljoin(current_page_url, relative_next_page_url)
next_page_url

'https://books.toscrape.com/catalogue/page-2.html'