# Exercise - Static web scraping 2

Author: Jun Sun (jun.sun@gesis.org)

## Import modules and set up environment

In [1]:
# Package query and download from web resources! Alternatives: URLlib2, URLlib3
import requests

# Speaking of, we can manipulate URLs easily with urllib
import urllib

# If you want HTML to make sense, you need soup
from bs4 import BeautifulSoup

# Avoids scroll-in-the-scroll in the entire Notebook
from IPython.display import Javascript, HTML
if 'google.colab' in str(get_ipython()):
    def resize_colab_cell():
      display(Javascript('google.colab.output.setIframeHeight(0, true, {maxHeight: 300})'))
    get_ipython().events.register('pre_run_cell', resize_colab_cell)

## Task 1
http://quotes.toscrape.com is a website that lists quotes from famous people.

1. Get this webpage into Beautiful soup and inspect it using .prettify(): http://quotes.toscrape.com/tag/classic/
2. Retrieve only the html of the rows containing the authors
3. Retrieve all quotes with the tag "friendship", and print them with their authors
4. How to retrieve all quotes with two given tags X and Y? Think about the concept/plan, code is not needed.


### Solution

##### first

In [2]:
# construct a request and load the content to beautifulsoup
page = requests.get("http://quotes.toscrape.com/tag/classic/")
soup2 = BeautifulSoup(page.content, 'html.parser')

In [3]:
# print the content nicely
print(soup2.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Quotes to Scrape
  </title>
  <link href="/static/bootstrap.min.css" rel="stylesheet"/>
  <link href="/static/main.css" rel="stylesheet"/>
 </head>
 <body>
  <div class="container">
   <div class="row header-box">
    <div class="col-md-8">
     <h1>
      <a href="/" style="text-decoration: none">
       Quotes to Scrape
      </a>
     </h1>
    </div>
    <div class="col-md-4">
     <p>
      <a href="/login">
       Login
      </a>
     </p>
    </div>
   </div>
   <h3>
    Viewing tag:
    <a href="/tag/classic/page/1/">
     classic
    </a>
   </h3>
   <div class="row">
    <div class="col-md-8">
     <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
      <span class="text" itemprop="text">
       “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
      </span>
      <span>
       by
       <small class="author" itemprop

##### second

In [4]:
# find the elements with class "author"
soup2.find_all(class_="author")

[<small class="author" itemprop="author">Jane Austen</small>,
 <small class="author" itemprop="author">Mark Twain</small>]

In [5]:
# alternatively
soup2.select(".author")

[<small class="author" itemprop="author">Jane Austen</small>,
 <small class="author" itemprop="author">Mark Twain</small>]

##### third

In [6]:
# construct a request and load the content to beautifulsoup
page_friendship = requests.get("http://quotes.toscrape.com/tag/friendship/")
soup_friendship = BeautifulSoup(page_friendship.content, 'html.parser')

In [7]:
for quote in soup_friendship.select(".quote"):
  print(quote.select_one(".text").text, '-', quote.select_one(".author").text)

“It is not a lack of love, but a lack of friendship that makes unhappy marriages.” - Friedrich Nietzsche
“Good friends, good books, and a sleepy conscience: this is the ideal life.” - Mark Twain
“The truth is, everyone is going to hurt you. You just got to find the ones worth suffering for.” - Bob Marley
“There is nothing I would not do for those who are really my friends. I have no notion of loving people by halves, it is not my nature.” - Jane Austen
“If I had a flower for every time I thought of you...I could walk through my garden forever.” - Alfred Tennyson


##### fourth
- Retrieve the first page of quotes with tag X. The URL of the first page should be "http://quotes.toscrape.com/tag/X/"
- If there are multiple pages, also retrieve them. The URL of the nth page should be "http://quotes.toscrape.com/tag/X/page/n"
- Examine the retrieved quotes and filter with tag Y.

## Task 2

Starting from the URL https://books.toscrape.com/catalogue/category/books/nonfiction_13/index.html, navigate through the pages listing books, and print the titles of the parsed books.

### Solution

In [8]:
base_url = 'https://books.toscrape.com/catalogue/category/books/nonfiction_13/'

# where you are at now
current_page_url = 'https://books.toscrape.com/catalogue/category/books/nonfiction_13/'

done = False
while not done:
    # print where you are now
    print('* fetching', current_page_url)

    # fetch the page and load it into beautifulsoup
    current_page_soup = BeautifulSoup(requests.get(current_page_url).content,
                                      'html.parser')

    # find where the book title is and print it out
    for book in current_page_soup.select(".product_pod h3 a"):
        print(book['title'])

    #current_book_links = extract_book_links(current_page_soup, current_page_url)

    try:
      # locate the URL to the next page
      relative_next_page_url = current_page_soup.select(".next a['href']")[0]['href']
      next_page_url = urllib.parse.urljoin(base_url, relative_next_page_url)
      current_page_url = next_page_url
    except:
      # if there is no next page, we are done!
      done=True

* fetching https://books.toscrape.com/catalogue/category/books/nonfiction_13/
Worlds Elsewhere: Journeys Around Shakespeare’s Globe
The Five Love Languages: How to Express Heartfelt Commitment to Your Mate
Reasons to Stay Alive
#HigherSelfie: Wake Up Your Life. Free Your Soul. Find Your Tribe.
Unseen City: The Majesty of Pigeons, the Discreet Charm of Snails & Other Wonders of the Urban Wilderness
Throwing Rocks at the Google Bus: How Growth Became the Enemy of Prosperity
The Life-Changing Magic of Tidying Up: The Japanese Art of Decluttering and Organizing
The Gutsy Girl: Escapades for Your Life of Epic Adventure
The Electric Pencil: Drawings from Inside State Hospital No. 3
Spark Joy: An Illustrated Master Class on the Art of Organizing and Tidying Up
Reskilling America: Learning to Labor in the Twenty-First Century
In the Country We Love: My Family Divided
Everydata: The Misinformation Hidden in the Little Data You Consume Every Day
Call the Nurse: True Stories of a Country Nurse on