# Scraping a web page

using the python BeautifulSoup library, we will scrape the webage: quotes.toscrape.com

In [1]:
# first, install BeautifulSoup if not already installed
!pip install bs4



## Crash course in BeautifulSoup

In [1]:
#pretty printer, utility library, not necessary for the purposes of this lab
from pprint import PrettyPrinter
pp = PrettyPrinter(indent=4)

pp = pp.pprint

In [2]:
from bs4 import BeautifulSoup
import requests

# a demo HTML content on which we will run a couple of tests
demo_content = """<div class="quote">

        <span class="text">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span>
        <span>by <small class="author">Marilyn Monroe</small>
        <a href="/author/Marilyn-Monroe">(about)</a>
        </span>
    </div>"""

you can use BeautifulSoup to query HTML elements in a webpage, similar to how you would use JavaScript

In [4]:
# initialize a BeautifulSoup object with the content you want to scrape
soup = BeautifulSoup(demo_content)

In [5]:
# find all 'span' elements in our content
spans = soup.find_all('span')
print("all the '<span>' elements extracted from the text:")
pp(spans)

all the '<span>' elements extracted from the text:
[   <span class="text">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span>,
    <span>by <small class="author">Marilyn Monroe</small>
<a href="/author/Marilyn-Monroe">(about)</a>
</span>]


In [6]:
# find a specific span by property name (class name in this case)
class_text_spans = soup.find_all('span', {'class': 'text'})
print("all the '<span>' elements with class 'text' extracted from the text:")
pp(class_text_spans)

all the '<span>' elements with class 'text' extracted from the text:
[   <span class="text">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span>]


In [9]:
# perform a search query on a result of a previous search query
fine_tuned_span_results = []
for span_result in spans:
    fine_tuned_span_results += span_result.find_all('small')
print("elements extracted from a previous query:")
pp(fine_tuned_span_results)

elements extracted from a previous query:
[<small class="author">Marilyn Monroe</small>]


If you are interested, you can read lore about BeautifulSoup in the documentation: https://beautiful-soup-4.readthedocs.io/en/latest/

## Scraping quotes.toscrape.com

Start by inspecting quotes.toscrape.com by using "inspect element" from the context menu (right-click)

Your mission, shall you choose to accept it, is to scrape the aforementioned website, and extract all the quotes of the first 10 pages. In addition, save the quotes in a python dictionary, where the key is the quote author's name, and the values are the list of quotes of the author.

So as an example, your final dictionary would look like this:

{

    'Jorge Luis Borges': [   '“I have always imagined that Paradise will be a '
                             'kind of library.”'],
                             
    'Khaled Hosseini':   [   '“But better to get hurt by the truth than '
                           'comforted with a lie.”'],
                           
    "Madeleine L'Engle": [   '“You have to write the book that wants to be '
                             'written. And if the book will be too difficult '
                             'for grown-ups, then you write it for children.”'],
    ...
}

In [3]:
from bs4 import BeautifulSoup
import requests
import time

# *** note: the link to different pages can be accessed as: http://quotes.toscrape.com/page/PAGE_NUMBER ***
link_to_scrape = 'http://quotes.toscrape.com/page/'
quotes = {}
for i in range(1,11):
    # do your magic
    full_link = "{}{}/".format(link_to_scrape,i)
    try:
        data = requests.get(full_link)
    except requests.exceptions.RequestException as e:
        print(e)
    if data.status_code == 200:
        soup = BeautifulSoup(data._content)
        #list all quotes
        for qt in soup.find_all('div', { "class" : "quote"}):
            #extract text
            txt = qt.find_all('span', {"class":'text'})
            quote = ' '.join([t.text for t in txt])
            #extract author
            auth = qt.find_all('small',{'class':'author'})
            auth = auth[0].text
            if auth not in quotes:
                
                quotes[auth] = []
            quotes[auth].append(quote)
    # sleep for 1 second so not to overload the web page with requests
    time.sleep(1)
pp(quotes)
        

{   'Albert Einstein': [   '“The world as we have created it is a process of '
                           'our thinking. It cannot be changed without '
                           'changing our thinking.”',
                           '“There are only two ways to live your life. One is '
                           'as though nothing is a miracle. The other is as '
                           'though everything is a miracle.”',
                           '“Try not to become a man of success. Rather become '
                           'a man of value.”',
                           "“If you can't explain it to a six year old, you "
                           "don't understand it yourself.”",
                           '“If you want your children to be intelligent, read '
                           'them fairy tales. If you want them to be more '
                           'intelligent, read them more fairy tales.”',
                           '“Logic will get you from A to Z; imagination wil