## Scraping Quotes 1

Web scraping : *Web scraping*, *web harvesting*, or *web data extraction* is data scraping used for extracting data from websites <a href='#[1]'>[1]</a>.

The general idea behind web scraping is to retrieve data that exists on a website, and convert it into a format that is usable for analysis. Webpages are rendered by the brower from HTML and CSS code, but much of the information included in the HTML underlying any website is not interesting to us <a href='#[2]'>[2]</a>.

The site we will use in this notebook is <a href="http://quotes.toscrape.com">http://quotes.toscrape.com</a>.  

We will use the BeautifulSoup library <a href='#[3]'>[3]</a> that will parse the source of the web page.  
This notebook will scrape the quotes of a single page.  
And advanced version (with pagination) are in [Scraping Quotes 2](Scraping Quotes 2.ipynb)  


In [None]:
# These are the needed libraries
import requests
from bs4 import BeautifulSoup

### Single page scraping

Let's start by scraping a single page of 'funny' quotes from : http://quotes.toscrape.com/tag/humor  
On this page you'll find a list of 'funny' quotes, let's find the author and tags per quote.

There are several libraries to get content from the web, but using ```requests``` is by far the easiest.  
Define the URL and and get the content of the page.  

In [None]:
url = "http://quotes.toscrape.com/tag/humor"
response = requests.get(url)
text = response.text
print(text)

This the same content we would see if we would 'View Page Source' in the browser.  
The page is made up of elements, consisting of 'tags' with content and attributes (key/value).  

## ```<p class="myclass">This is the content</p>``` 

- ```<p>``` and ```</p>``` are the opening and closing tags
- The content is between the tags
- *class* and *myclass* are the key/value of an attribute

Looking at the source, we see the following structure.
Ignoring the irrelevant parts, we see that 
- the total quote is packed in a ```<div>``` with __class="quote"__
- the quote text is in a ```<span>``` with __class="text"__
- the author is in a ```<small>``` with __class="author"__
- the tags are in a ```<a>``` with __class="tag"__

```
<div class="quote" ...>
  <span class="text" ...>“The actual quote”</span>
  <span>by
    <small class="author" ...>Author name</small>
    <a href="...">(about)</a>
  </span>
  <div class="tags">
    Tags:
    <meta ... /> 
    <a class="tag" href="...">tag 1</a>
    <a class="tag" href="...">tag 2</a>
  </div>
</div>```

Extracting all the info we want is tedious, therefor we're going to delegate this to BeautifulSoup.  
Create a 'soup' element from the source so we can query the required elements.

The ```find_all``` method searches for defined tags (optional with attributes).  
The method returns a list (might be empty) of all the found elements.  
To get all the quotes (there should be 10), we're going to look for a ```<div>``` with __class="quote"__

In [None]:
# Parse the source
page_soup = BeautifulSoup(text, 'html.parser')

# We don't need the angle brackets, just the tag name
quotes = page_soup.find_all('div', {'class': 'quote'})
print(len(quotes))

In [None]:
quote = quotes[0]
print(quote.prettify())

#### Find parts

Next we need to find the content, author and tags for this quote.  
Again, we'll use the find_all method with the corresponding tag names and attributes.

In [None]:
# Start with the content, find_all returns the whole element
# Note: we take the first element from the list
content_element = quote.find_all('span', {'class': "text"})[0]
print(content_element)

In [None]:
# We just want the text 
quote_text = content_element.text
print(quote_text)

In [None]:
# Let's do the same for the author
content_element = quote.find_all('small', {'class': "author"})[0]
quote_author = content_element.text
print(quote_author)

In [None]:
# And finally the tags, find_all will find more than one
content_elements = quote.find_all('a', {'class': "tag"})
for content_element in content_elements:
    tag = content_element.text
    print(tag)

#### Define function

In [None]:
def parse_quote(quote):
    # Extract the quote text 
    content_element = quote.find_all('span', {'class': "text"})[0]
    quote_text = content_element.text
    
    # Extract the author
    content_element = quote.find_all('small', {'class': "author"})[0]
    quote_author = content_element.text

    # Extract tags
    content_elements = quote.find_all('a', {'class': "tag"})
    tags = [content_element.text for content_element in content_elements]
    
    print()
    print(quote_text)
    print(quote_author)
    for tag in tags:
        print(tag, end=" ")
    print()

In [None]:
parse_quote(quote)

#### Parse page

In [None]:
# First get the source of the page
url = "http://quotes.toscrape.com/tag/humor"
response = requests.get(url)
text = response.text

# Parse the page
page_soup = BeautifulSoup(text,'html.parser')

# Loop through the quotes
for quote in page_soup.find_all('div', {'class': 'quote'}):
    parse_quote(quote)
    

<a id='[1]'>[1]</a> https://en.wikipedia.org/wiki/Web_scraping  
<a id='[2]'>[2]</a> https://en.wikipedia.org/wiki/HTML#Markup  
<a id='[3]'>[3]</a> https://www.crummy.com/software/BeautifulSoup/bs4/doc/  