<a href="http://quotes.toscrape.com/tag/humor">http://quotes.toscrape.com/tag/humor</a>

When you open the page, you'll find a list of 'funny' quotes.  
Let's see if we can extract these quotes and who they're attributed to.

Define the URL and get the content of the wegpage.  
One of the many ways to do this is with the requests module.

In [None]:
import requests

url = "http://quotes.toscrape.com/tag/humor"
response = requests.get(url)

The response consists of more the text you see on the screen.  
But for now, we're not interested in things like headers.

In [None]:
text = response.text

This is what we see with 'View page source' or 'Inspect'

In [None]:
# Only print the <body>
print(text[text.index("<body>"):])

```<span class="text" itemprop="text">
    “A day without sunshine is like, you know, night.”
</span>
<span>
    by
    <small class="author" itemprop="author">
        Steve Martin
    </small>
</span>```

If we look closely, we can see that each quote is contained in a **span** with the attribute **itemprop** set to **text**.  
The author is in a **small** with **itemprop** set to **author**.

## Do not try this at home!

We'll use the **re** module find iterate over the text.

Print the first 200 chars after we find an instance of 'itemprop="text"'.

In [None]:
# Do not try this at home!
import re

# Find all the quotes on this page
for m in re.finditer('itemprop="text"', text):
    print()
    # Only print first 200 chars
    print(text[m.start():m.start()+200])

## Do not try this at home!

This works because the quotes are the only items marked by the 'text' item-property.  
It's also very error-prone, 'itemprop = "text"' are missed (extra spaces, valid HTML).

We need to find the end of the quote, and do some cleaning.

In [None]:
# Do not try this at home!

# We need some cleaning
for m in re.finditer('itemprop="text"', text):
    # Find the marker
    quote = text[m.start():]
    
    # Cleanup of the quote
    # Cut off start marker
    quote = quote[quote.find(">")+1:]
    # Cut off end marker
    quote = quote[:quote.find("</span>")]
    # Replace the "&#39;" with "'"
    quote = quote.replace("&#39;", "'")
    
    print()
    print(quote)

## Do not try this at home!

But we also wanted to get the authors of the quote.

Remember, the name of the author was inside a tag **small** with **itemprop** set to **author**.

Once we find the quote look for the author marker, and do a similar cleaning.

In [None]:
# Do not try this at home!

# Also find the author of the quote
for m in re.finditer('itemprop="text"', text):
    # Find the marker
    quote = text[m.start():]

    # Find the marker for the author
    author = quote[quote.find('itemprop="author"'):]
    
    # Cleanup of the quote
    # Cut off start marker
    quote = quote[quote.find(">")+1:]
    # Cut off end marker
    quote = quote[:quote.find("</span>")]
    # Replace the "&#39;" with "'"
    quote = quote.replace("&#39;", "'")
   
    # Cleanup of the author
    # Cut off start marker
    author = author[author.find(">")+1:]
    # Cut off end marker
    author = author[:author.find("</small>")]
    
    print()
    print(author, "-", quote)

---

## Do try this at home! Or here!

Let's see if we can do this a bit more elegant.

In [None]:
from bs4 import BeautifulSoup

# Parse the document
soup = BeautifulSoup(text,'html.parser')

In [None]:
for quote in soup.find_all('span', {'itemprop': 'text'}):
    author = quote.findNext('small', {'itemprop': 'author'}).text
    
    print()
    print(author, "-", quote.text)