# Part 2: Web scraping HTML
Here we will learn how to download and parse HTML.
We will use [this helpful website](http://toscrape.com/)

First we will import the packages we need:

In [None]:
import os
import json
import requests
import time
import pandas as pd
from bs4 import BeautifulSoup as bs

Now we will get the HTML of a URL we need: [http://quotes.toscrape.com/](http://quotes.toscrape.com/).

It's a website with quotations, the people they are attributed to, and the short biographies of those people.

We will use the python `requests` library to send HTTP requests.

In [141]:
url = "http://quotes.toscrape.com/"
response = requests.get(url)
response

<Response [200]>

`<Response [200]>` means that our request was successful.
Usually what we want is the text from a website.
Let's get the text and print it. [Compare it to the source code of the actual webpage](view-source:http://quotes.toscrape.com/)

In [21]:
htmltext = response.text
print(text)

<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<title>Quotes to Scrape</title>
    <link rel="stylesheet" href="/static/bootstrap.min.css">
    <link rel="stylesheet" href="/static/main.css">
</head>
<body>
    <div class="container">
        <div class="row header-box">
            <div class="col-md-8">
                <h1>
                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>
                </h1>
            </div>
            <div class="col-md-4">
                <p>
                
                    <a href="/login">Login</a>
                
                </p>
            </div>
        </div>
    

<div class="row">
    <div class="col-md-8">

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
        <span>by <small class="author" itempr

We could use a combination of regular expressions, string matching, and loops to navigate the html, but luckily the Beautiful Soup package makes it much easier. [BeautifulSoup documentation is here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).

In [20]:
soup = bs(htmltext,'html.parser')
print(soup) # this doesn't look much different than before we parsed it, but it will let us navigate it easier

<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Quotes to Scrape</title>
<link href="/static/bootstrap.min.css" rel="stylesheet"/>
<link href="/static/main.css" rel="stylesheet"/>
</head>
<body>
<div class="container">
<div class="row header-box">
<div class="col-md-8">
<h1>
<a href="/" style="text-decoration: none">Quotes to Scrape</a>
</h1>
</div>
<div class="col-md-4">
<p>
<a href="/login">Login</a>
</p>
</div>
</div>
<div class="row">
<div class="col-md-8">
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
<a class="

There are several ways to navigate this. 
First start by navigating it using __tag names__.
This returns the first element with that tag name.

In [26]:
# head
print(soup.head)
# title
print(soup.title)
# body
print(soup.body)
# h1

<head>
<meta charset="utf-8"/>
<title>Quotes to Scrape</title>
<link href="/static/bootstrap.min.css" rel="stylesheet"/>
<link href="/static/main.css" rel="stylesheet"/>
</head>
<body>
<div class="container">
<div class="row header-box">
<div class="col-md-8">
<h1>
<a href="/" style="text-decoration: none">Quotes to Scrape</a>
</h1>
</div>
<div class="col-md-4">
<p>
<a href="/login">Login</a>
</p>
</div>
</div>
<div class="row">
<div class="col-md-8">
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
<a class="tag" href="/tag/change/page/1/">ch

What kinds of data structures are these returning?

In [32]:
print(type(soup))
print(type(soup.head))
print(type(soup.title))

<class 'bs4.BeautifulSoup'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>


<h1>
<a href="/" style="text-decoration: none">Quotes to Scrape</a>
</h1>

We can actually treat bs4.element.Tag as BeautifulSoup and navigate those the same way.
Try to get to the tag 

In [39]:
print(soup.body)

<div class="col-md-8">
<h1>
<a href="/" style="text-decoration: none">Quotes to Scrape</a>
</h1>
</div>


In [40]:
print(soup.body.div)
print(soup.body.div.div)
print(soup.body.div.div.div)
print(soup.body.div.div.div.div)
print(soup.body.div.div.div.h1.a)

<div class="container">
<div class="row header-box">
<div class="col-md-8">
<h1>
<a href="/" style="text-decoration: none">Quotes to Scrape</a>
</h1>
</div>
<div class="col-md-4">
<p>
<a href="/login">Login</a>
</p>
</div>
</div>
<div class="row">
<div class="col-md-8">
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
<

Note that doing that was also the same as doing this:

In [50]:
print(soup.h1.a)

<a href="/" style="text-decoration: none">Quotes to Scrape</a>


To get the style of that tag:

In [53]:
print(soup.h1.a['style'])

text-decoration: none


We can also use `.find` with the tag name and other attributes, and `.findAll` to return __all__ tags fitting those attributes.

In [74]:
# These are the same
print(soup.h1)
print(soup.find('h1'))
print('')
print(soup.find(style = "text-decoration: none"))
print(soup.h1.a)
print('')
#print(soup.findAll(div))
print(len(soup.findAll('div')))
print(type(soup.findAll('div')))
print(soup.find(''))
print('')

<h1>
<a href="/" style="text-decoration: none">Quotes to Scrape</a>
</h1>
<h1>
<a href="/" style="text-decoration: none">Quotes to Scrape</a>
</h1>

<a href="/" style="text-decoration: none">Quotes to Scrape</a>
None
<a href="/" style="text-decoration: none">Quotes to Scrape</a>

28
<class 'bs4.element.ResultSet'>
<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Quotes to Scrape</title>
<link href="/static/bootstrap.min.css" rel="stylesheet"/>
<link href="/static/main.css" rel="stylesheet"/>
</head>
<body>
<div class="container">
<div class="row header-box">
<div class="col-md-8">
<h1>
<a href="/" style="text-decoration: none">Quotes to Scrape</a>
</h1>
</div>
<div class="col-md-4">
<p>
<a href="/login">Login</a>
</p>
</div>
</div>
<div class="row">
<div class="col-md-8">
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing 

Let's practice on the first quotation, by Albert Einstein.
We get this by going to the first tag that has the class of quote.

In [84]:
einstein = soup.find('div',{'class':'quote'})
print(einstein)


<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>
</div>


In [88]:
print(einstein.div)
print(einstein.span)
print(einstein.a)
print(einstein.findAll('a'))

<div class="tags">
            Tags:
            <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
<a href="/author/Albert-Einstein">(about)</a>
[<a href="/author/Albert-Einstein">(about)</a>, <a class="tag" href="/tag/change/page/1/">change</a>, <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>, <a class="tag" href="/tag/thinking/page/1/">thinking</a>, <a class="tag" href="/tag/world/page/1/">world</a>]


Let's get all of the tags for that quotation, and use `get_text` to get __only__ the text from each tag.

In [99]:
e_tags = einstein.findAll('a',{'class':'tag'})
e_tags_list = []
for e_tag in e_tags:
    print(e_tag.get_text())
    e_tags_list.append(e_tag.get_text())
e_tags_list

# We can do the equivalent task without a loop using this line:
e_tags_list = [e_tag.get_text() for e_tag in e_tags]


change
deep-thoughts
thinking
world


Now navigate just to "Albert Einstein".

In [117]:
einstein
einstein.small.get_text()

'Albert Einstein'

Let's get Albert Einstein's quotation.

In [125]:
print(einstein.span.get_text())

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”


Now let's make a list of every person on this page, and then every quotation.

In [127]:
all_person_tags = soup.findAll('div',{'class':'quote'})
for person_tag in all_person_tags:
    print(person_tag.small.get_text())
    
persons = [person_tag.small.get_text() for person in all_person_tags]

quotes = [person_tag.span.get_text() for person in all_person_tags]

print(persons)
print(quotes)

Albert Einstein
J.K. Rowling
Albert Einstein
Jane Austen
Marilyn Monroe
Albert Einstein
André Gide
Thomas A. Edison
Eleanor Roosevelt
Steve Martin
['Steve Martin', 'Steve Martin', 'Steve Martin', 'Steve Martin', 'Steve Martin', 'Steve Martin', 'Steve Martin', 'Steve Martin', 'Steve Martin', 'Steve Martin']
['“A day without sunshine is like, you know, night.”', '“A day without sunshine is like, you know, night.”', '“A day without sunshine is like, you know, night.”', '“A day without sunshine is like, you know, night.”', '“A day without sunshine is like, you know, night.”', '“A day without sunshine is like, you know, night.”', '“A day without sunshine is like, you know, night.”', '“A day without sunshine is like, you know, night.”', '“A day without sunshine is like, you know, night.”', '“A day without sunshine is like, you know, night.”']


Say what we really want is to make a big spreadshet of all the names and quotations on this website. This means we need to go through the pages. Let's store everything in a python __dictionary__ before turning it into a spreadsheet with `pandas`.

We'll store each entry in this format:
`{'Person':'Albert Einstein',
'Quotation':'The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.'}`

First, let's make a __function__ to do that for us.

In [131]:
def storePerson(person_tag):
    name = person_tag.small.get_text()
    quote = person_tag.span.get_text()
    return {'name':name,'quote':quote}

print(storePerson(einstein))
    

{'name': 'Albert Einstein', 'quote': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'}


Loop through every person/quote on the page, and return a __list__ of __dictionaries__, where every dictionary is composed of 2 __key-value__ pairs: 1) Person's name 2) Person's quotation

In [132]:
all_person_tags = soup.findAll('div',{'class':'quote'})
all_quotes = []
for person_tag in all_person_tags:
    all_quotes.append(storePerson(person_tag))
    
print(all_quotes)
    

[{'name': 'Albert Einstein', 'quote': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'}, {'name': 'J.K. Rowling', 'quote': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”'}, {'name': 'Albert Einstein', 'quote': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”'}, {'name': 'Jane Austen', 'quote': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”'}, {'name': 'Marilyn Monroe', 'quote': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”"}, {'name': 'Albert Einstein', 'quote': '“Try not to become a man of success. Rather become a man of value.”'}, {'name': 'André Gide', 'quote': '“It is better to be hated for what you are than to be loved for what you are not.”'}, {'name': 'Thomas A. Edi

What we __really__ want is a list of __every person on this website__. To do this, we need to use `requests` to call on all the pages.

It's helpful to do some investigating first. Notice that [quotes.toscrape.com/page/1/](quotes.toscrape.com/page/1/) is this page we have been working with, [quotes.toscrape.com/page/2/](quotes.toscrape.com/page/2/) is the next page, and [quotes.toscrape.com/page/10/](quotes.toscrape.com/page/10/) is the last page. So our goal is to scrape these __10__ pages.

We can generate these 10 different URLs like this.

In [142]:
url = 'http://quotes.toscrape.com/page/'
page_num = 1
for page_num in range(1,11):
    print(page_num)
    print(url + str(page_num))

1
http://quotes.toscrape.com/page/1
2
http://quotes.toscrape.com/page/2
3
http://quotes.toscrape.com/page/3
4
http://quotes.toscrape.com/page/4
5
http://quotes.toscrape.com/page/5
6
http://quotes.toscrape.com/page/6
7
http://quotes.toscrape.com/page/7
8
http://quotes.toscrape.com/page/8
9
http://quotes.toscrape.com/page/9
10
http://quotes.toscrape.com/page/10


We are basically going to repeat the process that we did to get all the information from the first page for all 10 pages.


In [145]:
all_persons_pages = []

for page_num in range(1,11):
    time.sleep(.5) # So as not to overload the server!
    print(url + str(page_num))
    response = requests.get(url + str(page_num))
    htmltext = response.text
    soup = bs(htmltext,'html.parser')
    all_person_tags = soup.findAll('div',{'class':'quote'})
    for person_tag in all_person_tags:
        all_persons_pages.append(storePerson(person_tag))

    

    

http://quotes.toscrape.com/page/1
http://quotes.toscrape.com/page/2
http://quotes.toscrape.com/page/3
http://quotes.toscrape.com/page/4
http://quotes.toscrape.com/page/5
http://quotes.toscrape.com/page/6
http://quotes.toscrape.com/page/7
http://quotes.toscrape.com/page/8
http://quotes.toscrape.com/page/9
http://quotes.toscrape.com/page/10


We did it! Here is what the resulting dictionary looks like if we print it out:

In [148]:
print(len(all_persons_pages))
print(all_persons_pages)

100
[{'name': 'Albert Einstein', 'quote': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'}, {'name': 'J.K. Rowling', 'quote': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”'}, {'name': 'Albert Einstein', 'quote': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”'}, {'name': 'Jane Austen', 'quote': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”'}, {'name': 'Marilyn Monroe', 'quote': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”"}, {'name': 'Albert Einstein', 'quote': '“Try not to become a man of success. Rather become a man of value.”'}, {'name': 'André Gide', 'quote': '“It is better to be hated for what you are than to be loved for what you are not.”'}, {'name': 'Thomas A.

We can make this a JSON like this:

In [152]:
with open('famous_quotes.json','w') as f:
    json.dump(all_persons_pages,f,indent=4)

And also into a `pandas` DataFrame to export it as an Excel or CSV file.

In [159]:
df = pd.DataFrame(all_persons_pages)
df
df.to_csv('all_quotes.csv')

Unnamed: 0,name,quote
0,Albert Einstein,“The world as we have created it is a process ...
1,J.K. Rowling,"“It is our choices, Harry, that show what we t..."
2,Albert Einstein,“There are only two ways to live your life. On...
3,Jane Austen,"“The person, be it gentleman or lady, who has ..."
4,Marilyn Monroe,"“Imperfection is beauty, madness is genius and..."
...,...,...
95,Harper Lee,“You never really understand a person until yo...
96,Madeleine L'Engle,“You have to write the book that wants to be w...
97,Mark Twain,“Never tell the truth to people who are not wo...
98,Dr. Seuss,"“A person's a person, no matter how small.”"
