## WIM Python API-Webscraping workshop: 2020-09-18
### Helge Marahrens (hmarahre@iu.edu) & Anne Kavalerchik (akavaler@iu.edu)
#### Part 2: Web scraping HTML

http://toscrape.com/

First we will import the packages we need:

In [None]:
import os
import json
import requests
import time
import pandas as pd
from bs4 import BeautifulSoup as bs

Now we will get the HTML of a URL we need: [http://quotes.toscrape.com/](http://quotes.toscrape.com/).

It's a website with quotations, the people they are attributed to, and the short biographies of those people.

We will use the python `requests` library to send HTTP requests.

In [None]:
url = "http://quotes.toscrape.com/"
response = requests.get(url)
response

`<Response [200]>` means that our request was successful.

Usually what we want is the text from a website.
Let's get the text and print it. [Compare it to the source code of the actual webpage](view-source:http://quotes.toscrape.com/)

We could use a combination of regular expressions, string matching, and loops to navigate the html, but luckily the Beautiful Soup package makes it much easier. [BeautifulSoup documentation is here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).

There are several ways to navigate this. 
First start by navigating it using __tag names__.
This returns the first element with that tag name.

What kinds of data structures are these returning?

We can actually treat bs4.element.Tag as BeautifulSoup and navigate those the same way.
Try to get to the tag `a href="/" style="text-decoration: none">Quotes to Scrape</a>

Note that doing that was also the same as doing this:

To get the style of that tag:

We can also use `.find` with the tag name and other attributes, and `.findAll` to return __all__ tags fitting those attributes.

Let's practice on the first quotation, by Albert Einstein.
We get this by going to the first tag that has the class of quote.

And we can investigate this tag a bit.

Let's get all of the tags for that quotation, and use `get_text` to get __only__ the text from each tag.

Now navigate just to "Albert Einstein".

Let's get Albert Einstein's quotation.

Now let's make a list of every person on this page, and then every quotation.

Say what we really want is to make a big spreadshet of all the names and quotations on this website. This means we need to go through the pages. Let's store everything in a python __dictionary__ before turning it into a spreadsheet with `pandas`.

We'll store each entry in this format:
`{'Person':'Albert Einstein',
'Quotation':'The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.'}`

First, let's make a __function__ to do that for us.

Loop through every person/quote on the page, and return a __list__ of __dictionaries__, where every dictionary is composed of 2 __key-value__ pairs: 1) Person's name 2) Person's quotation

What we __really__ want is a list of __every person on this website__. To do this, we need to use `requests` to call on all the pages.

It's helpful to do some investigating first. Notice that [quotes.toscrape.com/page/1/](quotes.toscrape.com/page/1/) is this page we have been working with, [quotes.toscrape.com/page/2/](quotes.toscrape.com/page/2/) is the next page, and [quotes.toscrape.com/page/10/](quotes.toscrape.com/page/10/) is the last page. So our goal is to scrape these __10__ pages.

We can generate these 10 different URLs like this.

We are basically going to repeat the process that we did to get all the information from the first page for all 10 pages.


We did it! Here is what the resulting dictionary looks like if we print it out:

We can make this a JSON like this:

And also into a `pandas` DataFrame to export it as an Excel or CSV file.
To see where this saves, go to File -> Open in the header of this Notebook.