## WIM Python API-Webscraping workshop: 2020-09-18
### Helge Marahrens (hmarahre@iu.edu) & Anne Kavalerchik (akavaler@iu.edu)
#### Part 2: Web scraping HTML

http://toscrape.com/

First we will import the packages we need:

In [None]:
import os
import json
import requests
import time
import pandas as pd
from bs4 import BeautifulSoup as bs

Now we will get the HTML of a URL we need: [http://quotes.toscrape.com/](http://quotes.toscrape.com/).

It's a website with quotations, the people they are attributed to, and the short biographies of those people.

We will use the python `requests` library to send HTTP requests.

In [None]:
url = "http://quotes.toscrape.com/"
response = requests.get(url)
response

`<Response [200]>` means that our request was successful.
Usually what we want is the text from a website.
Let's get the text and print it. [Compare it to the source code of the actual webpage](view-source:http://quotes.toscrape.com/)

In [None]:
htmltext = response.text
print(htmltext)

We could use a combination of regular expressions, string matching, and loops to navigate the html, but luckily the Beautiful Soup package makes it much easier. [BeautifulSoup documentation is here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).

In [None]:
soup = bs(htmltext,'html.parser')
print(soup) # this doesn't look much different than before we parsed it, but it will let us navigate it easier

There are several ways to navigate this. 
First start by navigating it using __tag names__.
This returns the first element with that tag name.

In [None]:
# head
print(soup.head)
# title
print(soup.title)
# body
print(soup.body)
# h1

What kinds of data structures are these returning?

In [None]:
print(type(soup))
print(type(soup.head))
print(type(soup.title))

We can actually treat bs4.element.Tag as BeautifulSoup and navigate those the same way.
Try to get to the tag  `a href="/" style="text-decoration: none">Quotes to Scrape</a>

In [None]:
print(soup.body)
print(soup.body.div)
print(soup.body.div.div)
print(soup.body.div.div.div)
print(soup.body.div.div.div.div)
print(soup.body.div.div.div.h1.a)

Note that doing that was also the same as doing this:

In [None]:
print(soup.h1.a)

To get the style of that tag:

In [None]:
print(soup.h1.a['style'])

We can also use `.find` with the tag name and other attributes, and `.findAll` to return __all__ tags fitting those attributes.

In [None]:
# These are the same
print(soup.h1)
print(soup.find('h1'))
print('')
print(soup.find(style = "text-decoration: none"))
print(soup.h1.a)
print('')
#print(soup.findAll(div))
print(len(soup.findAll('div')))
print(type(soup.findAll('div')))
print(soup.find(''))
print('')

Let's practice on the first quotation, by Albert Einstein.
We get this by going to the first tag that has the class of quote.

In [None]:
einstein = soup.find('div',{'class':'quote'})
print(einstein)


And we can investigate this tag a bit....

In [None]:
print(einstein.div)
print(einstein.span)
print(einstein.a)
print(einstein.findAll('a'))

Let's get all of the tags for that quotation, and use `get_text` to get __only__ the text from each tag.

In [None]:
e_tags = einstein.findAll('a',{'class':'tag'})
e_tags_list = []
for e_tag in e_tags:
    print(e_tag.get_text())
    e_tags_list.append(e_tag.get_text())
e_tags_list

# We can do the equivalent task without a loop using this line:
e_tags_list = [e_tag.get_text() for e_tag in e_tags]


Now navigate just to "Albert Einstein".

In [None]:
einstein
einstein.small.get_text()

Let's get Albert Einstein's quotation.

In [None]:
print(einstein.span.get_text())

Now let's make a list of every person on this page, and then every quotation.

In [None]:
all_person_tags = soup.findAll('div',{'class':'quote'})
for person_tag in all_person_tags:
    print(person_tag.small.get_text())
    
persons = [person_tag.small.get_text() for person in all_person_tags]

quotes = [person_tag.span.get_text() for person in all_person_tags]

print(persons)
print(quotes)

Say what we really want is to make a big spreadshet of all the names and quotations on this website. This means we need to go through the pages. Let's store everything in a python __dictionary__ before turning it into a spreadsheet with `pandas`.

We'll store each entry in this format:
`{'Person':'Albert Einstein',
'Quotation':'The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.'}`

First, let's make a __function__ to do that for us.

In [None]:
def storePerson(person_tag):
    name = person_tag.small.get_text()
    quote = person_tag.span.get_text()
    return {'name':name,'quote':quote}

print(storePerson(einstein))
    

Loop through every person/quote on the page, and return a __list__ of __dictionaries__, where every dictionary is composed of 2 __key-value__ pairs: 1) Person's name 2) Person's quotation

In [None]:
all_person_tags = soup.findAll('div',{'class':'quote'})
all_quotes = []
for person_tag in all_person_tags:
    all_quotes.append(storePerson(person_tag))
    
print(all_quotes)
    

What we __really__ want is a list of __every person on this website__. To do this, we need to use `requests` to call on all the pages.

It's helpful to do some investigating first. Notice that [quotes.toscrape.com/page/1/](quotes.toscrape.com/page/1/) is this page we have been working with, [quotes.toscrape.com/page/2/](quotes.toscrape.com/page/2/) is the next page, and [quotes.toscrape.com/page/10/](quotes.toscrape.com/page/10/) is the last page. So our goal is to scrape these __10__ pages.

We can generate these 10 different URLs like this.

In [None]:
url = 'http://quotes.toscrape.com/page/'
page_num = 1
for page_num in range(1,11):
    print(page_num)
    print(url + str(page_num))

We are basically going to repeat the process that we did to get all the information from the first page for all 10 pages.


In [None]:
all_persons_pages = []

for page_num in range(1,11):
    time.sleep(.5) # So as not to overload the server!
    print(url + str(page_num))
    response = requests.get(url + str(page_num))
    htmltext = response.text
    soup = bs(htmltext,'html.parser')
    all_person_tags = soup.findAll('div',{'class':'quote'})
    for person_tag in all_person_tags:
        all_persons_pages.append(storePerson(person_tag))

    

    

We did it! Here is what the resulting dictionary looks like if we print it out:

In [None]:
print(len(all_persons_pages))
print(all_persons_pages)

We can make this a JSON like this:

In [None]:
with open('famous_quotes.json','w') as f:
    json.dump(all_persons_pages,f,indent=4)

And also into a `pandas` DataFrame to export it as an Excel or CSV file.

In [None]:
df = pd.DataFrame(all_persons_pages)
df
df.to_csv('all_quotes.csv')