# Introduction to Web Scraping, Part Two (Python)
- UMN LATIS & Libraries workshop, Nov 15, 2019
- Cody Hennesy (chennesy@umn.edu) and Michael Beckstrand (mjbeckst@umn.edu)

In this part of the workshop, we'll explore reproducible web scraping methods using Python. 

Specifically, in this part of the workshop we will:
* Use Python 3 in a JupyterLab computing environment
* Use the Requests and BeautifulSoup Python libraries to access HTML data from the web
* Create variables, lists and loops to work with web data in Python
* Store and view HTML data in Pandas dataframe format

Credits: Content for this workshop was adapted from [Rochelle Terman's Web Scraping workshop](https://github.com/rochelleterman/scrape-interwebz) and from [Software Carpentry Python lessons](http://swcarpentry.github.io/python-novice-inflammation/).

### Why Python? 
- Reproducibility
- Repeatable
- Extensible
- Great for data access and data cleaning

### What's Jupyter?
- Web-based, easy to share
- Easy to read, easy to run
- Run code piece by piece

## Python variables
- You can use Python as a calculator. 
- To "run" a Jupyter cell hold down shift and select Return/Enter, or choose the "play icon" (right-facing triangle) from the Jupyter menu above. 

In [None]:
weight_kg = 60

In [None]:
print(weight_kg)

In Python, variable names:

* can include letters, digits, and underscores
* cannot start with a digit
* are case sensitive.

You can do calculations and save text strings in variables too.

In [None]:
website = "All the words on a website"
print(website)

## Importing Libraries
Importing a library is like getting a piece of lab equipment out of a storage locker and setting it up on the bench. Libraries provide additional functionality to the basic Python package, much like a new piece of equipment adds functionality to a lab space. Just like in the lab, importing too many libraries can sometimes complicate and slow down your programs - so we only import what we need for each program.

Our primary tools will be the [Requests library](http://docs.python-requests.org/en/latest/user/quickstart/)
and [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/)

In [None]:
import requests
from bs4 import BeautifulSoup

In [None]:
requests.packages

### Library functions
The expression ```requests.get(...)``` is a function call that asks Python to run the function ```get``` which belongs to the ```requests``` library. 

This dotted notation is used everywhere in Python: the thing that appears before the dot contains the thing that appears after.

As an example, we could use the dot notation to write the relationship between Minneapolis and Minnesota as ```Minnesota.Minneapolis```, just as *get* is a function that belongs to the *requests* library.

In [None]:
requests.get('http://www.startribune.com/')

#### What did we do above?
1. Created a Python HTTP request object for a GET
2. Send the HTTP request to webserver at http://www.startribune.com/
3. Received the response ```[200]``` from http://www.startribune.com/ - [what's that mean?](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)

In Jupyter notebooks using Python you can explore functions of a library using the *tab* key.

And to understand each function you can get information by putting a question mark after it:

In [None]:
requests.get?

You can store the data that is returned from the GET request in a variable:

In [None]:
star_trib = requests.get('http://www.startribune.com/')
print(star_trib)

### Ethical scraping
One way to make sure you're engaging in transparent and ethical scraping practices is to send the website information about *yourself* along with your request. 

```requests.get``` includes a ```headers=``` parameter that you can use to send in your name and information about the software we're using to collect data:

In [None]:
headers = {'user-agent': 'python-requests/2.22.0; chennesy@umn.edu; Cody Hennesy'}
star_trib = requests.get('http://www.startribune.com/', headers=headers)

Now you can explore the attributes of the data object stored in ```star_trib``` using the same dot notation. 

Use tab to explore the options, and the question mark to read more about the attribute.

```star_trib.text```, for example.

In [None]:
src = star_trib.text

Let's move the .text content that was returned from the Request into a BeautifulSoup object so we can start to explore the HTML tree.

In [None]:
# parse the response into an HTML tree by calling BeautifulSoup
soup = BeautifulSoup(src, 'lxml')

# look at what it looks like now, using the soup.prettify tool
# [:1000] will give us the first 1000 characters in the soup object so it doesn't fill up the whole screen
print(soup.prettify()[:1000])

## Find Elements

BeautifulSoup has a number of functions to find things on a page. Like other webscraping tools, Beautiful Soup lets you find elements by their:

1. HTML tags
2. HTML Attributes
3. CSS Selectors

#### HTML tags
Let's search first for **HTML tags**. 

The function `find_all` searches the `soup` tree to find all the elements with an a particular HTML tag, and returns all of those elements.

What does the example below do?

In [None]:
# find all elements in a certain tag
#soup.find_all("a")

In [None]:
#soup.find_all('p')

In [None]:
#soup.find_all('h3')

Because `find_all()` is the most popular method in the Beautiful Soup search API, you can use a shortcut for it. If you treat the BeautifulSoup object as though it were a function, then it’s the same as calling `find_all()` on that object. 

These two lines of code are equivalent:

In [None]:
# soup.find_all("a")
# soup("a")

#### HTML Attributes 
If you search for everything with the `a` tag, you're likely to get a lot of stuff, much of which you don't want. What if we wanted to search for HTML tags ONLY with certain attributes, like particular CSS classes? 

We can do this by adding an additional argument to the `find_all` like this: ```soup("a", class_="class_name")```

### Challenge 1: Find the Most Read and Emailed articles on the Star Tribune homepage
Use Chrome's *Inspect* feature, to find the class name for the Most Read and Most Emailed articles lists. 
1. Create a variable called most_read, and use ```soup()``` to find all of the links with the appropriate class
2. Print out the matches below


In the example below, we are finding all the `a` tags, and then filtering those with `class_="feed-list-link"`.

In [None]:
# Get only the 'a' tags in 'sidemenu' class
most_read = soup("a", class_="feed-list-link")
most_read

#### CSS Selectors
It can be more efficient to search and find things on a website by **CSS selector.** For this we have to use a different method, `select()`. Just pass a string into the `.select()` to get all elements with that string as a valid CSS selector.

In the example above, we can use "a.feed-list-link" as a CSS selector, which returns all `a` tags with class `feed-list-link`.

In [None]:
# get elements with "a.sidemenu" CSS Selector.
most_read_select = soup.select("a.feed-list-link")

In [None]:
#most_read_select

### Python Lists

The most popular kind of data collection in Python is the list, which takes the place of arrays in programming languages like C and Fortran.
Lists have two primary important characteristics:
1. They are mutable, i.e., they can be changed after they are created.
2. They are heterogeneous, i.e., they can store values of many different types.

To create a new list, you can just put some values in square brackets with commas in between.

In [None]:
my_list = ['red', 'orange', 'yellow']
my_list

To fetch the element at a specific location, put the *index* of that location in square brackets.

Let's go back to all of the a links with the class selector ```feed-list-link```.

In [None]:
#this is a Python list
most_read = soup.select("a.feed-list-link")

You can see how many items are in your list using the ```len()``` Python function.

In [None]:
len(most_read)

And you can look at the first element in the list using the syntax variable[0]. 

Note: [0] refers to the first element in a list in Python, and [1] refers to the second.

In [None]:
most_read[0]

We can use a built in Python function called type() to explore the results.

In [None]:
# save the first element in the list to its own variable to make it easier to explore
first_link = most_read[0]

# check out its class
type(first_link)

It's a tag! If we look up Tag in the BeautifulSoup documentation, we know that we can use `.text` to look at the text.

In [None]:
first_link.text

We can also look at the href attribute to check out the URL:

In [None]:
first_link['href']

### Loops

If we want to explore all of the most popular articles, we can loop through each link and only grab the information that we care about. 

#### Note the syntax: 

```
for x in y:
    do_something # the code in the loop needs to be indented
    do_another_thing
```

For 'a' tags, we also know there's an 'href' attribute that tells us where the link URL goes. 
To look at all of the attributes, we can call .attrs

In [None]:
for link in most_read:
    print(link.text, link['href'])

In [None]:
#let's clean that up a bit:
for link in most_read:
    print(link.text.strip(), '\n', link['href'], '\n')

## Challenge 2: Find all of the a tags
Remember when we used soup.findall to collect all of the links on the Star Tribune homepage? 
Create a variable to collect all of the links on the homepage, and then loop over the list to print the text of each link and the URL?

Hint: the results might look pretty messy! You can use ```.strip()``` to remove whitespace from strings and make the output easier on the eyes.


In [None]:
all_links = soup.find_all("a")
for link in all_links:
    print(link.text.strip(), link['href'].strip())

## Let's look at a specific article page


In [None]:
page = requests.get("http://www.startribune.com/discover-lost-shipwrecks-in-lake-superior/514224542/")

In [None]:
src = page.text

In [None]:
page_soup = BeautifulSoup(src, 'lxml')

Exploring the HTML in Chrome is a great way to find the right selectors or attributes to scrape, but you can also take sneak peaks at common tags using ```.find_all()``` to help pinpoint specific elements. For example: ```.find_all('h1')``` or ```.find_all('p')```

In [None]:
#page_soup.find_all('p')

It looks like the p class ```Text_Body_mag``` would snag the full-text of the article for us:

In [None]:
article_text = page_soup.select("p.Text_Body_mag")

In [None]:
for article in article_text:
    print(article.text)

## Challenge 3: Scrape the headline, byline, and date
Explore the HTML for a Star Tribune article to see if you can scrape the headline, byline (author), and the date the article was posted from the page.

- Hint: there are several different routes to get to each element. 
- Hint two: The date and bylines can be a little confusing when you use ```.select()``` because they'll return a Python list, even if there's only one item on the list. To show the ```.text``` attribute from a list, you can point to the first item on the list using ```[0]```.

In [None]:
headline = page_soup.h1
headline.text

In [None]:
byline = page_soup.select('div.article-byline')
byline[0].a.text

In [None]:
date = page_soup.select('div.article-dateline')
date[0].text.strip()[:-9]