# Scraping Data from the Web

## Objectives
1. Become familiar with Python modules for working with HTML
2. Use Python to programmatically extract useful information from HTML
3. Use Python containers to convert data to a Pandas DataFrame for further manipulation

## Background
Remember from CS110 how computers communicate over the internet with one another? We won't go into great detail of the 7-layer network model or the hyper text markup language (HTML), but the rest of this section provides a brief background along with references to learn more.

When you enter a URL of a webpage in an internet browser and press enter, your computer sends a 'GET' request to the server hosting the webpage. The server then responds to that request and the most common type of response is an [HTML](https://www.w3schools.com/html/default.asp) file. Your browser (application level) knows how to translate HTML to a webpage of text, pictures, buttons, and other items that you are so familiar seeing. In this exercise, we will use a couple of Python modules that make it easy to work with HTML programmatically.

The other, less common, way to receive data from another computer over the internet is through an Application Programming Interface (API). An API, in the general sense, is a set of interfaces for programmers to use for some purpose. Pandas provides a set of functions and objects for programmers to work with data in rows and columns and they refer to their [documentation](https://pandas.pydata.org/docs/reference/index.html) as an API reference. An API used for receiving data from a computer (server), is a set of interfaces provided by the service (Twitter, for instance) to automatically request and receive data from a server in a machine-readable format. In these cases, the response is not an HTML file, but some semi-strucutred file like the [JavaScript Object Notation](https://www.json.org/json-en.html) (JSON) format. Using an API to scrape data from a server is out of the scope for this class, but it is important to understand the concept and know that there may be a better way to get data from a web server other than the HTML file.

## Exercises
In this Notebook, you will use Python modules to scrape PGA data from [ESPN](http://www.espn.com/golf/statistics/_/year/2018). However, if there is another website that you would like to extract information from then you are free to apply these same concepts to your specific problem.

We have also provided a Python script to give you hints along the way. Please ensure `scraping_solution.py` is saved in the same folder as this Notebook.

### Exercise 0
Read the code below that imports the necessary modules and creates the necessary objects to work with web pages. Discuss with a neighbor to ensure that you understand what each statement does.
* Link to the `requests` documentation: https://requests.readthedocs.io/en/master/
* Link to the `BeautifulSoup` documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'http://www.espn.com/golf/statistics/_/year/2018'

r = requests.get(url)  # HTTP GET request and capturing the response in r
html_doc = r.text  # Extracts the response as html
soup = BeautifulSoup(html_doc)  # creates a BeautifulSoup object to make it easier working with the associated HTML code

pretty_soup = soup.prettify()
print(pretty_soup)

This is not overly helpful. A better way is to browse to the web page and view the source (right-click on the web page and View Source). In this way you can search for text that we know we want to extract and see what HTML tags are useful. By looking at the web page, we know the first player we are interested in is Justin Thomas. Search (ctrl-F) for Justin Thomas in the source HTML code.

You should notice that all data is enclosed in HTML `<td>` tags which is a table cell. An HTML table row tag is `<tr>`.

### Exercise 1
1. Use the `find_all` method of the `soup` object to find all of the 'td' tags; store the result in `table_tags`
2. Iterate over each `cell` in the tags and print the cell text using the `text` attribute

In [None]:
# Uncomment the next line and run the cell to receive a hint as well as a command to see the solution
# %load -r 20:25 scraping_solution.py
### BEGIN SOLUTION
### END SOLUTION

There are a few things to notice based on the output from looking at all of the table rows and cell values.
* The table heading repeats every ten players
* The results are paginated so there are only 40 players per page

Our goal is to extract the data and populate a Pandas DataFrame. There are two ways to do this: 
1. create a dictionary where the keys are the column labels and the values are lists of each row value, or 
2. create a list of lists for each row in the table and then pass the column labels as the argument to the `columns` parameter of `pd.DataFrame`

We will do the latter. As a result, we will create a list of each row.

### Exercise 2
1. Copy Exercise 1 to the below code cell, but change the tag to 'tr'
2. In the loop create a list of all the values; name the list `row`

*Hint:* A list comprehension should look similar to ([Lesson 13](https://learn.zybooks.com/zybook/USAFACS212Spring2020/chapter/13/section/8)):
```
row = [value.text for value in cell]
```
3. Update the `print` statement in the loop to print the list

In [None]:
# Uncomment the next line and run the cell to receive a hint as well as a command to see the solution
# %load -r 30:36 scraping_solution.py
### BEGIN SOLUTION
### END SOLUTION

### Exercise 3
In this exercise, we will create containers to store all of the pertinent information. Also, we will deal with the column labels repeating every ten players.
1. Create an empty list to store the column labels; name it `header`
2. Create an empty list to store table cell values; name it `table_vals`
3. In the loop from Exercise 2:
   * Populate `header` list on the first occurence; ignore the column labels on each subsequent occurence
   * Append to `table_vals` each `row` list
4. Convert `table_vals` to a `pandas.DataFrame` and specify `columns = header`; name the DataFrame `pga`
5. Inspect the DataFrame

In [None]:
# Uncomment the next line and run the cell to receive a hint as well as a command to see the solution
# %load -r 42:48 scraping_solution.py
### BEGIN SOLUTION
### END SOLUTION

Besides the column data types being all non-numeric, this is what we want. However, this only contains the first 40 golfers out of 559. On the web page, click to advance to the second page of players (bottom right of the table). How does the URL change?
```
http://www.espn.com/golf/statistics/_/year/2018/count/41
```
Each subsequent page's URL is appended with '/count/##' where '##' is the rank of the player in the first row of the table. In order to extract the information from every page, we need to nest all of the statements from the GET request through appending the `row` to the `table_vals` inside of a loop.

How do we know when we have iterated through every page? There are two different ways to try to iterate through each page:
1. The number of total golfers is listed in the footer of the table which means we can determine the URL of the last page; you would have to inspect the source to determine its HTML tag to access that information programmatically
2. Use the `try-except` framework that we learned in lessons 19 & 20; this would work well if pages didn't exist besides those with actual content and we got an exception when we tried to access an invalid page, but try entering '1000' in the place of '##'. What happens? Browse the `requests` documentation to see what happens if a webpage is not found or other exceptions occur.

We will use the second approach where we stop iterating if an exception occurs or if we see the row with Justin Thomas a second time.

### Exercise 4
1. Initialize a Boolean variable, `page = True`; we will use this to iterate until an exception or we return to the first page of the table
2. Initialize a variable, `page_num = 41`; we will use this to concatenate to the root of the URL
3. In a `while` loop, use a `try` block to:
   * Set up, send, and capture the HTTP GET request
   * Convert it to a BeautifulSoup object
   * Use similar logic to the previous exercises to:
     * If the `row` values are the same as the first player (back to the beginning), then change `page = False`
     * Otherwise, extract the `row` values and append them to `table_vals`
4. At the end of the `while` loop, use an `except` block to change `page = False` in case exceptions that may occur
5. Convert to a DataFrame and inspect

In [None]:
# Uncomment the next line and run the cell to receive a hint as well as a command to see the solution
# %load -r 76:82 scraping_solution.py
### BEGIN SOLUTION
### END SOLUTION

At this point, you could begin to manipulate the DataFrame to analyze it like we have previously. Furthermore, you could save the data locally in a flat file, like a comma-separated-value file:
```
pga.to_csv('PGA Stats.csv', index = False)
```

If there is time, edit your code to use `try-except` blocks to replace commas and dollar signs, and convert columns to integers. If an exception occurs, then you can simply add the value without converting it to an integer. To do this, we will not be able to use a list comprehension.