# A web scraping demonstration using Python

We will be using *Jupyter Notebooks* to learn Python code in this class. Jupyter Notebooks should be installed by downloading Anaconda with Python 3.7 from the following link:
https://www.anaconda.com/products/individual.

Once installed, you can open the Anaconda Navigator and Launch a Jupyter Notebook.
 
For additional Python practice see: https://docs.python.org/3/tutorial/,
starting with "An informal introduction to Python".

### Python print function

The python print function in its simplest form is
```python
print(x)
```

which prints the value of *x*.

In [None]:
print('hello world')

We can also print the values of multiple objects, using
```python
print(x, y, ...)
```

In [None]:
print('hello', 'world')

### A variable stores a value

The statement 
```python
x = val
```

assigns the value *val* to the variable *x*

In [None]:
x = 1
print('The value of x is', x)

In [None]:
x = x + 5
print('The value of x is', x)

### A list is used to store a sequential collection of values

A list is specified using the notation

```python
listName = [item1, item2, ...]
```

In [None]:
x = ['one', 'two']
x

The *append* method adds an object to the list

In [None]:
# add the string 'three' to the list
x.append('three')
x

### The *split* method splits a string into multiple pieces

```
x.split(sep)
```
will split the string *x* and create a list of strings that are each separated by *sep*.
If *sep* is not specified, the string is split by any whitespace character (blank spaces, newlines, and tabs)

In [None]:
sentence = 'This is a sentence'
sentence.split('a')

In [None]:
sentence.split()

### A *for* loop is used to iterate over each element in a list

A for loop is used to iterate over each item in a sequence (which includes lists and tuples, amont others)

The syntax is

```python
for item in sequence :
    # do something with the item
```

In [None]:
words = ['hello', 'world', 'python']

print('The words are:')
for w in words :
    print(' ', w)

## Webscraping using BeautifulSoup

We will use the python module BeautifulSoup 4, (*bs4*), for web scraping in this class. More information and some examples can be found at the following site: https://beautiful-soup-4.readthedocs.io/en/latest/

We will scrape pages from Wikipedia, starting with https://en.wikipedia.org/wiki/Tom_Hanks

The robots file indicates that his is allowed: https://en.wikipedia.org/robots.txt

### Retreive the web page using the *requests* module

We first need to *get* the webpage, which we do using the *requests* module. It is good practice to create a *user-agent*, which identifies our web scraper.

After making the request, a status of 200 indicates that the request was successful

In [None]:
from bs4 import BeautifulSoup
import requests
import time

headers = {'user-agent': 'ASDD Web Scraper'}

# get the page
url = "https://en.wikipedia.org/wiki/Tom_Hanks"
page = requests.get(url, headers = headers)

page

The content of the page (the HTML) is in *page.content*. This is not meant to be easily readable, but we will use *BeautifulSoup* to parse it.

In [None]:
page.content

We parse the page (i.e., create the soup) by using the BeatifulSoup function, which returns a BeautifulSoup object

In [None]:
soup = BeautifulSoup(page.content, 'html.parser')

We can view the BeautifulSoup object by printing it; we can also use the prettify method for pretty printing (which may reformat indentations)

In [None]:
print(soup.prettify())

The BeautifulSoup object stores the web page elements in a tree, which can be navigated and searched. Elements of this tree consist of strings or bs4 objects.

First, we can specify a tag name to get the first element of that type.

The code below gets the title of the page.

In [None]:
soup.title

For any element, we can extract the text of the element using the *text* method.

In [None]:
soup.title.text

### Searching for elements by tag name, class name, etc 

The two searching functions are as follows:
- *element.find* will get the first occurence of the specified element
- *element.find_all* will return a list with all occurrences of the specified elements

Print out the text of the (first) element with id of 'firstHeading'

In [None]:
soup.find(id = 'firstHeading').text

Let's print out the text of all *h2* elements

In [None]:
h2_list = soup.find_all('h2')
for h2 in h2_list :
    print(h2.text)

The statement below indicates how we can get elements with a particular *class*.

In [None]:
age = soup.find('span', {'class': 'ForceAgeToShow'})
age

We will want to extract the age from this element, which we accomplish using the function below.

In [None]:
def formatAge(x) :
    """takes an string in the form '(age #)' and returns #"""
    x = x.split()[1]
    x = x.replace(')','')
    return int(x)

In [None]:
formatAge(age.text)

### Exercise

Can you output Tom Hanks's full name (from the table on the right?)

### Let's compare the current ages of recent Grammy Award winners for Album of the Year

First we create a list of URLs we want to scrape. Grammy Award winners are available here:

https://en.wikipedia.org/wiki/Grammy_Award_for_Album_of_the_Year

We also create empty lists for the names and ages of each winner.

In [None]:
urls = ['https://en.wikipedia.org/wiki/Billie_Eilish',
        'https://en.wikipedia.org/wiki/Kacey_Musgraves',
        'https://en.wikipedia.org/wiki/Bruno_Mars',
        'https://en.wikipedia.org/wiki/Adele',
        'https://en.wikipedia.org/wiki/Taylor_Swift'
       ]

nameList = []
ageList = []

Now we use a *for* loop to iterate through each URL, and for each page, extract the name and age of the winner, and add these to the nameList and ageList. When scraping multiple pages, it is good practice to *sleep* so you do not overburden the web server. This is accomplished using the *time* module; after parsing each page, we sleep for 1 second before proceeding.

In [None]:
import time

# for each url
for url in urls :
    
    # make a get request to retreive the page
    page = requests.get(url, headers = headers)

    # parse the page using BeautifulSoup
    soup = BeautifulSoup(page.content, 'html.parser')
    
    # extract the name, and add it to the list
    name = soup.find(id = 'firstHeading').text
    nameList.append(name)
    
    # extract the age, and add it to the list
    age = soup.find(True, {'class': 'ForceAgeToShow'})
    ageList.append(formatAge(age.text))
    
    # sleep for 1 second
    time.sleep(1)

Check the *nameList*:

In [None]:
nameList

Check the *ageList*:

In [None]:
ageList

Next we create a *data frame* using the *pandas* module. The data frame is created by first specifying a Python dictionary (not discussed) that has the form

```python
{'columnName1': list_of_values1, 'columnName2': list_of_values2, ...}
```

In [None]:
import pandas as pd
df = pd.DataFrame({'Name': nameList, 'Age': ageList})
df

Once we have a data frame, we can create a bar graph of our results (Note: you may need to run this twice for the graph to show)

In [None]:
ax = df.plot.bar(x = 'Name', y = 'Age', rot = 45, legend = False)
ax.set(xlabel = 'Grammy Winner', ylabel = 'Age', 
       title = 'Current ages of Grammy winners for Album of the Year')
None

### Exercises
1. Add additional Grammy winners (or other individuals) to the list
2. For each artist, extract and output the artist's birth place, using the format *Taylor Swift's birthplace is West Reading, Pennsylvania, U.S.*
