## Intro to Web Scraping in Python

Today we will learn how to scrape HTML web pages in python, using the Beautiful Soup 4 library. We can programmatically gather information from websites to use for your own purposes. We will gather information about prominent UVA employees, first by walking through each step of the process. Then by doing it in a more automated way.

First we need to install the Beautiful Soup 4 library. This is not a part of base python or the Anaconda installation.

In [None]:
!conda install --yes beautifulsoup4

Next, let's import the libraries we will be using in this Jupyter Notebook, including Beautiful Soup 4. The other two, requests and pandas, are already installed if you are using Anaconda.

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd


Now we are ready to make an HTTP request using the requests library. This means that once you have established a connection with the destination (the server which hosts the website you want to communicate with), the client (you) sends an HTTP GET request to the server to retrieve the website and all data within it. 

This is typically done by your web browser, but we can also do it in python. 

Notice that we are looking at the publicly available employee information for Tony Bennett, UVA's men's basketball coach. All Virginia public employees' salary (and other) information is publicly available thanks to the Richmond Times-Dispatch.

In [None]:
#This is how to make an HTTP GET request using the requests library.
source = requests.get(f'https://data.richmond.com/salaries/2018/state/university-of-virginia/tony-bennett')

print(source)      #this prints the type of response. 200 means "OK". There are many response codes

Now, we have a response from the server. If the response is good, the source code of the web page is contained within that response. Let's see what that looks like

In [None]:
soup = BeautifulSoup(source.text, 'html.parser')

print(soup) 


This is messy, but it is all the code for the page we have issued a request for. Some of it is human readable, some of it is not. Now, let's look at the source code of this page another way. Copy and paste this link into your web browser: https://data.richmond.com/salaries/2018/state/university-of-virginia/tony-bennett

Right click somewhere on your page and "view page source", then right click again and "inspect elements". While inspecting the page elements, you can see which parts of the page are controlled by different parts of the code. Notice that the code starts with large chunks (< body > for example), and has divisions within that (< div > tags), among others.

The class "container" looks like it contains the majority of the information in the body of the page. Let's start with this. We are going to scrape some information about Tony Bennett from this page. Specifically, his Salary and Job Title.

In [None]:
# the prettify() function makes the code somewhat more readable.
# I don't use this feature much but maybe you will appreciate it.
print(soup.prettify())

The find() function finds the first item matching this criteria. Notice our arguments are first, the HTML tag, and second, the class within that tag.

In [None]:
container = soup.find("div", class_='container')    #class_, because 'class' is a reserved word in python
print(container)

However, I know there are actually 3 container classes in this web site. So let's use the find_all() function to get the information from all of them.

In [None]:
container = soup.find_all('div', class_='container')
print(container)
print()
print(len(container))  #container is just a list! See there are 3 items in this list



The 2nd item in the container object is the one that contains the information we want, so let's rename that.

Then, the next < div > class under that is 'row col-12'

In [None]:
container = container[1]

row_12 = container.find_all('div', class_='row col-12')

print(row_12)



So as you see, you can step through each tag on your way to the information you need. In order to speed up this process I will take some shortcuts to get to Tony Bennett's job title and salary.

You don't have to step through each HTML tag to get to what you want. You can identify the tag you need and go straight to it.

Below you see how I go straight from the "container" class to the individual element holding hist job title. For his salary I find an 'h2' class which contains this information. Also, note that I am using the find() function because there is only one instance of each of these specific classes. 

In [None]:
job_title = container.find('span', class_='small text-muted')

salary = container.find('h2', class_='pay')

print(job_title.text)
print(salary.text)

Now let's do a more programmatic example. Every UVA employee has a page with basically the same URL, except the person's name is different. 

In [None]:
#let's do a little more interesting example
names = ['Tony Bennett', 'James E Ryan', 'Bronco Mendenhall', 'Carla Williams', 'Scott C Beardsley', 'Craig Benson', 
        'Ian Baucom']

formatted_names_of_important_people = []

#start with a little string formatting. I am formatting each name so I can insert it into the URL
for important_person in names:
    important_person = important_person.replace(' ', '-')
    important_person = important_person.lower()
    formatted_names_of_important_people.append(important_person)

print(formatted_names_of_important_people)

In [None]:
#now I will use f-strings formatting to insert each name into the source URL
for name in formatted_names_of_important_people:
    
    source = requests.get(f'https://data.richmond.com/salaries/2018/state/university-of-virginia/{name}')
    soup = BeautifulSoup(source.text, 'html.parser')

    main_box = soup.find("div", class_='pay')
    salary = main_box.find('h2').text
    
    print(name, salary)



In [None]:
#but let's make this look a little nicer

for name in formatted_names_of_important_people:
    
    source = requests.get(f'https://data.richmond.com/salaries/2018/state/university-of-virginia/{name}') 
    soup = BeautifulSoup(source.text, 'html.parser')
    
    main_box = soup.find("div", class_='pay')
    salary = main_box.find('h2').text

    main_box = soup.find("div", class_='col-12 col-lg-8')
    span_class = main_box.find_all("span")
    job_title = span_class[1].text
        
    print(f'Name = {name}')
    print(f'Job Title = {job_title}')
    print(f'Salary = {salary}')     
    print()
    
    

In [None]:
#A common thing you might do is take this data and use it for your own purposes elsewhere.
#Let's take this data we have scraped and put it into a pandas DataFrame

data = []

for name in formatted_names_of_important_people:
    
    source = requests.get(f'https://data.richmond.com/salaries/2018/state/university-of-virginia/{name}')
    soup = BeautifulSoup(source.text, 'html.parser')

    main_box = soup.find("div", class_='pay')
    salary = main_box.find('h2').text
    
    #data.append((name, salary))
    data.append((name, salary))
    
print(data)

Let's put this into a pandas Dataframe!

## Pandas ##

Pandas is an open source python library providing high-performance, easy to use data structures. It is common to store data scraped from the web in a pandas Dataframe.

A pandas Dataframe is a 2 Dimensional data structure with rows and columns (like a spreadsheet). 

In [None]:

df = pd.DataFrame(data)
print(df)

Now let's rename the columns in the pandas dataframe to something more descriptive

In [None]:

df.columns = ['Name', 'Salary']
print(df)
