## Intro to Web Scraping in Python

Today we will learn how to scrape HTML web pages in python, using the Beautiful Soup 4 library. We can programmatically gather information from websites to use for your own purposes. We will gather information about prominent UVA employees, first by walking through each step of the process. Then by doing it in a more automated way.

First we need to install the Beautiful Soup 4 and lxml libraries. These are not a part of base python or the Anaconda installation.

In [None]:
!conda install --yes beautifulsoup4
!conda install --yes lxml
!conda install --yes requests

Next, let's import the libraries we will be using in this Jupyter Notebook, including Beautiful Soup 4. The other two, requests and pandas, are already installed if you are using Anaconda.

In [9]:
import requests
from bs4 import BeautifulSoup
import pandas as pd


Now we are ready to make an HTTP request using the requests library. This means that once you have established a connection with the destination (the server which hosts the website you want to communicate with), the client (you) sends an HTTP GET request to the server to retrieve the website and all data within it. 

This is typically done by your web browser, but we can also do it in python. 

Today we are going to scrape information about players on the UVA men's basketball team. You can find the team's roster for the 2021-2022 season here: https://virginiasports.com/sports/mbball/roster/ 

We will start by gathering information about one of UVA's players, Kihei Clark.

In [10]:
#This is how to make an HTTP GET request using the requests library.
source = requests.get('https://virginiasports.com/sports/mbball/roster/season/2021-22/player/kihei-clark/')

print(source)      #this prints the type of response. 200 means "OK". There are many response codes

<Response [200]>


Now, we have a response from the server. If the response is good, the source code of the web page is contained within that response. Let's see what that looks like

In [None]:
soup = BeautifulSoup(source.text, 'html.parser')

print(soup) 


This is messy, but it is all the code for the page we have issued a request for. Some of it is human readable, some of it is not. Now, let's look at the source code of this page another way. Copy and paste this link into your web browser: https://virginiasports.com/sports/mbball/roster/season/2021-22/player/kihei-clark/

### Note - You need to use Google Chrome to have access to inspector and other developer tools

Right click somewhere on your page and click "inspect". Then, make sure to choose the 'elements' tab to see the HTML source code of this page. While inspecting the page elements, you can see which parts of the page are controlled by different parts of the code. Notice that the code starts with large chunks (< body > for example), and has divisions within that (< div > tags), among others.

The class "bio-info" looks like it contains the majority of the information in the body of the page. Let's start with this. We are going to scrape some information about Kihei Clark from this page.

In [None]:
# the prettify() function makes the code somewhat more readable.
# I don't use this feature much but maybe you will appreciate it.
print(soup.prettify())

The find() function finds the first item matching this criteria. Notice our arguments are first, the HTML tag, and second, the class within that tag.

In [11]:
player_name = soup.find("div", class_='bio-info')    #class_, because 'class' is a reserved word in python
print(player_name)

<div class="bio-info">
<div class="text-block">
<h1>Kihei Clark</h1>
<div class="info-block">
<div>
<div class="value">Guard</div>
<div class="description">Position</div>
</div>
<div>
<div class="value">5'10''</div>
<div class="description">Height</div>
</div>
<div>
<div class="value">172 lbs.</div>
<div class="description">Weight</div>
</div>
<div>
<div class="value">Senior</div>
<div class="description">Class</div>
</div>
<div>
<div class="value">Woodland Hills, Calif.</div>
<div class="description">Hometown</div>
</div>
<div>
<div class="value">Taft Charter</div>
<div class="description">High School / Club</div>
</div>
<div>
<div class="value"><a href="https://twitter.com/ClarkKihei" rel="nofollow" target="_blank"><i class="fab fa-twitter"></i> @ClarkKihei</a></div>
<div class="description">Twitter</div>
</div>
<div>
<div class="value"><a href="https://instagram.com/kihei.clark/" rel="nofollow" target="_blank"><i class="fab fa-instagram"></i> @kihei.clark</a></div>
<div class="descr

I now see that the player's name is inside a couple more tags inside that 'bio-info' class. Let's drill down into the code and get the player's name value.

In [12]:
player_name = soup.find('div', class_='bio-info').div.h1.text
print(player_name)


Kihei Clark


Next, I want more information about the player such as his position, height, weight, etc. 

Looking again at the HTML, it looks like all of that information is in repetitive 'div' tags. Let's select all of those elements.

In [14]:
for item in soup.find_all('div', class_="value"):
    print(item.text)


Guard
5'10''
172 lbs.
Senior
Woodland Hills, Calif.
Taft Charter
 @ClarkKihei
 @kihei.clark


I also want to get the labels for this data. This would be things such as 'position', 'height', 'weight', etc. 

In [17]:
for item in soup.find_all('div', class_='description'):
    print(item.text)

Position
Height
Weight
Class
Hometown
High School / Club
Twitter
Instagram


Lets do a little more here just to clean things up. In the following cell I will do the same as the previous two cells, but this time I will store the data in a list. Then I will zip those lists together and make a nice dictionary with my data.

In [19]:
player_stats=[]
labels=[]

for i in soup.find_all('div', class_='value'):
    player_stats.append(i.text)
    
for i in soup.find_all('div', class_='description'):
    labels.append(i.text)
    
#let's assume I only want the first five items in each list
player_stats = player_stats[:5]
labels = labels[:5]

player_dict = dict(zip(labels, player_stats))
print(player_dict)

{'Position': 'Guard', 'Height': "5'10''", 'Weight': '172 lbs.', 'Class': 'Senior', 'Hometown': 'Woodland Hills, Calif.'}


Now I have an easy to use dictionary!

In [21]:
print(player_dict['Weight'])

172 lbs.


# Start here
    

In [None]:
#A common thing you might do is take this data and use it for your own purposes elsewhere.
#Let's take this data we have scraped and put it into a pandas DataFrame

data = []

for name in formatted_names_of_important_people:
    
    source = requests.get(f'https://data.richmond.com/salaries/2018/state/university-of-virginia/{name}')
    soup = BeautifulSoup(source.text, 'html.parser')

    main_box = soup.find("div", class_='pay')
    salary = main_box.find('h2').text
    
    #data.append((name, salary))
    data.append((name, salary))
    
print(data)

Let's put this into a pandas Dataframe!

## Pandas ##

Pandas is an open source python library providing high-performance, easy to use data structures. It is common to store data scraped from the web in a pandas Dataframe.

A pandas Dataframe is a 2 Dimensional data structure with rows and columns (like a spreadsheet). 

In [None]:

df = pd.DataFrame(data)
print(df)

Now let's rename the columns in the pandas dataframe to something more descriptive

In [None]:

df.columns = ['Name', 'Salary']
print(df)
