## Intro to Web Scraping in Python

Today we will learn how to scrape HTML web pages in python, using the Beautiful Soup 4 library. We can programmatically gather information from websites to use for your own purposes. We will gather information about the UVA basketball team, first by walking through each step of the process. Then by doing it in a more automated way.

First we need to install the Beautiful Soup 4 and lxml libraries. These are not a part of base python or the Anaconda installation.

In [None]:
!conda install --yes beautifulsoup4
!conda install --yes lxml
!conda install --yes requests

Next, let's import the libraries we will be using in this Jupyter Notebook, including Beautiful Soup 4. The other two, requests and pandas, are already installed if you are using Anaconda.

In [1]:
import requests
import lxml
from bs4 import BeautifulSoup
import pandas as pd


Now we are ready to make an HTTP request using the requests library. This means that once you have established a connection with the destination (the server which hosts the website you want to communicate with), the client (you) sends an HTTP GET request to the server to retrieve the website and all data within it. 

This is typically done by your web browser, but we can also do it in python. 

Today we are going to scrape information about players on the UVA men's basketball team. You can find the team's roster for the 2024-2025 season here: https://virginiasports.com/sports/mbball/roster/ 

We will start by gathering information about one of UVA's players, Blake Buchanan.

In [2]:
#This is how to make an HTTP GET request using the requests library.
source = requests.get('https://virginiasports.com/sports/mbball/roster/season/2024-25/player/blake-buchanan/')

print(source)      #this prints the type of response. 200 means "OK". There are many response codes

<Response [200]>


Now, we have a response from the server. If the response is good, the source code of the web page is contained within that response. Let's see what that looks like

In [None]:
soup = BeautifulSoup(source.text, 'lxml')

print(soup) 


This is messy, but it is all the code for the page we have issued a request for. Some of it is human readable, some of it is not. Now, let's look at the source code of this page another way. Copy and paste this link into your web browser: https://virginiasports.com/sports/mbball/roster/season/2024-25/player/blake-buchanan/

### Note - You need to use Google Chrome to have access to inspector and other developer tools

Right click somewhere on your page and click "inspect". Then, make sure to choose the 'elements' tab to see the HTML source code of this page. While inspecting the page elements, you can see which parts of the page are controlled by different parts of the code. Notice that the code starts with large chunks (< body > for example), and has divisions within that (< div > tags), among others.

The class "bio-info" looks like it contains the majority of the information in the body of the page. Let's start with this. We are going to scrape some information about Kihei Clark from this page.

In [None]:
# the prettify() function makes the code somewhat more readable.
# I don't use this feature much but maybe you will appreciate it.
print(soup.prettify())

The find() function finds the first item matching this criteria. Notice our arguments are first, the HTML tag, and second, the class within that tag.

In [5]:
player_info = soup.find("div", class_='bio-info')    #class_, because 'class' is a reserved word in python
print(player_info.prettify())

<div class="bio-info">
 <div class="text-block">
  <h1>
   Blake Buchanan
  </h1>
  <div class="info-block">
   <div>
    <div class="value">
     Forward
    </div>
    <div class="description">
     Position
    </div>
   </div>
   <div>
    <div class="value">
     6'11''
    </div>
    <div class="description">
     Height
    </div>
   </div>
   <div>
    <div class="value">
     225 lbs.
    </div>
    <div class="description">
     Weight
    </div>
   </div>
   <div>
    <div class="value">
     Sophomore
    </div>
    <div class="description">
     Class
    </div>
   </div>
   <div>
    <div class="value">
     Coeur d’Alene, Idaho
    </div>
    <div class="description">
     Hometown
    </div>
   </div>
   <div>
    <div class="value">
     Lake City
    </div>
    <div class="description">
     High School
    </div>
   </div>
   <div>
    <div class="value">
     <a href="https://twitter.com/blake_buchanan4" rel="nofollow" target="_blank">
      <svg viewbox="0 0 512 51

I now see that the player's name is inside a couple more tags inside that 'bio-info' class. Let's drill down into the code and get the player's name value.

In [6]:
player_name = soup.find('div', class_='bio-info').div.h1.text
print(player_name)


Blake Buchanan


Next, I want more information about the player such as his position, height, weight, etc. 

Looking again at the HTML, it looks like all of that information is in repetitive 'div' tags. Let's select all of those elements.

In [7]:
for item in soup.find_all('div', class_="value"):
    print(item.text)

Forward
6'11''
225 lbs.
Sophomore
Coeur d’Alene, Idaho
Lake City
 @blake_buchanan4
 @blake_buchanan4


I also want to get the labels for this data. This would be things such as 'position', 'height', 'weight', etc. 

In [8]:
for item in soup.find_all('div', class_='description'):
    print(item.text)

Position
Height
Weight
Class
Hometown
High School
Twitter
Instagram


Lets do a little more here just to clean things up. In the following cell I will do the same as the previous two cells, but this time I will store the data in a list. Then I will zip those lists together and make a nice dictionary with my data.

In [9]:
player_stats=[]
labels=[]

for i in soup.find_all('div', class_='value'):
    player_stats.append(i.text)
    
for i in soup.find_all('div', class_='description'):
    labels.append(i.text)
    
#let's assume I only want the first five items in each list
player_stats = player_stats[:5]
labels = labels[:5]

player_dict = dict(zip(labels, player_stats))
print(player_dict)

{'Position': 'Forward', 'Height': "6'11''", 'Weight': '225 lbs.', 'Class': 'Sophomore', 'Hometown': 'Coeur d’Alene, Idaho'}


Now I have an easy to use dictionary!

In [10]:
print(player_dict['Hometown'])

Coeur d’Alene, Idaho


# Selenium

Now that we have learned to scrape static HTML content, let's automate this task using Selenium. Check out the Selenium documentation here: https://www.selenium.dev/

### Selenium Web Driver
In order to use Selenium, we must download and install a web driver which allows you to drive a browser with your code.

**Important**
The following code assumes you are using Google Chrome and will use the associated web driver. If you are using another browser (safari, firefox, edge, etc) you will need to download the selenium web driver for that browser. Just check the Selenium documentation in order to do that.

You also need to make sure to download the correct web driver which corresponds to your version of Google Chrome.

I have included a detailed writeup about this in the 'WebDriverInstall.md' file in the github repository.
    

In [None]:
# install selenium
!conda install --yes selenium

In [6]:
# Import selenium and webdrivers

from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
import time

Let's make sure now that Selenium is working and you have all your paths set up correctly. 

If this works correctly, it will open up a blank browser with a message that 'this is being controlled by automated test software'

In [9]:
# Make sure that Selenium is working and you have all your 'MY_PATH' variable set up correctly
# example: /Users/ep9k/Desktop/PythonWebScraping-master/chromedriver
MY_PATH = "./chromedriver"

driver = webdriver.Chrome(executable_path=MY_PATH)

Now we can go directly to a page of our choice like so...

In [10]:
driver = webdriver.Chrome(executable_path=MY_PATH)
driver.get('https://virginiasports.com/sports/mbball/roster/')

We can now proceed to write out script just like any other program. Just like BeautifulSoup, Selenium provides the ability to select HTML elements by the tag name, class name, id name, and so on. 

On the UVA roster homepage, looking at the HTML you can see that each player has it's own box with a picture, name, and so on. The HTML is the same for all of those players. In our example, we will simulate that the user is clicking on each player to see that player's individual page. Then we will scrape the information off of that page just like we did in earler. Let's start with just one player.

In [12]:
driver = webdriver.Chrome(executable_path=MY_PATH)
driver.get('https://virginiasports.com/sports/mbball/roster/')

#I am selecting by x path. xpath can be used to navigate through XML documents
player = driver.find_element_by_xpath('//*[@id="players"]/div[1]/div[2]/div[1]')
player.click()

See above that I used the 'find_element_by_xpath' method to click on this player's name. Let's define what an XPath is.

**XPath**: XPath enables testers to navigate through the XML structure of an HTML or XML document. Don't worry too much about this. I like to use 'find_element_by_xpath' because it is basically a unique identifier for items in the HTML document and makes selecting them easy.

To select an element (by XPath or another way), right click on the thing on the page you are interested in, click "inspect", then in the console in the "elements" tab right click on the HTML of that thing, click "copy", click "XPath".

There are many different ways to select HTML elements in Selenium. Check out the documentation for more examples: https://selenium-python.readthedocs.io/locating-elements.html

Now that we have Selenium installed and set up and we did a small example, we can now expand upon it. We will take just a few players from the team and iterate through them. Once we land on each page we will do exactly what we just did with static HTML with BeautifulSoup.

In [13]:
driver = webdriver.Chrome(executable_path=MY_PATH)
driver.get('https://virginiasports.com/sports/mbball/roster/')


#we will use a while loop to grab data about the first three players
count = 0

while count < 3:
    count += 1
    time.sleep(1)
    
    #select player's name using find_element_by_xpath()
    #each player's name is a link to their bio page
    player = driver.find_element_by_xpath(f'//*[@id="players"]/div[{count}]/div[2]/div[1]')
    player.click()
    
    #now we use beautiful soup to parse the HTML just as we did last time
    soup = BeautifulSoup(driver.page_source, 'lxml')
    
    player_name = soup.find('div', class_='bio-info').div.h1.text
    print()
    print(player_name)
    
    for item in soup.find_all('div', class_="value"):
        print(item.text) 
        
    #go back to roster page after collecting information about current player    
    driver.get('https://virginiasports.com/sports/mbball/roster/')
    
driver.quit()






Blake Buchanan
Forward
6'11''
225 lbs.
Sophomore
Coeur d’Alene, Idaho
Lake City
 @blake_buchanan4
 @blake_buchanan4

Jalen Warley
Guard
6'7''
205 lbs.
Senior
Philadelphia, Pa.
Westtown
Florida State
 @jjwarley
 @jalenwarley

Elijah Saunders
Forward
6'8''
225 lbs.
Junior
Phoenix, Ariz.
Sunnyslope
San Diego State
 @elijahsaunders_
 @elijah_saunders


Let's do just a little more to make it pretty. This time we will collect the information about each player and put it into a pandas dataframe. There are many ways to do this but I will use lists to populate the columns of the dataframe.

We will do the whole thing together.

In [14]:

driver = webdriver.Chrome(executable_path=MY_PATH)
driver.get('https://virginiasports.com/sports/mbball/roster/')

#accumulator lists we will use later to make our pandas dataframe
names = []
positions = []
heights = []
weights = []
years = []


#we will use a while loop to grab data about the first three players
count = 0

team_players = ['Blake Buchanan', 'Jalen Warley', 'Elijah Saunders', 'Andrew Rohde', 
               'Jacob Cofie', 'Dai Dai Ames', 'Bryce Walker', 'Ishan Sharma', 
                'Anthony Robinson', 'Taine Murray', 'Isaac McKneely', 'Elijah Gertude',
                'Desmond Roberts','TJ Power', 'Christian Bliss', 'Carter Lang' ]

#there are 16 players on the team
for i in team_players:
    count += 1
    time.sleep(1)
    
    #select player's name using find_element_by_xpath()
    #each player's name is a link to their bio page
    player = driver.find_element_by_xpath(f'//*[@id="players"]/div[{count}]/div[2]/div[1]')
    
    player.click()
    
    #now we use beautiful soup to parse the HTML just as we did last time
    soup = BeautifulSoup(driver.page_source, 'lxml')
    
    #player info will be used to store the data about each player in a list
    #this will look like:  ['Kihei Clark', 'Guard', '5'10"', '167 lbs', 'Senior']
    player_info = []
    
    for item in soup.find_all('div', class_="value"):
        player_info.append(item.text)
        
    #add the information from player_info to the accumulator lists outside this loop
    names.append(i)    # i is player name from team_players
    positions.append(player_info[0])
    heights.append(player_info[1])
    weights.append(player_info[2])
    years.append(player_info[3])
    
    #go back to roster page after collecting information about current player    
    driver.get('https://virginiasports.com/sports/mbball/roster/')
    
driver.quit()

print(names)
print(positions)


WebDriverException: Message: unknown error: cannot determine loading status
from unknown error: cannot determine loading status
from target frame detached
  (Session info: chrome=129.0.6668.71)


Lastly, we will take the lists we created above and put them into the pandas dataframe we created earlier

In [15]:
# Make empty dataframe
bball_team_df = pd.DataFrame()

# Take lists of team information and put them into columns of dataframe
bball_team_df['Name'] = names
bball_team_df['Position'] = positions
bball_team_df['Height'] = heights
bball_team_df['Weight'] = weights
bball_team_df['Year'] = years
print(bball_team_df)



               Name Position  Height    Weight       Year
0    Blake Buchanan  Forward  6'11''  225 lbs.  Sophomore
1      Jalen Warley    Guard   6'7''  205 lbs.     Senior
2   Elijah Saunders  Forward   6'8''  225 lbs.     Junior
3      Andrew Rohde    Guard   6'6''  202 lbs.     Junior
4       Jacob Cofie  Forward  6'10''  230 lbs.   Freshman
5      Dai Dai Ames    Guard   6'1''  185 lbs.  Sophomore
6      Bryce Walker    Guard   6'2''  200 lbs.     Senior
7      Ishan Sharma    Guard   6'5''  185 lbs.   Freshman
8  Anthony Robinson    Guard   6'5''  208 lbs.     Senior
9      Taine Murray    Guard   6'4''  188 lbs.     Junior
