# Selenium

Selenium is a powerful tool for controlling your web browsers and performing automation.

You can use it to automate tasks in web browsers, such as filling out forms, clicking buttons, and navigating through pages. Once the content that you want is loaded in a browser, we can invoke other tools such as BeautifulSoup (a Python scraping library) to extract data from web pages.

In [None]:
# Load necessary libraries
! pip install webdriver-manager
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time
import pandas as pd
from bs4 import BeautifulSoup

Now we are going to install the Selenium library and use it to control a web browser.

# Depend on which browser you want to use, you will run one of the two following trunks (one for Chrome, one for Firefox):

**I recommend you to use a browser that you do not do your daily work with, so you can avoid issues with your browser settings and extensions. For example, in my day-to-day work, I use Chrome, so I will install Firefox and just use it for Selenium.**

In [None]:
# If you want to use Firefox for Selenium, only run this trunk instead of the Chrome part below
from selenium.webdriver.firefox.service import Service as FirefoxService
from webdriver_manager.firefox import GeckoDriverManager
driver = webdriver.Firefox(service=FirefoxService(GeckoDriverManager().install()))

In [None]:
# If you want to use Chrome for Selenium, only run this trunk instead of the Firefox part above
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))

Running the above trunk will open a Firefox/Chrome browser

The next trunk will navigate the browser to the specified URL and do some basic interactions with the page.

**Before running it, make sure to enter your username and password in the `username` and `password` variables.**

In [None]:
# Go to MyU portal
driver.get("https://myu.umn.edu")

# Wait for page to load
time.sleep(10)

# Enter credentials (yes, you will input your username and password here)
username = "" # Replace with your actual username
password = ""  # Replace with your actual password
# Never do this in public code! You are gonna get hacked!

# Fill username and submit
driver.find_element(By.ID, "username").send_keys(username)
driver.find_element(By.ID, "password").send_keys(password)
driver.find_element(By.NAME, "_eventId_proceed").click()

# After running the above trunk, you might need to authenticate with Duo or click something on the page, so that it logs you in, which you have to do manually.

Now you are going to run the next trunk to navigate to the "My Info" tab and extract the data from the page.

In [None]:
# Right-click on the My Info tab and inspect it to find the href value
# This is usually something like '#tab_UM_SSS_MY_INFORMATION'
# The "click()" action will click on the My Info tab for you
my_info_tab = driver.find_element(By.CSS_SELECTOR, 'a[href="#tab_UM_SSS_MY_INFORMATION"]')
driver.execute_script("arguments[0].click();", my_info_tab)

The above trunk should have switched your browser to the "My Info" tab.

Now, we will scrape the data from the My Info tab.

In [None]:
# You can right-click on the page and inspect it to find the structure of the HTML
# You can see that, your phone number is under id 'UM_SSS_MY_INFO_PHONE_XFORM'
phone_number_element = driver.find_element(By.XPATH, '//*[@id="UM_SSS_MY_INFO_PHONE_XFORM"]//td')
phone_number = phone_number_element.text
print("My Phone Number:", phone_number)

# You can see that, your address is under id 'UM_SSS_MY_INFO_ADDRESS_XFORM'
address_element = driver.find_element(By.XPATH, '//*[@id="UM_SSS_MY_INFO_ADDRESS_XFORM"]//td')
address = address_element.text
print("My Address:", address)

Apparently, you can use `selenium` to do even more crazy things. You can literally control the browser as if you were a human, including clicking buttons, filling out forms, and even taking screenshots.

Using `selenium` with other tools such as scraping tools like BeautifulSoup, you can extract data from web pages and automate tasks that would otherwise be tedious and time-consuming.

Compared to directly extracting HTML from a website, like what we did in R, this approach "simulates" a human user interacting with the browser, which can be useful for websites that require JavaScript to load content or have complex interactions.

# NBA Scraping Again

Remember the NBA scraping we did in R? We could not do it because the data was loaded dynamically using JavaScript, and we could not get the data directly from the HTML.

However, with Selenium, we can automate the browser to load the page and extract the data. In essence, we are cheating the system: nba.com will see us as a human user, and we can extract the data as if we were browsing the site manually.

Again, between the below two trunks, you will only select one to run, depending on which browser you want to use.

In [None]:
# If you want to use Firefox for Selenium, only run this trunk instead of the Chrome part below
from selenium.webdriver.firefox.service import Service as FirefoxService
from webdriver_manager.firefox import GeckoDriverManager
driver = webdriver.Firefox(service=FirefoxService(GeckoDriverManager().install()))

In [None]:
# If you want to use Chrome for Selenium, only run this trunk instead of the Firefox part above
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))

Then we will run the following trunk to get the data from the NBA page.

In [None]:
# Go to NBA website
driver.get("https://www.nba.com/standings")

# Wait for page to load
time.sleep(10)

# Parse page source with BeautifulSoup (a HTML parser, you can interpret it as a library that helps you read HTML)
soup = BeautifulSoup(driver.page_source, "html.parser")

# Get both tables
tables = soup.find_all("table", class_="Crom_table__p1iZz")
print(f"Found {len(tables)} tables.")

Let's print out the two tables:

In [None]:
# Eastern Conference is first, Western Conference is second
eastern_table = tables[0]
western_table = tables[1]

def parse_table(html_table):
    rows = html_table.find_all("tr")
    data = []
    for row in rows:
        cols = [col.get_text(strip=True) for col in row.find_all(['th', 'td'])]
        if cols:  # skip empty rows
            data.append(cols)
    df = pd.DataFrame(data[1:], columns=data[0])  # First row is header
    return df

# Convert to DataFrames
east_df = parse_table(eastern_table)
west_df = parse_table(western_table)

# Preview results
print("Eastern Conference:")
print(east_df.head())

print("\nWestern Conference:")
print(west_df.head())

# Some cleaning is needed, but you see how to get the data now :)

In [None]:
# Optional: close the driver
driver.quit()