# Selenium

Selenium is a powerful tool for controlling your web browsers and performing automation.

You can use it to automate tasks in web browsers, such as filling out forms, clicking buttons, and navigating through pages. Once the content that you want is loaded in a browser, we can invoke other tools such as BeautifulSoup (a Python scraping library) to extract data from web pages.

In [1]:
# Load necessary libraries
! pip install webdriver-manager
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time
import pandas as pd
from bs4 import BeautifulSoup


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


The difficult part of using Selenium is setting it up. You need to install the Selenium package (which we already did), **download a web driver for your browser**, and ensure that the driver is in your system's PATH.

The browsers I use the most for Selenium are Chrome and Firefox, but apparently, Selenium also supports Edge and Safari: https://www.selenium.dev/documentation/webdriver/browsers/

I recommend you to use a browser that you do not do your daily work with, so you can avoid issues with your browser settings and extensions. For example, in my day-to-day work, I use Chrome, so I will install Firefox and just use it for Selenium.

- Go to https://www.selenium.dev/downloads/
- Scroll down to the "Platforms Supported by Selenium" section
- Follow the instruction for the browser you want to use (e.g., "Firefox") to download the driver
  - For example, for Firefox, the website says "GeckoDriver is implemented and supported by Mozilla, refer to their documentation for supported versions."
  - Once you click the link, you will see that "The following table shows a mapping between geckodriver releases, and required versions of Selenium and Firefox."
  - Once you click the link for "geckodriver releases" and go to https://github.com/mozilla/geckodriver/releases you will see a list of releases at the bottom of the page.
  - Download the latest version of the driver for your operating system. For me, it would be `geckodriver-v0.36.0-macos-aarch64.tar.gz`
  - Extract the downloaded file, and you will get a file called `geckodriver`
- Copy the `geckodriver`'s path to your clipboard, and modify the next code snippet.

In [None]:
# If you use Firefox, only run this part
from selenium.webdriver.firefox.service import Service as FirefoxService
from webdriver_manager.firefox import GeckoDriverManager
driver = webdriver.Firefox(service=FirefoxService(GeckoDriverManager().install()))

# from selenium.webdriver.chrome.service import Service as ChromeService
# from webdriver_manager.chrome import ChromeDriverManager
# driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))

Running the above trunk will open a Firefox browser

The next trunk will navigate the browser to the specified URL and do some basic interactions with the page.

In [6]:
# Go to MyU portal
driver.get("https://myu.umn.edu")

# Wait for page to load
time.sleep(10)

# Enter credentials (yes, you will input your username and password here)
username = "alvinyxz"
password = "5491Xiao5491Xiao"  # NEVER do this in real life though!

# Fill username and submit
driver.find_element(By.ID, "username").send_keys(username)
driver.find_element(By.ID, "password").send_keys(password)
driver.find_element(By.NAME, "_eventId_proceed").click()
# You might need to authenticate with Duo, which you have to do manually.
# Once you are logged in, you can proceed to the next step.

After running the above trunk, you should have logged into your MyU account.

Now you are going to run the next trunk to navigate to the "My Info" tab and extract the data from the page.

In [None]:
# Right-click on the My Info tab and inspect it to find the href value
# This is usually something like '#tab_UM_SSS_MY_INFORMATION'
# The "click()" action will click on the My Info tab for you
my_info_tab = driver.find_element(By.CSS_SELECTOR, 'a[href="#tab_UM_SSS_MY_INFORMATION"]')
driver.execute_script("arguments[0].click();", my_info_tab)
time.sleep(3)

The above trunk should have switched your browser to the "My Info" tab.

Now, we will scrape the data from the My Info tab.

In [None]:
# You can right-click on the page and inspect it to find the structure of the HTML

# You can see that, your phone number is under id 'UM_SSS_MY_INFO_PHONE_XFORM'
phone_number_element = driver.find_element(By.XPATH, '//*[@id="UM_SSS_MY_INFO_PHONE_XFORM"]//td')
phone_number = phone_number_element.text
print("My Phone Number:", phone_number)

# You can see that, your address is under id 'UM_SSS_MY_INFO_ADDRESS_XFORM'
address_element = driver.find_element(By.XPATH, '//*[@id="UM_SSS_MY_INFO_ADDRESS_XFORM"]//td')
address = address_element.text
print("My Address:", address)

My Phone Number: 213/706-1360
My Address: 615 Ontario St SE
Apt 19
Minneapolis, MN 55414-3136


Apparently, you can use `selenium` to do even more crazy things. You can literally control the browser as if you were a human, including clicking buttons, filling out forms, and even taking screenshots.

Using `selenium` with other tools such as scraping tools like BeautifulSoup, you can extract data from web pages and automate tasks that would otherwise be tedious and time-consuming.

Compared to directly reading in HTML, like what we did in R, this approach "simulates" a human user interacting with the browser, which can be useful for websites that require JavaScript to load content or have complex interactions.