# Configure Environment

```sh
python3 -m venv venv
source venv/bin/activate

# install necessary packages
pip3 install selenium
```

After installing Selenium, import the following modules:

In [1]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options

# Using Selenium Locators

For the following example, we will use [Washtenaw County precinct voter information](https://electionresults.ewashtenaw.org/electionreporting/nov2024/indexprecinctreport.html) to show how to get a variety of elements from a static webpage using Selenium Locators

First, we need to find where in our filesystem the browser of our choice lives.
## MacOS
On MacOS systems, you can typically find this in `/Applications/(Your browser's name).app/Contents/MacOS/(Your browser's name)`
## Linux/WSL
On Linux systems, you can find the browser's location using the `which` command (e.g. `which google-chrome-stable`)

For the following example, Brave Browser on MacOS will be used.

In [3]:
brave_path = "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome"

## Installing ChromeDriver

In addition to the browser executable, we also need a corresponding driver for Selenium to interface with it. Assuming that the browser is Chromium-based, we will be using ChromeDriver which can be downloaded [here](https://sites.google.com/chromium.org/driver/). After installing, we can specify the executable's path:

In [13]:
chromedriver_path = "./chromedriver-mac-arm64/chromedriver"

We can now instantiate the WebDriver with our predefined options:

In [15]:
# Set up chrome options
opt = Options()
opt.binary_location = brave_path

# Instantiate driver
service = Service(chromedriver_path)
driver = webdriver.Chrome(service=service, options=opt)

Now, lets get the HTML content for the Washtenaw County precinct voter data

In [16]:
# Send HTTP request
driver.get("https://electionresults.ewashtenaw.org/electionreporting/nov2024/indexprecinctreport.html")

From this page, we see that there is a long list of links for each precinct. Let's start by gathering all of this links so that we can traverse every precinct accordingly. In order to do this, we can simply get all the `<a>` elements on the page since all links on the page are a link to a precinct.

In [18]:
# get all <a> elements
rows = driver.find_elements(By.TAG_NAME, "a")

Now that we have all the `<a>` elements, we need to get the link URLs. We can do this by accessing the `href` attribute for each:

In [19]:
# Get URLs
hrefs = []
for row in rows:
	hrefs.append(row.get_attribute("href"))

If we visit any of these precinct links, we will find that they all share the same general structure. For example, each precinct page has exactly one `<font>` element which has a class of `h2`, which corresponds to the precinct's name. We can use this information to get the precinct name for every precinct.

Let's say we also want to tally the total number of voters for all precincts in Ann Arbor. Using the same method we used to find the precinct name, we can see that the number of registered voters is always put in the second `<td>` element of the page as a whole. To do this, let's use the following XPath statement:

`(//td)[2]`

Let's also break down how we came up with the syntax:
1. `//td`
- `//`: This selects all elements in the document that match the following node test, regardless of their position in the document tree

- `td`: This is the node test. It selects all <td> elements (table data cells in HTML).

So, `//td` selects all `<td>` elements in the entire HTML document.

2. `(//td)`
- The parentheses `()` are used to group the expression `//td`. This is useful when you want to apply an index or other operations to the result of the grouped expression.

3. `[2]`
- The `[2]` is a predicate. It filters the result of the grouped expression `(//td)` to select the second matching element.

In XPath, indexing starts at 1 (not 0, as in many programming languages).

In [20]:
total_reg_voters = 0
for href in hrefs:
	# recurse through each row
	driver.get(href)
	print(f"Precinct: {driver.find_element(By.CLASS_NAME, 'h2').text}")
	reg_voters = driver.find_element(By.XPATH, "(//td)[2]")
	print(f"Registered Voters: {reg_voters.text}")
	total_reg_voters += int(reg_voters.text.replace(",",""))
	print("")

Precinct: CITY OF ANN ARBOR, WARD 1, PRECINCT 1
Registered Voters: 2,908

Precinct: CITY OF ANN ARBOR, WARD 1, PRECINCT 2
Registered Voters: 4,344

Precinct: CITY OF ANN ARBOR, WARD 1, PRECINCT 4
Registered Voters: 1,112

Precinct: CITY OF ANN ARBOR, WARD 1, PRECINCT 5
Registered Voters: 2,018

Precinct: CITY OF ANN ARBOR, WARD 1, PRECINCT 6
Registered Voters: 2,143

Precinct: CITY OF ANN ARBOR, WARD 1, PRECINCT 7
Registered Voters: 2,279

Precinct: CITY OF ANN ARBOR, WARD 1, PRECINCT 8
Registered Voters: 1,868

Precinct: CITY OF ANN ARBOR, WARD 1, PRECINCT 9
Registered Voters: 3,973

Precinct: CITY OF ANN ARBOR, WARD 1, PRECINCT 10
Registered Voters: 2,631

Precinct: CITY OF ANN ARBOR, WARD 2, PRECINCT 13
Registered Voters: 1,625

Precinct: CITY OF ANN ARBOR, WARD 2, PRECINCT 14
Registered Voters: 5,131

Precinct: CITY OF ANN ARBOR, WARD 2, PRECINCT 16
Registered Voters: 1,891

Precinct: CITY OF ANN ARBOR, WARD 2, PRECINCT 17
Registered Voters: 2,086

Precinct: CITY OF ANN ARBOR, WARD

Finally, lets print the total registered voters count

In [21]:
print(f"Total Registered Voters: {total_reg_voters}")

Total Registered Voters: 337738


# Using Selenium Interactions & Actions API

For the next example, we will be using TikTok to show how we can use Interactions and the Actions API to scrape information from pages that require dynamic interaction

### Imports

In [22]:
from selenium.webdriver.common.keys import Keys
from selenium.webdriver import ActionChains
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.actions.wheel_input import ScrollOrigin
import time

For this example, let's load the page for a specic *The Onion* TikTok video:

In [23]:
# Send HTTP request
driver.get("https://www.tiktok.com/@theonion/video/7473543012786752811")

Right off the bat, if we were to try to interact with this page, we will likely come across a Captcha or some form of human verification. Since this part cannot be automated, we will have to complete this by hand before we run the next part of the code.

Throughout this example, we will be looking for DOM elements that might not have been loaded when we try to look for them. To properly wait for the elements that we are searching for, we will use the following helper function

In [24]:
def wait_then_find(by_method, name, driver):
	element = WebDriverWait(driver, 20).until(
		EC.presence_of_element_located((By.CLASS_NAME, name))
	)
	return element

Once we completed this Captcha, by analyzing the DOM on this page, we see that TikTok uses specific class identifiers for different types of elements. One of such classes is `ejcng161`, which seems to be the unique class identifier for the number of comments text box right above the comments. We can use this to our advantage to get the total number of comments

In [25]:
num_comments_el = wait_then_find(By.CLASS_NAME, "ejcng161", driver)

This text box will be in a format like `136 comments`, so let's parse the number out of the text content. Since TikTok uses an infinite scroll-like interface to load comments, we need to scroll down to load all the comments into the DOM. Let's use this count we just got to define how much we need to scroll

In [26]:
total_num_comments = int(num_comments_el.text.split(' ')[0])
scroll_amount = total_num_comments * 25

print(f"Scrolling by {scroll_amount} for {total_num_comments} comments") 

Scrolling by 3700 for 148 comments


Next, let's scroll the page in increments of 500, up to the amount specified

In [27]:
# scroll in increments of 500
scroll_pos = 0
while scroll_pos < scroll_amount:
	ActionChains(driver)\
		.scroll_by_amount(0, 500)\
		.perform()
	scroll_pos += 500
	time.sleep(1.5)

Once all comments are loaded, we now want to expand all replies to comments. In order to do this, we will traverse through each `View More` button to load as many replies as possible. We can do this by using the same strategy we used to get the total comments count and finding the unique class identifier for this kind of button, `ezgpko45`. Since this class identifier is also used for the `Hide` button, we want to filter out any non-`View More` button. In order to load as many replies as possible, we will continue to loop over all present `View More` buttons and click them until no new comment replies are loaded.

In [28]:
prev_num_comments = 0

# expand all "view more" buttons
while True:
	# get number of comment boxes
	num_comments = len(driver.find_elements(By.CLASS_NAME, "e1970p9w2"))

	# get all "view more" buttons
	comment_buttons = driver.find_elements(By.CLASS_NAME, "ezgpko45")
	view_mores = [vm for vm in comment_buttons if vm.find_element(By.TAG_NAME, "span").text != "Hide"]

	# stop trying if we did not get any new comments loaded
	if num_comments == prev_num_comments:
		break
	prev_num_comments = num_comments

	for vm in view_mores:
		try:
			button_text = vm.find_element(By.TAG_NAME, "span").text
		except Exception as e:
			continue

		ActionChains(driver)\
			.scroll_from_origin(ScrollOrigin.from_element(vm, 0, 0), 0, 35)\
			.pause(1)\
			.perform()
		driver.implicitly_wait(2)
		vm.click()
		driver.implicitly_wait(5)
	
	print(f"Found {len(driver.find_elements(By.CLASS_NAME, 'e1970p9w2')) - prev_num_comments} more replies!")

ReadTimeoutError: HTTPConnectionPool(host='localhost', port=50865): Read timed out. (read timeout=120)

Finally, let's print out the author and comment text for all of the comments we loaded

In [None]:
# get all comment boxes
comments = driver.find_elements(By.CLASS_NAME, "e1970p9w2")

print(f"{len(comments)} comments out of {total_num_comments} TikTok reported total found.\n")
# print all comment's text
for c in comments:
	print("Author: \n " + c.find_element(By.CLASS_NAME, "e1vx58lt0").text)
	print("Comment: \n" + c.find_element(By.TAG_NAME, "span").find_element(By.TAG_NAME, "p").text)
	print("")

driver.quit()

98 comments out of 149 TikTok reported total found.

Author: 
 Confused Person🎧🎸⚭
Comment: 
Wow im supprised the onion is able to keep going in todays america

Author: 
 raider_1941
Comment: 
this video is like a decade old.

Author: 
 shampoomaster42
Comment: 
no 🤪

Author: 
 Zylonpylon
Comment: 
Old clip

Author: 
 Confused Person🎧🎸⚭
Comment: 
I know he is such a idiot

Author: 
 fiona <3
Comment: 
Listen to anything the maga Republican Party says and you’ll be surprised what you can say on air….😐

Author: 
 Doobie Houser MD
Comment: 
Why? Too many Magats will think it’s real?

Author: 
 chowchow
Comment: 
Why... it is the current state of 80% of Americans

Author: 
 Brandon
Comment: 
They just blend in with real news cuz reality is stupid

Author: 
 Sloth
Comment: 
It's because trump loves watching "the best documentaries"

Author: 
 TheLeterminator
Comment: 
and everyone doesn't get it lol. What happened to satire. I guess when the whole world is a joke then it's tough to tell the 