---
title: "Data Collection"
format:
    html: 
        code-fold: false
---

{{< include instructions.qmd >}} 


{{< include overview.qmd >}} 

{{< include methods.qmd >}} 

# Code 

In the following code, we first utilized the requests library to retrieve the HTML content from the Wikipedia page. Afterward, we employed BeautifulSoup to parse the HTML and locate the specific table of interest by using the find function. Once the table was identified, we extracted the relevant data by iterating through its rows, gathering country names and their respective populations. Finally, we used Pandas to store the collected data in a DataFrame, allowing for easy analysis and visualization. The data could also be optionally saved as a CSV file for further use. 

The code in this section was largely manipulated from @Pfalzgraf_2020_selenium walkthough in scraping data for NBA players. 

In [1]:
# IMPORTS
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
import pandas as pd

In [2]:
# CONFIGURE DRIVER
# Initiate options object and configure chrome settings to run headless (without GUI)
# ChatGPT helped address error with Chrome crashing upon startup
chrome_options = Options()
chrome_options.add_argument("--headless")  # Run in headless mode
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
# Define path to driver
DRIVER_PATH = '/home/gentry/.cache/selenium/chromedriver/linux64/131.0.6778.85/chromedriver'
# Use the path to the chromedriver to set a service object
cService = webdriver.ChromeService(executable_path=DRIVER_PATH)
# Pass service object to make a driver object
driver = webdriver.Chrome(service=cService, options=chrome_options)

# Use driver to access the bike path website
driver.get('http://bikewashington.org/routes/all.htm')

In [3]:
# Use Xpath's (found using inspect in website) to locate instances of data we want
# Search for instances of bike path names 
names = driver.find_elements(By.XPATH, '//td/a')
# Search for instances of bike path descriptions 
descriptions = driver.find_elements(By.XPATH, '//td[@colspan="3"]')
# Search for instances of bike path ratings 
ratings = driver.find_elements(By.XPATH, "//tr[td/b[text()='Traffic:']]")

# Grab text from searched objects and turn them into lists
bike_path_names = [name.text for name in names]
bike_path_descriptions = [description.text for description in descriptions]
bike_path_ratings = [rating.text for rating in ratings]

# Since they are all ordered the same, we can combine into a single data frame. 
df = pd.DataFrame({'name':bike_path_names, 'description':bike_path_descriptions, 'ratings': bike_path_ratings})
# Extract ratings into their own columns leveraging regex pattern in strings
df[['terrain','traffic','scenery']] = df['ratings'].str.extract(r'Terrain: (\d+) Traffic: (\d+) Scenery: (\d+)')
# Drop ratings column 
df = df.drop(columns=['ratings'])

# Display the scraped data
print(df)
# save to CSV
df.to_csv('../../data/raw-data/dc_bike_routes.csv', index=False)

                             name  \
0                    Potomac Tour   
1                       BWI Trail   
2                  Airpark Cruise   
3              Seneca Valley Tour   
4                   For The Boyds   
5                  Key Chain Tour   
6          The Arlington Triangle   
7              Mount Vernon Trail   
8                 Gettysburg Tour   
9                  The Zoo Review   
10      Antietam Battlefield Tour   
11                Thurmont Ramble   
12                        Wye Not   
13              Lost Blossom Tour   
14  Patuxent Wildlife Center Loop   
15        Western Montgomery Loop   
16                Peach Tree Loop   
17                    Oxford Loop   
18            A Ride to the Falls   
19           Montgomery Lake Tour   
20        Shenandoah River Ramble   
21      The Great Washington Loop   
22               Chesapeake Bound   
23         Waterford Double-Cross   
24            South Mountain Loop   
25       Mason-Dixon Double-Cross   
2

{{< include closing.qmd >}} 