# **Web Scraping Task 2 : Using Selenium/BS4 to Extract Horse Race Data**

### **Background** ###
 This notebook provides a step by step guide to to web scraping HTML content. We will utilize both Selenium and BeautifulSoup (BS4) to automate the process of navigating websites and handling dynamic JavaScript content. 
 <br>
 
 In this example, we’ll demonstrate how to navigate www.swiftbet.com.au chose one of tomorrows gallops races at random and scrape horse and odds data

### **Step by Step Guide** ###
The steps outlined below will guide you through the entire process:
<br>

1) **Import Libraries :** Importing the relevant libraries, and adding the relevant settings for our selenium driver. Some may require pip installs. 

2) **Initialise Selenium :** Load the selenium driver and navigate to todays racing page on SwiftBet.

3) **Navigate to Tomorrows Races :** Use selenium to navigate to tomorrows races.

4) **Scrape Element Data and Chose Race:** Scrape all of the race elements and chose a race at random and navigste to its page.

5) **Scrape HTML for Race Data :** Scrape the relevant data for horse names and odds.

6) **Construct DataFrame :** Construct our DataFrame.

7) **Make Title for DataFrame :** Format a title for our file name.

8) **Export DataFrame :** Export our DataFrame as .CSV file.
<br>

By the end of this notebook, you will have a clear understanding of how to scrape and process web data, even from complex sites with dynamic content.  

In [51]:
#Importing relevant libraries, some of which may require pip installs
from selenium import webdriver 
from selenium.webdriver.common.by import By  
from selenium.webdriver.chrome.options import Options 
from bs4 import BeautifulSoup 
import pandas as pd 
import random
from datetime import datetime, timedelta 

In [52]:
# Initialising selenium and adding optional settings
options = Options()
#options.add_argument("--headless") #Ran without headless mode as it wasn't loading elements properly
options.add_argument("--disable-gpu")  # Disables GPU hardware acceleration
options.add_argument("--no-sandbox")   # Bypasses OS security model
options.add_argument("--disable-dev-shm-usage")  # Overcome limited resource problems
driver = webdriver.Chrome(options = options) # Using Chrome as web driver 
site =  "https://www.swiftbet.com.au" # Swiftbets URL
driver.get(site) # Accessing site with selenium
driver.implicitly_wait(30) # Waiting for JavaScript elements to load/render

In [53]:
# Clicking on the link to the racing page 
racing_button = driver.find_element(By.XPATH, "//div[contains(text(), 'Racing')]")
racing_button.click()
driver.implicitly_wait(30)

In [54]:
# Clicking on the link to tomorrows races in AU/NZ
tomorrow_button = driver.find_element(By.CSS_SELECTOR, "button[data-fs-title='page:racing-tab:tomorrow-header_bar']")
tomorrow_button.click()
driver.implicitly_wait(30)

In [55]:
# Extracting the elements from tomorrows gallops races
race_elements_tomorrow = driver.find_elements(By.CSS_SELECTOR, "a[href*='/racing/gallops/']")

# Checking the length of race elements 
print(len(race_elements_tomorrow))


32


In [56]:
# Choosing a random race 
if not race_elements_tomorrow:
    print("No race elements found.")
else:
        # Randomly select a race
    random_race = random.choice(race_elements_tomorrow)
        
        # Click on the randomly selected race
    random_race.click()

In [57]:
# Initialise BeautifulSoup with the page source
current_url = driver.current_url
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'html.parser')

In [58]:
# Extracting the elements which encase the information of horses in the randomly selected race
horse_items = soup.find_all('li', class_='name styles egik0gw0 css-1c8uam-ListItem-ListItem-ListItem-RaceSelectionsListItem-RaceSelectionsListItem-RaceSelectionsListItem-RaceSelectionsListItem e1ad0cjx0')


In [59]:
# Extracting the horses names 

# Initialise list
horses_names = []

# Loop through elements to extract the horses names
for item in horse_items:
    # Find the div containing the horse name
    horse_name_div = item.find('div', class_='e3trgs57 css-1bpf5z2-Text-Text-RaceSelectionDetails-RaceSelectionsDetails__Name-RaceSelectionDetails-RaceSelectionDetails ea6hjv30')
    
    if horse_name_div:
        # Get the full text from the div
        full_text = horse_name_div.get_text(strip=True)
        
        # Split by whitespace and remove the first element (the number and dot)
        name_parts = full_text.split()
        horse_name = ' '.join(name_parts[1:])
        
        # Remove the trailing number in parentheses if present
        horse_name = horse_name.split('(')[0].strip()
        
        # Append horse name to list
        horses_names.append(horse_name)

# Checking the result 
print(horses_names)

['Star Magnum', 'Ocean Emperor', 'Mathematics', 'Pompeii Empire']


In [60]:
# Extracting the selection of odds for each horse 

# Initialise list
odds_list = []

# Loop through the elements which contain the horse information to extract the odds data.
for item in horse_items:
    # Find all span elements containing the odds
    odds_spans = item.find_all('span', class_='eatknsg1 css-fvda5w-Text-Text-BettingAdd-styled-BettingAdd__Single-BettingAdd-styled ea6hjv30')
    
    if odds_spans:
        # Extract text from each odds span and combine them if there are multiple
        odds = ' / '.join([span.get_text(strip=True) for span in odds_spans])
        odds_list.append(odds)
    else:
        # In case no odds are found
        odds_list.append("N/A")  

# Checking the result 
print(odds_list) 

['MID', 'MID', 'MID', 'MID']


In [61]:
# Formatting the odds list into a list of lists, using '/' as a delimiter

# Initialising list 
complete_odds_list = [] 

# Looping through to seperate the different types of odds for each horse
for odd in odds_list : 
    odd_array = odd.split('/')
    complete_odds_list.append(odd_array)

# Checking the result 
print(complete_odds_list)

[['MID'], ['MID'], ['MID'], ['MID']]


In [62]:
# Extracting the names of the different types of odds

# Find div which contain the odds types
odds_titles_div = soup.find('div', class_='en9z9v51 css-sfxk7f-Text-Text-RaceSelectionsList-RaceSelectionsList__HeaderRow-RaceSelectionsList ea6hjv30')

# Initialising list 
odds_titles = []

# Find all divs within the parent div that contain the odds types
if odds_titles_div:
    titles = odds_titles_div.find_all('div', class_='css-101kdtv-RaceSelectionsList-RaceSelectionsList__HeaderCell-RaceSelectionsList-RaceSelectionsList en9z9v52')
    
    for title in titles:
        odds_titles.append(title.get_text(strip=True))

# Checking the result
print(odds_titles)

['MID (W)']


In [65]:
# Quitting the selenium driver 
driver.quit()

In [64]:
# Checking if all strings are the same length 
if len(horses_names) == len(odds_list) : 
    print('All lists are same length')
else : 
    print('List lengths are not the same length')

All lists are same length


In [66]:
# Constructing the DataFrame 
df_performed_bets = pd.DataFrame({'Horse Name' : horses_names})
print(df_performed_bets)

       Horse Name
0     Star Magnum
1   Ocean Emperor
2     Mathematics
3  Pompeii Empire


In [67]:
# Making the indexes match the odds elements and forming the dataframe
for i, title in enumerate(odds_titles) : 
    odds = [odds[i] for odds in complete_odds_list]
    df_performed_bets[title] = odds 

print(df_performed_bets)

       Horse Name MID (W)
0     Star Magnum     MID
1   Ocean Emperor     MID
2     Mathematics     MID
3  Pompeii Empire     MID


In [68]:
#Extracting and forming a title to use in the file name 
title = soup.title.string.split()
full_title = f'{title[0]}_{title[1]}_{title[2]}'

In [69]:
# Getting todays date, formatting the file name and exporting to .csv 
tomorrow = datetime.now().date() + timedelta(days=1)
file_name = f'{full_title}_{tomorrow}'
df_performed_bets.to_csv(file_name, index = True)
