# Web Scraping Ticket Prices from StubHub

This Jupyter Notebook demonstrates how to scrape ticket prices from StubHub using Selenium and process the data with pandas. The workflow includes:

1. Setting up the Selenium WebDriver with custom options.
2. Reading ticket sales data from an Excel file.
3. Logging into StubHub and navigating to the search results.
4. Extracting ticket prices for specified artists.
5. Saving the scraped data back to an Excel file for further analysis.

Below are the detailed steps and code implementation.


In [13]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
import pandas as pd
import numpy as np
from datetime import datetime
import re
import time


# Define a test user-agent string to simulate a browser request
test_ua = 'Mozilla/5.0 (Windows NT 4.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36'

# Initialize Chrome options for the WebDriver
options = Options()

# Uncomment the following line if you want to run the browser in headless mode (no GUI)
# options.add_argument("--headless")

# Set the window size for the browser
options.add_argument("--window-size=1920,1080")

# Set the user-agent to the test user-agent defined above
options.add_argument(f'--user-agent={test_ua}')

# Add additional options to improve stability and compatibility
options.add_argument('--no-sandbox')  # Required for running in some environments
options.add_argument("--disable-extensions")  # Disable extensions to avoid potential conflicts

# Initialize the Chrome WebDriver with the specified options
driver = webdriver.Chrome(options=options)




In [14]:
# The following code reads an Excel file and loads the data into a pandas DataFrame
path = "../../Documents/Ticket Sales.xlsx"
sales = pd.read_excel(path, sheet_name ="Sheet1")


In [15]:
"""
This code snippet is designed to automate the process of logging into the StubHub website using Selenium WebDriver. Here's a step-by-step breakdown of what the code does:
1. Navigates to StubHub's homepage.
2. Clicks on the 'Sign In' button.
3. Waits for the email input field to be present and enters the email.
4. Enters the password and submits the form.
5. Attempts to click the submit button if it appears.
Note: Ensure that the necessary imports for Selenium WebDriver, WebDriverWait, and expected conditions (EC) are included in your script.
"""
# Navigate to StubHub's homepage
driver.get("https://www.stubhub.ca")

# Click on the 'Sign In' button
driver.find_element(By.XPATH, "//*[text() ='Sign In']").click()

# Wait for the email input field to be present and enter the email
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "input[type='email']")))
driver.find_element(By.CSS_SELECTOR, "input[type='email']").send_keys("eric9090909090@hotmail.com")

# Enter the password and submit the form
driver.find_element(By.CSS_SELECTOR, "input[type=password]").send_keys("BlckPnk39!@!" + Keys.ENTER)
time.sleep(5)

# Attempt to click the submit button if it appears
try:
    driver.find_element(By.CSS_SELECTOR, "button[type='submit']").click()
except:
    pass
time.sleep(5)

In [16]:
def generate_stubhub_url(search_query):
    """
    Generates a StubHub search URL for a given string.
    This function takes a string input, replaces spaces with plus signs,
    and appends it to a predefined StubHub search URL. If the input is 
    None or NaN, it returns a placeholder string "lol".
    
    Args:
        search_query (str): The search query string.
    
    Returns:
        str: A formatted StubHub search URL or "lol" if the input is None or NaN.
    """
    if search_query in [None, np.nan]:
        return "lol"
    else:
        formatted_query = search_query.replace(" ", "+")
        return "https://www.stubhub.ca/secure/search?q=" + formatted_query + " Toronto" + "&sellSearch=false&sortBy="


In [17]:
"""
Reads ticket sales data from an Excel file.

The input Excel file contains names of artists and dates of their shows.
"""
path = "../../Documents/Ticket Sales.xlsx"

sales = pd.read_excel(path, sheet_name ="Sheet1")

In [18]:

"""
This cell extracts artist names from a sales DataFrame, iterates over each row to check for future events, 
and retrieves ticket prices from a website using Selenium WebDriver. It stores the results in lists for 
further processing.
Variables:
    names (pd.Series): Series containing artist names from the sales DataFrame.
    remain (list): List to store artist names that have been processed.
    prices (list): List to store ticket prices.
    my_prices (list): List to store ticket prices for the current user.
    dates (list): List to store event dates.
    today (datetime): Today's date.
Workflow:
    1. Extract artist names from the sales DataFrame.
    2. Initialize lists to store results.
    3. Get today's date.
    4. Iterate over each row in the sales DataFrame.
        a. Check if the event date is in the future and the artist is not already processed.
        b. Generate the search URL for the artist and navigate to it using Selenium WebDriver.
        c. Wait for the event link to be present and get its href attribute.
        d. Close any modal that appears.
        e. Apply filters to the search results.
        f. Get the list of ticket listings and check if the listing is for the current user.
        g. Retrieve the price from the listing and append the results to the lists.
"""

def process_artist(index, row):
    """
    Processes an artist's event by navigating to the search URL, applying filters, and retrieving ticket prices.
    
    Args:
        row (pd.Series): A row from the sales DataFrame containing artist and event information.
    
    Returns:
        tuple: A tuple containing the artist's name, the ticket price from StubHub, and the user's ticket price.
    """
    # Generate the search URL for the artist and navigate to it using Selenium WebDriver
    artist_search_url = generate_stubhub_url(row["Artist"])
    driver.get(artist_search_url)
    try:
        # Wait for the event link to be present and get its href attribute
        WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH,  "//*[contains(text(),'See')]")))
        event_link = driver.find_element(By.XPATH, "/html/body/div[1]/div[1]/div[4]/div[2]/div/div[1]/div/div[2]/ul/li[2]/a").get_attribute("href") + "&betterValueTickets=false" + "&estimatedFees=false"
        date_match = re.search(r'(\d{1,2})-(\d{1,2})-(\d{4})', event_link)
        if date_match:
            month, day, year = date_match.groups()
            extracted_date = datetime.strptime(f"{month}-{day}-{year}", '%m-%d-%Y')
        else:
            print("No date found in the URL")
            sales.at[index, "Date"] = pd.Timestamp(extracted_date)
        driver.get(event_link)
    except:
        print(row["Artist"] + ' no event link')
        return None, None, None

    # Close any modal that appears
    driver.find_element(By.XPATH, '//*[@id="modal-root"]/div/div/div/div[2]/div[3]/button').click()
    driver.find_element(By.CSS_SELECTOR, "div.sc-xrltsx-2").click()
    time.sleep(1)
    
    # Apply filters to the search results
    if driver.find_element(By.CSS_SELECTOR, "input[type=checkbox]").get_attribute("value") == "true":
        driver.find_element(By.CSS_SELECTOR, "input[type=checkbox]").click()
    time.sleep(6)

    user_ticket_price = 0
    # Get the list of ticket listings and check if the listing is for the current user
    ticket_listings = driver.find_elements(By.CLASS_NAME, "sc-57jg3s-0")
    for listing in ticket_listings:
        try:
            listing.find_element(By.CLASS_NAME, "sc-1l8fa2j-14") 
            user_ticket_price = listing.find_element(By.CLASS_NAME, "sc-1bp3ico-0").get_attribute("data-price")
            user_ticket_price = float(re.sub(r'[^\d.]', '', user_ticket_price))
            break
        except:
            pass

    if user_ticket_price ==  0:
        print(row["Artist"] + ' no user ticket price')

    try:
        # Retrieve the price from the listing
        ticket_price = driver.find_element(By.CLASS_NAME, "sc-1bp3ico-0").get_attribute("data-price")
        ticket_price = float(re.sub(r'[^\d.]', '', ticket_price))
        if (ticket_price) > (sales.at[index, "Max Resell"]) and (ticket_price) < 200:
            sales.at[index, "Max Resell"] = ticket_price
        return row["Artist"], ticket_price, user_ticket_price
    except:
        print('no ticket price')
        return None, None, None



In [19]:
# Extract artist names from the sales DataFrame
artist_names = sales["Artist"]

# Initialize lists to store results
processed_artists = []
ticket_prices = []
user_ticket_prices = []
event_dates = []

# Get today's date
current_date = datetime.today()

# Iterate over each row in the sales DataFrame
for index, row in sales.iterrows():
    # Check if the event date is in the future and the artist is not already processed
   
    if (current_date < row["Date"] or row["Date"] in [None, np.nan]) and row["Artist"] not in processed_artists:
        artist, ticket_price, user_ticket_price = process_artist(index, row)
        if artist:
            processed_artists.append(artist)
            ticket_prices.append(ticket_price)
            user_ticket_prices.append(user_ticket_price)
            event_dates.append(row["Date"])
            print(artist, ticket_price, user_ticket_price)


Billie Eillish no user ticket price
Billie Eillish 315.0 0
Dua Lipa no user ticket price
Dua Lipa 166.0 0
Myles Smith no user ticket price
Myles Smith 76.0 0
Allan Walker 122.0 C$123
Trivecta 72.0 C$72
Two Friends no user ticket price
Two Friends 85.0 0
Cloone no user ticket price
Cloone 83.0 0
Lilly Palmer 93.0 C$93
Markus Schulz 76.0 C$76
Layz 67.0 C$67
Virtual Riot no user ticket price
Virtual Riot 58.0 0
Chelsea Cutter and Jeremy Zucker no user ticket price
Chelsea Cutter and Jeremy Zucker 61.0 0
Black Tiger Sex Machine no user ticket price
Black Tiger Sex Machine 42.0 0
MK 75.0 C$75
Atliens 42.0 C$54
Ship Wrek 57.0 C$57
Sullivan King 73.0 C$74
Kiss of Life no user ticket price
Kiss of Life 203.0 0
Tinashe 41.0 C$68
Maddix 90.0 C$90
Vini Vici no user ticket price
Vini Vici 297.0 0
Polo G no user ticket price
Polo G 58.0 0
Pitbull no user ticket price
Pitbull 221.0 0
Kaivon 40.0 C$49
Lavern 58.0 C$58
Timmy Trumpet 73.0 C$73


KeyboardInterrupt: 

In [8]:
# BEGIN: Define a function to process the lists and create a DataFrame
def create_stubhub_dataframe(remain, dates, prices, my_prices):
    """
    Concatenates the lists into a DataFrame and processes the data.
    
    Args:
        remain (list): List of artist names.
        dates (list): List of event dates.
        prices (list): List of ticket prices from StubHub.
        my_prices (list): List of ticket prices for the current user.
    
    Returns:
        pd.DataFrame: Processed DataFrame with numeric values and sorted by 'Me' column.
    """
    # Concatenate the lists into a DataFrame
    df = pd.concat([pd.Series(remain), pd.Series(dates), pd.Series(prices), pd.Series(my_prices)], axis=1)
    
    # Set the column names for the DataFrame
    df.columns = ["Artist", "Dates", "Stubhub", "Me"]
    
    # Sort the DataFrame by the 'Me' column in descending order
    df = df.sort_values(by="Me", ascending=False)
    
    return df

# Create the DataFrame using the function
stubhub = create_stubhub_dataframe(processed_artists, event_dates, ticket_prices, user_ticket_prices)

# Write the DataFrame to an Excel file, replacing the existing sheet if it exists
with pd.ExcelWriter(path, mode='a', engine="openpyxl", if_sheet_exists="replace") as writer:
    stubhub.to_excel(writer, sheet_name="stubhub", header=True)

with pd.ExcelWriter(path, mode='a', engine="openpyxl", if_sheet_exists="replace") as writer:
    sales.to_excel(writer, sheet_name="Sheet1", header=True)

In [9]:
# Filter the DataFrame to find shows where 'Me' price is higher than 'Stubhub' price
higher_price_shows = stubhub[stubhub["Me"] > stubhub["Stubhub"]]

# Display the filtered DataFrame
print(higher_price_shows)

          Artist      Dates  Stubhub     Me
2   Allan Walker 2025-02-07    122.0  123.0
13       Tinashe 2024-11-11     56.0   68.0
21          Umek 2024-10-25     27.0   67.0
9        Atliens 2024-11-16     48.0   54.0
