### Ticketmaster Tour Data: Can we predict revenue & venue capacity utilization?

**Context**

Imagine you are an independent touring musican armed with the latest Ticketmaster tour sales data available from touringdata.org and you would like to plan where to have your next show or perhaps a performance tour. Using the available data from Ticketmaster can we make informed decisions about where to play to meet revenue requirements or at least play to rooms full of fans? 

**Overview**

The goal of this project is to use data visualizations and probability distributions to distinguish between events that maximized revenue and/or venue capacity utilization and those that did not.

**Data**

touringdata.org allows access to this Ticketmaster touring data via year-specific master documents from 2024 onward via the Patreon link: https://www.patreon.com/c/touringdataLinks to an external data-hosting site.

**Deliverables**

This notebook will deliver brief report that highlights the differences between events that maximized revenue and/or venue capacity utilization and those that did not. Additionally an appropriate predictive model will be proposed.





### Data Description

The attributes of this data set include:
- Event Start Date
- Event End Date
- Number of Shows
- Artist Name
- Revenue (USD)
- Tickets Sold
- Estimated Capacity
- Capacity Utilization (%)
- Venue
- City
- Country

In [1]:
from google.colab import drive
drive.mount('/content/drive')
prefix = '/content/drive/MyDrive/Colab Notebooks/tour-data-eda/'

ModuleNotFoundError: No module named 'google.colab'

In [2]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

### Problems

Use the prompts below to get started with your data analysis.  

1. Read in the `2024_2025_MasterDocument.csv` file.




In [3]:
import pandas as pd

path = 'data/2024_2025_MasterDocument.csv'
try:
  path = prefix + path
except NameError:
  pass

df = pd.read_csv(path)

image_path = 'images/'
try:
  image_path = prefix + image_path
except NameError:
  pass

2. Investigate the dataset for missing or problematic data.

In [4]:
df.head()

Unnamed: 0,Event Date,Event Date.1,Shows,Artist,Revenue (USD),Tickets Sold,Capacity,%,Venue,City,Country,Unnamed: 11,Unnamed: 12,Unnamed: 13
0,11/13/2025,11/14/2025,2,Lady Gaga,5297329,26324,26324,100%,LDLC Arena,Lyon,France,,,
1,11/12/2025,11/12/2025,1,Corona Capital Session,1116932,9668,13356,72.39%,Estadio Banorte,Monterrey,Mexico,,,
2,11/12/2025,11/12/2025,1,Zoé,3721003,57332,57332,100%,Estadio GNP Seguros,Mexico City,Mexico,,,
3,11/11/2025,11/11/2025,1,Katy Perry,2412047,15399,15617,98.60%,Movistar Arena,Madrid,Spain,,,
4,11/10/2025,11/10/2025,1,Ricky Martin,1623385,11157,11157,100%,Qudos Bank Arena,Sydney,Australia,,,


In [5]:
df.shape

# Note: original data has 12684 rows and 26 columns

(7776, 14)

In [6]:
df.info()

# Note: car column only has 108 non-null rows

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7776 entries, 0 to 7775
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Event Date     7776 non-null   object 
 1   Event Date.1   7776 non-null   object 
 2   Shows          7776 non-null   int64  
 3   Artist         7776 non-null   object 
 4   Revenue (USD)  7776 non-null   int64  
 5   Tickets Sold   7776 non-null   int64  
 6   Capacity       7776 non-null   int64  
 7   %              7776 non-null   object 
 8   Venue          7776 non-null   object 
 9   City           7774 non-null   object 
 10  Country        7776 non-null   object 
 11  Unnamed: 11    0 non-null      float64
 12  Unnamed: 12    0 non-null      float64
 13  Unnamed: 13    0 non-null      float64
dtypes: float64(3), int64(4), object(7)
memory usage: 850.6+ KB


In [11]:
# Investigation:
# Look for unique column values to try to spot duplications, notable
# characteristics, or nonsensical values that can be cleaned (either dropped or # substituted)

for col in df.columns:
    if col not in ['Event Date', 'Event Date.1']:  # Skip Event Date
        print(f"{col}: {df[col].unique()}")

# Findings and observations:


Shows: [ 2  1  3  6  4  9 10  5 13  7  8 16 25 32 22 12 14 31 26 34 15 29]
Artist: ['Lady Gaga' 'Corona Capital Session' 'Zoé' ... 'MajestuOsos'
 'Beetlejuice - The Musical' 'Motion City Soundtrack']
Revenue (USD): [5297329 1116932 3721003 ... 2519652   51668  131857]
Tickets Sold: [26324  9668 57332 ...  2671   723 35817]
Capacity: [26324 13356 57332 ... 11776 18200 59310]
%: ['100%' '72.39%' '98.60%' ... '60.39%' '33.75%' '82.48%']
Venue: ['LDLC Arena' 'Estadio Banorte' 'Estadio GNP Seguros' ... 'Fenix Beach'
 'Hanover Theatre' 'Matakana Country Park']
City: ['Lyon' 'Monterrey' 'Mexico City' 'Madrid' 'Sydney' 'Perth' 'New York'
 'Evansville' 'Adeje' 'Morrison' 'San Francisco' 'Newark' 'Adelaide'
 'Bogotá' 'Berlin' 'Rosemont' 'Boise' 'Cleveland' 'Melbourne' 'Detroit'
 'Billings' 'Vienna' 'Atlanta' 'Manchester' 'Washington' 'Bismarck'
 'Bowling Green' 'Memphis' 'Boston' 'Santiago' 'Phoenix' 'Grand Forks'
 'Brisbane' 'Macau' 'Kansas City' 'Parramatta' 'Nottingham' 'Charlotte'
 'Chicago'

In [13]:
df.isnull().sum()[df.isnull().sum() > 0]


City              2
Unnamed: 11    7776
Unnamed: 12    7776
Unnamed: 13    7776
dtype: int64

In [14]:
image_path = 'images/'
try:
  image_path = prefix + image_path
except NameError:
  pass