# This script contains:

1. Import Libraries and Data
2. Wrangling & Cleaning
3. Create Chloropleth of attendance 2015-2024
4. Create chloropleth of pre-covid attendance
5. Create chloropleth of post-covid attendance
6. Analysis

### Import Libraries & Data

In [4]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import os
import json
import folium

In [5]:
# Import JSON file
geo = "C:\\Users\\cschw\\OneDrive\\Desktop\\MLB Project\\Achievement 6\\6.3\\us-states.json"

In [6]:
geo

'C:\\Users\\cschw\\OneDrive\\Desktop\\MLB Project\\Achievement 6\\6.3\\us-states.json'

In [7]:
# Define Path
path = path = r"C:\Users\cschw\OneDrive\Desktop\MLB Project\Data Sets\Other Data"

In [8]:
# Upload attendance by state csv
att_df = pd.read_csv(f"{path}\\Attendance by state 2015-2024.csv")

In [9]:
att_df.head()

Unnamed: 0,State,Attendance/Game
0,Arizona,24039.66667
1,California,32178.77778
2,Colorado,32533.55556
3,District of Columbia,27137.44444
4,Florida,14629.55556


In [16]:
# Upload pre covid attendance csv
pre_df = pd.read_csv(f"{path}\\pre covid attendance by state.csv")

In [18]:
pre_df.head()

Unnamed: 0,State,Attendance/Game
0,Arizona,26244.2
1,California,34342.48
2,Colorado,34746.0
3,District of Columbia,30657.4
4,Florida,15885.2


In [20]:
# Upload post covid attendance csv
post_df = pd.read_csv(f"{path}\\post covid attendance by state.csv")

In [22]:
post_df.head()

Unnamed: 0,State,Attendance/Game
0,Arizona,21284.0
1,California,29474.15
2,Colorado,29768.0
3,District of Columbia,22737.5
4,Florida,13060.0


### Wrangling & Cleaning

Due to the simplicity of the data needed for these maps and for efficieny's sake I created pivot tables in excel to aggregate the data for each state.  The only additional cleaning step necessary was to remove the year 2020 from the 2015-2024 attendance file. This was important because attendance was 0 that year due to the pandemic. For the purpose of evaluating long term trends in attendance, and how the pandemic impacted attendance, I removed 2020 from the data set that includes all of the last 10 years.  Then I created the CSV files uploaded above so that I can create seperate maps to show change over time. Finally, I changed the name of the State column into 'STATE_NAME' to match with the json file

In [26]:
# Change State column for att_df
att_df.rename(columns={'State': 'STATE_NAME'}, inplace=True)

In [28]:
att_df.head()

Unnamed: 0,STATE_NAME,Attendance/Game
0,Arizona,24039.66667
1,California,32178.77778
2,Colorado,32533.55556
3,District of Columbia,27137.44444
4,Florida,14629.55556


In [30]:
# change State column for pre_df
pre_df.rename(columns={'State': 'STATE_NAME'}, inplace=True)

In [32]:
pre_df.head()

Unnamed: 0,STATE_NAME,Attendance/Game
0,Arizona,26244.2
1,California,34342.48
2,Colorado,34746.0
3,District of Columbia,30657.4
4,Florida,15885.2


In [34]:
#change the state column for post_df
post_df.rename(columns={'State': 'STATE_NAME'}, inplace=True)

In [36]:
post_df.head()

Unnamed: 0,STATE_NAME,Attendance/Game
0,Arizona,21284.0
1,California,29474.15
2,Colorado,29768.0
3,District of Columbia,22737.5
4,Florida,13060.0


### Create Cloropleth map of Attendance per Game 2015-2024

In [39]:
map = folium.Map(location=[40, -95], zoom_start=4)

folium.Choropleth(
    geo_data=geo, 
    data=att_df,
    columns=['STATE_NAME', 'Attendance/Game'],
    key_on='feature.properties.name',
    fill_color='Blues',  # Changed color scheme to blue
    fill_opacity=0.7,  # Increased opacity for better distinction
    line_opacity=0.2,  # Increased line opacity slightly
    nan_fill_color='lightgrey',  # States with no data will appear light grey
    nan_fill_opacity=0.4,  # Light grey slightly transparent
    legend_name="Attendance/Game"
).add_to(map)

folium.LayerControl().add_to(map)

map

### Create chloropleth map of Pre-Covid Attendance

In [42]:
map = folium.Map(location=[40, -95], zoom_start=4)

folium.Choropleth(
    geo_data=geo, 
    data=pre_df,
    columns=['STATE_NAME', 'Attendance/Game'],
    key_on='feature.properties.name',
    fill_color='Blues',  # Changed color scheme to blue
    fill_opacity=0.7,  # Increased opacity for better distinction
    line_opacity=0.2,  # Increased line opacity slightly
    nan_fill_color='lightgrey',  # States with no data will appear light grey
    nan_fill_opacity=0.4,  # Light grey slightly transparent
    legend_name="Attendance/Game"
).add_to(map)

folium.LayerControl().add_to(map)

map

### Create chloropleth map of post-covid attendance

In [45]:
map = folium.Map(location=[40, -95], zoom_start=4)

folium.Choropleth(
    geo_data=geo, 
    data=post_df,
    columns=['STATE_NAME', 'Attendance/Game'],
    key_on='feature.properties.name',
    fill_color='Blues',  # Changed color scheme to blue
    fill_opacity=0.7,  # Increased opacity for better distinction
    line_opacity=0.2,  # Increased line opacity slightly
    nan_fill_color='lightgrey',  # States with no data will appear light grey
    nan_fill_opacity=0.4,  # Light grey slightly transparent
    legend_name="Attendance/Game"
).add_to(map)

folium.LayerControl().add_to(map)

map

### Analysis

The teams with the highest attendance are spread across the country. There doesn't seem to be a huge geographic link to attendance in the sense that there are teams with high attendance in most parts of the country. However, one commonality to states with high attendance is a high population in general. California, Texas and New York all have high attendance averages. Interestingly, these state also all have multiple MLB teams. The maps once again visualize the decline in attendance after the pandemic. There are some exceptions. Georgia and Colorado also have high attendance averages despite having relatively smaller cities. Georgia presents one of the most interesting cases - attendance has actually risen in the post-covid world. This is likely attributed to the success of the Atlanta Braves in the last 5 years. A new question that I have from this analysis is how due attendance figures vary when taking into account the capacity of the stadium. Creating a new variable that shows the percentage of attendance based on capacity could reveal different numbers or trends. This analysis is currently likely biased against teams with smaller stadiums - they may actually have relatively high attendance but when compared to large stadiums the data gets muddied. 