# Do locals run faster?

## A comparison of in-town, in-state, and out-of-state runners for two events at the Probility Ann Arbor 2019 race.

### Background

Running a marathon (26 miles) or half marathon (13 miles) is an endurance event that requires training and preparation. Runners often compete in an official race, which could take place in their home city or a travel destination. When I ran a half marathon in a new town last year, I trained for six months in advance, but I was still unprepared for the climate and race course (not to mention tired from traveling). I wondered, do runners have a home-field advantage when running in their own home city or state?

### Methods

I used two datasets from one race to answer this question: recorded run times for the [half marathon](https://runsignup.com/Race/Results/29595/#resultSetId-147305) and [marathon](https://runsignup.com/Race/Results/29595/#resultSetId-147304) events at the Probility Ann Arbor 2019 race. I copied the data tables from the website and pasted them into sheets in Microsoft Excel.

To answer my question, I plotted run times for half marathon and marathon events in three categories of **local status**:
1. In Town: runners with home addresses in Ann Arbor, Michigan (blue)
2. In State: runners with home addresses in the state of Michigan, but in towns other than Ann Arbor (green)
3. Out of State: runners with home addresses in states other than Michigan (gray)

In [None]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# load and clean data
def clean_data(race_event):
    
    # load data
    df = pd.read_excel('race_data.xlsx', sheet_name=race_event, header=0, skiprows=[1], usecols=['City', 'State', 'Chip'])

    # add event column
    df['Event'] = race_event
    
    # remove records without city, state, or chip time
    df.dropna(inplace=True)

    # add column for local status
    # runner from Ann Arbor, Michigan = 2 = in-town
    # runner from Michigan = 1 = in-state
    # runner from out of state = 0 = out-of-state
    in_city = (df['City'] == 'Ann Arbor').astype('int')
    in_state = (df['State'] == 'MI').astype('int')
    local = (in_city + in_state)
    df['Local'] = local.replace([0,1,2], ['Out of State', 'In State', 'In Town'])
    
    # convert datetimes to hours it took to complete the race
    hourify = lambda x : x.hour + (x.minute/60) + (x.second/3600)
    df['Time'] = pd.Series(list(map(hourify, df['Chip'])))
    
    # drop city, state, and chip columns
    df.drop(columns=['City', 'State', 'Chip'], inplace=True)
    
    # remove records without event, local status, or time in hours
    df.dropna(inplace=True)
    
    # return data
    return df

In [None]:
# load marathon data
df_26mi = clean_data('Marathon')
# load 10K data
df_13mi = clean_data('HalfMarathon')

### Results

In [None]:
# set color palette
colors = ['dodgerblue', 'seagreen', 'dimgrey']
sns.set_palette(colors)
# set tick mark style
sns.set_style({'xtick.bottom':False, 'ytick.left':False})

In [None]:
# create figure
plt.figure(figsize=(7, 7))

# designate subplot with 2 rows, 1 column; current axis is 1st axis
plt.subplot(2, 1, 1)

# create swarm plot
ax1 = sns.swarmplot('Local', 'Time', data = df_13mi, order=['In Town', 'In State', 'Out of State'], size=3)
# add y-axis label and remove x-axis label
ax1.set(ylabel='Time (hours)', xlabel='', xticklabels=[])
# add title
ax1.title.set_text('Half Marathon')
# remove spines
sns.despine()

# designate subplot with 2 rows, 1 column; current axis is 2nd axis
plt.subplot(2, 1, 2)

# create swarm plot
ax2 = sns.swarmplot('Local', 'Time', data = df_26mi, order=['In Town', 'In State', 'Out of State'], size=4)
# add y-axis label and remove x-axis label
ax2.set(ylabel='Time (hours)', xlabel='')
# add title
ax2.title.set_text('Marathon')
# remove spines
sns.despine()

# save figure
plt.savefig('do_locals_run_faster.png', dpi=100, facecolor= 'white', edgecolor='none')

My visualization shows that the distribution of run times across these categories of local status. For the half marathon event (top panel), runners from all categories have similar run times. Runners from Ann Arbor (blue), Michigan (green), and out of state (gray) finish the race in 2 hours on average. For the marathon event (bottom panel), runners from Ann Arbor (blue) finish the race in 4 hours on average, which is slightly faster than runners from Michigan (green) or out of state (gray), who finish in 4.5 hours on average.

### Conclusion

**Do local runners run faster than non-local runners?** 

In the half marathon event, there is no apparent difference between runners from Ann Arbor, Michigan, and out of state. In the marathon event, it appears that runners from Ann Arbor run slightly faster than runners from other towns in Michigan and other states. This data represents runners in the Probility Ann Arbor 2019 race, and may not be generalizable to other races.

### Design Principles

I incorporated principles from Edward Tufte and Alberto Cairo in designing this graphic. Based on Tufte's high data-ink ratio, I minimized the use of color, redundant x-axis labels, tick marks, and plot borders. Based on Cairo's principle of functionality, I chose between a boxplot, violin plot and swarm plot to evaluate the distribution of quantitative data across three categories. Based on Cairo's principle of truthfulness, I chose to visualize the data with a swarm plot, which depcits the distribution of individual run times and differing sample sizes for each category, unlike a violin plot or boxplot.