<h1>Analyzing Professor Brooks' Mysterious Strava csv</h1>
<p>I am going to take you through the journey of a visual exploratory data analysis. In it we will read in some data and mainly use some print statements and charts to explore the data set to draw some meaning. Throughout the notebook I will use: (1) cellular divsion to make it easy for you to follow the steps; (2) tell a story along the way to lead you through my thought process while also  adding in some anecdotal brevity for engagement purposes; (3) I have used version control throughout this process but it is in a private repository so instead I will share; (4) I will share this in an interactive notebook and host the data; (5) And finally I have designed this notebooks to be read, run, and explored.</p>

In [1]:
# First step importing libraries
import pandas as pd
import plotly.express as px
import datetime
import numpy as np

In [19]:
# import the data and make sure formatting is correct
strava = pd.read_csv('assets/strava.csv', parse_dates=True, infer_datetime_format=True)

<p> We don't know anything about this data other than it exists. So let's print out a report on the values or missing values in it.</p>

In [3]:
# let's inspect Null values
for col in strava.columns:
    print (f'{col} has {strava[col].isnull().sum()} null values')

Air Power has 22807 null values
Cadence has 22802 null values
Form Power has 22807 null values
Ground Time has 22802 null values
Leg Spring Stiffness has 22807 null values
Power has 22802 null values
Vertical Oscillation has 22802 null values
altitude has 25744 null values
cadence has 22 null values
datafile has 0 null values
distance has 0 null values
enhanced_altitude has 51 null values
enhanced_speed has 10 null values
fractional_cadence has 22 null values
heart_rate has 2294 null values
position_lat has 192 null values
position_long has 192 null values
speed has 25721 null values
timestamp has 0 null values
unknown_87 has 22 null values
unknown_88 has 2294 null values
unknown_90 has 22031 null values


In [None]:
# looks like we got a timestamp so let's make sure that is in datetime/ timestamp format
strava['timestamp'] = pd.to_datetime(strava.timestamp) # convert timestamp to datetime

In [26]:
# let's define a list of datafiles
events = list(strava.datafile.unique())

# Let's calculate elapsed time and elevation change
strava['elapsed_seconds'] = None
strava['elevation_change'] = None
for event in events:
    start = strava[strava.datafile == event ]['timestamp'].min()
    n = 0 # reset n
    for i in strava[strava.datafile == event ].index:
        n += 1
        strava['elapsed_seconds'].iloc[i] = (strava['timestamp'].iloc[i] - start).total_seconds()
        if n > 1: # don't calculate elevation change on first data point of activity
            strava['elevation_change'].iloc[i] = strava.enhanced_altitude.iloc[i] - strava.enhanced_altitude.iloc[ i - 1]



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [27]:
strava_grouped = strava.groupby('datafile').agg({'elapsed_seconds':[max],'Power':[min, max, np.mean],'enhanced_speed':[min, max, np.mean],'distance':[max],'heart_rate':[min, max, np.mean],'cadence':[min, max, np.mean], 'elevation_change':[sum]}).reset_index()

# flatten column hierarchy
strava_grouped.columns = strava_grouped.columns.to_flat_index().str.join('_')
strava_grouped.sort_values(by='distance_max', ascending=False, inplace=True)

# rename column so that names match
strava_grouped.rename(columns={'datafile_':'datafile'}, inplace=True)

# need to convert to a string for potential join/concat operations
strava_grouped['datafile'] = strava_grouped['datafile'].convert_dtypes()
strava['datafile'] = strava['datafile'].convert_dtypes()

<h2>Histogram, Bar, and Scatter Plots</h2>

In [6]:
#let's look at mean speed
sp_mean_fig = px.histogram(strava_grouped, x="enhanced_speed_mean", title="EDA to See if Mean Speed Can Classify Activity")
sp_mean_fig.show()

#let's look at max speed
sp_max_fig = px.histogram(strava_grouped, x="enhanced_speed_max", title="EDA to See if Max Speed Can Classify Activity")
sp_max_fig.show()

<h3> Okay so there seems to be a threshold at 4 m/s for mean speed</h3>

In [28]:
# now we have an assumption we can encode into a category variable
strava_grouped['activity'] = np.where(strava_grouped['enhanced_speed_mean'] >= 4, "ride", "run") #aggregate DF

In [29]:
# let's also map this to the original dataframe
ride_data = list(strava_grouped[strava_grouped.activity == 'ride']['datafile'].unique())
strava['activity'] = np.where(strava.datafile.isin(ride_data), "ride", "run")
strava.activity.value_counts() # let's check if that worked

run     32779
ride     7870
Name: activity, dtype: int64

In [9]:
#let's look at mean speed
c_sp_mean_fig = px.histogram(strava_grouped, x="enhanced_speed_mean", color='activity', title="Confirming Mean Speed Classification")
c_sp_mean_fig.show()

#let's look at max speed
c_sp_max_fig = px.histogram(strava_grouped, x="enhanced_speed_max", color='activity', title="Confirming Max Speed Classification")
c_sp_max_fig.show()

<h3>Okay, that looks like it worked well. Let's see how this plays out in the timestamps using a scatter plot</h3>
<p>Unfortunately plotly express doesn't let us encode markers and color in the same chart so we will do two charts in the same time space: one for datafiles and one for activities. The time column is an object so order is not intuited by plotly. Therefore, we will have to define the y axes as an ascending category.</p>

In [228]:
# Explode timestamp into columns: weekday, date, time
strava['weekday'] = strava["timestamp"].dt.strftime("%a") 
strava['date'] = strava["timestamp"].dt.date 
strava['time'] = strava["timestamp"].dt.time pd.to_datetime(df['start_date'], format='%Y%m%d')

# strava.sort_values(by='time',inplace=True)

In [259]:
time_line_file = px.scatter(strava, x="date", y='time', color='datafile', title="Visualizing Data Files in Time Space").update_layout(yaxis_title="Time", xaxis_title="Date").update_yaxes(categoryorder='category ascending')
time_line_file.show()

time_line = px.scatter(strava, x="date", y='time', color='activity', title='What Times Does Brooks Workout').update_layout(yaxis_title="Time", xaxis_title="Date").update_yaxes(categoryorder='category ascending')
time_line.show()

<p>Wow, Brooks sure likes to enjoy some late evening runs between 10 PM to 4AM.</p>

<h3>Okay, let's plot some time per activity and times per weekday.</h3>
<p>Now, we can plot a separate histogram per activity or we can effectively do this comparison by group in a bar chart.</p>

In [240]:
#let's upgrade the summary statistics
strava_summary = strava.groupby('datafile').agg(
    {'weekday':['first'],'activity':['first'],'elapsed_seconds':[max],'Power':[min, max, np.mean],'enhanced_speed':[min, max, np.mean],'distance':[max],'heart_rate':[min, max, np.mean],'cadence':[min, max, np.mean], 'elevation_change':[sum]
    }).reset_index()

# flatten and rename columns for ease of access
strava_summary.columns = strava_summary.columns.to_flat_index().str.join('_')
strava_summary = strava_summary.rename(columns={'datafile_':'datafile','weekday_first':'weekday','activity_first':'activity'})
strava_summary.columns

Index(['datafile', 'weekday', 'activity', 'elapsed_seconds_max', 'Power_min',
       'Power_max', 'Power_mean', 'enhanced_speed_min', 'enhanced_speed_max',
       'enhanced_speed_mean', 'distance_max', 'heart_rate_min',
       'heart_rate_max', 'heart_rate_mean', 'cadence_min', 'cadence_max',
       'cadence_mean', 'elevation_change_sum'],
      dtype='object')

In [137]:
# Activity Time Spent Per Day of Week
weektime_fig = px.bar(strava_summary, x="weekday", y='elapsed_seconds_max', category_orders={"weekday": ["Mon",'Tue',"Wed","Thur", "Fri", "Sat", "Sun"]}, color='activity', barmode='group', title='Activity Time Spent Per Day of Week').update_layout(yaxis_title="Total Time in (s)", xaxis_title="Day of Week")
weektime_fig.show()

# Favorite Day for Activity'
weekday_fig = px.histogram(strava_summary, x="weekday", category_orders={"weekday": ["Mon",'Tue',"Wed","Thur", "Fri", "Sat", "Sun"]}, color='activity', barmode='group', title='Favorite Day for Activity').update_layout(yaxis_title="Count", xaxis_title="Day of Week")
weekday_fig.show()

<h3>Monday's are Brook's favorite days to ride. Weekends are when he spends the most time running yet he most frequently goes for a run on Wednesday. Next let's look at distance metrics.</h3>
<p>When plotting a histogram we can switch our barmode from stacked to overlay so that we aren't chartting the summation in total but rather the summation by activity.</p>

In [149]:
# Now let's see some distance metrics
dist_max_fig = px.histogram(strava_summary, x="distance_max", color='activity',title='Distribution of Distances Traveled During Cardio Activity').update_layout(barmode='overlay')
dist_max_fig.show()

<h3>Interesting, how do we think speed is impacted by the distance? Let's plot this comparison and add a trendline.</h3>

In [202]:
# let's graph the trendline for each activity in a facetted plot
mean_speed_scatter = px.scatter(strava_summary, x='distance_max', y='enhanced_speed_mean', facet_col="activity", color='activity', trendline="lowess")
mean_speed_scatter.show()

<P>Okay so that is interesting but I think we would have to do a little digging beyond the Visual Exploratory Data Analysis to understand what these trendlines are telling us. That said, I think it is safe to say that Brooks lowers his pace for a 10k versus a 5k.</P>

<h2>Violin and Box Plots</h2>
<p> Next we will move back into our non aggregated dataframe. Frist we will use violin plots to show the density curves and layer boxplots for a summarization of the distribution. We will use these plot types to chart heart rate and mean speed which will help show phisological effort levels.</p>

In [234]:
# let's plot a violin plot
mean_sp_violin = px.violin(strava, y="enhanced_speed", x="activity", color="activity", box=True, title='Distribution of Speed During Cardio Activity').update_layout(yaxis_title="Speed (m/s)", xaxis_title="Activity")
mean_sp_violin.show()

In [150]:
# let's take a look at the distribution of data in heart rate zones
violin2 = px.violin(strava, y="heart_rate", x="activity", color="activity", box=True, title='Inferring Heart Rate Zones Through Distribution of Data').update_layout(yaxis_title="Heart Rate (BPM)", xaxis_title="Activity",barmode='overlay')
violin2.show()

<p> So there seems to be a much more concentrated distribution of speed for running and as expected he travels faster by bike than by running. Interestingly enough it seems that during running there is much more fluctuation in the heart rate as well as a much greater max heart rate. It seems that these two cardio activties stress the heart in very different ways. Let's investigate this further by looking at some runs specifically.</p>

<h2>Line plots</h2>
<p>Now we will do some line plotting to see the run activity over time. In particular, let's look at 5k runs and above. The great thing about plotly express as an interactive plot is that you can visually plot everything and then comb through your layers, turning on and off specific workouts. This will allow us to visually select some runs for targetted analysis. This EDA feature is a particular benefit of using plotly for graphing.</p>

In [238]:
#Let's see if speed is maintained over distance
strava_5k_run = list(strava_summary.query("activity == 'run' and 6000 >= distance_max and distance_max >= 5000")['datafile'])
strava_10k_run = list(strava_summary.query("activity == 'run' and distance_max > 8000")['datafile'])
sp_dist = px.scatter(strava[strava.datafile.isin(strava_5k_run)], y="enhanced_speed", x="distance", opacity=.2, color='datafile', trendline="lowess", title='Moving Regression line showing 5k affect on run speed')
sp_dist.show()
sp_dist_10k = px.scatter(strava[strava.datafile.isin(strava_10k_run)], y="enhanced_speed", x="distance", color='datafile', opacity=.2, trendline="lowess", title='Moving Regression line showing 10k affect on run speed')
sp_dist_10k.show()

<p>Let's chart the 7-9 m/s max speed run that we found in our 'Confirming Max Speed Classification' chart and then use the 10k run for mapping.</p>

In [241]:
# let's find the datafile and the distance of this run
print(strava.query("label == '2019/07/24 Wed run'")['datafile'].unique()) #  'activities/2717598473.fit.gz', 'activities/2717660588.fit.gz'
print(strava_summary[strava_summary.datafile == 'activities/2717598473.fit.gz']['distance_max']) #3904.54

<StringArray>
['activities/2717598473.fit.gz', 'activities/2717660588.fit.gz']
Length: 2, dtype: string
17    3904.54
Name: distance_max, dtype: float64


In [65]:
# heart rate plot
speed_run = px.line(strava[strava.datafile == 'activities/2717598473.fit.gz'], x='timestamp', y='heart_rate', title='Heart Rate during 2019/07/24 Wed run of 3904.54 meters').update_layout(yaxis_title="Hear Rate BPM")
speed_run.show()

# speed plot
speed_run_speed = px.line(strava[strava.datafile== 'activities/2717598473.fit.gz'], x='timestamp', y='enhanced_speed', title='Enhanced Speed (m/s) during 2019/07/24 Wed Run of 3904.54 meters').update_layout(yaxis_title="Speed (m/s)")
speed_run_speed.show()


<p>Interesting I bet he was doing what is known as hill workouts, where you run up and down the same hills in repition.</p>

<h2>Map Plotting</h2>
<p> I know that strava uses decimals instead of degrees for it's latitude and longitude values so we will need to convert those for plotly</p>

In [None]:
# creating a latitude and longitude degrees column for scatter_mapbox
strava['position_lat_degrees'] = strava['position_lat'] * ( 180 / 2**31)
strava['position_long_degrees'] = strava['position_long'] * ( 180 / 2**31)

<h3>Next we are going to plot his run where he ran the fastest and colorize altitude because as a runner myself I suspect there is some hill work in here based on the lineplots.</h3>

In [254]:
#token = open("assets/mapbox_token.rtf").read() # you will need your own token
token = 'pk.eyJ1IjoiY3dkYXRhIiwiYSI6ImNrbXY0ZXQwNTAxa2kyb2w1MGwzNGRkZGwifQ.ztVIgZW8bIeJqDu1nOjaEQ'
sp_mapping = px.scatter_mapbox(strava_run.query('datafile == "activities/2717598473.fit.gz"').sort_values(by='timestamp'), lat="position_lat_degrees", lon="position_long_degrees", color='enhanced_altitude', hover_name="label", hover_data=['timestamp','elapsed_seconds','enhanced_altitude',"enhanced_speed", "cadence",'Leg Spring Stiffness','Form Power','Power'],
                        color_discrete_sequence=["fuchsia"], zoom=14, height=400)
sp_mapping.update_layout(mapbox_style="satellite-streets")
sp_mapping.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
sp_mapping.show()

<p> Yep, looks like he was running hills</p>

<h3>Next we are going to plot his longest run and colorize his heart rate to see how it performed over time. Usually at the end of a run your heart rate is sustaining its highest BPM.</h3>

In [257]:
#token = open("assets/mapbox_token.rtf").read() # you will need your own token
token = 'pk.eyJ1IjoiY3dkYXRhIiwiYSI6ImNrbXY0ZXQwNTAxa2kyb2w1MGwzNGRkZGwifQ.ztVIgZW8bIeJqDu1nOjaEQ'
mapping = px.scatter_mapbox(strava_run.query('datafile == "activities/2722642748.fit.gz"'), lat="position_lat_degrees", lon="position_long_degrees", color='heart_rate', hover_data=['timestamp','elapsed_seconds','enhanced_altitude',"enhanced_speed","cadence",'distance'],
                        color_discrete_sequence=["fuchsia"], zoom=13, height=500)
mapping.update_layout(mapbox_style="satellite-streets", mapbox_accesstoken=token)
mapping.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
mapping.show()

<h1>In Summary</h1>

<h3>Running Analysis</h3>
<ul>
    <li>2.24 m/s seems to be the average running speed. As with most runners Professor Brooks runs a faster pace for his 5k efforts versus his 10k efforts.</li>
    <li>Even though the variance in pace is quite small the variance in heart rate is greater than biking.</li>
    <li>Based on the rolling density lines in the violin charts it seems that Brooks consistently encounters changes in effort during his runs. I suspect that these rolling efforts correlate with hills.</li>
    <li>Wednesday seems to be when most of the runs are.</li>
    <li>Professor Brooks spends the most time on the weekends running.</li>
    <li>Professor Brooks also strangely seems to enjoy running between 10 PM and 4 AM</li>
    <li>July to September were the running months.</li>
</ul>
<h3>Riding</h3>
<ul>
    <li>5.90 m/s seems to be the average biking speed</li>
    <li>There is a much greater variance in pace when Brooks is on the bike than when he is running. Yet he has a much more focused heart rate zone.</li>
    <li>Monday is when biking is most likely</li>
    <li>Biking started up after running ended. September to October were the biking months</li>
</ul>