# Fundamentals of Data Visualization - Final Project

### Intro
 
For my final project, I will be analyzing and visualizing data that I myself have generated throughout 2023. The dataset is training data I have recorded using the popular fitness tracking app Strava. For people who aren't familiar, Strava is an app many athletes use to track activities using GPS. There are also social network components of the app, however, the social network component stastics will not be be included in this anaylsis. 

The app tracks metrics such as the activity type, elapsed time, distance, average speed, elevation gain and even weather statistics for the day and place you are exercising. The app allows you to download your historical data which is exactly how I obtained this dataset.


### Goals

The dataset I chose was data that I have accumulated over the last years worth of  Strava data (2022-09-01 -> 2023-09-01). Each row of the data is a newly recorded workout with an assigned primary “Activity ID” key. My goal for working with this data is to see if I have become faster throughout my years worth of training. My initial idea is to look at total time and distance to determine if my average pace was faster (when controlling for additional factors).

For the sake of this analysis, I will visualize other sports I used as cross-training, but limit the question to my ability to run faster. The large majority of my workouts were running. 


### Tasks
$\textbf{1. Download the data from my Strava profile}$

$\textbf{2. Read in the dataset}$

$\textbf{3. Clean the dataset}$

$\textbf{4. Exploratory data analysis}$

$\textbf{5.1: Uncover trends in the data}$

$\textbf{- Why is a task pursued? (goal)}$

The goal of this task is to determine whether or not there are any particular trends in my training data.

$\textbf{- How is a task conducted? (means)}$

I will use the pandas package to manipulate my data and attempt to use Alteria to visualize my findings.

$\textbf{- What does a task seek to learn about the data? (characteristics)}$

I will look across all variables in the data to see if there are any seasonal, elevation, or activity trends throughout the data.

$\textbf{- Where does the task operate? (target data)}$

This task will operate in a local Jupyter notebook.

$\textbf{- When is the task performed? (workflow)}$

The task will be performed over the next few weeks.

$\textbf{- Who is executing the task? (roles)}$

I will be the only one working on this task.

$\textbf{5.2: Am I getting faster?}$

$\textbf{- Why is a task pursued? (goal)}$

The goal of this task is to determine whether or not I am becoming a better athlete (measured by running faster)

$\textbf{- How is a task conducted? (means)}$

I will use the pandas package to manipulate my data and attempt to use Alteria to visualize my findings.

$\textbf{- What does a task seek to learn about the data? (characteristics)}$

I will look at the average weighted pace of runs throughout the data to see if I am improving in running faster.

$\textbf{- Where does the task operate? (target data)}$

This task will operate in a local Jupyter notebook.

$\textbf{- When is the task performed? (workflow)}$

The task will be performed over the next few weeks.

$\textbf{- Who is executing the task? (roles)}$

I will be the only one working on this task.


$\textbf{6. Reflect}$


### Data Callouts

- $\textit{Distance}$: Distance in kilometers
- $\textit{Moving Time}$: Time in minutes
- $\textit{Elevation Gain}$: Elevation in meters
- $\textit{Average Speed}$: Average speed in km/hr

The remaining columns should be self explanatory.

### Imports

In [1]:
import pandas as pd
import altair as alt
import numpy as np

data = pd.read_csv("activities.csv")

### Data Cleaning

In [2]:
# removing many unecessary columns that are not in the scope of this analysis
data.drop(columns = ['Activity Name', 'Activity Description', 'Commute', 'Activity Private Note', 'Activity Gear',
       'Filename', 'Athlete Weight', 'Bike Weight', 'Max Cadence', 'Average Cadence',  '<span class="translation_missing" title="translation missing: en-US.lib.export.portability_exporter.activities.horton_values.type">Type</span>',
       '<span class="translation_missing" title="translation missing: en-US.lib.export.portability_exporter.activities.horton_values.start_time">Start Time</span>',
       'Weighted Average Power', 'Power Count', 'Prefer Perceived Exertion',
       'Perceived Relative Effort', 'Commute.1', 'Total Weight Lifted', 'Sunrise Time',
       'Sunset Time', 'Moon Phase', 'Bike', 'Gear',
       'Precipitation Probability', '<span class="translation_missing" title="translation missing: en-US.lib.export.portability_exporter.activities.horton_values.jump_count">Jump Count</span>',
       '<span class="translation_missing" title="translation missing: en-US.lib.export.portability_exporter.activities.horton_values.total_grit">Total Grit</span>',
       '<span class="translation_missing" title="translation missing: en-US.lib.export.portability_exporter.activities.horton_values.avg_flow">Avg Flow</span>',
       '<span class="translation_missing" title="translation missing: en-US.lib.export.portability_exporter.activities.horton_values.flagged">Flagged</span>',
       '<span class="translation_missing" title="translation missing: en-US.lib.export.portability_exporter.activities.horton_values.avg_elapsed_speed">Avg Elapsed Speed</span>',
       '<span class="translation_missing" title="translation missing: en-US.lib.export.portability_exporter.activities.horton_values.dirt_distance">Dirt Distance</span>',
       '<span class="translation_missing" title="translation missing: en-US.lib.export.portability_exporter.activities.horton_values.newly_explored_distance">Newly Explored Distance</span>',
       '<span class="translation_missing" title="translation missing: en-US.lib.export.portability_exporter.activities.horton_values.newly_explored_dirt_distance">Newly Explored Dirt Distance</span>',
       '<span class="translation_missing" title="translation missing: en-US.lib.export.portability_exporter.activities.horton_values.sport_type">Sport Type</span>',
       '<span class="translation_missing" title="translation missing: en-US.lib.export.portability_exporter.activities.horton_values.total_steps">Total Steps</span>',
       'Media', 'Max Watts', 'Average Watts', 'From Upload', 'Average Positive Grade', 'Average Negative Grade', 'Max Heart Rate.1', 'Max Temperature', 'Number of Runs', 'Total Work', 'Uphill Time', 'Downhill Time'
                    , 'Other Time', 'Distance.1', 'Elapsed Time.1', 'Relative Effort.1'], inplace = True)


In [3]:
# changing the 'Activity Date' column to a datetime
data['Activity Date'] = pd.to_datetime(data['Activity Date'])

# removing the 
data['Distance'] = data['Distance'].str.replace(',', '')

# changing the 'Distance' column to a float
data['Distance'] = data['Distance'].astype(float)

# limiting the dataset to a full year
data = data.loc[data['Activity Date'] >= '2022-09-01']
data = data.loc[data['Activity Date'] < '2023-09-01']
                
# removing data errors (I have never done a stair stepper or nordic ski workout)
data = data.loc[data['Activity Type'] != 'Stair-Stepper']
data = data.loc[data['Activity Type'] != 'Nordic Ski']

# putting the 'Moving Time' column from seconds into minutes
data['Moving Time'] = data['Moving Time'] / 60

# resetting the index so it's clean to work with
data = data.reset_index(drop = True)


### EDA

In [4]:
# here we can take a high level look at the data and see some summary statistics
data.describe()

Unnamed: 0,Activity ID,Elapsed Time,Distance,Max Heart Rate,Relative Effort,Moving Time,Max Speed,Average Speed,Elevation Gain,Elevation Loss,...,Weather Pressure,Wind Speed,Wind Gust,Wind Bearing,Precipitation Intensity,Precipitation Type,Cloud Cover,Weather Visibility,UV Index,Weather Ozone
count,341.0,341.0,341.0,338.0,338.0,341.0,341.0,341.0,341.0,328.0,...,305.0,305.0,305.0,305.0,305.0,164.0,305.0,305.0,305.0,168.0
mean,8706869000.0,7122.703812,17.416716,153.95858,87.420118,100.433187,7.983236,3.118162,529.626956,672.926498,...,1016.548524,2.733246,5.383443,207.370492,0.076754,3.128049,0.454,22227.10187,3.101639,311.426785
std,575109800.0,5370.198857,15.55275,19.825606,139.658978,71.440584,5.264072,1.687865,508.425357,923.081351,...,7.820893,1.18112,2.27261,86.114045,0.291913,2.213055,0.358965,10880.219161,2.917647,42.25592
min,7739487000.0,358.0,0.75,104.0,3.0,5.966667,1.768889,0.552477,0.0,0.0,...,994.200012,0.49,1.36,2.0,0.0,1.0,0.0,1757.0,0.0,237.399994
25%,8225450000.0,3915.0,9.86,143.0,26.25,60.266667,4.337988,2.242087,247.0,260.949997,...,1011.400024,1.99,3.98,140.0,0.0,1.0,0.08,16093.0,1.0,276.974998
50%,8678014000.0,5033.0,12.48,150.0,53.0,77.933333,5.069922,2.775987,380.0,410.0,...,1015.710022,2.57,4.86,231.0,0.0,1.0,0.45,16093.0,2.0,304.099991
75%,9202844000.0,8456.0,19.69,161.0,94.75,114.85,11.621973,3.362911,718.0,760.25,...,1021.5,3.36,6.44,273.0,0.0,5.0,0.81,30760.589844,5.0,333.875008
max,9757197000.0,43333.0,165.11,224.0,1408.0,613.4,26.464014,9.569311,3879.770508,9479.0,...,1038.0,8.61,14.93,356.0,2.5,6.0,1.0,51682.550781,12.0,429.100006


In [5]:
print(f'The total distance I logged was: {round(sum(data["Distance"]), 2)} kilometers')

The total distance I logged was: 5939.1 kilometers


In [6]:
print(f'The total time I logged was: {round(((sum(data["Elapsed Time"]) / 60) / 60) / 24, 2)} days worth of training')

The total time I logged was: 28.11 days worth of training


In [7]:
print(f'The activity with the largest elevation gain was: {round(max(data["Elevation Gain"]), 2)} meters')

The activity with the largest elevation gain was: 3879.77 meters


### Task 1: Visualizations

In [None]:
# woah that's a lot of kilometers
chart1 = alt.Chart(data).mark_bar(tooltip = True).encode(
    x = 'Activity Type',
    y = 'count(Activity ID)',
    color = alt.condition(
        alt.datum['Activity Type'] == 'Run', alt.value('red'), alt.value('navy'))
)

chart2 = alt.Chart(data).mark_bar(tooltip = True).encode(
    x = "Activity Type", 
    y = "sum(Distance):Q",
        color = alt.condition(
        alt.datum['Activity Type'] == 'Run', alt.value('red'), alt.value('navy'))
)

chart3 = alt.Chart(data).mark_bar(tooltip = True).encode(
    x = "Activity Type",
    y = "average(Distance):Q",
        color = alt.condition(
        alt.datum['Activity Type'] == 'Run', alt.value('red'), alt.value('navy'))
)

chart1.properties(title = "Count by Activity", width = 200, height = 250) | chart2.properties(title = "Activity by Total Kilometers", width = 200, height = 250) | chart3.properties(title = "Activity by Average Kilometers", width = 200, height = 250)



These first graphs show each activity side by side with the activity count, total kilometers, and the average activity kilometers. The run category is highlighted as that will be the focus of the analysis later on.

You can see I run more than every other sport combined (more than 75% of activities), however, when I run I don't go as far on average compared to when I bike. This makes sense as you can generally cover more kilometers on a bike than you can running. I also run year round, whereas biking and backcountry skiing I can only do seasonally.

In [9]:
input_dropdown = alt.binding_select(options = ['Alpine Ski', 'Backcountry Ski', 'Hike', 'Ride', 'Run', 'Virtual Ride'], name = 'Activity Type: ')
selection = alt.selection_point(fields = ['Activity Type'], bind = input_dropdown)
color = alt.condition(selection,
                     alt.Color('Activity Type:N', scale = alt.Scale(scheme = 'dark2')).legend(None),
                    alt.value('lightgray')
                     )


chart4 = alt.Chart(data).mark_circle().encode(
    x = 'Average Speed:Q',
    y = 'Distance:Q',
    color = color,
    tooltip = ['Activity Type', 'Distance', 'Average Speed']).add_params(selection)

chart5 = alt.Chart(data).mark_circle().encode(
    x = 'Average Speed:Q',
    y = 'Elevation Gain',
    color = color,
    tooltip = ['Activity Type', 'Elevation Gain', 'Average Speed']).add_params(selection)

chart4.properties(title = 'Distance on Speed') | chart5.properties(title = 'Elevation on Speed')


The visualization on the left begins to show the relationship between speed on distance. Overall, if I am traveling faster I am generally going further. There is also a slight negative relationship between elevation and speed. Generally, if I am climbing a lot during a workout I am going slower on average. 

What's interesting in this visualization is using the selector to highlight a different activity type. For example, if you select 'Backcountry Ski' (pink dots) you can see I am pretty much going the same distance, speed, and elevation each time. Comparing that graph to when the activity type is 'Ride' my data is much more variable with more outliers. Some days I am going very far and climbing a lot on the bike!

In [10]:
chart6 = alt.Chart(data).mark_line().encode(
    x = 'Activity Date',
    y = 'Average Speed:Q',
    color = 'Activity Type',
    tooltip = ['Activity Type', 'Activity Date', 'Average Speed']).properties(title = 'Date on Average Speed', width = 600, height = 250).interactive()
chart6

Now we can start looking at time series data. Here we can see each activities speed trends across the year. I tend to have the highest average speed when doing a virtual ride indoors. The green line representing my running is highly variable. You can see some workouts I average close to 4 km/hr where in others I am barely at 2 km/hr. This makes sense as depending on the day I might do a faster workout vs a slow easy workout.

### Task 1 Conclusion

From the above section of graphs there are a few conclusions I can draw:

1. I run alot compared to other activities. Running was by far my most consistant and popular activity type. 

2. The activity type has a huge impact on how fast I'm going on average. Elevation has less of an impact.

3. It is hard to conclude that I am getting faster at running from the above graphs. I will look into this further below!

### Task 2: Am I getting faster?

For this second task, I am going to attempt to quantify whether or not I am getting faster. I will only be looking at running activities for this task. As you will see there are a couple of different considerations that must be made when we measure 'speed'.

In [11]:
run_data = data[data['Activity Type'] == 'Run']

In [12]:
Here I am just limiting the dataset we previously visualized to just runs.

SyntaxError: invalid syntax (<ipython-input-12-69635b6ccd6e>, line 1)

In [None]:
brush = alt.selection_interval(encodings = ['x'])

line = alt.Chart(run_data).mark_line(
    color = 'red',
    size = 3
).transform_window(
    rolling_mean = 'mean(Average Speed)',
    frame = [-7, 7]
).encode(
    x = 'Activity Date:T',
    y = 'rolling_mean:Q'
)

line2 = alt.Chart(run_data).mark_line(
    color = 'red',
    size = 3
).transform_window(
    rolling_mean = 'mean(Elevation Gain)',
    frame = [-7, 7]
).encode(
    x = 'Activity Date:T',
    y = 'rolling_mean:Q'
)

points = alt.Chart(run_data).mark_point().encode(
    x = 'Activity Date:T',
    y = alt.Y('Average Speed:Q'),
    color = alt.condition(brush, 'Activity Type', alt.value('lightgray'))
    ).properties(title = '2 week rolling average speed', width = 400, height = 450
    ).add_params(brush)

points + line | points.encode(y = 'Elevation Gain:Q') + line2


In this visualization, we can see my average speed on the left chart and elevation gain over time in the chart on the right. The red line I included on the left chart shows the 2 week rolling average of my speed whereas the red line on the right chart shows the 2 week rolling average of my elevation.

There is also functionality within the graph to click and drag across the x axis to highlight a particular time. The highlight is then carried over to the other graph. Try it out yourself!

In terms of the trend of the data, there are a few periods where my average speed picked up, notabaly from Feb to April and again from Jul to Aug. This was speed specific pre-season and race specific preperations respectively. Both were specific training blocks.

However, comparing both graphs you will notice a pattern. As the rolling mean speed decreases, the rolling mean elevation gain increases. In other words, I am typically running slower because I am running uphill which takes a lot more strength and power. Is there a way to control for elevation in these numbers? 

To control for the elevation gain in each of my runs, I'm going to look at a new metric called Grade Adjusted Pace (GAP). This is a function that takes in the elevation gain during a run as an input, and the outcome is the adjusted pace. There are many different formulas to use to calculate this metric (including very complex ones), however, here is a simplistic function we can use in this case. In the formula i is the elevation gain and f is the adjusted speed:

f = 1 + 0.03 * (i) + 1.5e-3 * (i) ** 2.0

If you are interested in learning more about GAP you can read some further information here: https://davethecanuck.github.io/runcalc/

In [None]:
avg_speed_elevation_adjusted = []
for i in run_data['Elevation Gain']:
    f = 1 + 0.03 * (i) + 1.5e-3 * (i) ** 2.0
    avg_speed_elevation_adjusted.append(f)

In [None]:
run_data.insert(loc = 10, column = "Average Speed Elevation Adjusted", value = avg_speed_elevation_adjusted)
                                                                                                      
                                                                                                      

We will take the 'Elevation Gain' column and use our model to create a new column, 'Average Speed Elevation Adjusted' that we will look into further.

In [None]:
line = alt.Chart(run_data).mark_line(
    color = 'red',
    size = 3
).transform_window(
    rolling_mean = 'mean(Average Speed Elevation Adjusted)',
    frame = [-7, 7]
).encode(
    x = 'Activity Date:T',
    y = 'rolling_mean:Q'
)

points = alt.Chart(run_data).mark_point().encode(
    x = 'Activity Date:T',
    y = alt.Y('Average Speed Elevation Adjusted:Q')).properties(title = '2 week rolling average speed', width = 400, height = 450).interactive()

points + line

Here we can see a simple dot plot of our newly created 'Average Speed Elevation Adjusted' field. The results are very interesting. If we look in 2023, I made marginal gains throughout the winter months which would be expected. During these months, I am not running very much as there is typically snow on the ground. I would only be running at very slow paces.

Then as the weather warms up and the snow melts around mid Apr, I am getting back outside to prepare for the spring/summer race season. This can be seen with the upward trend of the 2 week rolling GAP from Apr to June. During June, I am typically at my race fitness, and do not need the extra speed sessions. My GAP fluctuates a bit, but remains steady until August. in August I am near the end of the race season and am typically just logging miles and do not worry about a lot of speed work. I also had signed up for a few longer races in the early fall where training more miles and less speed is important which is reflected in the dropoff in GAP beginning in August.

What's suprising to me is the steady increase in pace in Apr. I am clearly getting faster throughout those months. There is rarely a single approach to training. Continuing flatter speed work into June and July could very much help me in particular races.

These graphs are excellent to view and compare season after season.

### Task 1 Conclusion

From the above section of graphs there are a few conclusions I can draw:

1. My running speed is very seasonal. There are particular months I am running faster than others.

2. The GAP model seems fairly accurate in it's measure. It did a good job of considering the elevation change when I'm training.

3. My running data has a good amount of outliers where I am climbing much more than average. This might be during a race or a specific day of training.

### Final Evaluation Approach

In conclusion, I believe my procedure was very well thought out and impactful. My initial ideas on scoping the analysis goals and key visualizations removed a lot of ambiguity from my analysis. Often when I am conducting analysis' like this, I will not have an approach that I have thought through and will start visualizing datapoints without a larger objective in mind. The design study approach we have learned thoughout this class greatly helped me and is a takeaway of mine from this class.

I also recruited three co-workers of mine to review my analysis and provide feedback. I believe this was another critical step in the design study process. Although sometimes there are time constraints for these types of analysis in the real world, having different peoples perspective on the analysis is very beneficial to drawing clear insight. I discussed my analysis and results with my three colleagues (who work in data analytics) for 30 minutes. The feedback was overall positive, however, it was clear to me there was some key context missing. None of my colleagues had any experience working with fitness data, and they immediately had questions about the structure and units of the dataset. The 'Data Callout' points were added to the analysis after hearing their questions.

Another idea that was brought forward by my colleagues was using supplmental data to layer on perceived exertion and/or heart rate data to enrich the graphs above. I agreed with their response, however, in the end I decided to leave out this data due to its sensitivity. Adding this data would be very interesting as a personal project.

Looking back on the analysis, one refinement I would make would be to remove outliers in my data. There were a good deal of outliers in my running data. Removing these outliers would have smoothed out the rolling mean functions I included. There are many additional questions I had while looking through this data such as: does the average temperature impact my speed? Is there a way to graph my overall 'fitness' with this data? 