# Getting Things Set-Up

<a id='section_id'></a>

### Computational Rules: 

- `Use cell divisions to make steps clear`
    - In the following notebook you will note that I lay out steps I use for converting and cleaning the dataset into usable files for later on in producing plots. Instead of doing all of these steps in one cell, multiple cells are utilized to take a step by step approach and keep things organized. Cells will be interspersed with markdown cells as well that will explain processes and what the newly formed data will be used for. There were no cells here that were overly long and caused confusion or a reader to get lost in the details.
- `Record dependencies`
    - This one came up in the `README` but I wanted to include this here as I did not consider the `README` a notebook per se. Many different libraries were installed, imported, and used to create this project. I used `pip install` commands to install all these libraries in my terminal. While a user who clones this repository and wants to run the cells can certainly look at the imports and install all the libraries manually, a way easier way is to run `pip install -r requirements.txt`. This is made possible because I wrote all the libraries and versions to a `requirements.txt` file using `pip`. This way, it organizes everything into a single location, and also allows somebody to just install everything in one go and get there environment ready. 

In [1]:
import pandas as pd
import seaborn as sns
import hvplot.pandas
import panel as pn
from hvplot.plotting import scatter_matrix
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import pearsonr
import matplotlib.patches as mpatches

pn.extension()

### Reading/Updating Strava File

The first thing I did here was to read in the csv file to a variable as a dataframe.

In [4]:
df = pd.read_csv('./csv/strava.csv')

Lets take a look at the types we have for the columns

In [5]:
df.dtypes

Air Power               float64
Cadence                 float64
Form Power              float64
Ground Time             float64
Leg Spring Stiffness    float64
Power                   float64
Vertical Oscillation    float64
altitude                float64
cadence                 float64
datafile                 object
distance                float64
enhanced_altitude       float64
enhanced_speed          float64
fractional_cadence      float64
heart_rate              float64
position_lat            float64
position_long           float64
speed                   float64
timestamp                object
unknown_87              float64
unknown_88              float64
unknown_90              float64
dtype: object

I am mostly ok with keeping many of these data types. The one that I want to convert is the timestamp column, and we can do this using pandas

In [7]:
df['timestamp'] = pd.to_datetime(df['timestamp'], format='%m-%d-%Y')

df.dtypes

Air Power                      float64
Cadence                        float64
Form Power                     float64
Ground Time                    float64
Leg Spring Stiffness           float64
Power                          float64
Vertical Oscillation           float64
altitude                       float64
cadence                        float64
datafile                        object
distance                       float64
enhanced_altitude              float64
enhanced_speed                 float64
fractional_cadence             float64
heart_rate                     float64
position_lat                   float64
position_long                  float64
speed                          float64
timestamp               datetime64[ns]
unknown_87                     float64
unknown_88                     float64
unknown_90                     float64
dtype: object

In [10]:
df.dtypes

Air Power                      float64
Cadence                        float64
Form Power                     float64
Ground Time                    float64
Leg Spring Stiffness           float64
Power                          float64
Vertical Oscillation           float64
altitude                       float64
cadence                        float64
datafile                        object
distance                       float64
enhanced_altitude              float64
enhanced_speed                 float64
fractional_cadence             float64
heart_rate                     float64
position_lat                   float64
position_long                  float64
speed                          float64
timestamp               datetime64[ns]
unknown_87                     float64
unknown_88                     float64
unknown_90                     float64
dtype: object

I have now changed the column to a `datetime64` type which will make things easier if we want to compare or manipulate the column in any way. I want to use this column now to add a couple more columns that will split this timestamp into specific time columns like the date, hour, minute, etc.

In [11]:
df['date'] = df['timestamp'].dt.strftime('%-m-%-d-%Y')
df['hour'] = df['timestamp'].dt.hour
df['minute'] = df['timestamp'].dt.minute
df['seconds'] = df['timestamp'].dt.second

I also want to add a column that will show if the bike was used in this workout or not. We will be doing some comparisons of running versus biking and I want an easy way to be able to discern between rows. How do we know if a bike was used or not? I made a judgement call here and deduced that anywhere there was no data for `Ground Time`, it was probably a biking workout. According to Stryd's website, Ground Contact Time is the "amount of time per stride that a runner's foot is touching the ground". If rows had this column value populated, then I assumed they were running. While biking, your foot generally does not touch the ground in any significant sense. 

In [14]:
df['Used_Bike?'] = df['Ground Time'].isna()

Here is where we will be creating different csv files for usage in making our plots. This will help us to modularize our code in these notebooks more by separating specific processes so we do not have to repeat them in the subsequent notebooks. So we will clearn/filter/create csv files in this notebook so that other notebooks can just focus on the implementation of the plot and not have to be cluttered with these extra steps.

### Scatter Plot Data

Starting out with the data we need for our scatter plot, we are going to filter our dataframe from earlier to only include running data. I also need another dataframe that will be for x and y variables of our line of best fit that we will use with our plot of all running data to compare heart rate and power.

In [17]:
df_run = df[(df['Used_Bike?'] == False) & (df['Ground Time'] > 0)]
df_run = df_run.dropna(subset=['Power'])

## Line of best fit
x = df_run['Power']
y = df_run['heart_rate']
m, b = np.polyfit(x,y,1)

x_fitted = np.linspace(x.min(), x.max(), 100)
y_fitted = (m * x_fitted) + b
df_fitted = pd.DataFrame({
    'x': x_fitted,
    'y': y_fitted
})
##


## Save to csv files
df_run.to_csv('./csv/df_run.csv')
df_fitted.to_csv('./csv/df_fitted.csv')

For the running data, I utilize the `Used_Bike?` column I created to filter rows out. I also want to reduce the amount of outliers, so I only grab the rows that have a `Ground Time` of above 0 as well. There were instances of where the ground time was 0 and this could be do to idle times or maybe the workout had not quite started yet. 

To create the line of best fit, I used numpy's `polyfit` function that takes in data for your x (independent variable) and y (dependent variable). In this case, we want to see how heart rate changes with increasing power. from the polyfit function we get a m (slope) and b (intercept value) that we can use to calculate y_fitted values. For the x_fitted values, I take some points between the minimum and maximum of my Power values using numpy's `linspace` function. This creates an array of evenly spaced points in my range.

### Heatmap Data

For the heatmap, I again will only look at the running data, and will be plotting the correlations of some of the different running metrics to see how they compare. I stored only the specific column values I needed into another separate data frame and stored it in a csv file.

In [18]:
df_run_heatmap = df_run[['Air Power', 'Cadence', 'Form Power', 'Ground Time',
       'Leg Spring Stiffness', 'Power', 'Vertical Oscillation']]

## Save to csv files
df_run_heatmap.to_csv('./csv/df_run_heatmap.csv')

### Histogram & Lineplot

For the last section of figures, I want to take a look at heart rate while biking vs. walking. I will create a histogram to check the frequencies of bins of heart rates and see where heart rates tend to be for biking and walking. Then, the line plot will aim to see at certain heart rates how much distance is covered. Again, to remove outliers so they don't overly affect the data, I chose to only look at the elevated levels of activity while running or biking. I chose cadence in this case because both the biking and running data track cadence as steps or rotations per minute so it allows us to have a standard threshold to meet and not include any values lower than this that might skew the data. 

In [19]:
df_filtered = df[df['cadence']>50]

I also have an advanced plot in the same figure looking at the autocorrelation of heart rate while biking and running over certain distances. For this, I used the `df_filtered` dataframe above. I have to first sort the dataframe by distance and store in a new dataframe. From there, I manually calculate the autocorrelation for both biking and running as seen below. Again, I drop any rows that had empty values or NaN values for `heart_rate`

In [21]:
sorted_df = df_filtered.sort_values(by='distance')

sorted_df = sorted_df.dropna(subset=['heart_rate'])
sorted_bike_df = sorted_df[sorted_df['Used_Bike?'] == True]
sorted_run_df = sorted_df[sorted_df['Used_Bike?'] == False]


x_bike = sorted_bike_df['heart_rate'] - sorted_bike_df['heart_rate'].mean()
autocorr_bike = np.correlate(x_bike, x_bike, mode='full')
autocorr_bike = autocorr_bike[x_bike.size:]
autocorr_bike /= autocorr_bike.max()

x_run = sorted_run_df['heart_rate'] - sorted_run_df['heart_rate'].mean()
autocorr_run = np.correlate(x_run, x_run, mode='full')
autocorr_run = autocorr_run[x_run.size:]
autocorr_run /= autocorr_run.max()

I will store all of these in a dataframe and then write these to csv files for later use

In [22]:
df_autocorr_bike = pd.DataFrame(autocorr_bike, columns=['autocorr_bike'])
df_autocorr_run = pd.DataFrame(autocorr_run, columns=['autocorr_run'])


df_filtered.to_csv('./csv/df_filtered.csv')
df_autocorr_bike.to_csv('./csv/df_autocorr_bike.csv')
df_autocorr_run.to_csv('./csv/df_autocorr_run.csv')