# Exploring Professor Chris Brooks' Strava data in Summer 2019

In this notebook, I will be attempting to make sense of the Strava data collected from Professor Chris Brooks exercise routine in the summer of 2019. I will split the assignment into the following section: retrieving and cleaning dataset, utilizing visual analysis techniques to make sense of data, and finally providing a conclusion based on the visualizations to Professor Brooks' activities.

## Importing the data

In [4]:
# import the required dependencies
import pandas as pd
import numpy as np

In [7]:
df = pd.read_csv('strava.csv')
df.head()

Unnamed: 0,Air Power,Cadence,Form Power,Ground Time,Leg Spring Stiffness,Power,Vertical Oscillation,altitude,cadence,datafile,...,enhanced_speed,fractional_cadence,heart_rate,position_lat,position_long,speed,timestamp,unknown_87,unknown_88,unknown_90
0,,,,,,,,,0.0,activities/2675855419.fit.gz,...,0.0,0.0,68.0,,,0.0,2019-07-08 21:04:03,0.0,300.0,
1,,,,,,,,,0.0,activities/2675855419.fit.gz,...,0.0,0.0,68.0,,,0.0,2019-07-08 21:04:04,0.0,300.0,
2,,,,,,,,,54.0,activities/2675855419.fit.gz,...,1.316,0.0,71.0,,,1316.0,2019-07-08 21:04:07,0.0,300.0,
3,,,,,,,,3747.0,77.0,activities/2675855419.fit.gz,...,1.866,0.0,77.0,504432050.0,-999063637.0,1866.0,2019-07-08 21:04:14,0.0,100.0,
4,,,,,,,,3798.0,77.0,activities/2675855419.fit.gz,...,1.894,0.0,80.0,504432492.0,-999064534.0,1894.0,2019-07-08 21:04:15,0.0,100.0,


In [13]:
# let's see how much data are we working with
df.shape

(40649, 22)

## Interpreting the rows

When making sense of this dataset, the first thing that we can try to do is to look at the timestamp column. Is each row a dataset captured during his entire exercise session in a day? Or is it in an interval? Is it per seconds? The timestamp column appears to contain timestamp in the format of 'yyyy-mm-dd hh:mm:ss'. We can see that it indicates that each row represents data that was captured once every few seconds during his exercise.

With that knowledge, we can begin our analysis by grouping the data into sessions.

In [21]:
# convert the timestamp column into datetime
df['datetime'] = pd.to_datetime(df['timestamp'])
df['datetime']

0       2019-07-08 21:04:03
1       2019-07-08 21:04:04
2       2019-07-08 21:04:07
3       2019-07-08 21:04:14
4       2019-07-08 21:04:15
                ...        
40644   2019-10-03 23:04:54
40645   2019-10-03 23:04:56
40646   2019-10-03 23:04:57
40647   2019-10-03 23:05:02
40648   2019-10-03 23:05:05
Name: datetime, Length: 40649, dtype: datetime64[ns]

From the output above, we see that the dataset contains timestamp ranged from his exercise data in July, up until October, for the timespan of 4 months period. We will then try to group the exercise into sessions. A session can be defined as an exercise that is ongoing, with an acceptable rest period up to 30 minutes. If over the specified of resting period, then we consider it as a new exercise session.

In [48]:
# Let's explore the date and time stamp further by investigating the differences in the consecutive timestamps
tdiff = df['datetime'].diff()
tdiff.describe()

count                        40648
mean     0 days 00:03:05.102883290
std      0 days 02:04:36.469630194
min                0 days 00:00:01
25%                0 days 00:00:01
50%                0 days 00:00:01
75%                0 days 00:00:05
max               11 days 04:54:11
Name: datetime, dtype: object

In [55]:
datetime = df['datetime'].groupby(df['datetime'].dt.floor('d')).value_counts()
datetime

datetime    datetime           
2019-07-08  2019-07-08 21:04:03    1
            2019-07-08 21:04:04    1
            2019-07-08 21:04:07    1
            2019-07-08 21:04:14    1
            2019-07-08 21:04:15    1
                                  ..
2019-10-03  2019-10-03 23:04:54    1
            2019-10-03 23:04:56    1
            2019-10-03 23:04:57    1
            2019-10-03 23:05:02    1
            2019-10-03 23:05:05    1
Name: datetime, Length: 40649, dtype: int64