# **POLAR PROJECT**

# Imports

In [1]:
import json
from functools import reduce
from os import listdir
from os.path import isfile, join
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import tabulate
from pylab import rcParams
from scipy.stats import shapiro
from statsmodels.graphics.gofplots import qqplot
from statsmodels.stats.diagnostic import het_goldfeldquandt
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [2]:
from IPython.display import display, HTML

# Setup

In [3]:
# Set figure size
rcParams['figure.figsize'] = (4, 4)

# Folder for images
Path('img').mkdir(parents=True, exist_ok=True)

# Nice float format
pd.options.display.float_format = "{:,.2f}".format

# Data description

Last year I purchased a Polar watch that tracks my vitals during workouts. I used the [Polar Flow](polar.flow.com) website to obtain a copy of my data. For privacy reasons I shall not be sharing the dataset.

In [4]:
path = './data/'

First, we create a list of files in the download.

In [5]:
files = [f for f in listdir(path) if isfile(join(path, f))]

FileNotFoundError: [Errno 2] No such file or directory: './data/'

We shall only consider files containing the string `'training-session'`.

In [6]:
files = [f for f in files if 'training-session' in f]

NameError: name 'files' is not defined

The number of files under consideration is:

In [7]:
len(files)

NameError: name 'files' is not defined

We loop over each of the files and them to a list.

In [8]:
data = []

for f in files:
    with open(join(path, f)) as f:
        d = json.load(f)
        data.append(d)

NameError: name 'files' is not defined

We define a function to extract statistics about heart rate measured during the workouts.

In [None]:
quantiles = [0.01, 0.25, 0.5, 0.75, 0.99]

In [None]:
def extract_hr_info(workout, quantiles):

    stats = {'heartRateAvg2': np.nan,
             'heartRateStd': np.nan}

    for q in quantiles:
        stats[f'heartRateQ' + str(int(q * 100))] = np.nan

    # Check if data exists
    try:
        heart_rates = workout['exercises'][0]['samples']['heartRate']
    except KeyError:
        return stats

    # Loop over measurements
    hr_data = []
    for hr in heart_rates:

        # Check if actually measured hr
        if 'value' in hr:
            hr_data.append(hr['value'])

    stats['heartRateAvg2'] = np.mean(hr_data)
    stats['heartRateStd'] = np.std(hr_data)

    for q in quantiles:
        stats[f'heartRateQ' + str(int(q * 100))] = np.quantile(hr_data, q)

    return stats

We extract the relevant information from the items in the list.

In [None]:
workouts = []

for d in data:
    basic = d['exercises'][0]
    hr = extract_hr_info(workout=d,
                         quantiles=quantiles)

    workouts.append({**basic, **hr})

Finally we create a dataframe containing the workout information.

In [None]:
df = pd.DataFrame(workouts)

# Data structure

We find the following columns in the dataframe.

In [None]:
df.info()

We remove columns that containt data from features I do not use in my training.

Due to privacy concerns I shan't be extracting longitudinal and latitudinal data.

In [None]:
df = df.drop(['zones', 'samples', 'autoLaps',
              'laps', 'latitude', 'longitude',
              'ascent', 'descent'], axis=1)

In [None]:
df.head()

# Missing Values

The watch tracks different information for different workouts. For example when walking it tracks location but when walking on a treadmill it doesn't, hence there is quite a lot of missing data.

In [None]:
missing = (df.isna().sum() / df.shape[0] * 100)
missing.name = 'Percent missing'
missing = missing.to_frame()
missing = missing.sort_values('Percent missing', ascending=False)
missing = missing[missing['Percent missing'] > 0]
missing = missing.reset_index()
missing = missing.rename(columns={'index': 'Feature'})
np.round(missing, 2)

# Transforms

We apply certain transforms to make the data easier to work with. First we convert strings to datetimes.

In [None]:
df['startTime'] = pd.to_datetime(df['startTime'])
df['stopTime'] = pd.to_datetime(df['stopTime'])

We calculate the total duration of each individual workout in minutes.

In [None]:
df['totalTime'] = (df['stopTime'] - df['startTime'])
df['totalTime'] = df['totalTime'].apply(lambda x: round(x.seconds / 60, 2))
df.drop('duration', axis=1, inplace=True)

We extract maximum, average and minimum heart rate values from the `heartRate` column.

In [None]:
df['heartRateMax'] = df['heartRate'].apply(lambda x: x['max'] if isinstance(x, dict) else np.nan)
df['heartRateAvg'] = df['heartRate'].apply(lambda x: x['avg'] if isinstance(x, dict) else np.nan)
df['heartRateMin'] = df['heartRate'].apply(lambda x: x['min'] if isinstance(x, dict) else np.nan)
df.drop('heartRate', axis=1, inplace=True)

We assume that if there is no `distance` then the workout was indoors:

In [None]:
df['isInside'] = df['distance'].apply(lambda x: True if pd.isnull(x) else False)
df = df.drop(['distance', 'speed'], axis=1)

We are going to map sports to different `activityType`'s. We will map strength training to `1` and cardiovascular work to `0`.

In [None]:
def sport_to_activity_type(x):
    if 'strength' in x.lower():
        return True
    else:
        return False

In [None]:
df['isStrength'] = df['sport'].apply(sport_to_activity_type)

In [None]:
df['sport'] = df['sport'].apply(lambda x: x.lower())
df['sport'] = pd.Categorical(df['sport'])

We extract a list of unique `sport` values:

In [None]:
sports = sorted(list(df['sport'].unique()))

We reorder the alphabetically

In [None]:
order = sorted(df.columns.to_list())

In [None]:
df = df[order]

We check if there are any more `NaN`'s in the data.

In [None]:
df.isna().sum()

There is one row with `NaN`'s. This might due to my watch having little battery left to make the measurements.

In [None]:
df = df.dropna()

We proceed to sort the data with the latest workouts at the top of the dataframe.

In [None]:
sort_cols = ['startTime','startTime']
df = df.sort_values(sort_cols, ascending=False)
df = df.reset_index(drop=True)

We verify that the datatypes are correct.

In [None]:
df.info()

In [None]:
df.head()

# Data analysis

Given that we have produced a clean dataset we can proceed to analyse a few aspects.

## Time span

The date of the first workout is:

In [None]:
str(df['startTime'].min())

The date of the last workout is:

In [None]:
str(df['startTime'].max())

Workouts measured:

In [None]:
len(df)

## Descriptive statistics

In [None]:
df.drop('timezoneOffset', axis=1).describe()

## Kilocalories burned in total

First we count the total `kiloCalories` I burned during the period in question.

In [None]:
total_calories = df['kiloCalories'].sum()
print(total_calories)

We convert this number to kilograms of body fat.
According to [this article](https://www.livestrong.com/article/304137-how-many-calories-per-kilogram-of-weight/) it equates to

In [None]:
def kcal_to_kg(x):
    return round(x / 7700, 2)

In [None]:
kcal_to_kg(total_calories)

## Kilocalories burned by sport

In [None]:
by_sport = df[['kiloCalories', 'sport']].groupby('sport', as_index=False)
by_sport = by_sport.sum()
by_sport['sport'] = by_sport['sport'].apply(lambda x: x.lower())
by_sport['kiloCalories'] = by_sport['kiloCalories'].astype(int)
by_sport = by_sport.rename(columns={'kiloCalories': 'Total kilocalories', 'sport': 'Sport'})
by_sport = by_sport.sort_values('Total kilocalories', ascending=False)
by_sport['Total kilograms'] = by_sport['Total kilocalories'].apply(kcal_to_kg)

# by_sport = by_sport.style.background_gradient(cmap='YlGn', subset='Total kilograms')
# by_sport = by_sport.set_precision(2)

by_sport

## Kilocalories burned over time

Next we produce a plot of `kiloCalories` burned over a two month period in 2019. First we extract the relevant data.

In [None]:
start = pd.to_datetime('2019-04-1')
stop = pd.to_datetime('2019-06-1')

daily = df[['startTime', 'kiloCalories']]
mask = (daily['startTime'] >= start) & (daily['startTime'] < stop)
daily = daily[mask]
daily['startTime'] = daily['startTime'].dt.date
daily = daily.groupby('startTime', as_index=False)
daily = daily.sum()
daily = daily.sort_values('startTime', ascending=False)
daily['startTime'] = pd.to_datetime(daily['startTime'])
daily = daily.reset_index(drop=True)

We create a dataframe with all the dates to perform a left join and fill the `NaN`'s with zeroes.

In [None]:
dates = pd.date_range(start, stop)
dates = dates.to_frame()
dates = dates.reset_index(drop=True)
dates.columns = ['startTime']

In [None]:
daily = pd.merge(dates, daily, on='startTime', how='left')
daily = daily.fillna(0)

Finally we produce the figure:

In [None]:
width = 800
height = 400
dpi = 100

plt.figure(figsize=(width/dpi, height/dpi))
plt.plot(daily['startTime'], daily['kiloCalories'])

plt.fill_between(x=daily['startTime'],
                 y1=0,
                 y2=daily['kiloCalories'],
                 alpha=1/2)

daily_avg = daily['kiloCalories'].mean()

plt.hlines(xmin=daily['startTime'].min(),
           xmax=daily['startTime'].max(),
           y=daily_avg,
           linestyle='dashed',
           label=f'Daily average = {round(daily_avg)} kcal',
           alpha=1/2)

plt.title('Kilocalories burned over time', fontsize=18)
plt.xticks(rotation=45, horizontalalignment='center')
plt.xlim(daily['startTime'].min(), daily['startTime'].max())
plt.ylim(0, daily['kiloCalories'].max() * 1.05)
plt.ylabel('Kilocalories')
plt.legend(loc='best')
plt.tight_layout()
plt.savefig('./img/kilocalories_ts.png')
plt.show()

## Kilocalories by intensity

In [None]:
plt.scatter(df['heartRateQ1'], df['heartRateQ99'], c=df['kiloCalories'])
plt.xlabel('0.01 quantile of heart rate (bpm)')
plt.ylabel('0.99 quantile of heart rate (bpm)')

cbar = plt.colorbar()
cbar.set_label('Kilocalories', rotation=270)
plt.savefig('./img/intensity_scatter.png')
plt.show()

## Workouts by sport

We check how many workouts I completed.

In [None]:
stats = df[['sport', 'startTime']]
stats = stats.groupby(['sport'], as_index=False)
stats = stats.count()
stats = stats.rename(columns={'sport': 'Sport',
                              'startTime': 'Count'})
stats = stats.sort_values('Count', ascending=False)

# stats = stats.style.background_gradient(cmap='YlGn', subset='Count')
# stats = stats.set_precision(2)

stats

## By hour of day

We count workouts by hour of day.

In [None]:
by_hour = df[['startTime', 'sport']].copy()
by_hour['startHour'] = by_hour['startTime'].dt.hour
by_hour = by_hour.drop('startTime', axis=1)
by_hour = by_hour.groupby('startHour', as_index=False)
by_hour = by_hour.count()

all_hours = pd.DataFrame(range(0, 24), columns=['startHour'])

by_hour = pd.merge(all_hours, by_hour, how='left')
by_hour = by_hour.fillna(0)
by_hour = by_hour.sort_values('startHour')
by_hour = by_hour.rename(columns={'startHour': 'Hour of day',
                                 'sport': 'Total workouts'})

In [None]:
plt.bar(by_hour['Hour of day'], by_hour['Total workouts'])
plt.ylabel('Number of workouts')
plt.xlabel('Hour of day')
plt.tight_layout()
plt.savefig('./img/workouts_by_hour_of_day.png')
plt.show()

## By day of week

We count workouts by day of week.

In [None]:
by_day = df[['startTime', 'sport']].copy()
by_day['Day of week'] = pd.to_datetime(by_day['startTime']).dt.day_name()
by_day['Day number'] = pd.to_datetime(by_day['startTime']).dt.dayofweek
by_day = by_day.groupby(['Day of week', 'Day number'], as_index=False)
by_day = by_day.count()
by_day = by_day.drop('startTime', axis=1)
by_day = by_day.sort_values('Day number')
by_day = by_day.rename(columns={'sport': 'Total Workouts'})

In [None]:
plt.bar(by_day['Day of week'], by_day['Total Workouts'])
plt.xticks(rotation=90)
plt.ylabel('Number of workouts')
plt.savefig('./img/workouts_by_day_of_week.png')
plt.show()

## Scatter plot of walks data

We plot `totalTime` versus `kiloCalories`. As can be seen their seems to exist a linear relationship between the two.

In [None]:
walking = df[df['sport'] == 'walking']
plt.scatter(walking['totalTime'], walking['kiloCalories'], s=2)
plt.xlabel('Duration (minutes)')
plt.ylabel('Kilocalories')
plt.savefig('./img/walks_kilocalories_vs_time.png')
plt.show()

We plot `heartRateAvg` against `kiloCalories`. Again we see a linear relationship although there are a couple of outliers

In [None]:
walking = df[df['sport'] == 'walking']
plt.scatter(walking['heartRateAvg'], walking['kiloCalories'], s=2)
plt.ylabel('Kilocalories')
plt.xlabel('Average HR (bpm)')
plt.savefig('./img/walks_kilocalories_vs_avg_hr.png')
plt.show()

# Regression

## Data preparation

Now we proceed to build a regression model to predict `kiloCalories` burned during a workout. First we create a subset of the original data.

In [None]:
reg_df = df[['kiloCalories', 'totalTime',
             'heartRateQ99', 'isStrength', 'sport']].copy()

In [None]:
reg_df.head()

We remove the rows where `sport` is `running` because there were only two workouts recorded during the period in question.

In [None]:
reg_df = reg_df[reg_df['sport'] != 'running']

### Outliers

The data is cleansed of outliers using interquartile range.

In [None]:
def is_outlier_iqr(series, k=1.5):
    """
    Check if value is an outlier
    using interquartile range.
    """

    q1 = series.quantile(0.25)
    q3 = series.quantile(0.75)
    iqr = q3 - q1
    is_outlier = (series <= q1 - k * iqr) | (q3 + k * iqr <= series)

    return is_outlier

In [None]:
time_mask = is_outlier_iqr(series=reg_df['totalTime'])
kcal_mask = is_outlier_iqr(series=reg_df['kiloCalories'])
hr_mask = is_outlier_iqr(series=reg_df['heartRateQ99'])

In [None]:
reg_df = reg_df[~(time_mask | kcal_mask | hr_mask)]

## Histograms

We proceed to visualize histograms of each of the variables.

In [None]:
plt.hist(reg_df['kiloCalories'], bins=30)
plt.xlabel('Kilocalories')
plt.ylabel('Frequency')
plt.savefig('./img/kilocalories_histogram.png')
plt.show()

In [None]:
plt.hist(reg_df['totalTime'], bins=30)
plt.xlabel('Duration (minutes)')
plt.ylabel('Frequency')
plt.savefig('./img/duration_histogram.png')
plt.show()

In [None]:
plt.hist(reg_df['heartRateQ99'], bins=30)
plt.xlabel('0.99 quantile of heart rate (bpm)')
plt.ylabel('Frequency')
plt.savefig('./img/q99_hr_histogram.png')
plt.show()

## Scatter plots

The plot below gives reason to suspect a linear relationship between `kiloCalories` and `totalTime`.

In [None]:
for val in [False, True]:
    tmp = reg_df[reg_df['isStrength'] == val]
    plt.scatter(tmp['totalTime'],
                tmp['kiloCalories'],
                s=3,
                label=val)

plt.xlabel('Time (minutes)')
plt.ylabel('Kilocalories')
plt.legend(title='isStrength', loc='best')
plt.tight_layout()
plt.savefig('./img/time_vs_kilocalories_scatter_by_strength.png')
plt.show()

In [None]:
for val in [False, True]:
    tmp = reg_df[reg_df['isStrength'] == val]
    plt.scatter(tmp['heartRateQ99'],
                tmp['kiloCalories'],
                s=3,
                label=val)

plt.xlabel('0.99 quantile of heart rate (bpm)')
plt.ylabel('Kilocalories')
plt.legend(title='isStrength', loc='best')
plt.savefig('./img/99q_hr_vs_kilocalories_scatter_by_strength.png')
plt.show()

In [None]:
plt.scatter(reg_df['isStrength'] + np.random.normal(scale=1/20, size=len(reg_df)),
            reg_df['kiloCalories'], s=3)

plt.ylabel('Kilocalories')
plt.xticks(ticks=[0, 1], labels=['Cardio', 'Strength'])
plt.savefig('./img/is_strength_vs_kilocalories_jitter.png')
plt.show()

In [None]:
plt.scatter(reg_df['isStrength'] + np.random.normal(scale=1/20, size=len(reg_df)),
            reg_df['heartRateQ99'], s=3)

plt.ylabel('0.99 quantile of heart rate (bpm)')
plt.xticks(ticks=[0, 1], labels=['Cardio', 'Strength'])
plt.savefig('./img/is_strength_vs_99q_hr_scatter.png')
plt.show()

In [None]:
plt.scatter(reg_df['isStrength'] + np.random.normal(scale=1/20, size=len(reg_df)),
            reg_df['totalTime'], s=3)

plt.ylabel('Time (minutes)')
plt.xticks(ticks=[0, 1], labels=['Cardio', 'Strength'])
plt.savefig('./img/is_strength_vs_time_jitter.png')
plt.show()

## Correlation

We convert binary the feature `isStrength` to integers for the rest of the analysis.

In [None]:
reg_df['isStrength'] = reg_df['isStrength'].astype(int)

We inspect the correlation matrix to check for multicollinearity. It should be noted that the correlation between `kiloCalories` and `totalTime` is quite high and this to be expected.

In [None]:
C = reg_df.corr(method='pearson')
# C = C.style.background_gradient(cmap='YlGn')
# C = C.set_precision(2)
C

## Multicollinearity

We inspect the respect variance inflation factors and are happy to see that all are below 10.

In [None]:
tmp = reg_df.drop(['kiloCalories', 'sport'], axis=1)

vifs = []
for i in range(tmp.shape[1]):
    vif = variance_inflation_factor(tmp.to_numpy(), i)
    vifs.append(round(vif, 2))

vifs = pd.DataFrame(vifs, index=tmp.columns, columns=['VIF'])
vifs = vifs.sort_values('VIF', ascending=False)
vifs = vifs.reset_index()
vifs = vifs.rename(columns={'index': 'Variable'})

# vifs = vifs.style.background_gradient(cmap='OrRd')
# vifs = vifs.set_precision(2)

vifs

## Modelling

Before the actual modelling we prepare a function to calculate `RMSE` to compare models and extract the true `kiloCalories` into a separate array.

In [None]:
y_true = reg_df['kiloCalories'].to_numpy()

In [None]:
def calc_rmse(y_true, y_pred):
    x = np.sqrt(np.mean(np.power(y_true - y_pred, 2)))
    return round(x, 4)

In [None]:
all_results = []

### Time only

We start the modelling section of by building the simplest model that comes to mind: predict `kiloCalories` using `totalTime`.

In [None]:
formula = 'kiloCalories ~ totalTime'
mdl_time = smf.ols(formula=formula, data=reg_df)
mdl_time = mdl_time.fit()
mdl_time.summary2()

In [None]:
y_pred = mdl_time.predict(reg_df)
rmse = calc_rmse(y_pred, y_true)
all_results.append((rmse, formula))

In [None]:
print(rmse)

### By sport

The next regression we are going to do will be univariate regression separately for each sport, this will help us answer the question which sport is the most effective at burning calories during a workout.

In [None]:
all_sports = sorted(reg_df['sport'].unique())
reg_sports_res = []

# For all sport do simple linear regression
for sport in all_sports:
    tmp = reg_df[reg_df['sport'] == sport]
    formula = 'kiloCalories ~ totalTime'
    mdl_sport = smf.ols(formula=formula, data=tmp)
    mdl_sport = mdl_sport.fit()
    sport_stats = [formula, sport] + list(mdl_sport.params) + [mdl_sport.rsquared]
    reg_sports_res.append(sport_stats)

cols = ['Formula', 'Sport', 'Intercept', 'Slope', 'R squared']

reg_sports_res = pd.DataFrame(reg_sports_res, columns=cols)
reg_sports_res = reg_sports_res.sort_values(['Slope'], ascending=False)
reg_sports_res = reg_sports_res.reset_index(drop=True)

readme_df = reg_sports_res.copy().round(2)

# reg_sports_res = reg_sports_res.style.background_gradient(cmap='YlGn', subset='Slope')
# reg_sports_res = reg_sports_res.set_precision(2)

reg_sports_res

## Time and heart rate

We try to enhance the model by adding `heartRateQ99`.

In [None]:
formula = 'kiloCalories ~ totalTime + heartRateQ99'
mdl_time_and_hr = smf.ols(formula=formula, data=reg_df)
mdl_time_and_hr = mdl_time_and_hr.fit()
mdl_time_and_hr.summary2()

In [None]:
y_pred = mdl_time_and_hr.predict(reg_df)
rmse = calc_rmse(y_pred, y_true)
all_results.append((rmse, formula))

In [None]:
print(rmse)

## Time with random effects by workout type

In [None]:
formula = 'kiloCalories ~ totalTime + heartRateQ99'
re_formula = ' ~ totalTime'
group = 'isStrength'

mdl_time_with_hr_re = smf.mixedlm(formula=formula,
                  data=reg_df,
                  groups=reg_df[group],
                  re_formula=re_formula)

mdl_time_with_hr_re = mdl_time_with_hr_re.fit(method='lbfgs')
mdl_time_with_hr_re.summary()

In [None]:
y_pred = mdl_time_with_hr_re.predict(reg_df)
rmse = calc_rmse(y_pred, y_true)
all_results.append((rmse, formula, re_formula, group))

In [None]:
print(rmse)

## Model evaluation

We compare the linear models created earlier:

In [None]:
comp_df = pd.DataFrame(all_results, columns=['RMSE', 'Formula', 'Random effects', 'Groups'])
comp_df = comp_df.sort_values('RMSE')

# comp_df = comp_df.style.background_gradient(cmap='OrRd', subset='RMSE')
# comp_df = comp_df.set_precision(2)

comp_df

For further evaluation we choose the random effects model.

In [None]:
mdl = mdl_time_with_hr_re
residuals = mdl_time_with_hr_re.resid

### Visual inspection

We proceed to inspect the residuals of the model. First we view the histogram of the residuals. It can be seen that it looks normal.

In [None]:
plt.hist(residuals)
plt.ylabel('Frequency')
plt.xlabel('Residuals')
plt.savefig('./img/mdl_residuals.png')
plt.show()

The next plot is a qqplot created to visually inspect the normality of the residuals. We see 3 nasty outliers in the top right corner.

In [None]:
plt.figure()
ax = plt.gca()

qqplot(data=mdl.resid,
       ax=ax,
       color='#1f77b4',
       markersize=3,
       line='45',
       fit=True,
       alpha=1/2)

plt.savefig('./img/mdl_qq.png')
plt.show()

The third plot we make is a plot of the standardized residuals to check for homoskedasticity. Again we see the same outliers as on the plot above.

In [None]:
residuals_std = np.abs((residuals - np.mean(residuals)) / np.std(residuals))
plt.plot(residuals_std, 'o', markersize=2)
plt.xlabel('Observation')
plt.ylabel('Standardized residuals')
plt.savefig('./img/mdl_residuals_std.png')
plt.show()

Finally we compare the predicted `kiloCalories` with the actual values.

In [None]:
y_pred = mdl.predict(reg_df)
y_pred = y_pred.to_numpy().reshape(len(y_pred))

m = np.min(np.hstack([y_true, y_pred]))
M = np.max(np.hstack([y_true, y_pred]))

x = np.linspace(m, M, len(y_pred))
plt.plot(y_true, y_pred, 'o', markersize=2)
plt.plot(x,x, alpha=3/4)
plt.ylabel('Predicted')
plt.xlabel('Actual')
plt.tight_layout()
plt.savefig('./img/mdl_predicted_vs_actual.png')
plt.show()

The next step is to take a look at the data points with the biggest error. As can be seen the model has issues predicting strength training workouts.

In [None]:
errors = reg_df.copy()
errors['kiloCaloriesPredicted'] = mdl.predict(reg_df)

errors['error'] = np.abs(errors['kiloCalories'] - errors['kiloCaloriesPredicted'])

errors = errors.sort_values('error', ascending=False)
errors = errors.reset_index(drop=True)

order = ['kiloCaloriesPredicted',
         'kiloCalories',
         'error',
         'totalTime',
         'isStrength']

errors = errors[order]

errors = errors.head(5)

errors = errors.style.background_gradient(cmap='OrRd', subset='error')
errors = errors.set_precision(2)

errors

# Summary

In [None]:
# Make table for README
# print(tabulate.tabulate(by_sport.values, by_sport.columns, tablefmt="pipe"))

In [None]:
# Make table for README
# print(tabulate.tabulate(readme_df.values, readme_df.columns, tablefmt="pipe"))

* In this project I define a `workout` as each instance in time when my watch was recording me.

* I downloaded data generated by my Polar watch that tracks `heart rate` and estimates burned `kilocalories` during workouts.

* The data came in the form of `.json` files which were read, transformed and cleaned with `pandas`.

* The clean dataset contains `283` workouts over a nearly one year period during which I burned roughly `12kg` of body fat.

| Sport             |   Total kilocalories |   Total kilograms |
|:------------------|---------------------:|------------------:|
| walking           |                33080 |              4.3  |
| strength_training |                31547 |              4.1  |
| treadmill_running |                19825 |              2.57 |
| cycling           |                 4029 |              0.52 |
| running           |                  940 |              0.12 |

* The timing of my workouts appears to follow a `bimodal distribution` with peaks at `12:00` and `20:00`.

<!-- ![image](https://github.com/besiobu/data-science-portfolio/blob/master/polar/img/workouts_by_hour_of_day.png) -->

* After further transforming the data, I find that the `duration` of a workout and `kilocalorie`'s burned have a `0.92` correlation.

<!-- ![image](https://github.com/besiobu/data-science-portfolio/blob/master/polar/img/time_vs_kilocalories_scatter_by_strength.png) -->

* Several linear regressions were performed.

* `kilocalories ~ duration` on the entire dataset achieved `R^2 = 0.85` and `RMSE = 79`.

* Regressions were performed on subsets of the data, specifically by sport - the highest slope is `10.14 kiloCalories` per minute.

| Formula                  | Sport             |   Intercept |   Slope |   R squared |
|:-------------------------|:------------------|------------:|--------:|------------:|
| kilo_calories ~ total_time | treadmill_running |      -21.23 |   10.14 |        0.96 |
| kilo_calories ~ total_time | cycling           |       -9.73 |    7.44 |        0.98 |
| kilo_calories ~ total_time | walking           |       12.59 |    6.95 |        0.82 |
| kilo_calories ~ total_time | strength_training |      -12.73 |    6.76 |        0.44 |

* A `linear mixed model with random effects` was created and validated. It achieved a `RMSE = 61` and normal looking residuals.

<!-- ![image](https://github.com/besiobu/data-science-portfolio/blob/master/polar/img/mdl_predicted_vs_actual.png) -->

* The biggest `errors` made by the `mixed model` was on `strength training` data points.