Welcome to your DataCamp project audition! This notebook must be filled out and vetted before a contract can be signed and you can start creating your project.

The first step is forking the repository in which this notebook lives. After that, there are two parts to be completed in this notebook:

- **Project information**:  The title of the project, a project description, etc.

- **Project introduction**: The three first text and code cells that will form the introduction of your project.

When complete, please email the link to your forked repo to projects@datacamp.com with the email subject line _DataCamp project audition_. If you have any questions, please reach out to projects@datacamp.com.

# Project information

**Project title**: Interactive visualization of historical training data.

**Name:** Andrii Pavlenko.

**Email address associated with your DataCamp account:** apavlenko69@gmail.com.

**Project description**: These days people from all over the world are running for fun or fitness. I do as well. Runners are tracking and collecting data with gadgets (smartphones, watches or sport trackers) to keep themselves motivated. People are competitive and, by nature, are striving to get answers for various questions like, for instance:
- How good was my run today? 
- Have I succeeded with my goal? 
- Am I progressing?
- What was my best achievement?
- How I do on average compared to others?

If you are data science enthusiast, as I do, collected data can feed your curiosity and give answers for above questions and much more. And, of course, it can be done in attractive visual format.

Before starting this project, will be handy to have completed:
- pandas Foundations
- Interactive Data Visualization with Bokeh

The dataset used in this project is my personal historical training data exported from Runkeeper – the service for tracking fitness activities.

# Project introduction

***Note: nothing needs to be filled out in this cell. It is simply setting up the template cells below.***

The final output of a DataCamp project looks like a blog post: pairs of text and code cells that tell a story about data. The text is written from the perspective of the data analyst and *not* from the perspective of an instructor on DataCamp. So, for this blog post intro, all you need to do is pretend like you're writing a blog post -- forget the part about instructors and students.

Below you'll see the structure of a DataCamp project: a series of "tasks" where each task consists of a title, a **single** text cell, and a **single** code cell. There are 8-12 tasks in a project and each task can have up to 10 lines of code. What you need to do:
1. Read through the template structure.
2. As best you can, divide your project as it is currently visualized in your mind into tasks.
3. Fill out the template structure for the first three tasks of your project.

As you are completing each task, you may wish to consult the project notebook format in our [documentation](https://instructor-support.datacamp.com/projects/datacamp-projects-jupyter-notebook). Only the `@context` and `@solution` cells are relevant to this audition.

## 1. Obtaining and reviewing of raw data 

One day I had a chat with my old friend. Since we both are runners, amongst other news we discussed our training habits and achievements. But, while answering his questions I suddenly understood that I had very rough idea about my running statistics. Fortunately, I had my training data collected (thanks for Runkeeper service). Logically, that conversation triggered an idea to analyze stored information properly. All unanswered questions can be addressed by using proper data analytics tools. And the data can tell this unique story of several years of efforts and motivations!
![Forrest Gump](img/RunningForrest.jpg "Explore world, explore your data!")

Since 2012 I used free Runkeeper services and tracked most of my fitness activities. So, I requested data export for the whole period from 2012 till 2018. In a while number of csv-files have been produced, available for download. From complete archive I picked the file containing list of required parameters, where each row is a single training activity from the past. 

First step is to explore data, find potential problems and remove them to get dataset ready for analyses and visualization. 

In [14]:
# Import required tools
import numpy as np
import pandas as pd

# Define file containing dataset
raw_data = 'datasets/cardioActivities.csv'

# Create dataframe with parse_dates=True for convenient slicing of time periods 
raw_df = pd.read_csv(raw_data, parse_dates=True, index_col='Date')

# First look at exported data: we can see columns, data types and detect missing values
display(raw_df.info())

# Deleting unnecessary columns
del raw_df['Friend\'s Tagged']
del raw_df['Route Name']
del raw_df['GPX File']
del raw_df['Activity Id']

# Picking up of eight random rows to observe data
raw_df.sample(n=8)

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 509 entries, 2018-11-11 14:05:12 to 2012-08-22 18:53:54
Data columns (total 13 columns):
Activity Id                 509 non-null object
Type                        509 non-null object
Route Name                  1 non-null object
Distance (km)               509 non-null float64
Duration                    509 non-null object
Average Pace                508 non-null object
Average Speed (km/h)        508 non-null float64
Calories Burned             509 non-null float64
Climb (m)                   509 non-null int64
Average Heart Rate (bpm)    294 non-null float64
Friend's Tagged             0 non-null float64
Notes                       231 non-null object
GPX File                    504 non-null object
dtypes: float64(5), int64(1), object(7)
memory usage: 55.7+ KB


None

Unnamed: 0_level_0,Type,Distance (km),Duration,Average Pace,Average Speed (km/h),Calories Burned,Climb (m),Average Heart Rate (bpm),Notes
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2014-03-11 19:14:15,Running,6.35,36:54,5:49,10.33,455.0,34,,
2015-08-23 17:35:47,Cycling,35.09,1:32:58,2:39,22.64,786.0,336,,
2017-08-19 18:01:57,Running,23.62,2:17:14,5:49,10.33,1667.0,377,138.0,TomTom MySports Watch
2018-07-03 18:00:05,Running,18.75,1:41:30,5:25,11.08,1332.0,300,139.0,TomTom MySports Watch
2017-04-23 11:50:12,Running,19.27,1:50:24,5:44,10.47,1450.0,305,149.0,TomTom MySports Watch
2018-01-24 18:13:26,Running,11.59,1:10:30,6:05,9.86,847.0,196,143.0,TomTom MySports Watch
2012-11-04 18:59:06,Walking,1.22,12:05,9:54,6.07,67.0,10,,
2014-12-26 17:50:00,Running,9.78,49:45,5:05,11.79,655.0,51,,


## 2. Dealing with missing values

Great, we have some achievements already! We built data structure with index of DatetimeIndex type. This will help us to manipulate data and easily select arbitrary date periods using familiar Python slicing techniques. Also, we left only important data, by deleting of unnecessary columns. 

However, there are some other issues to be fixed: missing values and incorrect data types for “Duration” and “Average Pace”. Both needs to be of datetime type, by now it is objects (strings).

Let's fix missing values at first. Some data for heart rate is empty simply because I didn't use any cardio sensors from the beginning. Notes is an optional field for comments about activity. 

It will be logical to fill missing heart rate data with mean value counted from available 294 not-null observations. NaN in 'Notes' will be filled with word 'Missing'.

In [15]:
# Filling empty Notes cells with string 'Missing' 
raw_df['Notes'].fillna('Missing', inplace=True)

# Filling with mean value counted from non-null's data 
raw_df['Average Heart Rate (bpm)'].fillna(raw_df['Average Heart Rate (bpm)'].mean(), inplace=True)

# Drop remaining NaN value
df = raw_df.dropna()

# Evaluate dataset again: all columns should have no NaN values
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 508 entries, 2018-11-11 14:05:12 to 2012-08-22 18:53:54
Data columns (total 9 columns):
Type                        508 non-null object
Distance (km)               508 non-null float64
Duration                    508 non-null object
Average Pace                508 non-null object
Average Speed (km/h)        508 non-null float64
Calories Burned             508 non-null float64
Climb (m)                   508 non-null int64
Average Heart Rate (bpm)    508 non-null float64
Notes                       508 non-null object
dtypes: float64(4), int64(1), object(4)
memory usage: 39.7+ KB


Unnamed: 0_level_0,Type,Distance (km),Duration,Average Pace,Average Speed (km/h),Calories Burned,Climb (m),Average Heart Rate (bpm),Notes
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2018-11-11 14:05:12,Running,10.44,58:40,5:37,10.68,774.0,130,159.0,Missing
2018-11-09 15:02:35,Running,12.84,1:14:12,5:47,10.39,954.0,168,159.0,Missing
2018-11-04 16:05:00,Running,13.01,1:15:16,5:47,10.37,967.0,171,155.0,Missing
2018-11-01 14:03:58,Running,12.98,1:14:25,5:44,10.47,960.0,169,158.0,Missing
2018-10-27 17:01:36,Running,13.02,1:12:50,5:36,10.73,967.0,170,154.0,Missing


## 3. Duration and pace as datetime

Good progress so far! 

The new task is to convert string values from 'Duration' and 'Average Pace' to datetime format. But to do so, we need to have all records in consistent format, following the same pattern. Checking values of 'Duration' in first and second rows of dataset, we can detect example of inconsistency. To address that, we will create and apply function adding leading zeroes to complete durations and follow same format of %H:%M:%S.

In [17]:
# Import module for dealing with datetime data
import datetime

def validate_time(t_str):
    """ 
    Adds leading zeroes when needed to return time in format %H:%M:%S
    """
    tstamp = t_str.split(':')
    while len(tstamp) < 3:
        tstamp.insert(0, '00')
    iso_time_string = ':'.join(tstamp)
    return iso_time_string

# Create vectorized function 
vfunc = np.vectorize(validate_time)

# Apply function for target columns
df1 = df.copy()
df1['Duration'] = vfunc(df['Duration'])
df1['Average Pace'] = vfunc(df['Average Pace'])

# Now it is safe to convert string object to datetime
df1.loc[:,'Duration'] = pd.to_datetime(df1['Duration'], format='%H:%M:%S')
df1.loc[:,'Average Pace'] = pd.to_datetime(df1['Average Pace'], format='%H:%M:%S')

Unnamed: 0_level_0,Type,Distance (km),Duration,Average Pace,Average Speed (km/h),Calories Burned,Climb (m),Average Heart Rate (bpm),Notes
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2018-11-11 14:05:12,Running,10.44,1900-01-01 00:58:40,1900-01-01 00:05:37,10.68,774.0,130,159.0,Missing
2018-11-09 15:02:35,Running,12.84,1900-01-01 01:14:12,1900-01-01 00:05:47,10.39,954.0,168,159.0,Missing


*Stop here! Only the three first tasks. :)*