# Data Management

In [1]:
import pandas as pd
import numpy as np
import re

The original dataframe "df" contains the date, distance, and runtime of each run from 2010-present. The goal of this data management process is to calculate a "pace" column from the distance and time of each run.

In [2]:
df = pd.read_excel("Running_Log.xlsx")

In [3]:
pd.options.display.max_rows=20
print(df)

            Date  Mileage     Time
0     2009-04-25     3.10    29:38
1     2009-11-21     3.10    22:51
2     2010-01-04     1.30    17:00
3     2010-01-05     2.60    28:00
4     2010-01-07     2.60    28:00
5     2010-01-11     1.30      NaN
6     2010-01-14     2.60    26:00
7     2010-01-16     4.05    35:48
8     2010-01-19     2.60    24:00
9     2010-01-25     2.66    24:15
...          ...      ...      ...
2562  2020-07-23     7.50    52:58
2563  2020-07-24     7.09    53:15
2564  2020-07-25    11.56  1:21:42
2565  2020-07-27     7.90    57:02
2566  2020-07-28     8.26    55:39
2567  2020-07-29    10.89      NaN
2568  2020-07-30     5.68    40:48
2569  2020-07-31     7.39    52:51
2570  2020-08-01    16.69  1:47:15
2571  2020-08-03     8.00  1:00:08

[2572 rows x 3 columns]


The main obstacle in creating a pace column is that it is not easy to do math with times in mm:ss format. So the "Time" column must be reformatted into minutes, with fractions of minutes put in decimal format (e.g. instead of "29:38", output = "29.633").

In [4]:
print(df.iloc[[0,1389]])

            Date  Mileage     Time
0     2009-04-25      3.1    29:38
1389  2016-11-27     10.0  1:10:51


However, as can be seen above, some of my runs lasted longer than an hour and were thus input in h:mm:ss format. When I attempted to delimit the col by ":", the "1" (hour) in row 1389 was put in the same column as the "29" (minutes) in row 0.

To work around this problem, I computed a new col "time_h" with a standardized time format of "h:mm:ss" as seen below. I added an assert() statement to ensure that all data points in "time_h" started with a number and then a colon, as I had intended.

In [5]:
# added new col to standardize format of time var as "h:mm:ss"
df["time_h"] = pd.np.where(df.Time.str.contains('\d:\d\d:\d\d', regex = True), df["Time"], "0:" + df["Time"])
print(df.iloc[[0,1389]])

testseries = df['time_h'].dropna()
for i in testseries:
    assert(re.match('\d:*', i))

            Date  Mileage     Time   time_h
0     2009-04-25      3.1    29:38  0:29:38
1389  2016-11-27     10.0  1:10:51  1:10:51


Then I created a new dataframe where df['time_h'] was delimited by ":" into three columns.

Next, I merged the "new" dataframe with the "df" dataframe and called the resulting dataframe "result".

Finally, I created "df2" from "result" and renamed the vars from "new" to be "hours", "minutes", and "seconds".

These changes can be seen in the following three sections of code.

In [6]:
# created new dataframe with "time_h" col expanded into hr, min, and sec vars
new = df["time_h"].str.split(pat = ":", expand = True)
print(new)

        0    1    2
0       0   29   38
1       0   22   51
2       0   17   00
3       0   28   00
4       0   28   00
5     NaN  NaN  NaN
6       0   26   00
7       0   35   48
8       0   24   00
9       0   24   15
...   ...  ...  ...
2562    0   52   58
2563    0   53   15
2564    1   21   42
2565    0   57   02
2566    0   55   39
2567  NaN  NaN  NaN
2568    0   40   48
2569    0   52   51
2570    1   47   15
2571    1   00   08

[2572 rows x 3 columns]


In [7]:
result = pd.concat([df, new], axis=1, sort=False)
print(result)

            Date  Mileage     Time   time_h    0    1    2
0     2009-04-25     3.10    29:38  0:29:38    0   29   38
1     2009-11-21     3.10    22:51  0:22:51    0   22   51
2     2010-01-04     1.30    17:00  0:17:00    0   17   00
3     2010-01-05     2.60    28:00  0:28:00    0   28   00
4     2010-01-07     2.60    28:00  0:28:00    0   28   00
5     2010-01-11     1.30      NaN      NaN  NaN  NaN  NaN
6     2010-01-14     2.60    26:00  0:26:00    0   26   00
7     2010-01-16     4.05    35:48  0:35:48    0   35   48
8     2010-01-19     2.60    24:00  0:24:00    0   24   00
9     2010-01-25     2.66    24:15  0:24:15    0   24   15
...          ...      ...      ...      ...  ...  ...  ...
2562  2020-07-23     7.50    52:58  0:52:58    0   52   58
2563  2020-07-24     7.09    53:15  0:53:15    0   53   15
2564  2020-07-25    11.56  1:21:42  1:21:42    1   21   42
2565  2020-07-27     7.90    57:02  0:57:02    0   57   02
2566  2020-07-28     8.26    55:39  0:55:39    0   55   

In [8]:
df2 = result.rename(columns={"time_h": "time_corr", 0: "hours", 1: "minutes", 2: "seconds"})
print(df2)

            Date  Mileage     Time time_corr hours minutes seconds
0     2009-04-25     3.10    29:38   0:29:38     0      29      38
1     2009-11-21     3.10    22:51   0:22:51     0      22      51
2     2010-01-04     1.30    17:00   0:17:00     0      17      00
3     2010-01-05     2.60    28:00   0:28:00     0      28      00
4     2010-01-07     2.60    28:00   0:28:00     0      28      00
5     2010-01-11     1.30      NaN       NaN   NaN     NaN     NaN
6     2010-01-14     2.60    26:00   0:26:00     0      26      00
7     2010-01-16     4.05    35:48   0:35:48     0      35      48
8     2010-01-19     2.60    24:00   0:24:00     0      24      00
9     2010-01-25     2.66    24:15   0:24:15     0      24      15
...          ...      ...      ...       ...   ...     ...     ...
2562  2020-07-23     7.50    52:58   0:52:58     0      52      58
2563  2020-07-24     7.09    53:15   0:53:15     0      53      15
2564  2020-07-25    11.56  1:21:42   1:21:42     1      21    

To calculate pace, I needed to get my three time columns into numeric format.

Then I needed to combine them back into one time variable in seconds.

In [9]:
df2['hours'] = pd.to_numeric(df2['hours'])
df2['minutes'] = pd.to_numeric(df2['minutes'])
df2['seconds'] = pd.to_numeric(df2['seconds'])

In [10]:
time_seconds = df2.hours*3600 + df2.minutes*60 + df2.seconds
df2['time_s'] = time_seconds
print(df2)

            Date  Mileage     Time time_corr  hours  minutes  seconds  time_s
0     2009-04-25     3.10    29:38   0:29:38    0.0     29.0     38.0  1778.0
1     2009-11-21     3.10    22:51   0:22:51    0.0     22.0     51.0  1371.0
2     2010-01-04     1.30    17:00   0:17:00    0.0     17.0      0.0  1020.0
3     2010-01-05     2.60    28:00   0:28:00    0.0     28.0      0.0  1680.0
4     2010-01-07     2.60    28:00   0:28:00    0.0     28.0      0.0  1680.0
5     2010-01-11     1.30      NaN       NaN    NaN      NaN      NaN     NaN
6     2010-01-14     2.60    26:00   0:26:00    0.0     26.0      0.0  1560.0
7     2010-01-16     4.05    35:48   0:35:48    0.0     35.0     48.0  2148.0
8     2010-01-19     2.60    24:00   0:24:00    0.0     24.0      0.0  1440.0
9     2010-01-25     2.66    24:15   0:24:15    0.0     24.0     15.0  1455.0
...          ...      ...      ...       ...    ...      ...      ...     ...
2562  2020-07-23     7.50    52:58   0:52:58    0.0     52.0    

At last, I was able to calculate pace (after converting the column for distance to numeric format). I first calculated it in seconds-per-mile.

Then I created a new column in which the units for pace were a more meaningful minutes-per-mile.

In [11]:
df2['Mileage'] = pd.to_numeric(df2['Mileage'])
pace_s = df2.time_s / df2.Mileage
df2['pace_s'] = pace_s
print(df2)

            Date  Mileage     Time time_corr  hours  minutes  seconds  time_s  \
0     2009-04-25     3.10    29:38   0:29:38    0.0     29.0     38.0  1778.0   
1     2009-11-21     3.10    22:51   0:22:51    0.0     22.0     51.0  1371.0   
2     2010-01-04     1.30    17:00   0:17:00    0.0     17.0      0.0  1020.0   
3     2010-01-05     2.60    28:00   0:28:00    0.0     28.0      0.0  1680.0   
4     2010-01-07     2.60    28:00   0:28:00    0.0     28.0      0.0  1680.0   
5     2010-01-11     1.30      NaN       NaN    NaN      NaN      NaN     NaN   
6     2010-01-14     2.60    26:00   0:26:00    0.0     26.0      0.0  1560.0   
7     2010-01-16     4.05    35:48   0:35:48    0.0     35.0     48.0  2148.0   
8     2010-01-19     2.60    24:00   0:24:00    0.0     24.0      0.0  1440.0   
9     2010-01-25     2.66    24:15   0:24:15    0.0     24.0     15.0  1455.0   
...          ...      ...      ...       ...    ...      ...      ...     ...   
2562  2020-07-23     7.50   

In [12]:
pace_min = df2.pace_s / 60
df2['pace_min'] = pace_min
print(df2)

            Date  Mileage     Time time_corr  hours  minutes  seconds  time_s  \
0     2009-04-25     3.10    29:38   0:29:38    0.0     29.0     38.0  1778.0   
1     2009-11-21     3.10    22:51   0:22:51    0.0     22.0     51.0  1371.0   
2     2010-01-04     1.30    17:00   0:17:00    0.0     17.0      0.0  1020.0   
3     2010-01-05     2.60    28:00   0:28:00    0.0     28.0      0.0  1680.0   
4     2010-01-07     2.60    28:00   0:28:00    0.0     28.0      0.0  1680.0   
5     2010-01-11     1.30      NaN       NaN    NaN      NaN      NaN     NaN   
6     2010-01-14     2.60    26:00   0:26:00    0.0     26.0      0.0  1560.0   
7     2010-01-16     4.05    35:48   0:35:48    0.0     35.0     48.0  2148.0   
8     2010-01-19     2.60    24:00   0:24:00    0.0     24.0      0.0  1440.0   
9     2010-01-25     2.66    24:15   0:24:15    0.0     24.0     15.0  1455.0   
...          ...      ...      ...       ...    ...      ...      ...     ...   
2562  2020-07-23     7.50   

I noticed that, due to some input errors in the original excel file, the formatting for all values in the "Date" column between the years 2012 and 2014 included the time of day (which, as I had never input that, ended up as a bunch of zeros at the end of the cell, as can be seen below in row 800).

I fixed this by changing the Date column to string format and then keeping only the first ten characters in each cell of that column.

In [13]:
print(df2.iloc[[0,800]])

                    Date  Mileage   Time time_corr  hours  minutes  seconds  \
0             2009-04-25     3.10  29:38   0:29:38    0.0     29.0     38.0   
800  2014-12-16 00:00:00     4.58  36:59   0:36:59    0.0     36.0     59.0   

     time_s      pace_s  pace_min  
0    1778.0  573.548387  9.559140  
800  2219.0  484.497817  8.074964  


In [14]:
df2.Date = df2.Date.astype(str)
df2['Date'] = df2['Date'].str[:10]
print(df2)

            Date  Mileage     Time time_corr  hours  minutes  seconds  time_s  \
0     2009-04-25     3.10    29:38   0:29:38    0.0     29.0     38.0  1778.0   
1     2009-11-21     3.10    22:51   0:22:51    0.0     22.0     51.0  1371.0   
2     2010-01-04     1.30    17:00   0:17:00    0.0     17.0      0.0  1020.0   
3     2010-01-05     2.60    28:00   0:28:00    0.0     28.0      0.0  1680.0   
4     2010-01-07     2.60    28:00   0:28:00    0.0     28.0      0.0  1680.0   
5     2010-01-11     1.30      NaN       NaN    NaN      NaN      NaN     NaN   
6     2010-01-14     2.60    26:00   0:26:00    0.0     26.0      0.0  1560.0   
7     2010-01-16     4.05    35:48   0:35:48    0.0     35.0     48.0  2148.0   
8     2010-01-19     2.60    24:00   0:24:00    0.0     24.0      0.0  1440.0   
9     2010-01-25     2.66    24:15   0:24:15    0.0     24.0     15.0  1455.0   
...          ...      ...      ...       ...    ...      ...      ...     ...   
2562  2020-07-23     7.50   

As a final step before completing the data management process, I created a new dataframe df3 which subset df2 into only the columns I needed. Then I wanted to view the entire dataset to ensure that there were no hidden mistakes.

In [15]:
df3 = df2[['Date','Mileage','time_corr','pace_min']]
pd.options.display.max_rows=3000
#print(df3)
print(df3.tail(10))
pd.options.display.max_rows=20

            Date  Mileage time_corr  pace_min
2562  2020-07-23     7.50   0:52:58  7.062222
2563  2020-07-24     7.09   0:53:15  7.510578
2564  2020-07-25    11.56   1:21:42  7.067474
2565  2020-07-27     7.90   0:57:02  7.219409
2566  2020-07-28     8.26   0:55:39  6.737288
2567  2020-07-29    10.89       NaN       NaN
2568  2020-07-30     5.68   0:40:48  7.183099
2569  2020-07-31     7.39   0:52:51  7.151556
2570  2020-08-01    16.69   1:47:15  6.426004
2571  2020-08-03     8.00   1:00:08  7.516667


Finally, I exported df3 to a csv so that I have externally saved changes. This csv will be read in at the beginning of the df_visualization and df_analysis ipynb's.

In [16]:
df3.to_csv("clean.csv", index=False)