# Data Management

In [1]:
import pandas as pd
import numpy as np
import re

The original dataframe "df" contains the date, distance, and runtime of each run from 2010-present. The goal of this data management process is to calculate a "pace" column from the distance and time of each run.

In [2]:
df = pd.read_excel("Running_Log.xlsx")

In [3]:
pd.options.display.max_rows=20
print(df)

            Date  Mileage     Time
0     2009-01-05     1.00      NaN
1     2009-04-25     3.10    29:38
2     2009-11-21     3.10    22:51
3     2010-01-04     1.30    17:00
4     2010-01-05     2.60    28:00
...          ...      ...      ...
3295  2024-05-15     7.54    53:50
3296  2024-05-17     4.50    34:08
3297  2024-05-18    20.08  2:14:32
3298  2024-05-21     4.39    37:43
3299  2024-05-22     6.76    49:31

[3300 rows x 3 columns]


The main obstacle in creating a pace column is that it is not easy to do math with times in mm:ss format. So the "Time" column must be reformatted into minutes, with fractions of minutes put in decimal format (e.g. instead of "29:38", output = "29.633").

In [4]:
print(df.iloc[[0,1389]])

            Date  Mileage   Time
0     2009-01-05     1.00    NaN
1389  2016-11-26     6.13  46:34


However, as can be seen above, some of my runs lasted longer than an hour and were thus input in h:mm:ss format. When I attempted to delimit the col by ":", the "1" (hour) in row 1389 was put in the same column as the "29" (minutes) in row 0.

To work around this problem, I computed a new col "time_h" with a standardized time format of "h:mm:ss" as seen below. I added an assert() statement to ensure that all data points in "time_h" started with a number and then a colon, as I had intended.

In [5]:
# added new col to standardize format of time var as "h:mm:ss"
df["time_h"] = np.where(df.Time.str.contains('\d:\d\d:\d\d', regex = True), df["Time"], "0:" + df["Time"])
print(df.iloc[[0,1389]])

testseries = df['time_h'].dropna()
for i in testseries:
    assert(re.match('\d:*', i))

            Date  Mileage   Time   time_h
0     2009-01-05     1.00    NaN      NaN
1389  2016-11-26     6.13  46:34  0:46:34


Then I created a new dataframe where df['time_h'] was delimited by ":" into three columns.

Next, I merged the "new" dataframe with the "df" dataframe and called the resulting dataframe "result".

Finally, I created "df2" from "result" and renamed the vars from "new" to be "hours", "minutes", and "seconds".

These changes can be seen in the following three sections of code.

In [6]:
# created new dataframe with "time_h" col expanded into hr, min, and sec vars
new = df["time_h"].str.split(pat = ":", expand = True)
print(new)

        0    1    2
0     NaN  NaN  NaN
1       0   29   38
2       0   22   51
3       0   17   00
4       0   28   00
...   ...  ...  ...
3295    0   53   50
3296    0   34   08
3297    2   14   32
3298    0   37   43
3299    0   49   31

[3300 rows x 3 columns]


In [7]:
result = pd.concat([df, new], axis=1, sort=False)
print(result)

            Date  Mileage     Time   time_h    0    1    2
0     2009-01-05     1.00      NaN      NaN  NaN  NaN  NaN
1     2009-04-25     3.10    29:38  0:29:38    0   29   38
2     2009-11-21     3.10    22:51  0:22:51    0   22   51
3     2010-01-04     1.30    17:00  0:17:00    0   17   00
4     2010-01-05     2.60    28:00  0:28:00    0   28   00
...          ...      ...      ...      ...  ...  ...  ...
3295  2024-05-15     7.54    53:50  0:53:50    0   53   50
3296  2024-05-17     4.50    34:08  0:34:08    0   34   08
3297  2024-05-18    20.08  2:14:32  2:14:32    2   14   32
3298  2024-05-21     4.39    37:43  0:37:43    0   37   43
3299  2024-05-22     6.76    49:31  0:49:31    0   49   31

[3300 rows x 7 columns]


In [8]:
df2 = result.rename(columns={"time_h": "time_corr", 0: "hours", 1: "minutes", 2: "seconds"})
print(df2)

            Date  Mileage     Time time_corr hours minutes seconds
0     2009-01-05     1.00      NaN       NaN   NaN     NaN     NaN
1     2009-04-25     3.10    29:38   0:29:38     0      29      38
2     2009-11-21     3.10    22:51   0:22:51     0      22      51
3     2010-01-04     1.30    17:00   0:17:00     0      17      00
4     2010-01-05     2.60    28:00   0:28:00     0      28      00
...          ...      ...      ...       ...   ...     ...     ...
3295  2024-05-15     7.54    53:50   0:53:50     0      53      50
3296  2024-05-17     4.50    34:08   0:34:08     0      34      08
3297  2024-05-18    20.08  2:14:32   2:14:32     2      14      32
3298  2024-05-21     4.39    37:43   0:37:43     0      37      43
3299  2024-05-22     6.76    49:31   0:49:31     0      49      31

[3300 rows x 7 columns]


To calculate pace, I needed to get my three time columns into numeric format.

Then I needed to combine them back into one time variable in seconds.

In [9]:
df2['hours'] = pd.to_numeric(df2['hours'])
df2['minutes'] = pd.to_numeric(df2['minutes'])
df2['seconds'] = pd.to_numeric(df2['seconds'])

In [10]:
time_seconds = df2.hours*3600 + df2.minutes*60 + df2.seconds
df2['time_s'] = time_seconds
print(df2)

            Date  Mileage     Time time_corr  hours  minutes  seconds  time_s
0     2009-01-05     1.00      NaN       NaN    NaN      NaN      NaN     NaN
1     2009-04-25     3.10    29:38   0:29:38    0.0     29.0     38.0  1778.0
2     2009-11-21     3.10    22:51   0:22:51    0.0     22.0     51.0  1371.0
3     2010-01-04     1.30    17:00   0:17:00    0.0     17.0      0.0  1020.0
4     2010-01-05     2.60    28:00   0:28:00    0.0     28.0      0.0  1680.0
...          ...      ...      ...       ...    ...      ...      ...     ...
3295  2024-05-15     7.54    53:50   0:53:50    0.0     53.0     50.0  3230.0
3296  2024-05-17     4.50    34:08   0:34:08    0.0     34.0      8.0  2048.0
3297  2024-05-18    20.08  2:14:32   2:14:32    2.0     14.0     32.0  8072.0
3298  2024-05-21     4.39    37:43   0:37:43    0.0     37.0     43.0  2263.0
3299  2024-05-22     6.76    49:31   0:49:31    0.0     49.0     31.0  2971.0

[3300 rows x 8 columns]


At last, I was able to calculate pace (after converting the column for distance to numeric format). I first calculated it in seconds-per-mile.

Then I created a new column in which the units for pace were a more meaningful minutes-per-mile.

In [11]:
df2['Mileage'] = pd.to_numeric(df2['Mileage'])
pace_s = df2.time_s / df2.Mileage
df2['pace_s'] = pace_s
print(df2)

            Date  Mileage     Time time_corr  hours  minutes  seconds  time_s  \
0     2009-01-05     1.00      NaN       NaN    NaN      NaN      NaN     NaN   
1     2009-04-25     3.10    29:38   0:29:38    0.0     29.0     38.0  1778.0   
2     2009-11-21     3.10    22:51   0:22:51    0.0     22.0     51.0  1371.0   
3     2010-01-04     1.30    17:00   0:17:00    0.0     17.0      0.0  1020.0   
4     2010-01-05     2.60    28:00   0:28:00    0.0     28.0      0.0  1680.0   
...          ...      ...      ...       ...    ...      ...      ...     ...   
3295  2024-05-15     7.54    53:50   0:53:50    0.0     53.0     50.0  3230.0   
3296  2024-05-17     4.50    34:08   0:34:08    0.0     34.0      8.0  2048.0   
3297  2024-05-18    20.08  2:14:32   2:14:32    2.0     14.0     32.0  8072.0   
3298  2024-05-21     4.39    37:43   0:37:43    0.0     37.0     43.0  2263.0   
3299  2024-05-22     6.76    49:31   0:49:31    0.0     49.0     31.0  2971.0   

          pace_s  
0       

In [12]:
pace_min = df2.pace_s / 60
df2['pace_min'] = pace_min
print(df2)

            Date  Mileage     Time time_corr  hours  minutes  seconds  time_s  \
0     2009-01-05     1.00      NaN       NaN    NaN      NaN      NaN     NaN   
1     2009-04-25     3.10    29:38   0:29:38    0.0     29.0     38.0  1778.0   
2     2009-11-21     3.10    22:51   0:22:51    0.0     22.0     51.0  1371.0   
3     2010-01-04     1.30    17:00   0:17:00    0.0     17.0      0.0  1020.0   
4     2010-01-05     2.60    28:00   0:28:00    0.0     28.0      0.0  1680.0   
...          ...      ...      ...       ...    ...      ...      ...     ...   
3295  2024-05-15     7.54    53:50   0:53:50    0.0     53.0     50.0  3230.0   
3296  2024-05-17     4.50    34:08   0:34:08    0.0     34.0      8.0  2048.0   
3297  2024-05-18    20.08  2:14:32   2:14:32    2.0     14.0     32.0  8072.0   
3298  2024-05-21     4.39    37:43   0:37:43    0.0     37.0     43.0  2263.0   
3299  2024-05-22     6.76    49:31   0:49:31    0.0     49.0     31.0  2971.0   

          pace_s   pace_min

I noticed that, due to some input errors in the original excel file, the formatting for all values in the "Date" column between the years 2012 and 2014 included the time of day (which, as I had never input that, ended up as a bunch of zeros at the end of the cell, as can be seen below in row 800).

I fixed this by changing the Date column to string format and then keeping only the first ten characters in each cell of that column.

In [13]:
print(df2.iloc[[0,800]])

                    Date  Mileage   Time time_corr  hours  minutes  seconds  \
0             2009-01-05      1.0    NaN       NaN    NaN      NaN      NaN   
800  2014-12-15 00:00:00      6.3  46:22   0:46:22    0.0     46.0     22.0   

     time_s      pace_s  pace_min  
0       NaN         NaN       NaN  
800  2782.0  441.587302  7.359788  


In [14]:
df2.Date = df2.Date.astype(str)
df2['Date'] = df2['Date'].str[:10]
print(df2)

            Date  Mileage     Time time_corr  hours  minutes  seconds  time_s  \
0     2009-01-05     1.00      NaN       NaN    NaN      NaN      NaN     NaN   
1     2009-04-25     3.10    29:38   0:29:38    0.0     29.0     38.0  1778.0   
2     2009-11-21     3.10    22:51   0:22:51    0.0     22.0     51.0  1371.0   
3     2010-01-04     1.30    17:00   0:17:00    0.0     17.0      0.0  1020.0   
4     2010-01-05     2.60    28:00   0:28:00    0.0     28.0      0.0  1680.0   
...          ...      ...      ...       ...    ...      ...      ...     ...   
3295  2024-05-15     7.54    53:50   0:53:50    0.0     53.0     50.0  3230.0   
3296  2024-05-17     4.50    34:08   0:34:08    0.0     34.0      8.0  2048.0   
3297  2024-05-18    20.08  2:14:32   2:14:32    2.0     14.0     32.0  8072.0   
3298  2024-05-21     4.39    37:43   0:37:43    0.0     37.0     43.0  2263.0   
3299  2024-05-22     6.76    49:31   0:49:31    0.0     49.0     31.0  2971.0   

          pace_s   pace_min

As a final step before completing the data management process, I created a new dataframe df3 which subset df2 into only the columns I needed. Then I wanted to view the entire dataset to ensure that there were no hidden mistakes.

In [15]:
df3 = df2[['Date','Mileage','time_corr','pace_min']]
pd.options.display.max_rows=3000
#print(df3)
print(df3.tail(10))
pd.options.display.max_rows=20

            Date  Mileage time_corr  pace_min
3290  2024-05-08     7.27   0:50:00  6.877579
3291  2024-05-10     4.40   0:36:01  8.185606
3292  2024-05-11    12.16   1:21:00  6.661184
3293  2024-05-13     6.01   0:47:53  7.967277
3294  2024-05-14     6.93   0:48:30  6.998557
3295  2024-05-15     7.54   0:53:50  7.139699
3296  2024-05-17     4.50   0:34:08  7.585185
3297  2024-05-18    20.08   2:14:32  6.699867
3298  2024-05-21     4.39   0:37:43  8.591496
3299  2024-05-22     6.76   0:49:31  7.324951


Finally, I exported df3 to a csv so that I have externally saved changes. This csv will be read in at the beginning of the df_visualization and df_analysis ipynb's.

In [16]:
df3.to_csv("clean.csv", index=False)