# Data Management

In [1]:
import pandas as pd
import numpy as np
import re

The original dataframe "df" contains the date, distance, and runtime of each run from 2010-present. The goal of this data management process is to calculate a "pace" column from the distance and time of each run.

In [2]:
df = pd.read_excel("Running_Log.xlsx")

In [3]:
pd.options.display.max_rows=20
print(df)

            Date  Mileage   Time
0     2009-01-05     1.00    NaN
1     2009-04-25     3.10  29:38
2     2009-11-21     3.10  22:51
3     2010-01-04     1.30  17:00
4     2010-01-05     2.60  28:00
...          ...      ...    ...
3572  2025-09-11     5.09  43:26
3573  2025-09-13     8.01  56:42
3574  2025-09-15     4.09  33:39
3575  2025-09-16     6.11  45:20
3576  2025-09-17    10.00    NaN

[3577 rows x 3 columns]


The main obstacle in creating a pace column is that it is not easy to do math with times in mm:ss format. So the "Time" column must be reformatted into minutes, with fractions of minutes put in decimal format (e.g. instead of "29:38", output = "29.633").

In [4]:
print(df.iloc[[1,1390]])

            Date  Mileage     Time
1     2009-04-25      3.1    29:38
1390  2016-11-27     10.0  1:10:51


However, as can be seen above, some of my runs lasted longer than an hour and were thus input in h:mm:ss format. When I attempted to delimit the col by ":", the "1" (hour) in row 1389 was put in the same column as the "29" (minutes) in row 0.

To work around this problem, I computed a new col "time_h" with a standardized time format of "h:mm:ss" as seen below. I added an assert() statement to ensure that all data points in "time_h" started with a number and then a colon, as I had intended.

In [5]:
# added new col to standardize format of time var as "h:mm:ss"
df["time_h"] = np.where(df.Time.str.contains('\d:\d\d:\d\d', regex = True), df["Time"], "0:" + df["Time"])
print(df.iloc[[1,1390]])

testseries = df['time_h'].dropna()
for i in testseries:
    assert(re.match('\d:*', i))

            Date  Mileage     Time   time_h
1     2009-04-25      3.1    29:38  0:29:38
1390  2016-11-27     10.0  1:10:51  1:10:51


Then I created a new dataframe where df['time_h'] was delimited by ":" into three columns.

Next, I merged the "new" dataframe with the "df" dataframe and called the resulting dataframe "result".

Finally, I created "df2" from "result" and renamed the vars from "new" to be "hours", "minutes", and "seconds".

These changes can be seen in the following three sections of code.

In [6]:
# created new dataframe with "time_h" col expanded into hr, min, and sec vars
new = df["time_h"].str.split(pat = ":", expand = True)
print(new)

        0    1    2
0     NaN  NaN  NaN
1       0   29   38
2       0   22   51
3       0   17   00
4       0   28   00
...   ...  ...  ...
3572    0   43   26
3573    0   56   42
3574    0   33   39
3575    0   45   20
3576  NaN  NaN  NaN

[3577 rows x 3 columns]


In [7]:
result = pd.concat([df, new], axis=1, sort=False)
print(result)

            Date  Mileage   Time   time_h    0    1    2
0     2009-01-05     1.00    NaN      NaN  NaN  NaN  NaN
1     2009-04-25     3.10  29:38  0:29:38    0   29   38
2     2009-11-21     3.10  22:51  0:22:51    0   22   51
3     2010-01-04     1.30  17:00  0:17:00    0   17   00
4     2010-01-05     2.60  28:00  0:28:00    0   28   00
...          ...      ...    ...      ...  ...  ...  ...
3572  2025-09-11     5.09  43:26  0:43:26    0   43   26
3573  2025-09-13     8.01  56:42  0:56:42    0   56   42
3574  2025-09-15     4.09  33:39  0:33:39    0   33   39
3575  2025-09-16     6.11  45:20  0:45:20    0   45   20
3576  2025-09-17    10.00    NaN      NaN  NaN  NaN  NaN

[3577 rows x 7 columns]


In [8]:
df2 = result.rename(columns={"time_h": "time_corr", 0: "hours", 1: "minutes", 2: "seconds"})
print(df2)

            Date  Mileage   Time time_corr hours minutes seconds
0     2009-01-05     1.00    NaN       NaN   NaN     NaN     NaN
1     2009-04-25     3.10  29:38   0:29:38     0      29      38
2     2009-11-21     3.10  22:51   0:22:51     0      22      51
3     2010-01-04     1.30  17:00   0:17:00     0      17      00
4     2010-01-05     2.60  28:00   0:28:00     0      28      00
...          ...      ...    ...       ...   ...     ...     ...
3572  2025-09-11     5.09  43:26   0:43:26     0      43      26
3573  2025-09-13     8.01  56:42   0:56:42     0      56      42
3574  2025-09-15     4.09  33:39   0:33:39     0      33      39
3575  2025-09-16     6.11  45:20   0:45:20     0      45      20
3576  2025-09-17    10.00    NaN       NaN   NaN     NaN     NaN

[3577 rows x 7 columns]


To calculate pace, I needed to get my three time columns into numeric format.

Then I needed to combine them back into one time variable in seconds.

In [9]:
df2['hours'] = pd.to_numeric(df2['hours'])
df2['minutes'] = pd.to_numeric(df2['minutes'])
df2['seconds'] = pd.to_numeric(df2['seconds'])

In [10]:
time_seconds = df2.hours*3600 + df2.minutes*60 + df2.seconds
df2['time_s'] = time_seconds
print(df2)

            Date  Mileage   Time time_corr  hours  minutes  seconds  time_s
0     2009-01-05     1.00    NaN       NaN    NaN      NaN      NaN     NaN
1     2009-04-25     3.10  29:38   0:29:38    0.0     29.0     38.0  1778.0
2     2009-11-21     3.10  22:51   0:22:51    0.0     22.0     51.0  1371.0
3     2010-01-04     1.30  17:00   0:17:00    0.0     17.0      0.0  1020.0
4     2010-01-05     2.60  28:00   0:28:00    0.0     28.0      0.0  1680.0
...          ...      ...    ...       ...    ...      ...      ...     ...
3572  2025-09-11     5.09  43:26   0:43:26    0.0     43.0     26.0  2606.0
3573  2025-09-13     8.01  56:42   0:56:42    0.0     56.0     42.0  3402.0
3574  2025-09-15     4.09  33:39   0:33:39    0.0     33.0     39.0  2019.0
3575  2025-09-16     6.11  45:20   0:45:20    0.0     45.0     20.0  2720.0
3576  2025-09-17    10.00    NaN       NaN    NaN      NaN      NaN     NaN

[3577 rows x 8 columns]


At last, I was able to calculate pace (after converting the column for distance to numeric format). I first calculated it in seconds-per-mile.

Then I created a new column in which the units for pace were a more meaningful minutes-per-mile.

In [11]:
df2['Mileage'] = pd.to_numeric(df2['Mileage'])
pace_s = df2.time_s / df2.Mileage
df2['pace_s'] = pace_s
print(df2)

            Date  Mileage   Time time_corr  hours  minutes  seconds  time_s  \
0     2009-01-05     1.00    NaN       NaN    NaN      NaN      NaN     NaN   
1     2009-04-25     3.10  29:38   0:29:38    0.0     29.0     38.0  1778.0   
2     2009-11-21     3.10  22:51   0:22:51    0.0     22.0     51.0  1371.0   
3     2010-01-04     1.30  17:00   0:17:00    0.0     17.0      0.0  1020.0   
4     2010-01-05     2.60  28:00   0:28:00    0.0     28.0      0.0  1680.0   
...          ...      ...    ...       ...    ...      ...      ...     ...   
3572  2025-09-11     5.09  43:26   0:43:26    0.0     43.0     26.0  2606.0   
3573  2025-09-13     8.01  56:42   0:56:42    0.0     56.0     42.0  3402.0   
3574  2025-09-15     4.09  33:39   0:33:39    0.0     33.0     39.0  2019.0   
3575  2025-09-16     6.11  45:20   0:45:20    0.0     45.0     20.0  2720.0   
3576  2025-09-17    10.00    NaN       NaN    NaN      NaN      NaN     NaN   

          pace_s  
0            NaN  
1     573.548

In [12]:
pace_min = df2.pace_s / 60
df2['pace_min'] = pace_min
print(df2)

            Date  Mileage   Time time_corr  hours  minutes  seconds  time_s  \
0     2009-01-05     1.00    NaN       NaN    NaN      NaN      NaN     NaN   
1     2009-04-25     3.10  29:38   0:29:38    0.0     29.0     38.0  1778.0   
2     2009-11-21     3.10  22:51   0:22:51    0.0     22.0     51.0  1371.0   
3     2010-01-04     1.30  17:00   0:17:00    0.0     17.0      0.0  1020.0   
4     2010-01-05     2.60  28:00   0:28:00    0.0     28.0      0.0  1680.0   
...          ...      ...    ...       ...    ...      ...      ...     ...   
3572  2025-09-11     5.09  43:26   0:43:26    0.0     43.0     26.0  2606.0   
3573  2025-09-13     8.01  56:42   0:56:42    0.0     56.0     42.0  3402.0   
3574  2025-09-15     4.09  33:39   0:33:39    0.0     33.0     39.0  2019.0   
3575  2025-09-16     6.11  45:20   0:45:20    0.0     45.0     20.0  2720.0   
3576  2025-09-17    10.00    NaN       NaN    NaN      NaN      NaN     NaN   

          pace_s   pace_min  
0            NaN     

I noticed that, due to some input errors in the original excel file, the formatting for all values in the "Date" column between the years 2012 and 2014 included the time of day (which, as I had never input that, ended up as a bunch of zeros at the end of the cell, as can be seen below in row 800).

I fixed this by changing the Date column to string format and then keeping only the first ten characters in each cell of that column.

In [13]:
print(df2.iloc[[0,800]])

                    Date  Mileage   Time time_corr  hours  minutes  seconds  \
0             2009-01-05      1.0    NaN       NaN    NaN      NaN      NaN   
800  2014-12-15 00:00:00      6.3  46:22   0:46:22    0.0     46.0     22.0   

     time_s      pace_s  pace_min  
0       NaN         NaN       NaN  
800  2782.0  441.587302  7.359788  


In [14]:
df2.Date = df2.Date.astype(str)
df2['Date'] = df2['Date'].str[:10]
print(df2)

            Date  Mileage   Time time_corr  hours  minutes  seconds  time_s  \
0     2009-01-05     1.00    NaN       NaN    NaN      NaN      NaN     NaN   
1     2009-04-25     3.10  29:38   0:29:38    0.0     29.0     38.0  1778.0   
2     2009-11-21     3.10  22:51   0:22:51    0.0     22.0     51.0  1371.0   
3     2010-01-04     1.30  17:00   0:17:00    0.0     17.0      0.0  1020.0   
4     2010-01-05     2.60  28:00   0:28:00    0.0     28.0      0.0  1680.0   
...          ...      ...    ...       ...    ...      ...      ...     ...   
3572  2025-09-11     5.09  43:26   0:43:26    0.0     43.0     26.0  2606.0   
3573  2025-09-13     8.01  56:42   0:56:42    0.0     56.0     42.0  3402.0   
3574  2025-09-15     4.09  33:39   0:33:39    0.0     33.0     39.0  2019.0   
3575  2025-09-16     6.11  45:20   0:45:20    0.0     45.0     20.0  2720.0   
3576  2025-09-17    10.00    NaN       NaN    NaN      NaN      NaN     NaN   

          pace_s   pace_min  
0            NaN     

As a final step before completing the data management process, I created a new dataframe df3 which subset df2 into only the columns I needed. Then I wanted to view the entire dataset to ensure that there were no hidden mistakes.

In [15]:
df3 = df2[['Date','Mileage','time_corr','pace_min']]
pd.options.display.max_rows=3000
#print(df3)
print(df3.tail(10))
pd.options.display.max_rows=20

            Date  Mileage time_corr  pace_min
3567  2025-08-30     9.09   1:05:33  7.211221
3568  2025-09-01    14.29   1:45:37  7.390949
3569  2025-09-06     5.01   0:43:19  8.646041
3570  2025-09-08     5.00   0:36:53  7.376667
3571  2025-09-09     7.64   0:54:53  7.183682
3572  2025-09-11     5.09   0:43:26  8.533071
3573  2025-09-13     8.01   0:56:42  7.078652
3574  2025-09-15     4.09   0:33:39  8.227384
3575  2025-09-16     6.11   0:45:20  7.419531
3576  2025-09-17    10.00       NaN       NaN


Finally, I exported df3 to a csv so that I have externally saved changes. This csv will be read in at the beginning of the df_visualization and df_analysis ipynb's.

In [16]:
df3.to_csv("clean.csv", index=False)