# Data Cleaning - Running Metadata of the RICKD Analysis

In this notebook we will focus on cleaning the run_data_meta.csv file with what we have learned from the previous notebook.

- Coalesce `SpecInjury` and `SpecInjury2` into a single column as the main injury.
- Consolidating categorical data (i.e. Other/Other, Plantar fasciitis/Plantar fasciitis)
- Removing duplicates if they exist.
- Clean free text fields.
- Consolidate data from multiple sessions for the same subject.
- Some subjects have been registered under the same ID but they are different people. i.e. 200375
- Investigate outliers in the data.
- Investigate missing data.

# Output datasets:
- run_data_meta_cleaned.csv

In [2]:
import pandas as pd
from core.constants import RICKD_RUNNING_METADATA_FILE
from core.data_quality import identify_conflicting_data

In [3]:
run_data_meta = pd.read_csv(RICKD_RUNNING_METADATA_FILE)

In [6]:
# 1. Coalesce SpecInjury and SpecInjury2 into a single column as the main injury
run_data_meta['MainInjury'] = run_data_meta['SpecInjury'].combine_first(run_data_meta['SpecInjury2'])
run_data_meta['MainInjury']

0                      pain
1         disc degeneration
2                     other
3                       NaN
4                       NaN
               ...         
1827                    NaN
1828    Pelvic malalignment
1829          Muscle strain
1830                    NaN
1831                  Other
Name: MainInjury, Length: 1832, dtype: object

In [None]:
# 1. Coalesce SpecInjury and SpecInjury2 into a single column as the main injury
run_data_meta['MainInjury'] = run_data_meta['SpecInjury'].combine_first(run_data_meta['SpecInjury2'])

# 2. Consolidate categorical data (e.g., 'Other/Other', 'Plantar fasciitis/Plantar fasciitis')
def consolidate_categorical(val):
    if isinstance(val, str) and '/' in val:
        parts = [p.strip() for p in val.split('/')]
        unique_parts = list(dict.fromkeys(parts))  # Remove duplicates, preserve order
        return '/'.join(unique_parts)
    return val

categorical_cols = ['MainInjury']  # Add other categorical columns as needed
for col in categorical_cols:
    run_data_meta[col] = run_data_meta[col].apply(consolidate_categorical)


# 3. Remove duplicates if they exist
run_data_meta = run_data_meta.drop_duplicates()


# 4. Clean free text fields
free_text_cols = ['MainInjury']  # Add other free text columns as needed
for col in free_text_cols:
    run_data_meta[col] = run_data_meta[col].astype(str).str.strip().str.title()



# 5. Consolidate data from multiple sessions for the same subject
# Example assumes 'SubjectID' column exists
agg_funcs = {
    'MainInjury': lambda x: '/'.join(sorted(set(x.dropna())))
    # Add other columns and aggregation functions as needed
}
if 'SubjectID' in run_data_meta.columns:
    run_data_meta = run_data_meta.groupby('SubjectID', as_index=False).agg(agg_funcs)



# 6. Handle subjects with the same ID but are different people (e.g., 200375)
# Manual step: If you know which rows are different people, assign new IDs
# Example: run_data_meta.loc[run_data_meta.index == problematic_row_index, 'SubjectID'] = '200375_A'
# (You need to specify which rows are which person.)




# 7. Investigate outliers in the data
display(run_data_meta.describe(include='all'))
# You can also use boxplots for visualization if needed
# import matplotlib.pyplot as plt
# run_data_meta.boxplot(rot=90)
# plt.show()



# 8. Investigate missing data
display(run_data_meta.isnull().sum())
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(run_data_meta.isnull(), cbar=False)
plt.show()

In [5]:
run_data_meta.describe(include='all')

Unnamed: 0,sub_id,datestring,filename,speed_r,age,Height,Weight,Gender,DominantLeg,InjDefn,...,SpecInjury2,Activities,Level,YrsRunning,RaceDistance,RaceTimeHrs,RaceTimeMins,RaceTimeSecs,YrPR,NumRaces
count,1832.0,1832,1832,1832.0,1832.0,1829.0,1830.0,1832,1480,1752,...,320,1516,1563,1315.0,1494,979,1025,924,425.0,504.0
unique,,1823,1832,,,,,3,3,4,...,55,800,2,,9,18,70,47,,
top,,NaT,20101005T132240.json,,,,,Female,Right,No injury,...,pain,running,Recreational,,Casual Runner (no times),HH,MM,SS,,
freq,,10,1,,,,,926,1131,659,...,74,153,1042,,436,424,324,479,,
mean,122721.658843,,,2.76016,38.170306,173.051919,71.017223,,,,...,,,,49.122624,,,,,1930.315294,5.329365
std,41154.448668,,,0.477627,13.145301,29.675143,37.466057,,,,...,,,,191.117688,,,,,394.518125,5.480369
min,100001.0,,,1.172048,18.0,0.0,0.0,,,,...,,,,0.0,,,,,0.0,0.0
25%,100608.75,,,2.482615,28.0,165.2,60.0,,,,...,,,,3.25,,,,,2012.0,2.0
50%,101256.0,,,2.72131,37.0,172.7,69.1,,,,...,,,,8.0,,,,,2012.0,4.0
75%,101795.25,,,2.933408,47.0,179.0,78.4,,,,...,,,,15.0,,,,,2012.0,8.0
