**Goal**

The goal of this notebook is to create code that will take as an input the raw problematic internet use data, incorporate additional variables, and output "cleaned" data

Note that this is not written as a class that could be called from within a pipe.

If we wanted to do that, here is a Kaggle notebook that has useful-looking examples:
https://www.kaggle.com/code/ksvmuralidhar/creating-custom-transformers-using-scikit-learn

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns

In [2]:
#The original data has been named train_original
#Here I am splitting the data into a train and test set. I want to stratify by age. 
# Then I export as csv files, since we are working over multiple jupiter notebooks.

from sklearn.model_selection import train_test_split

train_original=pd.read_csv('train_original.csv')

seed=1275
train, test = train_test_split(train_original, test_size=0.2, random_state=seed, stratify=train_original['Basic_Demos-Age'])

train.to_csv('train.csv', index=False)
test.to_csv('test.csv',index=False)

In [3]:
#This is the starting data.
train=pd.read_csv('train.csv')

**Adding Accelerometer Data**

We have accelerometer data that should be merged into this data set

Note that there are some participants who appear to have accelerometer data but aren't listed in train (they're likely in test). So we'll need to do a 'left' join to avoid incorporating participants who aren't in train

In [4]:
# Load the accelerometer data set Accelerometer_enmo_anglez_daily_averages.csv
accel = pd.read_csv('Accelerometer_enmo_anglez_daily_averages.csv')

# Join train and accel  on the 'id' column and accel on the 'ID' column
train = train.join(accel.set_index('ID'), on='id', how='left')

# It seems unlikly that we're going to want the ENMO_Avg_All_Days_MVPA192 or ENMO_Avg_All_Days_MVPA110 or Positive_Anglez_All_Days variables, so remove them
# Note: These variables no longer get generated from the Acclerometer_Computations, so we shouldn't need to remove them
#train = train.drop(columns=['ENMO_Avg_All_Days_MVPA192', 'ENMO_Avg_All_Days_MVPA110', 'Positive_Anglez_All_Days']) 

**Averaging Sit & Reach Data**

The Sit & Reach test is done twice, once with the left leg extended (SRL) and once with the right leg extended (SRR). Measuring this twice seems redundant, so we'll create a new variable that is the average of the right & left variables, then delete the SRL and SRR variables.

In [5]:
# Create a new variable 'FGC-FGC_SR' that is the mean of FGC-FGC_SRL and FGC-FGC_SRR
train['FGC-FGC_SR'] = train[['FGC-FGC_SRL', 'FGC-FGC_SRR']].mean(axis=1)

# Remove the old sit & reach variables
train = train.drop(columns=['FGC-FGC_SRL', 'FGC-FGC_SRR', 'FGC-FGC_SRL_Zone', 'FGC-FGC_SRR_Zone'])

**Calculating Sit & Reach Zone**

FitnessGram Healthy Fitness Zones are documented at https://pftdata.org/files/hfz-standards.pdf

We can use these to compute a new Zone variable for sit & reach

Note that Basic_Demos-Sex is coded as 0=Male and 1=Female

In [6]:
# Create a new variable 'FGC-FGC_SR_Zone' that is equal to 1 if any of the following are true:
# Basic_Demos-Sex==0 and FGC-FGC_SR >= 8
# Basic_Demos-Sex==1 and FGC-FGC_SR_Zone >= 9 and Basic_Demos-Age is between 5 and 10
# Basic_Demos-Sex==1 and FGC-FGC_SR_Zone >= 10 and Basic_Demos-Age is between 11 and 14
# Basic_Demos-Sex==1 and FGC-FGC_SR_Zone >= 12 and Basic_Demos-Age is at least 15

# One way to do this is to define a function that would take sex, age, and SR value as inputs and output 1 or 0
def sitreachzone(sex, age, sr):
    try:
        if np.isnan(sr):
            return np.nan
        elif sex == 0 and sr>=8:
            return 1
        elif sex == 1 and age >= 15 and sr >= 12:
            return 1
        elif sex == 1 and age >= 11 and sr >= 10:
            return 1
        elif sex == 1 and age >= 5 and sr >= 9:
            return 1
        else:
            return 0
    except:
        return np.nan

# Apply sitreachzone to create a new column using the columns Basic_Demos-Sex, Basic_Demos-Age, and FGC-FGC_SR as inputs
train['FGC-FGC_SR_Zone'] = train.apply(lambda x: sitreachzone(x['Basic_Demos-Sex'], x['Basic_Demos-Age'], x['FGC-FGC_SR']), axis=1)


# Note: The internet suggests that using .loc is vectorized, so much faster than using .apply. Below is a faster version that we could use if necessary
#train['FGC-FGC_SR_Zone'] = train.loc[(train['Basic_Demos-Sex']==0) & (train['FGC-FGC_SR'] >= 8)] = 1
#train['FGC-FGC_SR_Zone'] = train.loc[(train['Basic_Demos-Sex']==1) & (train['FGC-FGC_SR'] >= 12)] = 1
#train['FGC-FGC_SR_Zone'] = train.loc[(train['Basic_Demos-Sex']==1) & (train['FGC-FGC_SR'] >= 10) & (train['FGC-FGC_SR'] <= 14)] = 1
#train['FGC-FGC_SR_Zone'] = train.loc[(train['Basic_Demos-Sex']==1) & (train['FGC-FGC_SR'] >= 9) & (train['FGC-FGC_SR'] <= 10)] = 1
#train['FGC-FGC_SR_Zone'] = train['FGC-FGC_SR_Zone'].fillna(0)


**Creating the PAQ_MVPA_60 Variable**

Some research (https://pubmed.ncbi.nlm.nih.gov/27759968/) has identified a cut-off score of 2.75 (A) and 2.73 (C) to discriminate >60 minutes of MVPA. However, the study suggests that, while the cutoff is significant for PAQ-A, it isn't for PAQ-C.

With that caveat, before combining PAQ-A and PAQ-C, we'll create a new binary variable that flags for >60 minutes of MVPA, called MVPA_60

In [7]:
# Create a new variable that is 1 when PAQA/C Total is at least 2.75/2.73, 0 if it's less than these cutoffs, and NaN if PAQA/C is NaN
train['PAQA_Zone'] = np.where(train['PAQ_A-PAQ_A_Total']>=2.75, 1, 0)
train['PAQA_Zone'] = np.where(train['PAQ_A-PAQ_A_Total'].isnull(), np.nan, train['PAQA_Zone'])

train['PAQC_Zone'] = np.where(train['PAQ_C-PAQ_C_Total']>=2.73, 1, 0)
train['PAQC_Zone'] = np.where(train['PAQ_C-PAQ_C_Total'].isnull(), np.nan, train['PAQC_Zone'])

**Combining PAQ_A and PAQ_C Predictors**

The variables PAQ_A (Season and Total) and PAQ_C (Season and Total) both report "Information about children's participation in vigorous activities over the last 7 days." 
* More information about PAQ-C is available here: https://fcon_1000.projects.nitrc.org/indi/cmi_healthy_brain_network_old/assessments/paq-c.html. It is administered to participnats age 8-14
* More information about PAQ-A is available here: https://fcon_1000.projects.nitrc.org/indi/cmi_healthy_brain_network_old/assessments/paq-a.html. It is administered to participants age 14-19

These scores appear to be comparable, so we can combine them. However, prior to doing so, we should note that there could be participants who had scores for both measures. This would occur if their age was recorded in a different season than the PCIAT and/or PAQ seasons.

If we were being really careful, we'd construct a complicated function to account for these cases. However, exploration of the original training set (3600 participants) only found one such case, so it seems like it might be a relatively rare event. In addition, by combining the two PAQ columns we are assuming that the two scores are comparable. Thus, it makes sense to keep either of the A/C values.

For the one subject in the original data, their recorded age was 13; their PCIAT-Season and PAQ_C-Season were Spring, so when we combine these variables we'll keep the PAQ_C values.

In [8]:
# Create new variables that merge the three PAQA/C variables
train['PAQ_Total']=train['PAQ_C-PAQ_C_Total']
train.loc[train['PAQ_Total'].isnull(),'PAQ_Total']=train['PAQ_A-PAQ_A_Total']

train['PAQ_Season']=train['PAQ_C-Season']
train.loc[train['PAQ_Season'].isnull(),'PAQ_Season']=train['PAQ_A-Season']

train['PAQ_Zone']=train['PAQC_Zone']
train.loc[train['PAQ_Zone'].isnull(),'PAQ_Zone']=train['PAQA_Zone']

# Drop the PAQ variables we no longer need
train=train.drop(columns=['PAQ_C-PAQ_C_Total', 'PAQ_A-PAQ_A_Total', 'PAQ_C-Season', 'PAQ_A-Season', 'PAQA_Zone', 'PAQC_Zone'])

**Creating the Fitness Endurance Variable**

There are currently separate variables for fitness endurance minutes & seconds. We'll combine these into a single variable that measures the total number of seconds

In [9]:
# Combine the minutes and seconds of Fitness_Endurance into a single number (total number of seconds)
train['Fitness_Endurance_Total_Time_Sec'] = train['Fitness_Endurance-Time_Mins'] * 60 + train['Fitness_Endurance-Time_Sec']

train=train.drop(columns=['Fitness_Endurance-Time_Mins', 'Fitness_Endurance-Time_Sec'])

**Sleep Disturbance Scale Variable Removal**

The sleep disturbance scale was created/documented in 1996: https://pubmed.ncbi.nlm.nih.gov/9065877/

There are two Sleep Disturbance Scale variables: the "Raw" score and the Total T-Score. 
I can't find much information about what these mean. But if the T-Score is just a standardized version of the Raw score, then they should be conveying identical information.

Their correlation is 0.995731. It seems reasonable and safe to drop one of them.

In [10]:
# Remove the SDS-SDS_Total_T variable from train
train=train.drop(columns=['SDS-SDS_Total_T'])

**BIA Variable Removal**

Some of the BIA variables appear to either be redundant or computed from each other:
* BIA-BIA_BMI is measuring BMI. It has correlation 0.965105 with Physical_BMI. I can't find any information about how it is any different. It seems likely that BMI varies from day to day and measurement to measurement, so the difference is likely due to measurement error. Going to remove BIA-BIA_BMI

The other BIA variables are:
* BIA-BIA_BMC	Bone Mineral Content
* BIA-BIA_BMR	Basal Metabolic Rate
* BIA-BIA_DEE	Daily Energy Expenditure
* BIA-BIA_LDM	Lean Dry Mass
* BIA-BIA_LST	Lean Soft Tissue
* BIA-BIA_SMM	Skeletal Muscle Mass
* BIA-BIA_FFM	Fat Free Mass
* BIA-BIA_Fat	Body Fat Percentage
* BIA-BIA_Frame_num	Body Frame
* BIA-BIA_TBW	Total Body Water
* BIA-BIA_ECW	Extracellular Water
* BIA-BIA_ICW	Intracellular Water
* BIA-BIA_FMI	Fat Mass Index: Calcuated as FM divided by height squared
* BIA-BIA_FFMI	Fat Free Mass Index: Calcuated as FFM divided by height squared

There exist large correlations between many pairs of these predictors. Below is a list of all pairs with at least 0.99 correlation:
* FFM	BMR	0.9999999999991445
* BMR	TBW	0.9996843922178612
* FFM	TBW	0.9996843882486972
* ECW	BMR	0.99949744961038
* ECW	FFM	0.9994974489820077
* TBW	ECW	0.9994121539401678
* TBW	ICW	0.9989842717508111
* FFM	LDM	0.9989337841842884
* BMR	LDM	0.9989337768885788
* SMM	ICW	0.9984531398777805
* ICW	BMR	0.9981423016493606
* FFM	ICW	0.998142293296351
* ECW	LDM	0.9980088142821305
* TBW	SMM	0.9976684856970098
* TBW	LDM	0.9974587197051186
* ICW	LST	0.9969014072300802
* SMM	BMR	0.996864975676666
* FFM	SMM	0.9968649649869374
* ECW	ICW	0.996852207975357
* LST	SMM	0.9967543066184706
* SMM	ECW	0.9957140182568774
* TBW	LST	0.995487594282981
* BMC	LDM	0.9950199542105168
* LDM	ICW	0.9949518859314233
* ICW	DEE	0.9948088046663506
* DEE	TBW	0.9944107160263206
* DEE	BMR	0.9941186367579632
* DEE	FFM	0.9941186344807962
* LDM	SMM	0.9937473329914341
* LST	BMR	0.9936088620550899
* LST	FFM	0.99360884373626
* DEE	SMM	0.9933843458863473
* LST	ECW	0.9930574064543182
* ECW	DEE	0.992754802616644
* DEE	LDM	0.9919453675327521
* FFM	BMC	0.9914085178617896
* BMR	BMC	0.9914084966327703
* LST	DEE	0.9912776109118937
* BMC	ECW	0.9909706551293764

If we keep the BIA-BIA_FFM variable, then we could eliminate:
* BMR
* TBW
* ECW
* ICW
* LDM
* SMM
* DEE
* LST
* BMC

In [11]:
# Remove BIA-BIA_BMI from train
train=train.drop(columns=['BIA-BIA_BMI'])

# Remove the following variables from train: BIA-BIA_BMR, BIA-BIA_TBW, BIA-BIA_ECW, BIA-BIA_LDM, BIA-BIA_ICW, BIA-BIA_SMM, BIA-BIA_DEE, BIA-BIA_LST, and BIA-BIA_BMC
train=train.drop(columns=['BIA-BIA_BMR', 'BIA-BIA_TBW', 'BIA-BIA_ECW', 'BIA-BIA_LDM', 'BIA-BIA_ICW', 'BIA-BIA_SMM', 'BIA-BIA_DEE', 'BIA-BIA_LST', 'BIA-BIA_BMC'])

**Removing Negative Values**

None of the quantitative variables should have negative values, so we'll make a list of these numerical variables and replace any negative entries with NaN

In [12]:
# Create a list of numerical columns of type float. Note that these columns include the "Zone" variables which are really categorical/ordinal:
float_columns = train.select_dtypes(include=['float']).columns

# Change negative values to NaN
train[train[float_columns] < 0] = np.nan

**Removing 0 Values from Physical Variables**

All of the Physical- variables should have non-zero values (this is not be the case for most other variables). 

In [13]:
# For each variable that starts with 'Physical-' replace any values that are 0 with NaN
for column in train.columns:
    if column.startswith('Physical-'):
        train[column] = train[column].replace(0, np.nan)

**Removing Outliers**

We'll define an "outlier" as any value in a column that is more than 5 standard deviations above or below the mean.

In [14]:
# For each column in float_columns, identify entries that are 5 standard deviations above or below the mean and replace them with NaN
for column in float_columns:
    train[column] = train[column].mask(train[column] > train[column].mean() + 5 * train[column].std())
    train[column] = train[column].mask(train[column] < train[column].mean() - 5 * train[column].std())

**Export Dataframe to CSV**

We'll do this here to support our EDA work. But the code above will eventually make its way into a unified code file.

In [15]:
train.to_csv('train_cleaned.csv', index=False)