This notebook includes Aaron's attempts at EDA, feature selection/reduction, and initial attempts at trying a couple of different predictive regression models (including hyperparameter tuning)

The outline:

1. Import the data. These include the accelerometer predictors. We should consider doing a separate analysis without these variables, since relatively few participants have these data
2. Remove a couple of variables we won't be using, including:
    * id
    * ID
    * BIA-BIA_BMI (since we already have a BMI variable as part of the physical group)
    * PCIAT-PCIAT_01 through 20
    * FitnessGram Zone variables that were used to compute the Zone Total (i.e., CU, PU, SRL, SRR, TL) *Note - not entirely sure if we should be doing this*
3. Examining correlations between each predictor (individually) and PCIAT-PCIAT_Total
4. Examining NaN counts for all variables and, potentially, removing variables that:
    * Have very large NaN counts AND
    * Don't have face value as predictors AND
    * Have low correlations with the outcome variable
5. Examining correlations between all predictors and, potentially, removing variables that have extremely high (>0.9) correlation
6. Exploring interactions between Seasons and associated predictors:
    1. For each predictor that is associated with a Season variable (e.g., within the Physical variables), make a scatterplot of the predictor vs. outcome and display regression lines by Season
    2. If there aren't any clear interactions, removing the Season variable from the list of predictors
    3. If there do appear to be interactions, creating dummy variables from the Season variable
7. Create a linear regression model using a greedy algorithm from the "bottom up"
    1. Make a list of all numerical predictors and also a new empty data frame with 100(?) rows and the predictors as variables
    2. Randomly select a predictor from the list and create a linear model
    3. Randomly select a second predictor from the list and add it to the model
    4. Perform an F test to see if the new model is significantly better than the old
    5. Repeat until the F test is no longer significant
    6. Record the predictors that are in the model in the newly-created data frame
    7. Repeat the above steps 100 (??) times
    8. Compute the mean for each predictor in the data frame. This should give some sense of the "importance" of each predictor
8. Repeat the previous method but using a "top down" algorithm, starting with a full model and removing predictors one-by-one
9. *Maybe* Trying to use PCA and either linear or KNN regression to see if it appears to improve prediction
    * PCA on the entire set of predictors
    * On each set of grouped predictors
10. Using RandomForest Regression on the entire set of predictors and examining the importance matrix to try to find a potential list of predictors
11. *Maybe* using XGBoost to do stuff. (Need to learn what this is)
12. Removing highly-correlated predictors and using LASSO and using LASSO regression (with hyperparameter tuning) to identify important predictors
13. Comparing the apparent predictive power of all the previous methods. If none stand out, then stick with linear regression(?)
14. Start to engage more formally with the modeling process, using Kfold splits

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# Load the original data set
train_cleaned = pd.read_csv('train_cleaned.csv')

In [5]:
# Remove some variables we won't need

# Remove the variables 'id', 'ID', 'BIA-BIA_BMI'
train_cleaned = train_cleaned.drop(['id', 'ID', 'BIA-BIA_BMI'], axis=1)

# Remove the PCIAT component variables
train_cleaned = train_cleaned.loc[:,~train_cleaned.columns.str.startswith('PCIAT-PCIAT_0')]
train_cleaned = train_cleaned.loc[:,~train_cleaned.columns.str.startswith('PCIAT-PCIAT_1')]
train_cleaned = train_cleaned.drop(['PCIAT-PCIAT_20'], axis=1)

# Remove FGC-FGC_CU_Zone, FGC-FGC_PU_Zone, FGC-FGC_SRL_Zone, and FGC-FGC_TL_Zone
train_cleaned = train_cleaned.drop(['FGC-FGC_CU_Zone', 'FGC-FGC_PU_Zone', 'FGC-FGC_SRL_Zone', 'FGC-FGC_TL_Zone'], axis=1)