# Advanced ML Phase 1 Journal

## EDA

## Random Forest with Aggregation Statistics

The goal of our models is to predict our target variables, y1 and y2, for 5 minutes after the last timestamp for a given observation unit. An observation unit is usually about 30-60 minutes worth of health data, taken every minute. In our data set, the target variables at each time step are the values of the health data for that 5 minute after window. This means that each row within the same observation window has the same values for the 2 target variables. This leads to two key insights:

1. The problem is not a time series problem. Since we don't see how y1 and y2 vary over time, we are really trying to solve a regression problem with each observation as a single instance. This leads us to insight two.

2. We need to find a way to collapse the entries for each observation window into a single entry. One observation window has one set of target values, so it would be inappropriate to pass the entire dataset to a model, as we would be breaking the IID assumption.

Two options for collapsing our feature set is PCA and using some combination of aggregation statistics. First, we attempt the simpler the aggregation statistics method. This means turning each block into a single feature vector with the mean, std, min, max values per feature.

This is pretty simple to implement. All we have to do is use the `.groupby` and `.agg` methods for pandas Dataframes to group the data by observation and calculate mean, standard deviation, min, and max for each feature. This expands our dataset to $4 \text{ stats} * 13 \text{ original features} = 52 \text{ features}$.

After creating the new training set, we split off 20% to serve as our validation set. We then conducted a randomized search with the following parameters:  

param_grid = {   
    'n_estimators': np.arange(1, 1000, 100),  
    'max_features': [None],  
    'max_depth': [None, 10, 20, 30, 40, 50],  
    'min_samples_split': randint(2, 11),  
    'min_samples_leaf': randint(1, 11),  
}

We keep `max_features` set to None because it doesn't make sense to remove a single aggregation statistic for a feature from the RF while leaving the others in. The model submitted to Kaggle for this method had an MAE of 4.81445, although this was without hyperparameter tuning. We would expect the MAE to go down for the optimal RF configuration, but we are mostly using RF as a stepping stone so we won't spend too much time on hyperparameters.

## Random Forest with PCA

The alternative to summarizing the features of each observation window with aggregation statistics is using Principal Component Analysis (PCA). In order to do this, we must flatten the features of each observation window into a wide DF. Each column can the best described as feature n for row k. We then rely on PCA to make sense of this. The scree plot for choosing the best K value looks like this, which implies that we should only keep the first 6-7 components.

![PCA Scree Plot](images/RF_PCA_Scree_Plot.png)

After using the same process to fit a RF as before, we get an MAE of 5.34102 on the Kaggle test data. This is significantly worse than our aggregation statistic model, which begs the question of why. One explanation is that only keeping 7 components is too little, and we are cutting off signal. Another reason could be that random forest and PCA don't pair well together. PCA is most useful when features are highly correlated, and it does a good job of cutting through the noise of multicollinearity. However, RF doesn't care about multicollinearity, and only feeding it 7 features might result in trees that are too similar. This difference is something to keep in mind for the future, but we should continue to try both the PCA and aggregation statistics methods on future models. 

## XGBoost