--- 
# Advanced Models 
***

In this section, we will use various other regression techniques and variable selection techniques to attempt to improve upon our naive model. In particular, we will try 

1. Lasso
2. PCA
3. Regression Tree
4. Step-wise Variable Selection

For reference, our naive model gives us the following cross-validated $R^2$ values with $k = 5$: 

||Cardio | Diabetes | Cancer
|--- | --- | --- | ---|
|R^2 | sdf | sdf | sdf

## Lasso

The naive model brought up in the previous section has one major flaw: by including all of the predictors, it is very likely to be overfitted to the initial dataset. As such, we would like to reduce that overfitting by using variable selection techniques such as Lasso to reduce the number of predictors our model includes. 
Using the LassoCV package in sklearn, we obtain the following cross-validated $R^2$:

| |Cardio   |  Diabetes | Cancer  |
|-----|---|---|---|
|$r^2$ (Lasso) |  f |  f |   f|
|$r^2$ (Naive) |  f |  f |   f|


In [1]:
# Code for Lasso

# PCA
For similar reasons to Lasso, we also tried testing a $PCA$-based model. Using the sklearn.decomposition.PCA package and model_selection.cross_val_score, we obtained the following cross-validated $R^2$ for PCA: 

| |Cardio   |  Diabetes | Cancer  |
|-----|---|---|---|
|$r^2$ (PCA) |  f |  f |   f|
|$r^2$ (Naive) |  f |  f |   f|

In [2]:
# Code for PCA

# Regression Trees
Again, to decrease the dimensionality, we also tried testing a regression trees- based model. Using the sklearn.decomposition.PCA package and model_selection.cross_val_score, we obtained the following cross-validated $R^2$ for PCA: 

| |Cardio   |  Diabetes | Cancer  |
|-----|---|---|---|
|$r^2$ (PCA) |  f |  f |   f|
|$r^2$ (Naive) |  f |  f |   f|

In [3]:
# Code for regression Trees

# Step-wise Variable Selection
As a final option, we tried step-wise variable selection. Using **SOME PACKAGE** and model_selection.cross_val_score, we obtained the following cross-validated $R^2$ for PCA: 

| |Cardio   |  Diabetes | Cancer  |
|-----|---|---|---|
|$r^2$ (Step-wise) |  f |  f |   f|
|$r^2$ (Naive) |  f |  f |   f|

In [4]:
# Code for step-wise

Again, we can examine a world map to see the fractional differences. Again, dark blue/purple colors indicate an accurate estimate,brighter purples/pinks indicate an overestimate, and brighter blues indicate an underestimate. We note that
### NOTE I"M NOT SURE ABOUT THIS BUT JUST WRITING SOME ANALYSIS 
this model has slightly worse performance than our cancer model, perhaps indicating that diabetes is less related to dietary intake than **SSOME OTHER DISEASE**.   

# Preliminary Analysis/ Results

Just from looking at $r^2$, it looks like **SOME MODEL** does not work and can be eliminated from consideration. In the data analysis section, we will explore more detailed data analysis to determine which model is the best choice.

| |Cardio   |  Diabetes | Cancer  |
|-----|---|---|---|
|$r^2$ (Naive) |  f |  f |   f|
|$r^2$ (Lasso) |  f |  f |   f|
|$r^2$ (PCA) |  f |  f |   f|
|$r^2$ (Regression Tree) |  f |  f |   f|
|$r^2$ (Step-wise) |  f |  f |   f|

Again, we can examine a world map to see the fractional differences. Dark blue/purple colors indicate an accurate estimate,brighter purples/pinks indicate an overestimate, and brighter blues indicate an underestimate. While this model appears to be fairly accurate for *THESE COUNTRIES*, it could be improved for *THESE*. This might be due to certain predictors, such as *SOME RANDOM PREDICTOR*, that is more heavily weighted for larger countries than for the country that is seeing a larger fractional difference.