--- 
# Model Fitting and Techniques 
***

The overall goal of this section is to try various techniques to fit a model for mortality rate using food consumption data.  First, we will find a null model, representing the 'average' input and representing a baseline estimation that we will then improve upon. Then we will fit a multilinear regression to all of the predictors (all livestock and all crop predictors), and find a cross-validated $R^2$ for this naive model. Next, we will try more advanced techniques such as Lasso, Ridge, Step-wise, and Regression Trees to improve this model. 

To summarize, our null model achieved a cross-validated $R^2$ score of 0 for all three diseases. Our naive model achieved a cross-validated score of $ $ for diabetes, $ $ for cancer, and $ $ for cardiovascular diseases.

## Null Model

Before fitting the linear regression, we will find a simple null model for global food consumption data. To calculate the null model, we found the average of each predictor column in the Dataframe. This gives us a 'global average' of consumption of each predictor. We can then use the null model to establish a baseline $R^2$ that we will then improve upon using our linear regression models.

In [11]:
import numpy as np
import matplotlib 
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.cm as cmx
import matplotlib.colors as colors
import pandas as pd
import math
from sklearn.cross_validation import cross_val_score
from sklearn.cross_validation import KFold
from sklearn.linear_model import LinearRegression as LinReg
from sklearn.cross_validation import train_test_split as sk_split
import statsmodels.api as sm

%matplotlib inline

In [2]:
# Read in initial dataframe
x_df = pd.read_csv('predictors_filled_156.csv')

# read in disease rates
diabetes_df = pd.read_csv('diabetes_156.csv',index_col = 0)
cardio_df = pd.read_csv('cardio_156.csv',index_col = 0)
cancer_df= pd.read_csv('cancer_156.csv',index_col = 0)

### Null Model testing:

As expected, testing the null model on various training set give us a cross-validated $R^2$ of approximately zero for all three diseases. 

#### Cancer: 
The null model for cancer will always predict the mean cancer mortality rate. Testing on cancer, we get an $R^2$ of 3.33 E -16, which is ~ 0:

In [3]:
# Null Model Cancer
null_model = LinReg()
null_model.fit(x_df, [np.mean(cancer_df['Cancer Mortality Rate'])]*x_df.shape[0])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [4]:
# Test Cancer.
null_model.score(x_df, cancer_df)

0.0

#### Diabetes
Testing on diabetes, we also get an $R^2$ of 0.

In [5]:
# Fit Diabetes Null Model
null_model = LinReg()
null_model.fit(x_df, [np.mean(diabetes_df['Diabetes Mortality Rate'])]*x_df.shape[0])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [6]:
# Test Diabetes.
null_model.score(x_df, diabetes_df)

0.0

#### Test Cardiovascular Diseases

In [7]:
# Test cardiovascular diseases
null_model = LinReg()
null_model.fit(x_df, [np.mean(cardio_df['Cardio Mortality Rate'])]*x_df.shape[0])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [8]:
# Test cardiovascular diseases.
null_model.score(x_df, cardio_df)

0.0

# Simple LinReg

## Cancer LinReg

Now, we will fit a simple multi-linear regression to all of the food consumption inputs for each of the diseases. First, for cancer, our regression has an initial $R^2$ on the training set of .85, and a cross-validated $R^2$ of -16.5 for $k = 5$ folds. 

In [12]:
linreg = LinReg()
linreg.fit(x_df, cancer_df)
print "Training r^2:",linreg.score(x_df, cancer_df)

Training r^2: 0.842393103733


In [13]:
# Cross validated R-squared score
np.mean(cross_val_score(LinReg(), x_df,cancer_df, cv = KFold(151, 5), scoring = "r2"))

-30.410282842295079

To further examine the accuracy of this model, the map below displays the fractional difference of the model estimates as compared to the actual cancer data on a world map. As we can see, the vast majority of countries are colored a dark blue/ purple color, indicating they have a low fractional difference. Countries colored a brighter purple/pink color indicate an overestimate, while countries colored in a brighter blues indicate an underestimate.

In [14]:
# PUT GRAPH HERE

## Diabetes LinReg
For diabetes, our regression has an initial cross-validated $R^2$ of ** PUT THE R^2 HERE**. 

In [15]:
linreg = LinReg()
linreg.fit(x_df, diabetes_df)
print "Training R^2", linreg.score(x_df, diabetes_df)

# Cross-validated R^2 score for diabetes
print "CV R^2 score:",np.mean(cross_val_score(LinReg(), x_df,diabetes_df, cv = KFold(5), scoring = "r2"))

Training R^2 0.807019987902
CV R^2 score: -4.48569244044


Again, we can examine a world map to see the fractional differences. Again, dark blue/purple colors indicate an accurate estimate,brighter purples/pinks indicate an overestimate, and brighter blues indicate an underestimate. We note thatthis model has slightly worse performance than our cancer model, which may have to do with the fact that diabetes does not lead to death as commonly as cancer does.

## Cardiovascular Diseases LinReg
For diabetes, our regression has an initial $R^2$ on the training set of .856, and a cross-validated $R^2$ of -6.02

In [16]:
linreg = LinReg()
linreg.fit(x_df, cardio_df)
print "Training R^2", linreg.score(x_df, cardio_df)
print "Cross-validated R^2", np.mean(cross_val_score(LinReg(), x_df,cardio_df, cv = KFold(151, 5), scoring = "r2"))

Training R^2 0.850938184461
Cross-validated R^2 -5.2779763269


Again, we can examine a world map to see the fractional differences. Dark blue/purple colors indicate an accurate estimate,brighter purples/pinks indicate an overestimate, and brighter blues indicate an underestimate. While this model appears to be fairly accurate for *THESE COUNTRIES*, it could be improved for *THESE*. This might be due to certain predictors, such as *SOME RANDOM PREDICTOR*, that is more heavily weighted for larger countries than for the country that is seeing a larger fractional difference.

--- 
# Advanced Models 
***

In this section, we will use various other regression techniques and variable selection techniques to attempt to improve upon our naive model. In particular, we will try 

1. Lasso
2. PCA
3. Regression Tree
4. Step-wise Variable Selection

For reference, our naive model gives us the following cross-validated $R^2$ values with $k = 5$: 

||Cardio | Diabetes | Cancer
|--- | --- | --- | ---|
|R^2 (Training)| .856 | .834 | .856|

## Lasso

The naive model brought up in the previous section has one major flaw: by including all of the predictors, it is very likely to be overfitted to the initial dataset. As such, we would like to reduce that overfitting by using variable selection techniques such as Lasso to reduce the number of predictors our model includes. 
Using the LassoCV package in sklearn, we obtain the following cross-validated $R^2$:

| |Cardio   |  Diabetes | Cancer  |
|-----|---|---|---|
|$r^2$ (Lasso) |  .456 |  .31 |   .053|


Below is a function we used to calculate the cross-validated r^2 for lasso over a number of folds for a certain parameter value.

In [40]:
def lasso_k_fold_r_squared(x_train, y_train, num_folds, param_val):
    n_train = x_train.shape[0]
    n = int(np.round(n_train * 1. / num_folds)) # points per fold

    # Iterate over folds
    cv_r_squared = 0
    
    for fold in range(1, num_folds + 1):
        # Take k-1 folds for training 
        x_first_half = x_train.iloc[:n * (fold - 1), :]
        x_second_half = x_train.iloc[n * fold + 1:, :]
        x_train_cv = np.concatenate((x_first_half, x_second_half), axis=0)
        
        y_first_half = y_train.iloc[:n * (fold - 1)]
        y_second_half = y_train.iloc[n * fold + 1:]
        y_train_cv = np.concatenate((y_first_half, y_second_half), axis=0)
        
        # Take the middle fold for testing
        x_test_cv = x_train.iloc[1 + n * (fold - 1):n * fold, :]
        y_test_cv = y_train.iloc[1 + n * (fold - 1):n * fold]

        # Fit Decision Tree model with parameter value on CV train set, and evaluate CV test performance
        reg = Lasso(alpha = param_val, normalize=True)
        reg.fit(x_train_cv, y_train_cv)
        coefficients = reg.coef_
        #print len([i for i, item in enumerate(coefficients) if abs(item) >0])
        r_squared = reg.score(x_test_cv, y_test_cv)
    
        # Cummulative R^2 value across folds
        cv_r_squared += r_squared

    # Return average R^2 value across folds
    return cv_r_squared * 1.0 / num_folds

### Cancer

Running Lasso on different values of alpha yields a top cross-validated $r^2$ score of .059, for alpha = .1

In [60]:
from sklearn.linear_model import Lasso
from sklearn.linear_model import LassoCV

print "LASSO"
for alpha in [.001, .01, .1,.4,.5,.6,1,5,10,100]:
    print "cancer",alpha, lasso_k_fold_r_squared(x_df,cancer_df,5, alpha)

LASSO
cancer 0.001 -10.0863952587
cancer 0.01 -2.62319418962
cancer 0.1 -0.189052791559
cancer 0.4 0.15663394152
cancer 0.5 0.13439828752
cancer 0.6 0.117227909906
cancer 1 0.0530720050382
cancer 5 -0.0745935162681
cancer 10 -0.0745935162681
cancer 100 -0.0745935162681


### Cardiovascular Diseases
Running Lasso on different values of alpha yields a top cross-validated $r^2$ score of .459, for alpha = .9

In [42]:
print "LASSO"
for alpha in [.001, .1, .5,.8, .9, 1, 2,5, 10, 100, 1000]:
    print "cardio",alpha, lasso_k_fold_r_squared(x_df,cardio_df,5, alpha)
    

LASSO
cardio 0.001 -5.42791396765
cardio 0.1 -0.111196584756
cardio 0.5 0.415841480013
cardio 0.8 0.451848911048
cardio 0.9 0.455639379888
cardio 1 0.456181146633
cardio 2 0.404302118408
cardio 5 0.0132238574549
cardio 10 -0.010213535932
cardio 100 -0.010213535932
cardio 1000 -0.010213535932


In [31]:
reg = Lasso(alpha = 1, normalize=True)
reg.fit(x_df, cancer_df)

Lasso(alpha=1, copy_X=True, fit_intercept=True, max_iter=1000, normalize=True,
   positive=False, precompute=False, random_state=None, selection='cyclic',
   tol=0.0001, warm_start=False)

### Diabetes 
Running Lasso on different values of alpha from .001 to 100 yields a top cross-validated $r^2$ score of .29, for alpha = .1

In [39]:
print "LASSO"
for alpha in [.001, .01, .05, .1, .5, 1, 10, 100]:
    print "diabetes",alpha, lasso_k_fold_r_squared(x_df,diabetes_df,4, alpha)

LASSO
diabetes 0.001 -11.0600616776
diabetes 0.01 -7.15802252011
diabetes 0.05 -2.82047154641
diabetes 0.1 -1.89295319289
diabetes 0.5 -0.634815280625
diabetes 1 -0.151038370671
diabetes 10 0.309595435793
diabetes 100 0.243569859518


## Ridge 

Again, in this section we'd like to try to use Ridge regression to improve the cross-validated $r^2$ for our model and reduce overfitting of the model.
Below is a function we used to calculate the cross-validated r^2 for ridge regression over a number of folds for a certain parameter value.

To summarize, we have:


| |Cardio   |  Diabetes | Cancer  |
|-----|---|---|---|
|$r^2$ (Ridge) |  .424 |  .323 |   .168|

In [21]:
from sklearn.linear_model import Ridge
def ridge_k_fold_r_squared(x_train, y_train, num_folds, param_val):
    n_train = x_train.shape[0]
    n = int(np.round(n_train * 1. / num_folds)) # points per fold

    # Iterate over folds
    cv_r_squared = 0
    
    for fold in range(1, num_folds + 1):
        # Take k-1 folds for training 
        x_first_half = x_train.iloc[:n * (fold - 1), :]
        x_second_half = x_train.iloc[n * fold + 1:, :]
        x_train_cv = np.concatenate((x_first_half, x_second_half), axis=0)
        
        y_first_half = y_train.iloc[:n * (fold - 1)]
        y_second_half = y_train.iloc[n * fold + 1:]
        y_train_cv = np.concatenate((y_first_half, y_second_half), axis=0)
        
        # Take the middle fold for testing
        x_test_cv = x_train.iloc[1 + n * (fold - 1):n * fold, :]
        y_test_cv = y_train.iloc[1 + n * (fold - 1):n * fold]

        # Fit Decision Tree model with parameter value on CV train set, and evaluate CV test performance
        reg = Ridge(alpha = param_val, normalize=True)
        reg.fit(x_train_cv, y_train_cv)
        r_squared = reg.score(x_test_cv, y_test_cv)
    
        # Cummulative R^2 value across folds
        cv_r_squared += r_squared

    # Return average R^2 value across folds
    return cv_r_squared * 1.0 / num_folds

### Cardiovascular Diseases
Running Ridge regression on different values of alpha from .001 to 100 yields a top cross-validated $r^2$ score of .425, for alpha = 1.1

In [22]:
print "RIDGE"
for alpha in [.001, .01, .1, .9,1,1.1,1.3,1.5,1.9,5, 10, 11, 100]:
    print "cardio",alpha, ridge_k_fold_r_squared(x_df,cardio_df,5, alpha)

RIDGE
cardio 0.001 -4.9919143728
cardio 0.01 -2.75443505612
cardio 0.1 -0.175066648391
cardio 0.9 0.404437744071
cardio 1 0.410643001
cardio 1.1 0.415244083252
cardio 1.3 0.421038145156
cardio 1.5 0.423729737401
cardio 1.9 0.423597298657
cardio 5 0.36475407367
cardio 10 0.27837000173
cardio 11 0.265254747482
cardio 100 0.0444563936422


### Diabetes

Running Lasso on different values of alpha from .001 to 100 yields a top cross-validated $r^2$ score of .317, for alpha = 1

In [23]:
print "RIDGE"
for alpha in [.001, .01, .1, .9,1,3, 5, 10,100, 1000]:
    print "diabetes",alpha, ridge_k_fold_r_squared(x_df,diabetes_df,4, alpha)

RIDGE
diabetes 0.001 -9.77735177572
diabetes 0.01 -3.76622924788
diabetes 0.1 -0.279553857695
diabetes 0.9 0.314979301771
diabetes 1 0.321214840769
diabetes 3 0.325946173079
diabetes 5 0.300072240351
diabetes 10 0.246363686822
diabetes 100 0.0540094335668
diabetes 1000 -0.00498501391727


In [54]:
reg = Ridge(alpha = 1, normalize=True)
reg.fit(x_df, diabetes_df)

Ridge(alpha=1, copy_X=True, fit_intercept=True, max_iter=None, normalize=True,
   random_state=None, solver='auto', tol=0.001)

In [72]:
map(lambda x: x*132.092646274 +319.841721854, [ 126.74044772,  119.02399505,  128.69894356,  117.6477955 ,
        117.81973665,  118.45159687,  123.18981711,  117.55372852,
        117.48444972,  117.31333624,  118.74778987,  119.65541055,
        124.16352213,  117.90761687,  123.02321862,  117.32209914,
        127.88921716,  127.05721709,  125.34798032,  117.32710788,
        117.91666707,  119.38835231,  118.21439185,  118.21908394,
        120.79331652,  117.4917535 ,  125.88519989,  118.29421702,
        118.83615898,  119.65999099,  118.21547272,  119.57074411,
        117.37698261,  118.00182149,  119.89682979,  120.71212605,
        119.12433306,  119.6768647 ,  122.48392108,  118.68048848,
        118.20949186,  120.70527368,  118.46042955,  119.80067375,
        125.01981762,  119.63286591,  119.75614751,  128.33624061,
        122.03570491,  126.14134988,  122.16600829,  117.35045716,
        120.0839777 ,  123.72143454,  117.36797717,  123.76289286,
        118.94773009,  118.04131492,  121.90849495,  128.85462126,
        123.10488598,  117.77727355,  117.70034641,  118.9764192 ,
        124.26159158,  123.47145934,  117.83044911,  121.66413294,
        125.32483126,  117.61977676,  117.60462204,  119.52016724,
        119.69148766,  118.01128748,  118.1185207 ,  126.81086274,
        117.77061031,  118.54514941,  131.21512135,  117.55041134,
        119.21623736,  119.51006939,  119.62577197,  123.53862982,
        119.34162921,  126.09972408,  121.11361867,  129.30116273,
        117.42273745,  121.13058284,  132.93239905,  120.0082676 ,
        119.0549999 ,  125.76653877,  117.73100524,  122.13895347,
        127.36461816,  117.66567067,  117.61869926,  122.83006392,
        117.3078453 ,  121.66308334,  118.0012801 ,  117.51483953,
        118.17176123,  119.9467897 ,  126.94827063,  117.54637903,
        118.58487913,  123.16984085,  119.33132297,  118.79971901,
        130.12137825,  127.91410014,  121.10032152,  123.12683437,
        119.82135166,  118.97319634,  117.55739124,  117.56243848,
        121.77615448,  120.445332  ,  121.31953588,  118.52473907,
        123.1706325 ,  119.95884235,  126.91595444,  127.52828781,
        119.02883873,  122.36543793,  120.81612583,  122.51237294,
        117.49206438,  117.54910862,  118.53496494,  118.74633315,
        129.20443405,  123.11713689,  131.25125182,  117.78927038,
        123.5222537 ,  120.87452718,  131.27148218,  118.177218  ,
        118.28957306,  117.62879086,  119.56342205,  119.52752925,
        118.04673368,  119.65406163,  128.17585226,  117.3682765 ,
        123.4557608 ,  117.83291979,  118.77455836,  119.33976681])

[17061.322851140354,
 16042.036198111977,
 17320.025749362572,
 15860.250357751389,
 15882.962519258283,
 15966.426607793357,
 16592.310657923987,
 15847.824801436187,
 15838.673581413499,
 15816.070749027145,
 16005.551524971192,
 16125.441542405399,
 16720.929930706065,
 15894.570850073225,
 16570.30422251463,
 15817.228263677182,
 17213.066846428654,
 17103.165755482198,
 16877.388147424077,
 15817.889881398278,
 15895.766314940534,
 16090.16511277452,
 15935.093568992079,
 15935.713359576735,
 16275.750553193682,
 15839.638357041502,
 16948.350902055554,
 15945.63788693665,
 16017.224434559968,
 16126.046584846097,
 15935.236343970657,
 16114.2577282952,
 15824.47796646618,
 15907.014587620262,
 16157.331248678456,
 16265.025889159151,
 16055.290111374745,
 16128.275477852458,
 16499.066983326975,
 15996.661506268172,
 15934.446316346262,
 16264.120741472603,
 15967.593339868248,
 16144.629742899628,
 16834.040267972654,
 16122.463561248504,
 16138.748154029397,
 17272.115356885686

### Cancer

Running Lasso on different values of alpha from .001 to 100 yields a top cross-validated $r^2$ score of .18, for alpha = 5

In [63]:
print "RIDGE"

for alpha in [.001, .01, .1, 1, 3, 4,5,10, 20, 1000]:
    print "cancer",alpha, ridge_k_fold_r_squared(x_df,cancer_df,5, alpha)

RIDGE
cancer 0.001 -9.11931708715
cancer 0.01 -3.21130503559
cancer 0.1 -0.681543066623
cancer 1 0.0574910468362
cancer 3 0.162757999505
cancer 4 0.168501040516
cancer 5 0.168009123824
cancer 6 0.164780179616
cancer 10 0.144581726236
cancer 20 0.0992315721652
cancer 1000 -0.0668769303926


In [68]:
reg_cancer = Ridge(alpha = 4, normalize=True)
reg_cancer.fit(x_df, cancer_df)
ridge2 = [139.241631199 ,
121.501955508 ,
140.297525424 ,
110.128754036 ,
106.336567507 ,
113.031223791 ,
132.499282598 ,
98.4223726067 ,
102.980979456 ,
104.583660622 ,
106.526359688 ,
130.804125375 ,
141.058425775 ,
115.010609151 ,
134.186246935 ,
102.230232014 ,
137.790323795 ,
145.380501412 ,
138.065311419 ,
98.3235369524 ,
105.34579378 ,
130.376110372 ,
106.923470213 ,
104.567157104 ,
109.735883671 ,
111.025881961 ,
146.228638437 ,
108.977200105 ,
115.621621913 ,
127.735132442 ,
114.265098534 ,
119.312254325 ,
109.751939359 ,
111.910611805 ,
123.517874961 ,
117.916556325 ,
114.298294616 ,
116.935376574 ,
123.354517174 ,
118.278870547 ,
106.331191226 ,
116.494207789 ,
106.798500949 ,
145.62235145 ,
143.717195256 ,
126.536368307 ,
123.593636978 ,
139.623462661 ,
121.503740398 ,
133.203719403 ,
113.372976011 ,
107.296650549 ,
114.828992577 ,
138.151751578 ,
106.049175297 ,
123.416047005 ,
134.003101034 ,
113.57936679 ,
119.513907567 ,
136.358105249 ,
135.344391433 ,
108.385577933 ,
93.5866982007 ,
113.55601884 ,
136.591127827 ,
136.770090972 ,
100.581646229 ,
145.262928377 ,
134.929930685 ,
106.057013059 ,
90.7874853717 ,
125.278009406 ,
118.434689553 ,
111.261483998 ,
111.551775238 ,
146.550288185 ,
111.042906554 ,
117.306982706 ,
148.985945144 ,
97.144577843 ,
120.8391306 ,
116.339694171 ,
119.060148909 ,
128.816958891 ,
115.390415861 ,
144.385662284 ,
126.146708162 ,
149.086505658 ,
94.7621957944 ,
113.079190264 ,
149.257826203 ,
113.64307757 ,
115.940186477 ,
141.604961257 ,
115.892274867 ,
125.742709616 ,
143.869140978 ,
110.824234996 ,
114.380587225 ,
120.585608617 ,
99.6587560981 ,
120.340533695 ,
111.569849799 ,
99.9165354362 ,
114.19402262 ,
111.786208167 ,
132.721795603 ,
103.626698207 ,
115.22703027 ,
137.732188838 ,
121.27873734 ,
111.161927907 ,
141.111100968 ,
138.383025182 ,
132.349352098 ,
117.216431638 ,
118.784355543 ,
108.525066318 ,
116.717076857 ,
119.635855031 ,
112.328308613 ,
127.126284828 ,
123.704252468 ,
113.713431738 ,
136.606596416 ,
102.987820478 ,
137.227747848 ,
135.816766495 ,
115.799459796 ,
132.320292255 ,
122.812506568 ,
139.410723081 ,
116.370613242 ,
103.300872479 ,
126.321863798 ,
115.854807124 ,
153.964245782 ,
132.06102788 ,
142.033829429 ,
112.283990228 ,
128.405908494 ,
119.058200209 ,
141.259064427 ,
122.037738432 ,
114.768593626 ,
107.991645402 ,
122.347076737 ,
110.53269814 ,
106.382939743 ,
127.621901377 ,
137.031200534 ,
104.592572924 ,
133.535283826 ,
104.942560859 ,
111.691404442 ,
116.090079159 ]

In [70]:
map (lambda x: x * 36.9716407189 + 120.637748344, ridge2)

[5268.6293101470055,
 4612.764394029548,
 5307.667452070867,
 4192.2784753831,
 4052.0751174938596,
 4299.587544362434,
 5019.353620069255,
 3759.474347060617,
 3928.013521671654,
 3987.2672739279533,
 4059.092045821048,
 4956.6808762584515,
 5335.799186470923,
 4372.768668736604,
 5081.723459442416,
 3900.2571569753964,
 5214.972094233638,
 5495.593414081998,
 5225.1388378703095,
 3755.8202307596207,
 4015.44458722549,
 4940.856459345236,
 4073.773873477042,
 3986.6571117898598,
 4177.753413399217,
 4225.446766705092,
 5526.950431450695,
 4149.703637177731,
 4395.358813047931,
 4843.215172170731,
 4345.205918052755,
 4531.807548609922,
 4178.347018527448,
 4258.156680630749,
 4687.296243764106,
 4480.206303601835,
 4346.433231669735,
 4443.930478367203,
 4681.256638354507,
 4493.601654844967,
 4051.876347564324,
 4427.61974455179,
 4069.1535547475287,
 5504.535006794787,
 5434.098256476831,
 4798.894895264809,
 4690.09728983677,
 5282.746245775241,
 4612.830384341351,
 5045.3978045328

---
## Regression Trees
---

In this section, we'll try regression trees to see if they improve our model.

Below, we have a function that we'll use to find the $r^2$ value for a given number of folds and certain hyperparameter.

In [25]:
from sklearn.tree import DecisionTreeRegressor
def rtree_k_fold_r_squared(x_train, y_train, num_folds, param_val):
    n_train = x_train.shape[0]
    n = int(np.round(n_train * 1. / num_folds)) # points per fold

    # Iterate over folds
    cv_r_squared = 0
    
    for fold in range(1, num_folds + 1):
        # Take k-1 folds for training 
        x_first_half = x_train.iloc[:n * (fold - 1), :]
        x_second_half = x_train.iloc[n * fold + 1:, :]
        x_train_cv = np.concatenate((x_first_half, x_second_half), axis=0)
        
        y_first_half = y_train.iloc[:n * (fold - 1)]
        y_second_half = y_train.iloc[n * fold + 1:]
        y_train_cv = np.concatenate((y_first_half, y_second_half), axis=0)
        
        # Take the middle fold for testing
        x_test_cv = x_train.iloc[1 + n * (fold - 1):n * fold, :]
        y_test_cv = y_train.iloc[1 + n * (fold - 1):n * fold]

        # Fit Decision Tree model with parameter value on CV train set, and evaluate CV test performance
        reg = DecisionTreeRegressor(max_depth=param_val)
        reg.fit(x_train_cv, y_train_cv)
        r_squared = reg.score(x_test_cv, y_test_cv)
    
        # Cummulative R^2 value across folds
        cv_r_squared += r_squared

    # Return average R^2 value across folds
    return cv_r_squared * 1.0 / num_folds

### Cardiovascular Diseases

For cardiovascular diseases, we see the best $r^2 = .44$ for max_depth = 2. 

In [26]:
for depth in [2, 3, 4, 5, 10, 50, 100]:
    print depth, rtree_k_fold_r_squared(x_df,cardio_df,5, depth)

2 0.424335403226
3 0.313078907323
4 0.230103986075
5 0.154881054117
10 0.0746420574928
50 0.217683941889
100 0.0820025125761


### Diabetes

For diabetes, we see the best $r^2$ of $.157$ for a max-depth of 2.

In [27]:
for depth in [2, 3, 4, 5, 10, 50, 100]:
    print depth, rtree_k_fold_r_squared(x_df,diabetes_df,5, depth)

2 0.135877261593
3 0.0669070393589
4 -0.0048575681988
5 0.0545434405144
10 -0.150080332407
50 -0.299664889236
100 -0.305172491679


### Cancer

All of the $r^2$ for decision trees on cancer were negative, indicating that these actually perform worse than the null model and as such are not useful models to examine.

In [28]:
for depth in [2, 3, 5, 8, 9, 10, 20, 50, 70, 100]:
    print depth, rtree_k_fold_r_squared(x_df,cancer_df,5, depth)

2 -0.908504830182
3 -0.914686106965
5 -1.02371038451
8 -1.17331433126
9 -0.92569309602
10 -1.11543087937
20 -1.41288087569
50 -1.3521432352
70 -0.972645342176
100 -1.15913214825
