# Using Validation/Cross-Validation For Model Selection

This notebook demonstrates two typical workflows for using validation data to select models. It also demonstrates the use of some utility methods like generating **polynomial features**, converting **categorical features to "dummy variable"** binary columns, and **scaling features** when applying regularization.

**Notebook Contents**

> 1. Simple preprocessing and dummy variables
> 2. Basic validation method: Train/validation/test
> 3. Rigorous validation method: Cross-validation/test
> 4. Making CV less manual via scikit-learn



## 1. Preprocessing

In [1]:
#Data loading: cars data set (using car characteristics to predict the price)
import pandas as pd
import numpy as np

## Load in the Game Data
datafile = "game_data_final_features.csv"
df=pd.read_csv(datafile)

In [11]:
df.head(3)

Unnamed: 0.1,Unnamed: 0,categories,contains_ads,downloads_and_revenue,has_iap,installs,most_popular_country,name,android,price,...,categories_Other,categories_Puzzle,categories_Role Playing,categories_Simulation,categories_Sports,categories_Strategy,revenue_club,revenue_clubx10,log_revenue,Free_Pay
0,0,Adventure,0.0,"{'downloads': '< 5k', 'revenue': '$9k', 'reven...",0,1000.0,US,KIDS,1.0,2.99,...,0,0,0,0,0,0,1,1,9.10498,1
1,1,Casual,1.0,"{'downloads': '5m', 'revenue': '< $5k', 'reven...",0,10000000.0,BR,Mini Block Craft,1.0,0.0,...,0,0,0,0,0,0,1,1,8.517193,0
2,2,Adventure,0.0,"{'downloads': '< 5k', 'revenue': '$10k', 'reve...",0,1000.0,US,Wonder Boy: The Dragon's Trap,1.0,9.99,...,0,0,0,0,0,0,2,100,9.21034,1


We're going to simplify things a bit by focusing on the numeric columns and a single categorical column, make.

In [13]:
# Already have the categories set up.
df.columns

Index(['Unnamed: 0', 'categories', 'contains_ads', 'downloads_and_revenue',
       'has_iap', 'installs', 'most_popular_country', 'name', 'android',
       'price', 'publisher_country', 'publisher_name', 'rating',
       'rating_breakdown', 'rating_count', 'revenue', 'downloads', 'one_star',
       'two_star', 'three_star', 'four_star', 'five_star', 'ios', 'one_star_1',
       'three_star_3', 'four_star_4', 'five_star_5', 'two_star_2',
       'overall_rating', 'categories_Action', 'categories_Adventure',
       'categories_Arcade', 'categories_Board', 'categories_Card',
       'categories_Casino', 'categories_Casual', 'categories_Other',
       'categories_Puzzle', 'categories_Role Playing', 'categories_Simulation',
       'categories_Sports', 'categories_Strategy', 'revenue_club',
       'revenue_clubx10', 'log_revenue', 'Free_Pay'],
      dtype='object')

In [14]:
game = df.copy()

Now we're ready to start modeling! We're going to try out the validation process to choose between 3 models: simple linear regression, linear regression with ridge regularization, and linear regression with 2nd degree polynomial features.

## 2. Simple Validation Method: Train / Validation / Test

Here we will break the data into 3 portions: 60% for training, 20% for validation (used to select the model), 20% for final testing evaluation.

In [49]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge #ordinary linear regression + w/ ridge regularization
from sklearn.preprocessing import StandardScaler, PolynomialFeatures

X = game.loc[:,['contains_ads', 'has_iap',
       'android', 'ios', 'Free_Pay','rating_count', 'downloads',
       'categories_Action','categories_Adventure', 'categories_Arcade', 'categories_Board',
       'categories_Card', 'categories_Casino', 'categories_Casual',
       'categories_Other', 'categories_Puzzle', 'categories_Role Playing',
       'categories_Simulation', 'categories_Sports', 'categories_Strategy', 
       'overall_rating', 'one_star','two_star','three_star', 'four_star', 'five_star']]
y = game.loc[:,['revenue']]
# hold out 20% of the data for final testing
X, X_test, y, y_test = train_test_split(X, y, test_size=.2, random_state=40)

In [50]:
# hold out 20% of the data for validation testing
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=.2, random_state=50)

Now we need some model setup: **when using regularization, we must standardize** the data so that all features are on the same scale (we subtract the mean of each column and divide by the standard deviation, giving us features with mean 0 and std 1). Since this scaling is part of our model, we need to scale using the training set feature distributions and apply the same scaling to validation and test without refitting the scaler. 

Also, we need to get **polynomial features** for the poly model

In [51]:
#set up the 3 models we're choosing from:

lm = LinearRegression()

#Feature scaling for train, val, and test so that we can run our ridge model on each
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train.values)
X_val_scaled = scaler.transform(X_val.values)
X_test_scaled = scaler.transform(X_test.values)

lm_reg = Ridge(alpha=1)

#Feature transforms for train, val, and test so that we can run our poly model on each
poly = PolynomialFeatures(degree=2) 

X_train_poly = poly.fit_transform(X_train.values)
X_val_poly = poly.transform(X_val.values)
X_test_poly = poly.transform(X_test.values)

lm_poly = LinearRegression()

Now we can train, validate, and test.

In [52]:
#validate

lm.fit(X_train, y_train)
print(f'Linear Regression val R^2: {lm.score(X_val, y_val):.3f}')

lm_reg.fit(X_train_scaled, y_train)
print(f'Ridge Regression val R^2: {lm_reg.score(X_val_scaled, y_val):.3f}')

lm_poly.fit(X_train_poly, y_train)
print(f'Degree 2 polynomial regression val R^2: {lm_poly.score(X_val_poly, y_val):.3f}')

Linear Regression val R^2: 0.268
Ridge Regression val R^2: 0.297
Degree 2 polynomial regression val R^2: -2.261


Check out that negative R^2, some severe overfitting! 

So having run this validation step, we see that the evidence points to simple linear regression being the best model. So our validation process lets us **select** that choice of model, and as our final step we retrain it on the entire chunk of train/val data and see how it does on test data:  

In [53]:
lm_reg.fit(X,y)
print(f'Linear Regression test R^2: {lm_reg.score(X_test, y_test):.3f}')

Linear Regression test R^2: 0.289


  overwrite_a=True).T


## 3. Rigorous Validation Method: Cross-Validation / Test

Here we will break the data into 2 portions: 80% for a cross-validated training process, and 20% for final testing evaluation. 

Remember that the idea of CV is to make efficient use of the data available to us (using 80% instead of 60% above), while also performing multiple validation checks. For k-fold CV, we come up with k train/validation splits of the whole chunk of data, in such a way that **each observation is in the validation set exactly 1 time**. Here's a helpful diagram:

![](images/cross_validation_diagram.png)

For simplicity we'll focus on linear regression and ridge regression (we also can feel pretty comfortable throwing out the full degree 2 polynomial regression based on the poor results above!) As we loop through our CV folds, we will train and validate both models and collect the results to compare at the end. Note that we scale the training features within the CV loop.

In [63]:
from sklearn.model_selection import KFold

X = game.loc[:,['contains_ads', 'has_iap',
       'android', 'ios', 'Free_Pay','rating_count', 'downloads',
       'categories_Action','categories_Adventure', 'categories_Arcade', 'categories_Board',
       'categories_Card', 'categories_Casino', 'categories_Casual',
       'categories_Other', 'categories_Puzzle', 'categories_Role Playing',
       'categories_Simulation', 'categories_Sports', 'categories_Strategy', 
       'overall_rating', 'one_star','two_star','three_star', 'four_star', 'five_star']]
y = game.loc[:,['revenue']]

X, X_test, y, y_test = train_test_split(X, y, test_size=.2, random_state=28) #hold out 20% of the data for final testing

#this helps with the way kf will generate indices below
X, y = np.array(X), np.array(y)

In [64]:
#run the CV

kf = KFold(n_splits=5, shuffle=True, random_state = 10)
cv_lm_r2s, cv_lm_reg_r2s = [], [] #collect the validation results for both models

for train_ind, val_ind in kf.split(X,y):
    
    X_train, y_train = X[train_ind], y[train_ind]
    X_val, y_val = X[val_ind], y[val_ind] 
    
    #simple linear regression
    lm = LinearRegression()
    lm_reg = Ridge(alpha=1)

    lm.fit(X_train, y_train)
    cv_lm_r2s.append(lm.score(X_val, y_val))
    
    #ridge with feature scaling
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_val_scaled = scaler.transform(X_val)
    
    lm_reg.fit(X_train_scaled, y_train)
    cv_lm_reg_r2s.append(lm_reg.score(X_val_scaled, y_val))

print('Simple regression scores: ', cv_lm_r2s)
print('Ridge scores: ', cv_lm_reg_r2s, '\n')

print(f'Simple mean cv r^2: {np.mean(cv_lm_r2s):.3f} +- {np.std(cv_lm_r2s):.3f}')
print(f'Ridge mean cv r^2: {np.mean(cv_lm_reg_r2s):.3f} +- {np.std(cv_lm_reg_r2s):.3f}')

Simple regression scores:  [0.37389313504560184, 0.14269151038883976, 0.3908220128228935, 0.30201471730937146, 0.4973273390460588]
Ridge scores:  [0.36921619058932, 0.1457945335894546, 0.396240500755199, 0.3033333311672918, 0.5209416871570955] 

Simple mean cv r^2: 0.341 +- 0.117
Ridge mean cv r^2: 0.347 +- 0.123


The plot thickens! Our simple validation method above pointed to simple linear regression being better than ridge, but k-fold shows the opposite. The ridge model appears to be both better on average and has less varying results.

**Since k-fold is more reliable than a single validation set, we select the ridge regression model**. This shows the dangers of relying on simple validation methods, especially when our sample sizes are small.

Finally, see that we do better on the same test set!

In [65]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_test_scaled = scaler.transform(X_test)

lm_reg = Ridge(alpha=1)
lm_reg.fit(X_scaled,y)
print(f'Ridge Regression test R^2: {lm_reg.score(X_test_scaled, y_test):.3f}')

Ridge Regression test R^2: 0.384


## Seems like R^2 cannot get over 0.4 on all the training data and test data.

## Let's use Log for better or worst
### Here is some Info for log
numpy.log(x[, out]) = <ufunc 'log'>
Natural logarithm, element-wise.

The natural logarithm log is the inverse of the exponential function, so that log(exp(x)) = x. The natural logarithm is logarithm in base e.

By taking logarithms of variables which are multiplicatively related and/or growing exponentially over time, we can often explain their behavior with linear models.

It straightens out exponential growth patterns and reduces **heteroscedasticity** (i.e., stabilizes variance
The marginal effect of one variable on the expected value of another is linear in terms of percentage changes rather than absolute changes.  In such cases, applying a natural log or diff-log transformation to both dependent and independent variables may be appropriate. 

Small changes in the natural log of a variable are directly interpretable as percentage changes



1)Your variable has a right skew (mean > median). Taking the log would make the distribution of your transformed variable appear more symmetric (more normal). However, this is not a very good reason to log your variable. There are no regression assumptions that require your independent or dependent variables to be normal. However, if you have outliers in your dependent or independent variables, a log transformation could reduce the influence of those observations.

2)The variance of your regression residuals are increasing with your regression predictions. Taking the log of your dependent and/or independent variables may eliminate the heteroscedasticity.

3)Your regression residuals are non-normal. This may or may not be a problem for you. Even if your residuals are non-normal, your estimates may still be BLUE (best linear unbiased estimates). However in order for you to trust your inferences, you could log the dependent and/or independent variables and then check if the residuals are normal after the log transformation.

4)Transforms a non-linear model into a linear model. In economics, the production function is a non-linear combination of labor (L) and capital (K) , 𝑌=𝐾𝑎𝐿𝑏. Where 𝑎 and 𝑏 are parameters that you want to estimate. If you log both sides of the equation and add a constant 𝑐, 𝑙𝑜𝑔(𝑌)=𝑐+𝑎𝑙𝑜𝑔(𝐾)+𝑏𝑙𝑜𝑔(𝐿). The latter equation can be estimated using linear regression.






### In my case: why should I taking logarithms of my target "Revenue" or log transform.
- 1) The revenue series has a minimal bound 0, in this case 5000. Use apply log to revenue makes it no longer bounded.

- 2)It straightens out exponential growth patterns and reduces **heteroscedasticity** (i.e., stabilizes variance)

- 3)We have heavy right tail or skew. Applying log make y looks more normal