# MODEL VALIDATION

Model validation consists of:
* Ensuring your model performs as expected on new data
* Testing model performance on holdout datasets
* Selecting the best model, parameters, and accuracy metrics
* Achieving the best accuracy for the data given

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

## Basic Steps in Modeling: Candy Data

Model's tend to have higher accuracy on observations they have seen before. In the candy dataset, predicting the popularity of Skittles will likely have higher accuracy than predicting the popularity of Andes Mints—Skittles is in the dataset, and Andes Mints is not.

You've built a model based on 70% of candies data using the dataset X_train and need to report how accurate the model is at predicting the popularity of the candies the model was built on, and the 30% of candies (X_test) it has never seen. You will use the mean squared error, mse(), as the accuracy metric.

### Explore Data

Load the data from CSV file

In [2]:
candy = pd.read_csv(r'candy-data.csv')

View the first 5 observations

In [3]:
candy.head(5)

Unnamed: 0,competitorname,chocolate,fruity,caramel,peanutyalmondy,nougat,crispedricewafer,hard,bar,pluribus,sugarpercent,pricepercent,winpercent
0,100 Grand,1,0,1,0,0,1,0,1,0,0.732,0.86,66.971725
1,3 Musketeers,1,0,0,0,1,0,0,1,0,0.604,0.511,67.602936
2,One dime,0,0,0,0,0,0,0,0,0,0.011,0.116,32.261086
3,One quarter,0,0,0,0,0,0,0,0,0,0.011,0.511,46.116505
4,Air Heads,0,1,0,0,0,0,0,0,0,0.906,0.511,52.341465


Show the summary statistics

In [4]:
candy.describe()

Unnamed: 0,chocolate,fruity,caramel,peanutyalmondy,nougat,crispedricewafer,hard,bar,pluribus,sugarpercent,pricepercent,winpercent
count,85.0,85.0,85.0,85.0,85.0,85.0,85.0,85.0,85.0,85.0,85.0,85.0
mean,0.435294,0.447059,0.164706,0.164706,0.082353,0.082353,0.176471,0.247059,0.517647,0.478647,0.468882,50.316764
std,0.498738,0.50014,0.373116,0.373116,0.276533,0.276533,0.383482,0.433861,0.502654,0.282778,0.28574,14.714357
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011,0.011,22.445341
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.22,0.255,39.141056
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.465,0.465,47.829754
75%,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.732,0.651,59.863998
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.988,0.976,84.18029


### Splitting Train and Test Data

In [5]:
X = candy.drop(['winpercent','competitorname'],axis=1)
y = candy['winpercent']

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3, random_state=1111)

### Fitting Model

In [6]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()

In [10]:
# The model is fit using X_train and y_train
model.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

Getting the Parameter Info

In [9]:
model.get_params()

{'bootstrap': True,
 'criterion': 'mse',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 10,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

Well done! Recalling which parameters were used will be helpful going forward. Model validation and performance rely heavily on which parameters were used, and there is no way to replicate a model without keeping track of the parameters used!

### Prediction

In [None]:
# Create vectors of predictions
train_predictions = model.predict(X_train)
test_predictions = model.predict(X_test)

### Model Validation

In [8]:
from sklearn.metrics import mean_squared_error as mse

# Train/Test Errors
train_error = mse(y_true=y_train, y_pred=train_predictions)
test_error = mse(y_true=y_test, y_pred=test_predictions)

# Print the accuracy for seen and unseen data
print("Model error on seen data: {0:.2f}.".format(train_error))
print("Model error on unseen data: {0:.2f}.".format(test_error))

Model error on seen data: 23.69.
Model error on unseen data: 144.02.


Excellent. When models perform differently on training and testing data, you should look to model validation to ensure you have the best performing model. In the next lesson, you will start building models to validate.

## Splitting Data

Seen vs. Unseen data:
* Seen data (used for training)
* Unseen data (unavailable for training)
<img src='Capture.png'>

### Ratio Between Train and Test

Example (Train:Test):
* 80:20
* 90:10 (used when we have little data)
* 70:30 (used when model is computationally expensive)

In [13]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3, random_state=1111)

print('X_train:', X_train.shape[0])
print('X_test:', X_test.shape[0])

X_train: 59
X_test: 26


<b>Parameter in train_test_split:</b>

test_size : float, int or None, optional (default=0.25)
    If float, should be between 0.0 and 1.0 and represent the proportion
    of the dataset to include in the test split. If int, represents the
    absolute number of test samples. If None, the value is set to the
    complement of the train size. By default, the value is set to 0.25.
    The default will change in version 0.21. It will remain 0.25 only
    if ``train_size`` is unspecified, otherwise it will complement
    the specified ``train_size``.

train_size : float, int, or None, (default=None)
    If float, should be between 0.0 and 1.0 and represent the
    proportion of the dataset to include in the train split. If
    int, represents the absolute number of train samples. If None,
    the value is automatically set to the complement of the test size.

random_state : int, RandomState instance or None, optional (default=None)
    If int, random_state is the seed used by the random number generator;
    If RandomState instance, random_state is the random number generator;
    If None, the random number generator is the RandomState instance used
    by `np.random`.

shuffle : boolean, optional (default=True)
    Whether or not to shuffle the data before splitting. If shuffle=False
    then stratify must be None.

stratify : array-like or None (default=None)
    If not None, data is split in a stratified fashion, using this as
    the class labels.

### Splitting by Stratification

It is important to stratify the split (at least) using Target Variable, especially in <b>classification problem</b>. Stratification split ensures the distribution of Target in whole data will be simillar in the test and train dataset. 

In [20]:
diabetes=pd.read_csv(r'diabetes.csv')

In [21]:
diabetes.head(2)

Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age,diabetes
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0


In [22]:
y = diabetes['diabetes']
X = diabetes.drop(['diabetes'],axis=1)

In [42]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3, random_state=1111, stratify = y)

Checking the distribution of Target

In [43]:
train_target = y_train[y_train==1].value_counts()/len(y_train)
test_target = y_test[y_test==1].value_counts()/len(y_test)

print('Target Dist. in y_train:', train_target)
print()
print('Target Dist. in y_test::', test_target)

Target Dist. in y_train: 1    0.348231
Name: diabetes, dtype: float64

Target Dist. in y_test:: 1    0.350649
Name: diabetes, dtype: float64


## Model Validation: Holdout Sample

What do we do when testing different model? e.g. testing Random Forest with number of trees of 100 versus 1000? Using validation set will help us validate our model
<img src='Capture2.png'>

### Splitting the Data

In [65]:
# Do Train_Test_Split Twice
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3, random_state=1111, stratify = y)
X_train,X_val,y_train,y_val = train_test_split(X_train,y_train,test_size=0.3, random_state=1111, stratify = y_train)

print('X_train:', X_train.shape[0])
print('X_test:', X_test.shape[0])
print('X_val:', X_val.shape[0])

X_train: 375
X_test: 231
X_val: 162


* Use train dataset to train the model
* Use valid dataset to validate the model
* Use test dataset to check the performance of the data in unseen data

In [57]:
from sklearn.ensemble import RandomForestClassifier

#define the model
model1 = RandomForestClassifier(n_estimators=10, random_state=11)
model2 = RandomForestClassifier(n_estimators=50, random_state=11)
model3 = RandomForestClassifier(n_estimators=100, random_state=11)

#fitting the model
model1.fit(X_train,y_train)
model2.fit(X_train,y_train)
model3.fit(X_train,y_train)

#accuracy score
print('Model1 Score:',round(model1.score(X_val,y_val),4))
print('Model2 Score:',round(model2.score(X_val,y_val),4))
print('Model3 Score:',round(model3.score(X_val,y_val),4))

Model1 Score: 0.7901
Model2 Score: 0.8086
Model3 Score: 0.7963


### Accuracy Score

Accuracy Score from validation set shows that RF with n_estimators = 50 is the best. We check the model to the Test dataset.

In [58]:
#accuracy score
print('Model Train Score:',round(model2.score(X_train,y_train),4))
print('Model Validation Score:',round(model2.score(X_val,y_val),4))
print('Model Test Score:',round(model2.score(X_test,y_test),4))

Model Train Score: 1.0
Model Validation Score: 0.8086
Model Test Score: 0.7489


Wow, we have an overfitting model. You can see that the score of train model is higher than the score of validation model and test model. This means our model is only good in seen data, the model will give inaccurate prediction for the unseen data. Tuning the hyperparameter may help reducing the overfitting problem.

A disadvantage of the holdout method is that the performance estimate may be very sensitive to how we partition the training set into the training and validation subsets; the estimate will vary for different samples of the data.

## Model Validation: Cross Validation

In k-fold cross-validation, we randomly split the training dataset into k folds without replacement, where k — 1 folds are used for the model training, and one fold is used for performance evaluation. This procedure is repeated k times so that we obtain k models and performance estimates.
<img src='Capture3.png'>

We then calculate the average performance of the models based on the different, independent folds to obtain a performance estimate that is less sensitive to the sub-partitioning of the training data compared to the holdout method. Typically, we use k-fold cross-validation for model tuning, that is, finding the optimal hyperparameter values that yields a satisfying generalization performance.
<img src='Capture4.png'>

### K-fold CV

Performing K-fold Cross Validation

In [76]:
#Running K-fold CV
from sklearn.model_selection import KFold

#Defining Kfold
kfold = KFold(n_splits=10,random_state=11).split(X_train,y_train)

#Defining the model
rfc = RandomForestClassifier(n_estimators = 50, random_state=11)

#Initialize empty list for the scores
scores = []

#Loop the kfold
for k, (train, test) in enumerate(kfold):
    rfc.fit(X_train.iloc[train], y_train.iloc[train])
    score = rfc.score(X_train.iloc[test], y_train.iloc[test])
    scores.append(score)
    print('Fold: %2d, Class dist.: %s, Acc: %.3f' % (k+1,np.bincount(y_train.iloc[train]), score))
print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))

Fold:  1, Class dist.: [223 114], Acc: 0.684
Fold:  2, Class dist.: [223 114], Acc: 0.658
Fold:  3, Class dist.: [219 118], Acc: 0.789
Fold:  4, Class dist.: [220 117], Acc: 0.684
Fold:  5, Class dist.: [220 117], Acc: 0.816
Fold:  6, Class dist.: [217 121], Acc: 0.730
Fold:  7, Class dist.: [218 120], Acc: 0.892
Fold:  8, Class dist.: [219 119], Acc: 0.784
Fold:  9, Class dist.: [218 120], Acc: 0.757
Fold: 10, Class dist.: [219 119], Acc: 0.730
CV accuracy: 0.752 +/- 0.067


It turns out Random Forest with number of trees of 50 has an average accuracy of 0.752. By averaging the accuracy score from cross validation, we can have more information regarding the validity of the model. 

### Stratified K-fold CV

A slight improvement over the standard k-fold cross-validation approach is stratified k-fold cross-validation, which can yield better bias and variance estimates, especially in cases of unequal class proportions

In [78]:
#Running K-fold CV
from sklearn.model_selection import StratifiedKFold

#Defining Kfold
kfold = StratifiedKFold(n_splits=10,random_state=11).split(X_train,y_train)

#Defining the model
rfc = RandomForestClassifier(n_estimators = 50, random_state=11)

#Initialize empty list for the scores
scores = []

#Loop the kfold
for k, (train, test) in enumerate(kfold):
    rfc.fit(X_train.iloc[train], y_train.iloc[train])
    score = rfc.score(X_train.iloc[test], y_train.iloc[test])
    scores.append(score)
    print('Fold: %2d, Class dist.: %s, Acc: %.3f' % (k+1,np.bincount(y_train.iloc[train]), score))
print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))

Fold:  1, Class dist.: [219 117], Acc: 0.846
Fold:  2, Class dist.: [219 118], Acc: 0.737
Fold:  3, Class dist.: [219 118], Acc: 0.737
Fold:  4, Class dist.: [219 118], Acc: 0.763
Fold:  5, Class dist.: [220 118], Acc: 0.730
Fold:  6, Class dist.: [220 118], Acc: 0.784
Fold:  7, Class dist.: [220 118], Acc: 0.703
Fold:  8, Class dist.: [220 118], Acc: 0.838
Fold:  9, Class dist.: [220 118], Acc: 0.622
Fold: 10, Class dist.: [220 118], Acc: 0.730
CV accuracy: 0.749 +/- 0.062


### Cross_val_score

Although the previous code example was useful to illustrate how k-fold crossvalidation works, scikit-learn also implements a k-fold cross-validation scorer, which allows us to evaluate our model using stratified k-fold cross-validation less verbosely

In [80]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(estimator=rfc, X=X_train, y=y_train, cv=10, n_jobs=1)
print('CV accuracy scores: %s' % scores)
print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))

CV accuracy scores: [0.84615385 0.73684211 0.73684211 0.76315789 0.72972973 0.78378378
 0.7027027  0.83783784 0.62162162 0.72972973]
CV accuracy: 0.749 +/- 0.062


### LOOCV

A special case of k-fold cross-validation is the Leave-one-out cross-validation (LOOCV) method. In LOOCV, we set the number
of folds equal to the number of training samples (k = n) so that only one training sample is used for testing during each iteration, which is a recommended approach for working with very small datasets.

Use when:
* The amount of training data is limited
* You want the absolute best error estimate for new data

Be cautious when:
* Computational resources are limited
* You have a lot of data
* You have a lot of parameters to test

## Parameter & Hyperparameter

### Parameter

<b>Parameters</b> are:
* Learned or estimated from the data
* The result of fitting a model
* Used when making future predictions
* Not manually set

In [84]:
X = candy.drop(['winpercent','competitorname'],axis=1)
y = candy['winpercent']

In [85]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X, y)

#These are the parameters from linear Regression
print(lr.coef_, lr.intercept_) 

[19.74806698  9.42232207  2.22448136 10.07068847  0.8043306   8.91896981
 -6.1653265   0.44154009 -0.85449954  9.08676286 -5.92836143] 34.53397841468771


### Hyperparameter

<b>Hyperparameters</b> are:
* Manually set before the training occurs
* Specify how the training is supposed to happen

In [86]:
#Example of Random Forest
rfc.get_params()

{'bootstrap': True,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 50,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 11,
 'verbose': 0,
 'warm_start': False}

n_estimators, max_depth, max_features,and min_samples_split are the main hyperparameters for Random Forest

### Hyperparameter Tuning

* Select hyperparameters
* Run a single model type at different value sets
* Create ranges of possible values to select from
* Specify a single accuracy metric

In [89]:
#Example: Using Diabetes dataset
y = diabetes['diabetes']
X = diabetes.drop(['diabetes'],axis=1)

#Split data
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3, random_state=1111, stratify = y)

In [96]:
#Using Random Forest Model
rfc = RandomForestClassifier(n_estimators=50)

#Define the Hyperparameter
param_grid = {"max_depth": [4, 6, None],
              "max_features": [1,3,5],
              "min_samples_split": [0.2,0.3]} 

#### Grid Search

The approach of grid search is quite simple; it's a brute-force exhaustive search paradigm where we specify a list of values for different hyperparameters, and the computer evaluates the model performance for each combination of those to obtain the optimal combination of values from that set.

* Benefits: Tests every possible combination
* Drawbacks: Additional hyperparameters increase training time exponentially

In [98]:
from sklearn.model_selection import GridSearchCV

gs = GridSearchCV(estimator=rfc,
                  param_grid=param_grid,
                  scoring='accuracy',
                  cv=5,
                  verbose = 3,
                  n_jobs=-1)

gs.fit(X_train,y_train)

print(gs.best_score_)
print(gs.best_params_)

Fitting 5 folds for each of 18 candidates, totalling 90 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  28 tasks      | elapsed:    3.3s


0.7728119180633147
{'max_depth': 6, 'max_features': 3, 'min_samples_split': 0.2}


[Parallel(n_jobs=-1)]: Done  90 out of  90 | elapsed:   11.1s finished


#### Random Search

In contrast to GridSearchCV, not all parameter values are tried out, but rather a fixed number of parameter settings is sampled from the specified distributions. The number of parameter settings that are tried is given by n_iter.

In [104]:
from sklearn.model_selection import RandomizedSearchCV

rs = RandomizedSearchCV(estimator=rfc,
                  param_distributions=param_grid,
                  scoring='accuracy',
                  cv=5,
                  n_iter = 10,
                  verbose = 3,
                  n_jobs=-1)

rs.fit(X_train,y_train)

print(rs.best_score_)
print(rs.best_params_)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  28 tasks      | elapsed:   28.3s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:   31.7s finished


0.7672253258845437
{'min_samples_split': 0.2, 'max_features': 5, 'max_depth': 4}


#### Implementing The Result

In [110]:
model = rs.best_estimator_
model

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=4, max_features=5, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=0.2,
            min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [111]:
print('Model Train Score:',round(model.score(X_train,y_train),4))
print('Model Test Score:',round(model.score(X_test,y_test),4))

Model Train Score: 0.784
Model Test Score: 0.7576


We get a better result than the previous RF model of Diabetes in section 1.3.2, even though the model is still overfitting. Hyperparameter tuning is a repetitive work and an art of its own, choosing the best set of hyperparameter to be tested needs a lot of experiences and references.