# Comparing different validation methods

In [1]:
#general imports
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn import linear_model
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
import itertools
import warnings # To suppress warnings
warnings.filterwarnings('ignore') 
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error
from datetime import datetime

For this analysis, we will be using the crime rate data from different communities.The original data set can be found
on the UCI machine learning data repository (https://archive.ics.uci.edu/ml/datas
ets/Communities+and+Crime+Unnormalized). This data set consists of many attributes of
different communities, such as household size, percentage of race of different groups, number
of police officers, etc. We will be using a  cleaned dataset for this analysis

In [2]:
community=pd.read_csv("community.csv")
community.head(2)

Unnamed: 0,population,householdsize,racepctblack,racePctWhite,racePctAsian,racePctHisp,agePct12t21,agePct12t29,agePct16t24,agePct65up,...,PctForeignBorn,PctBornSameState,PctSameHouse85,PctSameCity85,PctSameState85,LandArea,PopDens,PctUsePubTrans,LemasPctOfficDrugUn,nonViolPerPop
0,11980,3.1,1.37,91.78,6.5,1.88,12.47,21.44,10.93,11.33,...,10.66,53.72,65.29,78.09,89.14,6.5,1845.9,9.63,0.0,1394.59
1,23123,2.82,0.8,95.57,3.44,0.85,11.01,21.3,10.48,17.18,...,8.3,77.17,71.27,90.22,96.12,10.6,2186.7,3.84,0.0,1955.95


In [3]:
print(community.columns.values) #Looking at the variables we wil be working with
print("\nThe number of variables : {}".format(len(community.columns.values)))
print("\nThe number of observations : {}".format(len(community)))

['population' 'householdsize' 'racepctblack' 'racePctWhite' 'racePctAsian'
 'racePctHisp' 'agePct12t21' 'agePct12t29' 'agePct16t24' 'agePct65up'
 'numbUrban' 'pctUrban' 'medIncome' 'pctWWage' 'pctWFarmSelf' 'pctWInvInc'
 'pctWSocSec' 'pctWPubAsst' 'pctWRetire' 'medFamInc' 'perCapInc'
 'whitePerCap' 'blackPerCap' 'indianPerCap' 'AsianPerCap' 'HispPerCap'
 'NumUnderPov' 'PctPopUnderPov' 'PctLess9thGrade' 'PctNotHSGrad'
 'PctBSorMore' 'PctUnemployed' 'PctEmploy' 'PctEmplManu' 'PctEmplProfServ'
 'PctOccupManu' 'PctOccupMgmtProf' 'MalePctDivorce' 'MalePctNevMarr'
 'FemalePctDiv' 'TotalPctDiv' 'PersPerFam' 'PctFam2Par' 'PctKids2Par'
 'PctYoungKids2Par' 'PctTeen2Par' 'PctWorkMomYoungKids' 'PctWorkMom'
 'NumKidsBornNeverMar' 'PctKidsBornNeverMar' 'NumImmig' 'PctImmigRecent'
 'PctImmigRec5' 'PctImmigRec8' 'PctImmigRec10' 'PctRecentImmig'
 'PctRecImmig5' 'PctRecImmig8' 'PctRecImmig10' 'PctSpeakEnglOnly'
 'PctNotSpeakEnglWell' 'PctLargHouseFam' 'PctLargHouseOccup'
 'PersPerOccupHous' 'PersPerOwnO

In [4]:
# defining X and Y variables we will be using in our model
X=community.copy()
del X['nonViolPerPop']
y=community['nonViolPerPop'].copy() # response variable is the number of non-violent crimes per population

### The aim of this notebook is to implement different validation methods and compare their performances using the crime rate data from different communities. We will be using a Lasso regression model to predict the total number of non-violent crimes per population. 

### We will performing Lasso regression using the below three validation methods and determine the best hyperparameters

### 1. Train / validation / test split.
### 2. 5-Fold cross validation.
### 3. 10-Fold cross validation.

#### Let's start with splitting the data into train valid and test. We will be using 30% of the data as our test data

In [5]:
X_train_valid, X_test, y_train_valid, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1)
X_train, X_valid, y_train, y_valid = train_test_split(X_train_valid, y_train_valid, test_size = 0.2857, random_state = 1) 

In [6]:
# Hyperparameters
alphas = np.logspace(-10,10,21) # lambda values
max_iters = np.arange(50,75,5) # Setting the min and max number of iteration we want the model to run
tols = np.linspace(0.0001,0.1,5) # tolerance for optimization

# Method 1: Train / validation / test split.

In [7]:
hyperparameter_trio=list(itertools.product(alphas,max_iters,tols)) #Forming all possible combinations for accuracy, max iterations and tolerance defined ranges
print("The number of trios in total: {}".format(len(hyperparameter_trio)))

The number of trios in total: 525


In [8]:
#scaling the data
scaler=StandardScaler() # Instantiate
scaler.fit(X_train) # Fitting the data
X_train=pd.DataFrame(scaler.transform(X_train)) # transforming the data
X_valid=pd.DataFrame(scaler.transform(X_valid)) # transforming the validation set

In [9]:
Validation_Scores=[]
# performing lasso regression on all the hyperparameter trios. Fit in the training data and predicting the validation set and finding MSE
start = datetime.now()
for a in hyperparameter_trio:
    lm_trainlasso=linear_model.Lasso(alpha=a[0],max_iter=a[1],tol=a[2])
    lm_trainlasso.fit(X_train,y_train)
    Validation_Scores.append(metrics.mean_squared_error(lm_trainlasso.predict(X_valid),y_valid))
end = datetime.now() 
M1 = end - start


In [10]:
minerror_M1 = min(Validation_Scores) # min validation misclassification error
besttrio_M1 = hyperparameter_trio[np.argmin(Validation_Scores)] #finding the hyperparameter trio that gives least error

In [11]:
# Displaying the hyperparameter trio (alpha,max iteration,tolerance) with the lowest mean squared errors
bestparam_M1 = pd.DataFrame(zip(['alpha','max_iter','tol'], besttrio_M1))
bestparam_M1 = bestparam_M1.rename(columns = {0:'Hyperparameter', 1:'Best trio'})
bestparam_M1 = bestparam_M1.set_index('Hyperparameter')
bestparam_M1

Unnamed: 0_level_0,Best trio
Hyperparameter,Unnamed: 1_level_1
alpha,1.0
max_iter,50.0
tol,0.0001


In [12]:
#Scaling the data
scaler = StandardScaler()
scaler.fit(X_train_valid)
X_train_valid = pd.DataFrame(scaler.transform(X_train_valid))
X_test = pd.DataFrame(scaler.transform(X_test)) # transforming the test set

In [13]:
# Refit model with train + validation set, perform prediction on test set
lm1 = linear_model.Lasso(alpha = besttrio_M1[0], max_iter = besttrio_M1[1], tol = besttrio_M1[2])
lm1.fit(X_train_valid, y_train_valid)
d1 = pd.DataFrame(zip(X.columns.values, lm1.coef_))
d1 = d1.rename(columns = {0:'Vars', 1:'Coef'})
d1 = d1.set_index('Vars')
d1.head() #printing the first 5 results

Unnamed: 0_level_0,Coef
Vars,Unnamed: 1_level_1
population,-191.307715
householdsize,-262.531808
racepctblack,271.710957
racePctWhite,286.458009
racePctAsian,89.959975


In [14]:
M1_terror = metrics.mean_squared_error(lm1.predict(X_test),y_test)
print("The prediction error for the test set is : {}".format(M1_terror))

The prediction error for the test set is : 3539035.5943705086


# Method 2: 5-Fold cross validation.

In [15]:
start = datetime.now()
estimator = Pipeline([('scale', StandardScaler()), ('lasso',Lasso())]) # setting up model pipeline
parameters = {'lasso__alpha':alphas, 'lasso__max_iter':max_iters, 'lasso__tol':tols}#Adding the three hyperparameters
lm2 = GridSearchCV(estimator = estimator, param_grid = parameters, cv = 5, scoring = 'neg_mean_squared_error', n_jobs = -1) 
lm2.fit(X_train_valid, y_train_valid) # fitting train+valid
end = datetime.now()
M2 = end - start


#### Best hyperparameter trio

In [16]:
# Displaying the hyperparameter trio (alpha,max iteration,tolerance) with the lowest mean squared error
bestparam_M2 = pd.DataFrame(zip(parameters.keys(), lm2.best_params_.values()))
bestparam_M2 = bestparam_M2.rename(columns = {0:'Hyperparameter', 1:'Best trio'})
bestparam_M2 = bestparam_M2.set_index('Hyperparameter')
bestparam_M2 # the best parameter from CV

Unnamed: 0_level_0,Best trio
Hyperparameter,Unnamed: 1_level_1
lasso__alpha,10.0
lasso__max_iter,65.0
lasso__tol,0.05005


In [17]:
M2_terror = metrics.mean_squared_error(lm2.predict(X_test), y_test) # prediction on test test
print("The prediction error for the test set is : {}".format(M2_terror))

The prediction error for the test set is : 3476161.6429157713


In [18]:
d2 = pd.DataFrame(zip(X.columns.values, lm2.best_estimator_.named_steps['lasso'].coef_))
d2 = d2.rename(columns = {0:'Vars', 1:'Coef'})
d2 = d2.set_index('Vars')
d2.head() #printing the first 5 results

Unnamed: 0_level_0,Coef
Vars,Unnamed: 1_level_1
population,-0.0
householdsize,-257.075396
racepctblack,87.032575
racePctWhite,0.0
racePctAsian,0.0


# Method 3: 10-Fold cross validation.

In [19]:
start = datetime.now()
estimator = Pipeline([('scale', StandardScaler()), ('lasso',Lasso())]) # Model Pipeline
parameters = {'lasso__alpha':alphas, 'lasso__max_iter':max_iters, 'lasso__tol':tols}
lm3 = GridSearchCV(estimator = estimator, param_grid = parameters, cv = 10, scoring = 'neg_mean_squared_error', n_jobs = -1) 
lm3.fit(X_train_valid, y_train_valid) 
end = datetime.now()
M3 = end - start

#### Best hyperparameter trio

In [20]:
# Displaying the hyperparameter trio (alpha,max iteration,tolerance) with the lowest mean squared errors
bestparam_M3 = pd.DataFrame(zip(parameters.keys(), lm3.best_params_.values()))
bestparam_M3 = bestparam_M3.rename(columns = {0:'Hyperparameter', 1:'Best trio'})
bestparam_M3 = bestparam_M3.set_index('Hyperparameter')
bestparam_M3 

Unnamed: 0_level_0,Best trio
Hyperparameter,Unnamed: 1_level_1
lasso__alpha,10.0
lasso__max_iter,70.0
lasso__tol,0.025075


In [21]:
M3_terror = metrics.mean_squared_error(lm3.predict(X_test), y_test) # prediction on test test
print("The prediction error for the test set is : {}".format(M3_terror))

The prediction error for the test set is : 3473558.939727634


In [22]:
d3 = pd.DataFrame(zip(X.columns.values, lm3.best_estimator_.named_steps['lasso'].coef_))
d3 = d3.rename(columns = {0:'Vars', 1:'Coef'})
d3 = d3.set_index('Vars')
d3.head() #printing the first 5 results

Unnamed: 0_level_0,Coef
Vars,Unnamed: 1_level_1
population,-0.0
householdsize,-258.593448
racepctblack,88.166913
racePctWhite,0.0
racePctAsian,0.0


### Comparing the three methods (M1,M2,M3)

#### 1) Time comparison

In [23]:
Methodname = ['M1','M2','M3']
Mintime = [M1, M2, M3]
d4 = pd.DataFrame(zip(Methodname, Mintime))
d4 = d4.rename(columns = {0:'Method', 1:'Time'})
print('Time taken by each model to run:\n\n', d4)

# calculating the model which is taking minimum time
mintime = d4[['Method','Time']][d4['Time'] == d4['Time'].min()] 
print('\nMinimum time taken:\n\n', mintime)

Time taken by each model to run:

   Method            Time
0     M1 00:00:25.351023
1     M2 00:04:31.204233
2     M3 00:09:41.683957

Minimum time taken:

   Method            Time
0     M1 00:00:25.351023


###### Method 1 takes the least amount of time

#### 2) Hyperparameter trios

In [24]:
triocomp=pd.DataFrame(zip(['alpha','max_iter','tol'], besttrio_M1,lm2.best_params_.values(),lm3.best_params_.values()))
triocomp=triocomp.rename(columns={0:'Hyperparameters',1:'M1',2:'M2',3:'M3'})
triocomp


Unnamed: 0,Hyperparameters,M1,M2,M3
0,alpha,1.0,10.0,10.0
1,max_iter,50.0,65.0,70.0
2,tol,0.0001,0.05005,0.025075


###### The best alpha value is 10 for both M2 and M3. The number of iteration is the most for method 3 and least for method 1

#### 3) Mean squared error comparison (test set)

In [25]:
Minerror = [M1_terror, M2_terror, M3_terror]
d5 = pd.DataFrame(zip(Methodname, Minerror))
d5 = d5.rename(columns = {0:'Method', 1:'MSE'})
print('MSE of three methods:\n', d5)

 # calculating the model with least MSE
minerror = d5[['Method','MSE']][d5['MSE'] == d5['MSE'].min()]
print('\nMinimum error:\n', minerror)

MSE of three methods:
   Method           MSE
0     M1  3.539036e+06
1     M2  3.476162e+06
2     M3  3.473559e+06

Minimum error:
   Method           MSE
2     M3  3.473559e+06


###### Least error on the test set is given by method 3

#### 4) Coefficient comparison

In [26]:
# Taking the absolute values of coefficients and printing the top five for each in descending order 

# Method 1
d1['Coef'] = abs(d1['Coef'].values)
print('Model_1\n', d1.sort_values(by='Coef', ascending=False).head())

# Method 2
d2['Coef'] = abs(d2['Coef'].values)
print('\nModel_2\n', d2.sort_values(by = 'Coef', ascending=False).head())

# Method 3
d3['Coef'] = abs(d3['Coef'].values)
print('\nModel_3\n', d3.sort_values(by = 'Coef', ascending=False).head())

# Common/Uncommon features shrinked by each model
common_features = d2[d2['Coef'] == 0].index & d3[d3['Coef'] == 0].index
print(common_features)

uncommon_features = d2[d2['Coef'] == 0].index.symmetric_difference(d3[d3['Coef'] == 0].index)
print(uncommon_features)

Model_1
                             Coef
Vars                            
PersPerOccupHous     1057.493663
PctForeignBorn        831.529309
NumKidsBornNeverMar   744.253307
MalePctNevMarr        737.867836
PersPerOwnOccHous     669.837470

Model_2
                         Coef
Vars                        
PctForeignBorn    694.631090
PersPerOccupHous  615.307836
MalePctNevMarr    547.458027
PctPopUnderPov    533.548267
PctKids2Par       511.629813

Model_3
                         Coef
Vars                        
PctForeignBorn    750.458970
PctKids2Par       593.013168
PersPerOccupHous  581.750552
MalePctNevMarr    562.973969
PctPopUnderPov    477.270452
Index(['population', 'racePctWhite', 'racePctAsian', 'racePctHisp',
       'agePct65up', 'numbUrban', 'medIncome', 'pctWWage', 'perCapInc',
       'NumUnderPov', 'PctOccupManu', 'PersPerFam', 'PctYoungKids2Par',
       'PctTeen2Par', 'PctWorkMomYoungKids', 'NumImmig', 'PctImmigRec8',
       'PctRecImmig5', 'PctRecImmig10', 'PctNotSp

###### We have taken the absolute values of the coefficients here in descending order to determine the best coefficient according to every model.
###### Shrinkage: We compared Model_2 and Model_3 only since Model_1 shrinked only one feature which is 'PctEmplProfServ' which was not present in M2 and M3 when we looked for shrinked features.

In [27]:
# counting the number of features shrinked to zero
print('Total features shrinked to zero by M1:', d1[d1['Coef']==0].count(axis=1).sum())
print('Total features shrinked to zero by M2:', d2[d2['Coef'] == 0].count(axis = 1).sum())
print('Total features shrinked to zero by M3:', d3[d3['Coef'] == 0].count(axis = 1).sum())

Total features shrinked to zero by M1: 1
Total features shrinked to zero by M2: 30
Total features shrinked to zero by M3: 32


###### Method 1: Out of 101 predictors, this is able to shrink only one feature to zero.
###### Method 2: This model seems to be give better results than the previous as it is shrinking 30 features to zero.
###### Method 3: This performs evern better as it is able to shrink 32 useless features to zero.

##### Conclusion

Time: 
The least time taken is by M1 and M3 takes the maximum time. This makes absolute sense as more the number of folds, more is the time taken by the model.

Mean squared error:
Among M1,M2 and M3- M3 has the lowest MSE which means M3 is giving us the most accurate prediction results.

Coefficients: 
If we compare the top five coefficients among all the three models, it can be observed that 'PctForeignBorn, PersPerOccupHous and MalePctNevMarr' are the predictors which are in the top list for all the three models.

Shrinkage: 
M1: 1, M2: 30, M3: 32
We can see that M1 is shrinking only 1 feature to zero, whereas M2 and M3 are shrinking 30 and 32 features respectively.
So, with respect to shrinkage, we can say that M3 would be the least complicated method since it has less features. Scoring will be much faster for M2 and M3.

There is not much difference in MSE of M2 and M3(the difference starts at third decimal place) but the time taken by M3 is almost 2.5 times more than M2, so if time is a major factor for us, then we should go ahead with M2 model.

If time is not a major factor, for this high dimensional data, we would select M3 Model as it is screening out the most number of useless features making it simplest among all the three models. 

###### Pros and cons of the 3 methods

In train/valid/test split method, we may encounter a problem if the data is not split at random. We may end up overfitting the model. Athough this is more of a problem for train/test split method. We reduce the chances by including the extra validation step. Pros would be that when time is a major factor, this method takes lesser time as compared to CV


The risk of running a 10 fold CV is that if we have a small sample size, there are chances of getting duplicates. So it is better to use 10 fold CV when we have a larger data set. Also, more training samples usually means that we are at a flatter part of the learning curve, so the difference between the surrogate models and the "real" model trained on all n samples becomes negligible.
If the slope of the learning curve is flat enough at say training_size = 90% of total dataset, then the bias can be ignored and K=10 is reasonable.

High K means more folds, thus higher computational time and vice versa. Also higher K gives more samples to estimate a more accurate confidence interval on our estimate

Lower the K means lower the variance and higher the bias. Higher the K means higher the variance and lower the bias.