In this notebook, I will use different cross-validation technique to test which validation split perform best on which models.

In [19]:
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold, LeaveOneOut, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import cross_val_score

from sklearn.linear_model import Lasso, LinearRegression, Ridge

In [20]:
train = pd.read_csv('C:/Users/Duy Nguyen/Downloads/9.1 discussion/train.csv')
test = pd.read_csv('C:/Users/Duy Nguyen/Downloads/9.1 discussion/test.csv')

In [21]:
train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

since there are missing value, let perform an imputation:
- age has a low number of missing value (mean imputation)
- cabin has a large number of missing value
    - I decide to fill the value with Unknow so that when performing analysis, we are treating the value as it own and has it's own impact base on model
    - in late process we will want to imform business analyst regarding the data collection process and infer which missing data it was.

In [22]:
len(train)

891

In [23]:
train['Cabin'].fillna('Unknown', inplace=True)
train['Age'].fillna(value=train['Age'].mean(), inplace=True)
train.dropna(subset=['Embarked'], inplace=True)

# same for test set
test['Cabin'].fillna('Unknown', inplace=True)
test['Age'].fillna(value=test['Age'].mean(), inplace=True)
test.dropna(subset=['Embarked'], inplace=True)


checking missing value one last time?

In [24]:
assert train['Age'].isnull().sum() == 0
assert train['Cabin'].isnull().sum() == 0
assert train['Embarked'].isnull().sum() == 0

assert test['Age'].isnull().sum() == 0
assert test['Cabin'].isnull().sum() == 0
assert test['Embarked'].isnull().sum() == 0


In [25]:
train.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

now that our data set is clean, let's shuffle the data set, split it and perform k-fold cross validation
we are splitting before we want to make sure the data is randomly distributed

In [26]:
# define what is the target feature, say we want to predict the survival rate
X = train.drop(columns=['Survived'], axis=1)
y = train['Survived']

In [27]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Index: 889 entries, 0 to 890
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  889 non-null    int64  
 1   Pclass       889 non-null    int64  
 2   Name         889 non-null    object 
 3   Sex          889 non-null    object 
 4   Age          889 non-null    float64
 5   SibSp        889 non-null    int64  
 6   Parch        889 non-null    int64  
 7   Ticket       889 non-null    object 
 8   Fare         889 non-null    float64
 9   Cabin        889 non-null    object 
 10  Embarked     889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 83.3+ KB


keep in mind that there are 5 categorical columns we need to consider one hot encoded for the model to work
for now let's observed how KFold module split your data

In [28]:
kf = KFold(n_splits=5, shuffle=True, random_state=42)

In [29]:
cnt = 1
for train_index, test_index in kf.split(X, y):
    print(f'Fold:{cnt}, Train set: {len(train_index)}, Test set:{len(test_index)}')
    cnt+=1

Fold:1, Train set: 711, Test set:178
Fold:2, Train set: 711, Test set:178
Fold:3, Train set: 711, Test set:178
Fold:4, Train set: 711, Test set:178
Fold:5, Train set: 712, Test set:177


now that we have a data k fold splitted, let's build a training pipeline that we could pass the kf into
In the pipeline, let's pass a preprocessor for categorical features

In [30]:
categorical_columns = X.select_dtypes(include=['object', 'category']).columns.tolist()
numeric_columns = X.select_dtypes(include=['int64', 'float64']).columns.tolist()

numeric_transformer = Pipeline(steps=[
    ('poly_features', PolynomialFeatures(degree=3, include_bias=False)),
    ('scaler', StandardScaler())
])
categorical_transformer = OneHotEncoder(drop='if_binary', handle_unknown='ignore')

preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, categorical_columns),
    ('num', numeric_transformer, numeric_columns),
])



In [31]:
model_selector_pipe = Pipeline([('preprocessor', preprocessor),
                                ('selector', SelectFromModel(Lasso(alpha=0.01))), # due to norm ball effect, SelectFromModel()
                                 # will use Lasso to select features that avoid overfitting
                                 # remember to be careful of what Lasso default alpha, because it might dropped all features
                                ('linreg', LinearRegression())])

to connect the pipeline and k-fold cv, we use a function called cross_val_score

In [32]:
scores = cross_val_score(model_selector_pipe, X, y, cv=kf, scoring='neg_mean_squared_error')
np.mean(scores)



-0.14402071751497977

now, for kfold the average we get -0.14402071751497977
to repeat all these process is quite tedious, I have created a class that automate the training process of the pipeline and apply different cv on that one model.

In [35]:
from CrossValidationComparison import CrossValidationComparison

# create an instance that passes  the data point and set the penalty rate ready for model training
cv_comparison = CrossValidationComparison(X, y, alpha=0.1)

# call the setup_model() on the instance to process the data and set up the pipeline
cv_comparison.setup_model()

PLEASE be patient, this is a lot of computation because we going through each cv method and kfolds has 5 fold samples to compute and 

In [36]:
# run compare_methods() to fit the data for 3 different cv methods
# last this method will yield the best score for each cv method!
cv_comparison.compare_methods()



KeyboardInterrupt: 

as you can see, we have kfold as the winner. 