Creative Commons CC BY 4.0 Lynd Bacon & Associates, Ltd. Not warranted to be suitable for any particular purpose. (You're on your own!)

# Avoid the Embarrassment of Data Leakage: <br>Rescale or Transform Within CV Folds!

A basic cross-validation, "anti-data leakage" notion in that test data fold are data that an algorithm hasn't yet "seen" when it is learning from a training data fold.   To be consistent with this idea, any rescaling or transformations that depends on the values of the data must be done separately for training data and for test data.

# Rescaling Transformations Within Folds

In the following example we'll "standardize" the patient satisfaction data so that every predictor has mean=0 and SD=1.  In the "olden days" of data mining, doing this was sometimes referred to as "sphering" the data.  See:

[scikit-Learn StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler)

There are several other ways of rescaling variables.  Another common method is "MinMax," which rescales a feature's data to be within the range of the minimum and maximum values.

[scikit-Learn MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler)

You can find other tools for rescaling at [scikit-learn preprocessing API](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing).

# Get Some Essential Packages

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import numpy as np
import pandas as pd
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt

from sklearn import linear_model  # OLS 
from sklearn.metrics import mean_squared_error, r2_score # Basic metrics
from sklearn.model_selection import KFold
from sklearn import preprocessing

# Get the PT Satisfaction Data

Assuming that they are in the pwd:

In [2]:
# Input into a DataFrame, check the column names

ptSatDF=pd.read_csv('DATA/ML/DECART-patSat.csv')
ptSatDF.columns

Index(['caseID', 'patSat', 'q2', 'q3', 'q4', 'q5', 'q6', 'q7', 'q8', 'q9',
       'ptCat'],
      dtype='object')

## Dummy Code the pt Categories

Just for consistency with things elsewhere, we'll dummy code the categories of `ptCat`, leaving out the first(0) category, medical admission.  (The regular one, not the "highfalutin" concierge type.

In [10]:
ptSatDF2=ptSatDF.drop('caseID',axis=1)  # Get rind of caseID
ptCats=pd.get_dummies(ptSatDF2.ptCat,drop_first=True) # get 0/1 dummies, drop the 0 category
ptCats.columns

Int64Index([1, 2], dtype='int64')

In [12]:
ptSatDF3=ptSatDF2.drop('ptCat',axis=1)
ptSatDF4=pd.concat([ptSatDF3,ptCats],axis=1,sort=False)
ptSatDF4=ptSatDF4.rename(index=str,columns={1:'ptCat1',2:'ptCat2'})
ptSatDF4.columns

Index(['patSat', 'q2', 'q3', 'q4', 'q5', 'q6', 'q7', 'q8', 'q9', 'ptCat1',
       'ptCat2'],
      dtype='object')

In [13]:
ptSatDF4.shape

(1811, 11)

# K-Fold CV with Separate X Train and Test Standardization Within Folds


20 folds, using defaults for the `scikit-learn` StandardScaler() method.

In [14]:
kf=KFold(n_splits=20,random_state=99,shuffle=True)
X=ptSatDF4.iloc[:,1:].to_numpy()
y=ptSatDF4.iloc[:,0].to_numpy()

cvres=[]  # Holder list for fold results

regr=linear_model.LinearRegression() # define a reg model to use

scaler=preprocessing.StandardScaler() # by default, mean=0, sd=1

for traindx, testdx in kf.split(X):  # loop over folds
    resDict={}                       # Dictionary to hold fold results
    XTrainS=scaler.fit_transform(X[traindx])  # Xtrain rescaled
    yTrain=y[traindx]
    XTestS=scaler.fit_transform(X[testdx])    # Xtest rescaled
    yTest=y[testdx]
    regModel=regr.fit(XTrainS,yTrain) 
    trainPred=regModel.predict(XTrainS)
    trainR2=r2_score(yTrain,trainPred)
    trainMSE=mean_squared_error(yTrain,trainPred)
    testPred=regModel.predict(XTestS)
    testR2=r2_score(yTest,testPred)
    testMSE=mean_squared_error(yTest,testPred)
    resDict.update({'trainR2':trainR2,
                    'testR2':testR2,
                    'trainMSE':trainMSE,
                    'testMSE':testMSE})
    cvres.append(resDict)

In [15]:
# Rearranging cols to make train vs test comparisons easier

cvresDF=pd.DataFrame(cvres)[['trainMSE','testMSE','trainR2','testR2']]

In [16]:
cvresDF.describe()

Unnamed: 0,trainMSE,testMSE,trainR2,testR2
count,20.0,20.0,20.0,20.0
mean,1.905219,2.053023,0.706933,0.675779
std,0.017148,0.309234,0.002749,0.052867
min,1.871845,1.542167,0.700704,0.561533
25%,1.894593,1.879348,0.705033,0.650602
50%,1.902233,2.088802,0.70727,0.668211
75%,1.923043,2.229367,0.708789,0.702608
max,1.929971,2.599047,0.711835,0.795454


# A UDU: Radon Regression With MinMax Rescaling

This can be done essentially like what's above. But instead of

`scaler=preprocessing.StandardScaler()`

use

`scaler=preprocessing.MinMax()`

Use the radon data, of course.  Don't forget that there's an observation with a missing value on `hhincome`. 