## Train Test Frameworks

The following exercise is to practice the syntax of the various functions from sklearn that split data into train and test sets. The goal of this exercise is to get familiar with these different splitting methods before engaging with the more complex activities at the end of the day. 

In [14]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [1]:
# import numpy
import numpy as np

In [2]:
X = np.random.normal(0,1,20).reshape(10,2)
y = np.random.normal(0,1,10)

* print X

In [3]:
X

array([[ 1.63890215,  1.00137746],
       [ 1.52380719,  2.03260452],
       [ 0.23580624, -0.45649076],
       [-0.68720811, -0.05388869],
       [-0.17807331, -0.11255123],
       [ 0.3095679 ,  0.71384931],
       [ 1.39010448,  1.36055944],
       [ 0.72646961, -0.86341564],
       [ 0.18998379, -1.61783135],
       [-0.32824382, -0.37729068]])

* print y

In [4]:
y

array([ 1.00755839,  0.4164915 , -1.85286893,  0.09658669,  0.43114454,
        0.45493933,  0.34300939, -0.7788784 ,  0.55106677,  0.02994444])

_____________________________
### Holdout split

* import the **train_test_split** function from sklearn

In [6]:
from sklearn.model_selection import train_test_split

* split the data to train set and test set, use a 70:30 ratio or a 80:20 ratio.

In [8]:
X.shape

(10, 2)

In [9]:
y.shape

(10,)

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

In [40]:
X_train, y_train, X_test, y_test

(array([[ 1.63890215,  1.00137746],
        [ 1.52380719,  2.03260452],
        [ 0.23580624, -0.45649076],
        [-0.68720811, -0.05388869],
        [-0.17807331, -0.11255123],
        [ 0.3095679 ,  0.71384931],
        [ 1.39010448,  1.36055944],
        [ 0.72646961, -0.86341564]]),
 array([ 1.00755839,  0.4164915 , -1.85286893,  0.09658669,  0.43114454,
         0.45493933,  0.34300939, -0.7788784 ]),
 array([[ 0.18998379, -1.61783135],
        [-0.32824382, -0.37729068]]),
 array([0.55106677, 0.02994444]))

* print X_train

In [16]:
X_train

array([[-0.68720811, -0.05388869],
       [ 0.23580624, -0.45649076],
       [-0.32824382, -0.37729068],
       [ 0.72646961, -0.86341564],
       [ 0.18998379, -1.61783135],
       [ 1.39010448,  1.36055944],
       [ 0.3095679 ,  0.71384931],
       [-0.17807331, -0.11255123]])

* split the data again but now with the parameter shuffle = False

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, shuffle=False)

* print X_train

In [18]:
X_train

array([[ 1.63890215,  1.00137746],
       [ 1.52380719,  2.03260452],
       [ 0.23580624, -0.45649076],
       [-0.68720811, -0.05388869],
       [-0.17807331, -0.11255123],
       [ 0.3095679 ,  0.71384931],
       [ 1.39010448,  1.36055944],
       [ 0.72646961, -0.86341564]])

In [21]:
X_test

array([[ 0.18998379, -1.61783135],
       [-0.32824382, -0.37729068]])

* print the shape of X_train and X_test

In [19]:
X_train.shape

(8, 2)

In [20]:
X_test.shape

(2, 2)

_________________________________
### K-fold split 

* import the **KFold** function from sklearn

In [22]:
from sklearn.model_selection import KFold

* instantiate KFold with k=5

In [23]:
help(KFold)

Help on class KFold in module sklearn.model_selection._split:

class KFold(_BaseKFold)
 |  KFold(n_splits=5, *, shuffle=False, random_state=None)
 |  
 |  K-Folds cross-validator
 |  
 |  Provides train/test indices to split data in train/test sets. Split
 |  dataset into k consecutive folds (without shuffling by default).
 |  
 |  Each fold is then used once as a validation while the k - 1 remaining
 |  folds form the training set.
 |  
 |  Read more in the :ref:`User Guide <k_fold>`.
 |  
 |  Parameters
 |  ----------
 |  n_splits : int, default=5
 |      Number of folds. Must be at least 2.
 |  
 |      .. versionchanged:: 0.22
 |          ``n_splits`` default value changed from 3 to 5.
 |  
 |  shuffle : bool, default=False
 |      Whether to shuffle the data before splitting into batches.
 |      Note that the samples within each split will not be shuffled.
 |  
 |  random_state : int, RandomState instance or None, default=None
 |      When `shuffle` is True, `random_state` affect

In [24]:
kf = KFold(n_splits=5)

* iterate over train_index and test_index in kf.split(X) and print them

In [36]:
split_data = kf.split(X)

for train_index, test_index in split_data:
    print(f"training set indices: {train_index}, testing set indices: {test_index}")

training set indices: [2 3 4 5 6 7 8 9], testing set indices: [0 1]
training set indices: [0 1 4 5 6 7 8 9], testing set indices: [2 3]
training set indices: [0 1 2 3 6 7 8 9], testing set indices: [4 5]
training set indices: [0 1 2 3 4 5 8 9], testing set indices: [6 7]
training set indices: [0 1 2 3 4 5 6 7], testing set indices: [8 9]


* instantiate KFold with k=5 and shuffle=True

In [38]:
kf = KFold(n_splits=5, shuffle=True)

* iterate over train_index and test_index in kf.split(X) and print them

In [39]:
split_data = kf.split(X)

for train_index, test_index in split_data:
    print(f"training set indices: {train_index}, testing set indices: {test_index}")

training set indices: [0 1 3 4 6 7 8 9], testing set indices: [2 5]
training set indices: [0 1 2 3 4 5 7 8], testing set indices: [6 9]
training set indices: [1 2 3 4 5 6 7 9], testing set indices: [0 8]
training set indices: [0 2 4 5 6 7 8 9], testing set indices: [1 3]
training set indices: [0 1 2 3 5 6 8 9], testing set indices: [4 7]


_______________________________________
### Leave-One-Out split
This is a similar technique to the Leave-p-out in the previous readings, with p=1. Each observation is used as test set separately.
- This is a popular method for tiny datasets.
- It takes a lot of time with bigger datasets and can lead to overfitting on a final model.

* import the **LeaveOneOut** function from sklearn

In [41]:
from sklearn.model_selection import LeaveOneOut

* instantiate LeaveOneOut

In [47]:
loo= LeaveOneOut()

In [48]:
help(loo)

Help on LeaveOneOut in module sklearn.model_selection._split object:

class LeaveOneOut(BaseCrossValidator)
 |  Leave-One-Out cross-validator
 |  
 |  Provides train/test indices to split data in train/test sets. Each
 |  sample is used once as a test set (singleton) while the remaining
 |  samples form the training set.
 |  
 |  Note: ``LeaveOneOut()`` is equivalent to ``KFold(n_splits=n)`` and
 |  ``LeavePOut(p=1)`` where ``n`` is the number of samples.
 |  
 |  Due to the high number of test sets (which is the same as the
 |  number of samples) this cross-validation method can be very costly.
 |  For large datasets one should favor :class:`KFold`, :class:`ShuffleSplit`
 |  or :class:`StratifiedKFold`.
 |  
 |  Read more in the :ref:`User Guide <leave_one_out>`.
 |  
 |  Examples
 |  --------
 |  >>> import numpy as np
 |  >>> from sklearn.model_selection import LeaveOneOut
 |  >>> X = np.array([[1, 2], [3, 4]])
 |  >>> y = np.array([1, 2])
 |  >>> loo = LeaveOneOut()
 |  >>> loo.get

* iterate over train_index and test_index in loo.split(X) and print them

In [50]:
split_data = loo.split(X)

In [61]:
next(split_data)

StopIteration: 

In [62]:
for train_index, test_index in loo.split(X):
    print(train_index, test_index)

[1 2 3 4 5 6 7 8 9] [0]
[0 2 3 4 5 6 7 8 9] [1]
[0 1 3 4 5 6 7 8 9] [2]
[0 1 2 4 5 6 7 8 9] [3]
[0 1 2 3 5 6 7 8 9] [4]
[0 1 2 3 4 6 7 8 9] [5]
[0 1 2 3 4 5 7 8 9] [6]
[0 1 2 3 4 5 6 8 9] [7]
[0 1 2 3 4 5 6 7 9] [8]
[0 1 2 3 4 5 6 7 8] [9]


* print the number of splits

In [64]:
loo.get_n_splits(X)

10