## Train-test Split 

You can split your dataset into a training set and a test set using the `train_test_split` function from `sklearn`

You can now train your model with the training set and test it on the unseen test set.

Useful parameters of the `train_test_split` function:
- `test_size` (float): specify the size of the test set (between 0.0 and 1.0). Eg: 0.2 would reserve 20% of the dataset as the test set and use 80% as training set
- `random_state` (int): controls the randomness in the train and test split. Pass an integer for reproducible output across multiple function calls

The `random_state` parameter is very important because if you don't use it, the train and test split will be different each time you run the function. To make sure the split remains the same each time, pass an integer value for the random_state. Popular values for the integer seed are 0 and 42 (from Hitchhiker's guide to the galaxy)

This is very important for debugging and reproducibility, you want the train and test split to be the exact same whenever the function is run.


In [3]:
# a very quick and basic example of train and test split using sklearn

from sklearn.model_selection import train_test_split

quick_test_dataset = [1,2,3,4,5,6,7,8,9,10]

# splits this dataset into a train and a test set 
train_test_split(quick_test_dataset, test_size=0.2)

# observe that the split is different each time the function is run because random_state hasn't been specified

[[4, 2, 6, 5, 3, 9, 8, 10], [7, 1]]

In [4]:
# split with random_state makes sure that the split remains the same across different runs of the function!

train_test_split(quick_test_dataset, random_state=42, test_size=0.2)

# the split doesn't change in this case!


[[6, 1, 8, 3, 10, 5, 4, 7], [9, 2]]

In [5]:
# Another example
import numpy as np

# let's try creating a fake synthetic classification dataset and try the train-test split on it
X = np.arange(1, 25).reshape(12, 2)
Y = np.array([1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1])

# here X has 12 data points in R2
# and Y contains the labels (0 or 1)

X,Y

(array([[ 1,  2],
        [ 3,  4],
        [ 5,  6],
        [ 7,  8],
        [ 9, 10],
        [11, 12],
        [13, 14],
        [15, 16],
        [17, 18],
        [19, 20],
        [21, 22],
        [23, 24]]),
 array([1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1]))

In [6]:
# splitting this dataset using train test split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.2, random_state=0) # the test dataset is 20% of the original

print("X Train:", X_train, "X Train Shape:", X_train.shape)
print("X Test:", X_test, "X Test Shape:", X_test.shape)
print("Y Train:", Y_train, "Y Train Shape", Y_train.shape)
print("Y Test:", Y_test, "Y Test Shape", Y_test.shape)

X Train: [[21 22]
 [ 5  6]
 [17 18]
 [ 3  4]
 [15 16]
 [19 20]
 [ 7  8]
 [ 1  2]
 [11 12]] X Train Shape: (9, 2)
X Test: [[13 14]
 [23 24]
 [ 9 10]] X Test Shape: (3, 2)
Y Train: [0 0 1 1 1 0 1 1 0] Y Train Shape (9,)
Y Test: [1 1 1] Y Test Shape (3,)
