## Measuring Success: Splitting up the data for train, validation, and test set

Split the dataset up into the following segments:
1. Training Data: 60%
2. Validation Data: 20%
3. Test Data: 20%

### Read in data

_Welcome back, in this lesson we're going to take what we discussed in the last section and actually implement it._

_So we will start by importing the packages we'll need and reading in our data - I'll just call your attention to the `train test split` method we're importing from `sklearn` - that will make our job here **very** easy._

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

titanic = pd.read_csv('../titanic.csv')
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Split into train, validation, and test set

_Now, to split into train, validation, and test set. We need to start by splitting our data into our features (the fields used to make a prediction) and our labels or target variable (in our case that's whether somebody survived or not)._

_Next, we will call our `train test split` method and first we need to pass in our features, then we'll pass in our labels, we tell it what percent of the dataset we want allocated to the test set, and lastly `random state` is just the initialization seed for the randomizer (don't need to discuss here). So now you might be wondering why I'm indicating test size of 40%? Well, `train test split` doesn't have the functionality to split into three datasets. So we'll handle this in two steps. Allocate 60% to training and 40% to which it's calling "test". Then we will take that 40% and split it in half and that will give us our 60% training, 20% validation, 20% test set._

_Now first, we need to name the outputs `train test split` will give us. So it will output 4 datasets: it takes features and labels and splits each of them into train and test. So the output is `X train`, `X test`, `y train`, `y test`. AND IT IS IN THIS ORDER - THIS IS VERY IMPORTANT. And `train test split` will correlate between features and labels so the same examples that are in `X train` are in `y train` and in the same order - same for the test._

_Now, lets take that test set and split it into validation and test. So we will copy and paste down. Change `features` to `X test`, change `labels` to `y test`, and change `test size` to 50%. So we're taking the 40% of the full dataset that we assigned to the test set and we're splitting it in half to create our validation set and test set. Now, lets update the output names. We'll say `X test`, `X val`, `y test`, and `y val`._

In [2]:
features = titanic.drop('Survived', axis=1)
labels = titanic['Survived']

X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.4, random_state=42)
X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, test_size=0.5, random_state=42)

_Now, lets quickly take a look at the length of each of these datasets to make sure that 60% went to train and 20% to each test and validation. So print out the length of `labels` (full dataset), length of `y train`, length of `y val`, and length of `y test`._

_And we can confirm that it's split out the way we expected. We didn't have the right number for test and validation to be exactly equal so there is one more in the validation set but that's not a big deal._

In [3]:
print(len(labels), len(y_train), len(y_val), len(y_test))

891 534 179 178


### Explore the data

_One last step, I'm not going to go over the code here but basically this is just going to print out some summary statistics in the training set, test set, and validation set for each continuous feature. The idea here is that each dataset should be representative of the full data. One way to check that is to make sure that each feature has a very similar distribution across the three datasets_

_So taking a quick look at `Pclass` - the mean, standard deviation, first quartile, median, and third quartile are all very similar. You can go down the list for `Age`, etc. and see it's the same for each of them. Now, it would be an issue if we saw the average age in the training set was 17 while the average age in the validation set is 42. However, as long as you have a large enough dataset and it's being split up randomly, you really shouldn't run into any cases like that._

In [4]:
columns = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']

for c in columns:
    print(pd.concat([X_train[c].describe(), X_val[c].describe(), X_test[c].describe()], axis=1))

           Pclass      Pclass      Pclass
count  534.000000  179.000000  178.000000
mean     2.337079    2.223464    2.308989
std      0.825628    0.864602    0.837017
min      1.000000    1.000000    1.000000
25%      2.000000    1.000000    2.000000
50%      3.000000    3.000000    3.000000
75%      3.000000    3.000000    3.000000
max      3.000000    3.000000    3.000000
              Age         Age         Age
count  427.000000  147.000000  140.000000
mean    29.284356   30.938231   29.663071
std     14.504577   14.608181   14.537967
min      0.420000    0.920000    0.830000
25%     20.000000   21.000000   20.750000
50%     28.000000   29.000000   28.000000
75%     37.000000   40.000000   38.250000
max     80.000000   71.000000   71.000000
            SibSp       SibSp       SibSp
count  534.000000  179.000000  178.000000
mean     0.588015    0.391061    0.460674
std      1.268291    0.736777    0.830984
min      0.000000    0.000000    0.000000
25%      0.000000    0.000000    0