# Split your data into train and test sets

#### Having data to test how your model performs is essential to give you confidence it works.

As with any data project, when developing a model, you want to check that everything works as expected. The data you use for developing a machine learning model can be split into **train** and **test** sets.

The **train set** will be used for building the model, and will generally be around **70 to 80 percent** of your initial data. While the **test set** will make up the remaining **20 to 30 percent**.

## Why do I need a test set?

You can use the test set to check how well your model performs.

You do this by feeding the **input variables** from the test set into your model and seeing what output it gives you. You can then compare this **predicted output** with the **actual output**.

Let’s take a look at an example of how to split your data into train and test sets.

## Step 1: Split input and output data

First let’s import Pandas, and the data.

In [6]:
import pandas as pd

df = pd.read_csv("life_insurance_data.csv")
df

Unnamed: 0,income_usd,property_status,has_life_insurance
0,20500,Owner with Mortgage,0
1,31500,Owner with Mortgage,1
2,37000,Owner with Mortgage,0
3,47000,Owner with Mortgage,1
4,49000,Owner with Mortgage,0
5,55000,Owner with Mortgage,1
6,67500,Owner with Mortgage,1
7,72000,Owner with Mortgage,1
8,98000,Owner with Mortgage,1
9,102000,Owner with Mortgage,1


There are **30 rows** in this dataframe.

From this data, we can model `has_life_insurance` using `income_usd` and `property_status` as inputs.

When we train a model, we always have to provide the **input data** (`income_usd`, `property_status`) and **output data** (`has_life_insurance`) separately.

The **input** dataframe is denoted using an upper case `X`:

In [7]:
X = df.drop(columns=['has_life_insurance'])
X

Unnamed: 0,income_usd,property_status
0,20500,Owner with Mortgage
1,31500,Owner with Mortgage
2,37000,Owner with Mortgage
3,47000,Owner with Mortgage
4,49000,Owner with Mortgage
5,55000,Owner with Mortgage
6,67500,Owner with Mortgage
7,72000,Owner with Mortgage
8,98000,Owner with Mortgage
9,102000,Owner with Mortgage


And the **output** is denoted using a lower case `y`:

In [27]:
y = df['has_life_insurance']
y

0     0
1     1
2     0
3     1
4     0
5     1
6     1
7     1
8     1
9     1
10    0
11    0
12    0
13    0
14    0
15    1
16    0
17    0
18    1
19    1
20    0
21    0
22    0
23    0
24    0
25    0
26    0
27    1
28    0
29    1
Name: has_life_insurance, dtype: int64

This is fine if all we want to do is train a model.

But we also need to split the data into train and test sets, **not just input and output sets**.

## Step 2: Split train and test data

For this we can use `test_train_split()`, which we first need to import:

In [10]:
from sklearn.model_selection import train_test_split

Then, call `train_test_split()` by specifying your input data `X`, output data `y` and how much of the original data you want in the test sample - in this case we specified 20%:

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

`train_test_split()` returns a tuple with these four dataframes:

- X_train – input training data
- X_test – input testing data
- y_train – output training data
- y_test – output testing data

`X_train` will now contain 80% of the input data along with the corresponding output data in `y_train`.

`X_test` will have 20% of the input data along with the corresponding output data in `y_test`. We can confirm this with a quick check for number of records in each:

In [26]:
len(X_train)

24

In [28]:
len(y_train)

24

In [29]:
len(X_test)

6

In [23]:
len(y_test)

6

💡 Top Tip:

As the output of `train_test_split()` is a random sample. The records chosen each time you run the code will be different.

If you need the random sample to stay the same, you can add the `random_state=n` argument. where `n` is any random number. As long as `n` stays the same, you’ll get the same random sample. For example:

In [33]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)

X_test will always include the same data, even if you run the code again:

In [34]:
X_test

Unnamed: 0,income_usd,property_status
5,55000,Owner with Mortgage
23,43000,Renter
22,37500,Renter
28,92000,Renter
1,31500,Owner with Mortgage
21,26500,Renter


You can now fit your model using `X_train` and `y_train`. And then test using `X_test` and `y_test`.