# Data Preprocessing for Machine Learning—Splitting your Data

## Why do you need to split your dataset into training and testing sets?

If you train your machine learning model on the whole dataset, you have two problems.

Firstly, you don't know how your model will perform on other datasets, because you haven't tested the model on anything else.

Secondly, you risk **overfitting** the model. That is, making it work well for one dataset, at the cost of model performance on other datasets. That is, your model will perform worse than it could.

## When do you need to split your dataset?

Splitting the data into training and testing steps should happen **before feature engineering**. Otherwise, you can end up with **data leakage**, where information from the testing set is contained in the training set. This is a type of "cheating" which will increase the apparent performance of the model and give you a false sense that the model is better at predicting than it really is. 

## Case study: Classifying loan applications

Let's make predictions on some loan application data. If you want to try your own analysis, you can access this via [its Workspace template](https://app.datacamp.com/workspace/datasets/dataset-python-loans).

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [None]:
loan_applications = pd.read_csv("loan_data.csv", nrows=1000)
loan_applications

The response variable is `credit.policy`. It takes the value `1` when the application meets the underwriting policy (so a loan can be issued), and `0` otherwise.

In [None]:
response = loan_applications["credit.policy"]
response

All the other columns can be used for features.

Note the use of [`get_dummies()`](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html) to convert the categorical column, `purpose`, to several columns of ones and zeros.

In [None]:
features = pd.get_dummies(
    loan_applications.drop(columns="credit.policy")
)
features

## Splitting into training and testing sets

To split into training and tesing sets, we call [`train_test_split()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html), passing the response and the features.

It returns a list of 4 things: 

- the responses for the training set
- the features for the training set
- the responses for the testing set
- the features for the testing set

To make the code easier to work with, we use **variable unpacking** to return 4 separate variables.

In [None]:
response_train, response_test, features_train, features_test = train_test_split(response, features)

Looking at the **shape** of each of these variables makes it clear what each contains.

In [None]:
print(response_train.shape)
print(response_test.shape)
print(features_train.shape)
print(features_test.shape)

By default, 70% of the data ends up in the training set, and 30% ends up in the testing set.

## Controlling the train/test split quantities

To change the amount of data used in the test set, set the `test_size` argument. For very small datasets, it is common to reduce the fraction in the test set.

In [None]:
response_train, features_train, response_test, features_test = train_test_split(response, features, test_size=0.2)

In [None]:
print(response_train.shape)
print(response_test.shape)
print(features_train.shape)
print(features_test.shape)

## Enforcing reproducibility

The training and testing sets are randomly generated. If you want to return exactly the same training and testing sets when you run your code repeatedly (for example when publishing the results in a report), you need to set the random seed with the `random_state` argument.

In [None]:
response_train, response_test, features_train, features_test = train_test_split(response, features, random_state=2022)
response_train2, response_test2, features_train2, features_test2 = train_test_split(response, features, random_state=2022)

In [None]:
response_train.equals(response_train2)