# How to Sample Data in Python

## Learning Objectives
In order to get an unbiased assessment of the performance of a supervised machine learning model, we need to evaluate it based on data that it did not previously encounter during the training process. To accomplish this, we must first split our data into a training subset and a test subset prior to the model build stage. One common way to split data in this fashion is by creating non-overlapping subsets of the original data using one of several **sampling** approaches. By the end of the tutorial, you will have learned:

+ how to split data using simple random sampling
+ how to split data using stratified random sampling

In [None]:
import pandas as pd
vehicles = pd.read_csv("vehicles.csv")
vehicles

## How to split data using Simple Random Sampling

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
x_train.shape

In [None]:
y_train.shape

In [None]:
x_test.shape

In [None]:
y_test.shape

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y)
x_test.shape

## How to split data using Stratified Random Sampling

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, 
                                                    test_size = 0.01, 
                                                    random_state = 1234)

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, 
                                                    test_size = 0.01, 
                                                    random_state = 1234)

In [None]:
x_test['drive'].value_counts(normalize = True)