# Split data into train and test sets

Train Test Split is basically a technique to randomly split data into training and test sets since training and testing a model on the same data will lead to overfitting. Typically test set is about 20% of the dataset.

Data snooping refers to statistical inference that the researcher decides to perform after looking at the data.

## Table of Contents
1. [Split data using custom function](#1.-Split-data-using-custom-function)
2. [Scikit-learn: train_test_split](#2.-Scikit-learn:-train_test_split)

# 1. Split data using custom function
### Define a custom function
- Shuffle the indices
- Select the test size/ratio
- Split data based on the shuffled indices
- Note: Random number generator `seed` can be set to replicate same shuffled indices over multiple reruns

In [1]:
import numpy as np
def split_train_test(data, test_ratio):
    np.random.seed(42)
    shuffled_indices = np.random.permutation(len(data))
    test_set_size = int(len(data) * test_ratio)
    train_set = data.iloc[shuffled_indices[test_set_size:]]
    test_set = data.iloc[shuffled_indices[:test_set_size]]
    return train_set, test_set

### Import VDOT Traffic dataset and take a quick look

In [2]:
import pandas as pd

# read dataset + quick look
vdotDf = pd.read_csv('./datasets/VDOT_Traffic_Volume.csv')
vdotDf.info()
vdotDf.head(3)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 121811 entries, 0 to 121810
Data columns (total 23 columns):
OBJECTID                 121811 non-null int64
DATA_DATE                121811 non-null object
ROUTE_COMMON_NAME        121811 non-null object
START_LABEL              121811 non-null object
END_LABEL                121811 non-null object
ADT                      121803 non-null float64
ADT_QUALITY              121811 non-null object
PERCENT_4_TIRE           25457 non-null float64
PERCENT_BUS              25457 non-null float64
PERCENT_TRUCK_2_AXLE     25457 non-null float64
PERCENT_TRUCK_3_AXLE     25457 non-null float64
PERCENT_TRUCK_1_TRAIL    25457 non-null float64
PERCENT_TRUCK_2_TRAIL    25457 non-null float64
CLASS_QUALITY_CODE       121811 non-null object
AAWDT                    27685 non-null float64
AAWDT_QUALITY_CODE       121796 non-null object
FROM_JURISDICTION        121714 non-null object
TO_JURISDICTION          121646 non-null object
ROUTE_NAME               

Unnamed: 0,OBJECTID,DATA_DATE,ROUTE_COMMON_NAME,START_LABEL,END_LABEL,ADT,ADT_QUALITY,PERCENT_4_TIRE,PERCENT_BUS,PERCENT_TRUCK_2_AXLE,...,CLASS_QUALITY_CODE,AAWDT,AAWDT_QUALITY_CODE,FROM_JURISDICTION,TO_JURISDICTION,ROUTE_NAME,FROM_DISTRICT,TO_DISTRICT,RTE_TYPE_CD,Shape__Length
0,1,2011-08-03T00:00:00.000Z,SC-2901N (Accomack County),Bus US 13,Dead End,100.0,R,,,,...,X,,X,Accomack County,Accomack County,R-VA001SC02901NB,Hampton Roads,Hampton Roads,SC,253.337714
1,2,2013-05-15T00:00:00.000Z,SC-1383N (Prince William County),Cul-de-Sac,76-1279 Longview Dr,60.0,M,,,,...,X,,X,Prince William County,Prince William County,R-VA076SC01383NB,Northern Virginia,Northern Virginia,SC,111.903491
2,3,2014-08-05T00:00:00.000Z,SC-2352N (Hanover County),42-1685 Daffodil Rd,42-2351 Sydnor Lane,100.0,R,,,,...,X,,X,Hanover County,Hanover County,R-VA042SC02352NB,Richmond,Richmond,SC,582.523332


### Execute custom split_train_test()
- Set test size to 20% or 0.2 ratio

In [3]:
# execute split_train_test()
train_set, test_set = split_train_test(vdotDf, 0.2)
print(train_set.shape)
print(test_set.shape)

(97449, 23)
(24362, 23)


# 2. Scikit-learn: train_test_split
Scikit-learn provides a `train_test_split` function to achieve the same thing as above. The `random_state` parameter allows to set random number generator `seed`.

In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split

# read dataset
vdotDf = pd.read_csv('./datasets/VDOT_Traffic_Volume.csv')

# execute train_test_split
train_set, test_set = train_test_split(vdotDf, test_size=0.2, random_state=42)

# display set sizes
print(train_set.shape)
print(test_set.shape)

(97448, 23)
(24363, 23)
