# Train Test Split

As you know, it is important that we test our models thoroughly on new data that they haven't seen before. It is therefore important that our model building decisions and in particular feature engineering decisions are made based on what we see in our training data.

We can then test the effectiveness of these techinques and our model on the testing data.

### Import Basic Packages

In [14]:
#Load the required libraries

# Data manipulation libraries
import pandas as pd
import numpy as np

### Separate Training & Testing Data

Prior to our feature engineering step, it is important that we have clearly separated training and testing datasets. Sometimes, each of these datasets comes from a different file. 

In [15]:
# Option 1a: Train and Test data appear in different files. Import training data.
df_train = pd.read_csv('airbnb_dataset_training.csv')
df_train

Unnamed: 0,id,minimum_nights,number_of_reviews,neighbourhood_group,neighbourhood,room_type,price
0,2539,1,9,Brooklyn,Kensington,Private room,149
1,2595,1,45,Manhattan,Midtown,Entire home/apt,225
2,3647,3,0,Manhattan,Midtown,Private room,150
3,3831,1,270,Brooklyn,Clinton Hill,Entire home/apt,89
4,5022,10,9,Manhattan,Murray Hill,Entire home/apt,80
5,5099,3,74,Manhattan,Murray Hill,Entire home/apt,200
6,5121,45,49,Brooklyn,Bedford-Stuyvesant,Private room,60
7,5178,2,430,Manhattan,Hell's Kitchen,Private room,79


In [16]:
# Option 1b: Train and Test data appear in different files. Import testing data.
df_test = pd.read_csv('airbnb_dataset_testing.csv')
df_test

Unnamed: 0,id,minimum_nights,number_of_reviews,neighbourhood_group,neighbourhood,room_type,price
0,13808,1,112,Brooklyn,Bedford-Stuyvesant,Private room,80
1,16338,7,27,Brooklyn,Clinton Hill,Private room,55
2,16421,30,191,Manhattan,Hell's Kitchen,Private room,52
3,15220,2,289,Manhattan,Hell's Kitchen,Private room,69
4,5238,1,160,Manhattan,Chinatown,Entire home/apt,150
5,12937,3,248,Queens,Long Island City,Private room,130


### SKLearn's Train Test Split

Often, we must separate training and testing data, as well as our inputs from our outputs. We can use SKLearn's Train Test Split to help us achieve this.


SKLearn Train Test Split Documentation can be found here:
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [17]:
# Import data to a pandas dataframe
df_phones = pd.read_csv('phone_marketplace_dataset.csv')

# We will be working with the phones dataset. The target variable PRICE.

In [18]:
# Separate the target variable from the input features.
y = df_phones['price']    # Note that y becomes a series
x = df_phones.drop(columns=['price'], axis = 1)  # X becomes a dataframe.

# Often, a capitalized X is used to denote the X input variables. For this course, we kept everything lowercase.

In [19]:
# import SKlearn's train test split module
from sklearn.model_selection import train_test_split

In [20]:
# Separate the 4 parts of the dataset from eachother, using a typical test size of 20%.
# Random state ensures repeatability.
test_size = 0.2
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=test_size, random_state=42)
# Again X is usually a dataframe, whilst y is a series.

**Example Feature Engineering Application**

- MinMaxScaler FIT and TRANSFORM methods would be applied to x_train (scaler would apply to numerical cols only)
- MixMaxScaler TRANSFORM method would be applied to x_test (scaler would apply to numerical cols only)

### Visualizing x_train, y_train, x_test, y_test

A great way to understand the train test split process is to visualize the 4 parts of the data as a grid. You do not need to know how to do this. We've simply done this to help you visualize the data.

In [13]:
# Use the below function to display the 4 dataframes in a grid.
display_train_test_split()

Unnamed: 0,battery_life_percentage,year_made,rating,name,storage,magnet_charging,marketplace
228,95,2019,,iPhone_11,128,no,craigslist
78,76,2019,,iPhone_11,128,no,facebook
90,87,2020,,iPhone_12,64,yes,facebook
16,74,2021,,iPhone_13,256,yes,facebook
66,94,2020,,iPhone_12,64,yes,craigslist
287,82,2019,,iPhone_11,128,no,kijiji
7,84,2019,,iPhone_11,256,no,facebook
110,73,2021,,iPhone_13,64,yes,kijiji

Unnamed: 0,price
228,514
78,600
90,675
16,1179
66,723
287,403
7,452
110,1015

Unnamed: 0,battery_life_percentage,year_made,rating,name,storage,magnet_charging,marketplace
157,79,2022,,iPhone_14,128,yes,kijiji
341,77,2020,,iPhone_12,128,yes,facebook

Unnamed: 0,price
157,1377
341,831


### Used for teaching purposes only to display dataframes in a grid

In [12]:
# Display dataframes next to eachother
from IPython.display import display_html
from itertools import chain,cycle

def display_train_test_split():

    # Calculate rough test/train size (OUT OF TEN) to display in visual
    test_size_summary = int(round(test_size * 100 / 10,0))
    train_size_summary = int(10 - test_size_summary)
    
    pct_display = ''
    if test_size_summary != test_size * 10: pct_display = 'Approx '

    train_pct = pct_display + f'{train_size_summary * 10}'
    test_pct = pct_display + f'{test_size_summary * 10}'


    # Create summary versions of the dataframes
    y_train_summary = pd.DataFrame(y_train).head(train_size_summary)
    x_train_summary = pd.DataFrame(x_train).head(train_size_summary)

    y_test_summary = pd.DataFrame(y_test).head(test_size_summary)
    x_test_summary = pd.DataFrame(x_test).head(test_size_summary)

    html_str=''
    
    html_str+='<tr>'
    html_str+='<td></td>'
    html_str+='<td style="vertical-align:top"><h2 style="text-align: center;">X</h2></td>'
    html_str+='<td style="vertical-align:top"><h2 style="text-align: center;"> </h2></td>'
    html_str+='<td>'
    html_str+='<td style="vertical-align:top"><h2 style="text-align: center;">Y</h2></td>'
    html_str+='</tr>'

    html_str+='<tr>'
    html_str+=f'<td><h2 style="text-align: center;">TRAIN</h2><p>({train_pct}% of data)</p></td>'
    html_str+='<td>' + x_train_summary.to_html().replace('table','table style="text-align:center"') + '</td>'
    html_str+='<td><td>'
    html_str+='<td table style="display:inline">' + y_train_summary.to_html().replace('table','table style="text-align:center"') + '</td>'
    html_str+='</tr>'
    
    html_str+='<tr>'
    html_str+=f'<td><h2 style="text-align: center;">TEST</h2><p>({test_pct}% of data)</p></td>'
    html_str+='<td>' + x_test_summary.to_html().replace('table','table style="text-align:center"') + '</td>'
    html_str+='<td><td>'
    html_str+='<td>' + y_test_summary.to_html().replace('table','table style="text-align:center"') + '</td>'
    html_str+='</tr>'    
   
    display_html(html_str,raw=True)  