# 1. Target and Features

An example of a supervised learning question related to the [Ames housing data](https://drive.google.com/file/d/1Jyu1qtuqGewyfhUdpOfdZYCj6o6b4FE1/view) is:

How well can the Sale Price of a home be predicted based on the features of the home?

# Load the data

In [None]:
# Mount google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Import pandas and change max columns
import pandas as pd
pd.set_option('display.max_columns',100)

# Load in the data from Google drive, set the index, and preview first 5 rows
fpath = "/content/drive/MyDrive/STUDENTS/refactory/DS/ames-housing-dojo_edited.csv"
df = pd.read_csv(fpath)
df = df.set_index("PID")
df.head()
#df.shape

Unnamed: 0_level_0,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Utilities,Neighborhood,Bldg Type,House Style,Overall Qual,Overall Cond,Year Built,Year Remodeled,Exter Qual,Exter Cond,Bsmt Unf Sqft,Total Bsmnt Sqft,Central Air,Living Area Sqft,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom,Kitchen,Total Rooms,Garage Type,Garage Yr Blt,Garage Cars,Garage Area,Garage Qual,Garage Cond,Paved Drive,Fence,Date Sold,SalePrice
PID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1
907227090,RL,60,7300,Pave,MISSING,AllPub,CollgCr,1Fam,1Story,5,8,1972,1972,TA,TA,427.0,864.0,Y,864.0,0.0,0.0,1,0,3,1,5,Detchd,1977.0,1.0,297.0,TA,TA,Y,MnPrv,03-2006,119900.0
527108010,RL,134,19378,Pave,MISSING,AllPub,Gilbert,1Fam,2Story,7,5,2005,2006,Gd,TA,1335.0,1392.0,Y,2462.0,1.0,0.0,2,1,4,1,9,Attchd,2006.0,2.0,576.0,TA,TA,Y,MISSING,03-2006,320000.0
534275170,RL,-1,12772,Pave,MISSING,AllPub,NAmes,1Fam,1Story,6,8,1960,1998,TA,Gd,460.0,958.0,Y,958.0,0.0,0.0,1,0,2,1,5,Attchd,1960.0,1.0,301.0,TA,TA,Y,MISSING,04-2007,151500.0
528104050,RL,114,14803,Pave,MISSING,AllPub,NridgHt,1Fam,1Story,10,5,2007,2008,Ex,TA,442.0,2078.0,Y,2084.0,1.0,0.0,2,0,2,1,7,Attchd,2007.0,3.0,1220.0,TA,TA,Y,MISSING,06-2008,385000.0
533206070,FV,32,3784,Pave,Pave,AllPub,Somerst,TwnhsE,1Story,8,5,2006,2007,Gd,TA,1451.0,1511.0,Y,1565.0,1.0,0.0,2,0,2,1,5,Attchd,2006.0,2.0,476.0,TA,TA,Y,MISSING,02-2007,193800.0


You may be thinking: Why would we need to predict the "Sale Price"? Isn't that value already given in the "Sale Price" column?

While we do already know the Sale Price for the cases provided in the dataset, the ultimate goal will be to make predictions for cases where we don't yet know what the charges will be.

***NB: In order to make predictions on future unknown data, our computer will first have to "learn" from past cases where the answer is already known (labeled data).***

**Target (y)**

The target is the column we are trying to predict. In this case, the "SalePrice" column is the target.

**Features (X)**

The features are the columns we will use to make the prediction. In this case, the other columns ("MS Zoning", "Lot Frontage", "Lot Area," "Street," "Alley," etc.) are the features.

**Feature selection** is the process of selecting which features (columns) we want to use within our machine-learning model. There are often columns that are not going to be helpful for the machine learning process. For example, a column that has unique values in every single row or a value that we know is not predictive (such as ID number or phone number) is not going to provide any useful information on trends that help us predict our target.

However, the unique identifier could be used as the index, which is what we did immediately after loading the data with the "PID" column from our Ames Housing data above.

# Arrange Data into Features Matrix and Target Vector

The code below arranges data into a features matrix and target vector.

In [None]:
df.columns

Index(['MS Zoning', 'Lot Frontage', 'Lot Area', 'Street', 'Alley', 'Utilities',
       'Neighborhood', 'Bldg Type', 'House Style', 'Overall Qual',
       'Overall Cond', 'Year Built', 'Year Remodeled', 'Exter Qual',
       'Exter Cond', 'Bsmt Unf Sqft', 'Total Bsmnt Sqft', 'Central Air',
       'Living Area Sqft', 'Bsmt Full Bath', 'Bsmt Half Bath', 'Full Bath',
       'Half Bath', 'Bedroom', 'Kitchen', 'Total Rooms', 'Garage Type',
       'Garage Yr Blt', 'Garage Cars', 'Garage Area', 'Garage Qual',
       'Garage Cond', 'Paved Drive', 'Fence', 'Date Sold', 'SalePrice'],
      dtype='object')

In [None]:
y = df['SalePrice'] #defined our target vector and stored it under variable y

#defining features matrix (x)
df[['MS Zoning', 'Lot Frontage', 'Lot Area', 'Street', 'Alley', 'Utilities',
       'Neighborhood', 'Bldg Type', 'House Style', 'Overall Qual',
       'Overall Cond', 'Year Built', 'Year Remodeled', 'Exter Qual',
       'Exter Cond', 'Bsmt Unf Sqft', 'Total Bsmnt Sqft', 'Central Air',
       'Living Area Sqft', 'Bsmt Full Bath', 'Bsmt Half Bath', 'Full Bath',
       'Half Bath', 'Bedroom', 'Kitchen', 'Total Rooms', 'Garage Type',
       'Garage Yr Blt', 'Garage Cars', 'Garage Area', 'Garage Qual',
       'Garage Cond', 'Paved Drive', 'Fence', 'Date Sold']]

Unnamed: 0_level_0,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Utilities,Neighborhood,Bldg Type,House Style,Overall Qual,Overall Cond,Year Built,Year Remodeled,Exter Qual,Exter Cond,Bsmt Unf Sqft,Total Bsmnt Sqft,Central Air,Living Area Sqft,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom,Kitchen,Total Rooms,Garage Type,Garage Yr Blt,Garage Cars,Garage Area,Garage Qual,Garage Cond,Paved Drive,Fence,Date Sold
PID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1
907227090,RL,60,7300,Pave,MISSING,AllPub,CollgCr,1Fam,1Story,5,8,1972,1972,TA,TA,427.0,864.0,Y,864.0,0.0,0.0,1,0,3,1,5,Detchd,1977.0,1.0,297.0,TA,TA,Y,MnPrv,03-2006
527108010,RL,134,19378,Pave,MISSING,AllPub,Gilbert,1Fam,2Story,7,5,2005,2006,Gd,TA,1335.0,1392.0,Y,2462.0,1.0,0.0,2,1,4,1,9,Attchd,2006.0,2.0,576.0,TA,TA,Y,MISSING,03-2006
534275170,RL,-1,12772,Pave,MISSING,AllPub,NAmes,1Fam,1Story,6,8,1960,1998,TA,Gd,460.0,958.0,Y,958.0,0.0,0.0,1,0,2,1,5,Attchd,1960.0,1.0,301.0,TA,TA,Y,MISSING,04-2007
528104050,RL,114,14803,Pave,MISSING,AllPub,NridgHt,1Fam,1Story,10,5,2007,2008,Ex,TA,442.0,2078.0,Y,2084.0,1.0,0.0,2,0,2,1,7,Attchd,2007.0,3.0,1220.0,TA,TA,Y,MISSING,06-2008
533206070,FV,32,3784,Pave,Pave,AllPub,Somerst,TwnhsE,1Story,8,5,2006,2007,Gd,TA,1451.0,1511.0,Y,1565.0,1.0,0.0,2,0,2,1,5,Attchd,2006.0,2.0,476.0,TA,TA,Y,MISSING,02-2007
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
903400030,RL,50,11672,Pave,Pave,AllPub,BrkSide,1Fam,1Story,5,5,1925,1950,TA,TA,816.0,816.0,Y,816.0,0.0,0.0,1,0,2,1,4,Detchd,1925.0,1.0,210.0,Fa,Fa,N,MISSING,07-2006
533234020,FV,79,10646,Pave,MISSING,AllPub,Somerst,1Fam,2Story,7,5,2001,2001,TA,TA,177.0,858.0,Y,1789.0,1.0,0.0,2,1,3,1,7,Attchd,2001.0,2.0,546.0,TA,TA,Y,MISSING,06-2008
908188140,RM,24,2522,Pave,MISSING,AllPub,Edwards,Twnhs,2Story,7,5,2004,2004,Gd,TA,970.0,970.0,Y,1709.0,0.0,0.0,2,0,3,1,7,Detchd,2004.0,2.0,380.0,TA,TA,Y,MISSING,04-2006
909254050,RL,54,7609,Pave,MISSING,AllPub,Crawfor,1Fam,2Story,8,9,1925,1997,Gd,Gd,392.0,798.0,Y,1512.0,1.0,0.0,2,0,3,1,7,Detchd,1925.0,1.0,180.0,TA,TA,P,GdPrv,06-2008


In [None]:
df.head()
df.shape

(2930, 36)

In [None]:
# Assign the target column to y
y = df['SalePrice']
y
# Assign the features to X (In this case we include all columns except the target column)
# x_1= df[['MS Zoning', 'Lot Frontage', 'Lot Area', 'Street', 'Alley', 'Utilities',
#        'Neighborhood', 'Bldg Type', 'House Style', 'Overall Qual',
#        'Overall Cond', 'Year Built', 'Year Remodeled', 'Exter Qual',
#        'Exter Cond', 'Bsmt Unf Sqft', 'Total Bsmnt Sqft', 'Central Air',
#        'Living Area Sqft', 'Bsmt Full Bath', 'Bsmt Half Bath', 'Full Bath',
#        'Half Bath', 'Bedroom', 'Kitchen', 'Total Rooms', 'Garage Type',
#        'Garage Yr Blt', 'Garage Cars', 'Garage Area', 'Garage Qual',
#        'Garage Cond', 'Paved Drive', 'Fence', 'Date Sold',]]
#print(x_1.head())
#OR


#defining the features matrix
x=df.drop(columns='SalePrice')
x.head()

Unnamed: 0_level_0,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Utilities,Neighborhood,Bldg Type,House Style,Overall Qual,Overall Cond,Year Built,Year Remodeled,Exter Qual,Exter Cond,Bsmt Unf Sqft,Total Bsmnt Sqft,Central Air,Living Area Sqft,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom,Kitchen,Total Rooms,Garage Type,Garage Yr Blt,Garage Cars,Garage Area,Garage Qual,Garage Cond,Paved Drive,Fence,Date Sold
PID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1
907227090,RL,60,7300,Pave,MISSING,AllPub,CollgCr,1Fam,1Story,5,8,1972,1972,TA,TA,427.0,864.0,Y,864.0,0.0,0.0,1,0,3,1,5,Detchd,1977.0,1.0,297.0,TA,TA,Y,MnPrv,03-2006
527108010,RL,134,19378,Pave,MISSING,AllPub,Gilbert,1Fam,2Story,7,5,2005,2006,Gd,TA,1335.0,1392.0,Y,2462.0,1.0,0.0,2,1,4,1,9,Attchd,2006.0,2.0,576.0,TA,TA,Y,MISSING,03-2006
534275170,RL,-1,12772,Pave,MISSING,AllPub,NAmes,1Fam,1Story,6,8,1960,1998,TA,Gd,460.0,958.0,Y,958.0,0.0,0.0,1,0,2,1,5,Attchd,1960.0,1.0,301.0,TA,TA,Y,MISSING,04-2007
528104050,RL,114,14803,Pave,MISSING,AllPub,NridgHt,1Fam,1Story,10,5,2007,2008,Ex,TA,442.0,2078.0,Y,2084.0,1.0,0.0,2,0,2,1,7,Attchd,2007.0,3.0,1220.0,TA,TA,Y,MISSING,06-2008
533206070,FV,32,3784,Pave,Pave,AllPub,Somerst,TwnhsE,1Story,8,5,2006,2007,Gd,TA,1451.0,1511.0,Y,1565.0,1.0,0.0,2,0,2,1,5,Attchd,2006.0,2.0,476.0,TA,TA,Y,MISSING,02-2007


Checking the type of object, we see that y is now a series:

In [None]:
#y is a target vector
type(y)

We can view the first five entries for y:

In [None]:
y[5:10]

Unnamed: 0_level_0,SalePrice
PID,Unnamed: 1_level_1
908102320,149900.0
528174080,185850.0
532377130,131400.0
528102030,394617.0
534127170,185750.0


We can also check the type of object for X and we see that it is a dataframe:

In [None]:
#X is a feature matrix
type(x)

We can view the first five entries for X:

In [None]:
x.head()

Unnamed: 0_level_0,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Utilities,Neighborhood,Bldg Type,House Style,Overall Qual,Overall Cond,Year Built,Year Remodeled,Exter Qual,Exter Cond,Bsmt Unf Sqft,Total Bsmnt Sqft,Central Air,Living Area Sqft,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom,Kitchen,Total Rooms,Garage Type,Garage Yr Blt,Garage Cars,Garage Area,Garage Qual,Garage Cond,Paved Drive,Fence,Date Sold
PID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1
907227090,RL,60,7300,Pave,MISSING,AllPub,CollgCr,1Fam,1Story,5,8,1972,1972,TA,TA,427.0,864.0,Y,864.0,0.0,0.0,1,0,3,1,5,Detchd,1977.0,1.0,297.0,TA,TA,Y,MnPrv,03-2006
527108010,RL,134,19378,Pave,MISSING,AllPub,Gilbert,1Fam,2Story,7,5,2005,2006,Gd,TA,1335.0,1392.0,Y,2462.0,1.0,0.0,2,1,4,1,9,Attchd,2006.0,2.0,576.0,TA,TA,Y,MISSING,03-2006
534275170,RL,-1,12772,Pave,MISSING,AllPub,NAmes,1Fam,1Story,6,8,1960,1998,TA,Gd,460.0,958.0,Y,958.0,0.0,0.0,1,0,2,1,5,Attchd,1960.0,1.0,301.0,TA,TA,Y,MISSING,04-2007
528104050,RL,114,14803,Pave,MISSING,AllPub,NridgHt,1Fam,1Story,10,5,2007,2008,Ex,TA,442.0,2078.0,Y,2084.0,1.0,0.0,2,0,2,1,7,Attchd,2007.0,3.0,1220.0,TA,TA,Y,MISSING,06-2008
533206070,FV,32,3784,Pave,Pave,AllPub,Somerst,TwnhsE,1Story,8,5,2006,2007,Gd,TA,1451.0,1511.0,Y,1565.0,1.0,0.0,2,0,2,1,5,Attchd,2006.0,2.0,476.0,TA,TA,Y,MISSING,02-2007


Firstly

 - Define the target vector - slicing method
 - Define the features matrix - drop method (dropping the target column from the original dataset)


# Summary


In this lesson, you learned the terminology used in supervised learning. The column of data being predicted is the target vector, y. The columns used to make the predictions are the features matrix, X. You also learned how to define these variables in Python. This is an important step in preparing data for machine learning.

# 2. Train Test Split (Model Evaluation)

We will use our Ames Data that we prepared for Machine Learning

For this machine learning task, the goal is to predict the value which is the price (target) of homes based on a variety of features such as the Living Area Sqft, the number of bedrooms, bathrooms, etc.

Is this a regression or classification task and Why?

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2930 entries, 907227090 to 902201120
Data columns (total 36 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   MS Zoning         2930 non-null   object 
 1   Lot Frontage      2930 non-null   int64  
 2   Lot Area          2930 non-null   int64  
 3   Street            2930 non-null   object 
 4   Alley             2930 non-null   object 
 5   Utilities         2930 non-null   object 
 6   Neighborhood      2930 non-null   object 
 7   Bldg Type         2930 non-null   object 
 8   House Style       2930 non-null   object 
 9   Overall Qual      2930 non-null   int64  
 10  Overall Cond      2930 non-null   int64  
 11  Year Built        2930 non-null   int64  
 12  Year Remodeled    2930 non-null   int64  
 13  Exter Qual        2930 non-null   object 
 14  Exter Cond        2930 non-null   object 
 15  Bsmt Unf Sqft     2930 non-null   float64
 16  Total Bsmnt Sqft  2930 non-null   

Once you have loaded in your data frame and split it into the target (y) and features (X), you can split it into a training set and a test set.

In [None]:
#splits: X and y

#Step 2 - Train set, test set  - train te


train_x= x.head(1500)
train_y= y.head(1500)
test_x = x.tail(530)
test_y = y.tail(530)




In [None]:
# Import the train test split method from sklearn
#import sklearn
from sklearn.model_selection import train_test_split
# Train test split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42 )
#default test_size - 0.25

We are defining 4 new variables here:

From our original X dataframe, we get two new dataframes:

X_train is the data we will use to train the model and contains only the features

X_test is the data we set aside for testing (evaluating) our model and contains only features
From our original y series, we get two new series:

y_train contains the target values for the data corresponding to the X_train features

y_test contains the target values for the data we have set aside for testing corresponding to the X_test feautres

In [None]:
# Showing X_train and y_train have the same PID's


Unnamed: 0_level_0,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Utilities,Neighborhood,Bldg Type,House Style,Overall Qual,Overall Cond,Year Built,Year Remodeled,Exter Qual,Exter Cond,Bsmt Unf Sqft,Total Bsmnt Sqft,Central Air,Living Area Sqft,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom,Kitchen,Total Rooms,Garage Type,Garage Yr Blt,Garage Cars,Garage Area,Garage Qual,Garage Cond,Paved Drive,Fence,Date Sold
PID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1
528221100,RL,105,15578,Pave,MISSING,AllPub,Gilbert,1Fam,2Story,6,5,2006,2006,Gd,TA,728.0,728.0,Y,1456.0,0.0,0.0,2,1,3,1,8,Attchd,2006.0,2.0,429.0,TA,TA,Y,MISSING,05-2006
906475200,RL,62,70761,Pave,MISSING,AllPub,ClearCr,1Fam,1Story,7,5,1975,1975,TA,TA,878.0,1533.0,Y,1533.0,1.0,0.0,2,0,2,1,5,Attchd,1975.0,2.0,576.0,TA,TA,Y,MISSING,12-2006
527212060,RL,98,12328,Pave,MISSING,AllPub,StoneBr,1Fam,2Story,8,5,2005,2005,Gd,TA,163.0,1149.0,Y,2541.0,1.0,0.0,3,1,4,1,10,BuiltIn,2005.0,3.0,729.0,TA,TA,Y,MISSING,07-2007


Unnamed: 0_level_0,SalePrice
PID,Unnamed: 1_level_1
528221100,172785.0
906475200,280000.0
527212060,349265.0


Note that a random state was set to 42. Since the split is a random process (some rows of data end up in the training set and some rows of data end up in the test set), we can ensure that we can get reproducible results by assigning a random state. Any number can be used here.

To verify the split, you can check the length of each:

In [None]:
# of rows in X_train
len(X_train)

2344

In [None]:
len(y_train)

2344

In [None]:
# Check what % of the "X" data is included


70.0

The output above shows that of the original 2930 rows, there are 2197 rows in the training set. (Which is 75% of the values)

In [None]:
# of rows in X_test
len(X_test)

879

In [None]:
len(y_test)

879

In [None]:
# Check what % of the "X" data is included
len(X_test)/len(x) * 100

20.0

The output above shows that of the original 2930 rows, there are 733 rows in the test set. (Which is 25% of the values)

**You should have the same values for your target y.**

# Summary



Performing a train test split is an essential step for developing any supervised learning model.