The purpose of this section is to have final preparations before modeling.  My goal is to complete two steps.

-Split my data into testing and training datasets

-Create dummy or indicator features for categorical variables

I will not scale the data since I am using tree classifiers and there is not a need.

In [1]:
#Import anything that I might use
import pandas as pd
import numpy as np
import os
import pickle
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

In [5]:
cars = pd.read_csv('cars.csv', index_col=0)
cars=cars.drop(columns='id')
cars.head()

Unnamed: 0,price,year,manufacturer,model,cylinders,odometer,transmission,drive,paint_color,state
1,8750,2013.0,hyundai,sonata,4 cylinders,90821.0,automatic,fwd,grey,MN
2,10900,2013.0,toyota,prius,4 cylinders,92800.0,automatic,fwd,blue,CT
6,7995,2010.0,chevrolet,equinox,4 cylinders,108124.0,automatic,4wd,grey,MN
14,10995,2008.0,chevrolet,tahoe,Unknown,143528.0,automatic,4wd,grey,MN
17,14995,2011.0,chevrolet,silverado 1500,8 cylinders,102462.0,automatic,4wd,blue,MN


The first thing I need to do is to split my data into a training and testing set.  I do not want to get dummy variables beforehand because I want the training and testing data to be completely independent.

In [7]:
X_train, X_test, y_train, y_test = train_test_split(cars.drop(columns='price'),cars.price,test_size=0.3, random_state=42)


In [8]:
X_train.shape, X_test.shape

((128459, 9), (55055, 9))

In [9]:
y_train.shape, y_test.shape

((128459,), (55055,))

The second thing I will do is get dummy variables for my categorical variables.  This is manufacturer, cylinders, transmission, drive, color, and state. I'm also going to drop one column for each to avoid the dummy variable trap.  I will do this separately for the X training set and x testing set. First I will do it for the training set.

In [14]:
train_dummy_manufacturer = pd.get_dummies(X_train['manufacturer'],drop_first = True)
train_dummy_manufacturer


Unnamed: 0,alfa-romeo,aston-martin,audi,bmw,buick,cadillac,chevrolet,chrysler,datsun,dodge,...,pontiac,porche,ram,rover,saturn,subaru,tesla,toyota,volkswagen,volvo
248704,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1786,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
65689,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
380733,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
410489,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
301860,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
273463,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
323570,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
350491,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [15]:
train_dummy_cylinders=pd.get_dummies(X_train['cylinders'],drop_first = True)
train_dummy_cylinders

Unnamed: 0,12 cylinders,3 cylinders,4 cylinders,5 cylinders,6 cylinders,8 cylinders,Unknown,other
248704,0,0,0,0,1,0,0,0
1786,0,0,0,0,0,0,1,0
65689,0,0,0,0,1,0,0,0
380733,0,0,0,0,0,0,1,0
410489,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...
301860,0,0,0,0,0,0,1,0
273463,0,0,0,0,0,1,0,0
323570,0,0,0,0,1,0,0,0
350491,0,0,0,0,1,0,0,0


In [16]:
train_dummy_transmission=pd.get_dummies(X_train['transmission'],drop_first = True)
train_dummy_transmission

Unnamed: 0,automatic,manual,other
248704,1,0,0
1786,1,0,0
65689,1,0,0
380733,0,0,1
410489,1,0,0
...,...,...,...
301860,1,0,0
273463,1,0,0
323570,1,0,0
350491,1,0,0


In [17]:
train_dummy_drive=pd.get_dummies(X_train['drive'],drop_first = True)
train_dummy_drive

Unnamed: 0,Unknown,fwd,rwd
248704,0,0,0
1786,0,0,0
65689,0,0,0
380733,0,1,0
410489,0,1,0
...,...,...,...
301860,0,0,1
273463,0,0,0
323570,0,0,1
350491,0,1,0


In [18]:
train_dummy_color=pd.get_dummies(X_train['paint_color'],drop_first = True)
train_dummy_color

Unnamed: 0,black,blue,brown,custom,green,grey,orange,purple,red,silver,white,yellow
248704,1,0,0,0,0,0,0,0,0,0,0,0
1786,1,0,0,0,0,0,0,0,0,0,0,0
65689,0,0,0,0,0,0,0,0,0,0,0,0
380733,0,0,0,0,0,0,0,0,0,1,0,0
410489,0,0,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...
301860,0,0,0,0,0,0,0,0,0,0,1,0
273463,0,0,0,0,0,0,0,0,0,0,1,0
323570,0,0,0,1,0,0,0,0,0,0,0,0
350491,0,0,0,0,0,1,0,0,0,0,0,0


In [19]:
train_dummy_state=pd.get_dummies(X_train['state'],drop_first = True)
train_dummy_state

Unnamed: 0,AL,AR,AZ,CA,CO,CT,DC,DE,FL,GA,...,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY
248704,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1786,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
65689,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
380733,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
410489,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
301860,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
273463,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
323570,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
350491,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now it's time to merge all of these variables into a main database, and I'm also going to drop the corresponding columns that go with the dummy columns.

In [20]:
X_train_dummies = pd.concat([X_train, train_dummy_manufacturer, train_dummy_cylinders, train_dummy_transmission, train_dummy_drive, train_dummy_color, train_dummy_state],axis=1)
X_train_dummies

Unnamed: 0,year,manufacturer,model,cylinders,odometer,transmission,drive,paint_color,state,alfa-romeo,...,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY
248704,2011.0,ford,f150 xlt 4x4,6 cylinders,116000.0,automatic,4wd,black,WV,0,...,0,0,0,0,0,0,0,0,1,0
1786,2017.0,chrysler,300 s awd gas sedan,Unknown,55802.0,automatic,4wd,black,WA,0,...,0,0,0,0,0,0,1,0,0,0
65689,2018.0,ford,f-150,6 cylinders,95997.0,automatic,4wd,Unknown,WA,0,...,0,0,0,0,0,0,1,0,0,0
380733,2013.0,volkswagen,beetle 2.5l hatchback,Unknown,59785.0,other,fwd,silver,NC,0,...,0,0,0,0,0,0,0,0,0,0
410489,2005.0,toyota,camry le,4 cylinders,54000.0,automatic,fwd,yellow,OH,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
301860,2017.0,mercedes-benz,e-class,Unknown,34760.0,automatic,rwd,white,CA,0,...,0,0,0,0,0,0,0,0,0,0
273463,2011.0,chevrolet,silverado 2500 crewcab lt 4x4,8 cylinders,115014.0,automatic,4wd,white,DE,0,...,0,0,0,0,0,0,0,0,0,0
323570,2019.0,ford,f-150,6 cylinders,42397.0,automatic,rwd,custom,FL,0,...,0,0,0,0,0,0,0,0,0,0
350491,2019.0,nissan,maxima sv sedan 4d,6 cylinders,26944.0,automatic,fwd,grey,PA,0,...,0,0,0,0,0,0,0,0,0,0


In [21]:
X_train_fin = X_train_dummies.drop(['manufacturer','model','cylinders','transmission','drive','paint_color','state'], axis=1)
X_train_fin.head()

Unnamed: 0,year,odometer,alfa-romeo,aston-martin,audi,bmw,buick,cadillac,chevrolet,chrysler,...,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY
248704,2011.0,116000.0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1786,2017.0,55802.0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0
65689,2018.0,95997.0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
380733,2013.0,59785.0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
410489,2005.0,54000.0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


I will now repeat the process with the testing set.

In [23]:
test_dummy_manufacturer = pd.get_dummies(X_test['manufacturer'],drop_first = True)
test_dummy_manufacturer

Unnamed: 0,alfa-romeo,aston-martin,audi,bmw,buick,cadillac,chevrolet,chrysler,datsun,dodge,...,pontiac,porche,ram,rover,saturn,subaru,tesla,toyota,volkswagen,volvo
345036,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
344082,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
217092,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
398453,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
192478,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
348166,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
365446,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
376491,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
296233,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [24]:
test_dummy_cylinders=pd.get_dummies(X_test['cylinders'],drop_first = True)
test_dummy_cylinders

Unnamed: 0,12 cylinders,3 cylinders,4 cylinders,5 cylinders,6 cylinders,8 cylinders,Unknown,other
345036,0,0,0,0,0,0,1,0
344082,0,0,0,0,0,0,1,0
217092,0,0,1,0,0,0,0,0
398453,0,0,0,0,0,0,1,0
192478,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...
348166,0,0,1,0,0,0,0,0
365446,0,0,0,0,0,0,1,0
376491,0,0,0,0,0,0,1,0
296233,0,0,0,0,0,0,1,0


In [25]:
test_dummy_transmission=pd.get_dummies(X_test['transmission'],drop_first = True)
test_dummy_transmission

Unnamed: 0,automatic,manual,other
345036,0,1,0
344082,1,0,0
217092,1,0,0
398453,1,0,0
192478,1,0,0
...,...,...,...
348166,1,0,0
365446,1,0,0
376491,1,0,0
296233,1,0,0


In [26]:
test_dummy_drive=pd.get_dummies(X_test['drive'],drop_first = True)
test_dummy_drive

Unnamed: 0,Unknown,fwd,rwd
345036,0,0,1
344082,0,0,0
217092,0,0,1
398453,0,0,1
192478,1,0,0
...,...,...,...
348166,0,1,0
365446,0,1,0
376491,0,1,0
296233,1,0,0


In [27]:
test_dummy_color=pd.get_dummies(X_test['paint_color'],drop_first = True)
test_dummy_color

Unnamed: 0,black,blue,brown,custom,green,grey,orange,purple,red,silver,white,yellow
345036,0,0,0,0,0,0,0,0,0,0,1,0
344082,1,0,0,0,0,0,0,0,0,0,0,0
217092,0,0,0,0,0,0,0,0,0,0,1,0
398453,0,0,0,0,0,0,0,0,0,0,1,0
192478,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
348166,1,0,0,0,0,0,0,0,0,0,0,0
365446,0,0,0,0,0,0,0,0,0,0,1,0
376491,0,1,0,0,0,0,0,0,0,0,0,0
296233,0,0,0,0,0,0,0,0,0,0,0,0


In [29]:
test_dummy_state=pd.get_dummies(X_test['state'],drop_first = True)
test_dummy_state

Unnamed: 0,AL,AR,AZ,CA,CO,CT,DC,DE,FL,GA,...,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY
345036,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
344082,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
217092,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
398453,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
192478,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
348166,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
365446,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
376491,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
296233,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Like the training set, now it's time to merge all of these variables into a main database, and I'm also going to drop the corresponding columns that go with the dummy columns.

In [30]:
X_test_dummies = pd.concat([X_test, test_dummy_manufacturer, test_dummy_cylinders, test_dummy_transmission, test_dummy_drive, test_dummy_color, test_dummy_state],axis=1)
X_test_dummies

Unnamed: 0,year,manufacturer,model,cylinders,odometer,transmission,drive,paint_color,state,alfa-romeo,...,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY
345036,2018.0,chevrolet,camaro,Unknown,19255.0,manual,rwd,white,MO,0,...,0,0,0,0,0,0,0,0,0,0
344082,2009.0,infiniti,m35,Unknown,35396.0,automatic,4wd,black,CT,0,...,0,0,0,0,0,0,0,0,0,0
217092,2013.0,ford,transit connect,4 cylinders,20234.0,automatic,rwd,white,CT,0,...,0,0,0,0,0,0,0,0,0,0
398453,2016.0,chevrolet,silverado 3500hd,Unknown,102089.0,automatic,rwd,white,CA,0,...,0,0,0,0,0,0,0,0,0,0
192478,2012.0,gmc,acadia,Unknown,117085.0,automatic,Unknown,Unknown,MI,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
348166,2014.0,chevrolet,equinox,4 cylinders,123865.0,automatic,fwd,black,MO,0,...,0,0,0,0,0,0,0,0,0,0
365446,2013.0,ford,explorer,Unknown,124140.0,automatic,fwd,white,OK,0,...,0,0,0,0,0,0,0,0,0,0
376491,2018.0,hyundai,elantra,Unknown,47932.0,automatic,fwd,blue,FL,0,...,0,0,0,0,0,0,0,0,0,0
296233,2019.0,jeep,wrangler unlimited,Unknown,23328.0,automatic,Unknown,Unknown,NJ,0,...,0,0,0,0,0,0,0,0,0,0


In [31]:
X_test_fin = X_test_dummies.drop(['manufacturer','model','cylinders','transmission','drive','paint_color','state'], axis=1)
X_test_fin.head()

Unnamed: 0,year,odometer,alfa-romeo,aston-martin,audi,bmw,buick,cadillac,chevrolet,chrysler,...,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY
345036,2018.0,19255.0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
344082,2009.0,35396.0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
217092,2013.0,20234.0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
398453,2016.0,102089.0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
192478,2012.0,117085.0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


The data has been split, and dummy variables have been made for the training and testing data.  Everything should be ready for modeling in the next section of the capstone project.  I will save everything as a csv file to easily access.

In [32]:
X_train_fin.to_csv('cars_x_train.csv')
X_test_fin.to_csv('cars_x_test.csv')
y_train.to_csv('cars_y_train.csv')
y_test.to_csv('cars_y_test.csv')