# This notebook is to prepare and pre-process data for various prediction models from the Used Car Price data 
at : https://www.kaggle.com/CooperUnion/cardataset, after the initial Data exploration, as given in CarPricePredictionAnalysis.ipynb.  
The numerical variables are scaled with StandardScaler, imputation strategy is used to replace 0 values with mean
StratifiedRandomShuffleSplit is done based on Age of car (Curr Year - Year of Car), by creating Age category (Age / 5), and same distribution of Age-category is maintained in Train and Test data.
The categorical variables (Engine Cylinders, Engine Fuel type, Transmission type, Vehicle Size, Vehichle Style, Drive wheels, Make) are one-hot encoded and added to the feature vector. The numerical variables considered are : Age, City mpg, Engine HP, with powers of 2,3,4 (based on initial Data Analysis). Two sets of X features are 
produced, with Make and without Make and results compared.

In [29]:
#Import all necessary libraries
import pickle
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelBinarizer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import PolynomialFeatures

Read the pickle files prepared by stratifying the Car Sales Data based on Make and Price. This is necessary since
cars belong to different price segments, and including all make and models is not a viable solution, as the range of price
is different for the same features for different makes. The stratification details can be found in the 
Data exploration notebook, which precedes this and outputs the data into .pkl files, based on car segment/price category

In [30]:
df_ordinary=pd.read_pickle('C:/users/hackuser1/Hackathon18/ordinarydfUSJap.pkl')

In [31]:
df_ordinary.head()

Unnamed: 0,Make,Model,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Vehicle Size,Vehicle Style,MSRP,Age,log_MSRP,log_city mpg,log_Engine HP
3,Chrysler,200,flex-fuel (unleaded/E85),184.0,4.0,AUTOMATIC,front wheel drive,4.0,Midsize,Sedan,25170,2,10.133448,3.178054,5.220356
4,Chrysler,200,flex-fuel (unleaded/E85),184.0,4.0,AUTOMATIC,front wheel drive,4.0,Midsize,Sedan,23950,2,10.083765,3.178054,5.220356
5,Chrysler,200,flex-fuel (unleaded/E85),295.0,6.0,AUTOMATIC,all wheel drive,4.0,Midsize,Sedan,29370,2,10.287763,2.944439,5.690359
6,Chrysler,200,flex-fuel (unleaded/E85),184.0,4.0,AUTOMATIC,front wheel drive,4.0,Midsize,Sedan,21995,2,9.998616,3.178054,5.220356
7,Chrysler,200,flex-fuel (unleaded/E85),184.0,4.0,AUTOMATIC,front wheel drive,4.0,Midsize,Sedan,26625,2,10.189643,3.178054,5.220356


In [32]:
df_ordinary["Make"].value_counts()

Chevrolet    1123
Ford          879
Nissan        556
Honda         449
Mazda         423
Suzuki        351
Hyundai       303
Subaru        256
Chrysler      187
Pontiac       186
Name: Make, dtype: int64

In [33]:
print(len(df_ordinary))

4713


We will take the ordinary segment as it has the most data, and fit our model. Once done the same model will be applied
to other segments.(For MVP, in actuality the modeling exercise needs to be repeated for each segment, as the relationships 
may be different)

In [34]:
df_ordinary["Number of Doors"] = df_ordinary["Number of Doors"].replace("?",0)
df_ordinary["Number of Doors"] = df_ordinary["Number of Doors"].astype('float32')
df_ordinary["MSRP"] = df_ordinary["MSRP"].replace("?",0)
df_ordinary["MSRP"] = df_ordinary["MSRP"].astype("float32")
df_ordinary["log_Engine HP"] = df_ordinary["log_Engine HP"].astype("float32")

We check the distribution of Car Sales on the basis of Age of Car, and create Age-cat and check the distribution of the Car data based on Age-cat (Age / 5). We plan to use StratifiedSampling to make sure both Test and Train data represents same distribution of cars based on Age of Car

In [35]:
df_ordinary["Age"].value_counts()
#create a field Age-cat to divide the data into 5 Age categories, based on the Age of the car
df_ordinary["Age-cat"] = np.ceil(df_ordinary["Age"] / 5)
df_ordinary["Age-cat"].where(df_ordinary["Age-cat"] < 5, 5.0, inplace=True)
#check distribution of Age Cat in the original data
df_ordinary["Age-cat"].value_counts() / len(df_ordinary)

1.0    0.424570
2.0    0.154891
0.0    0.144494
3.0    0.119669
4.0    0.081052
5.0    0.075324
Name: Age-cat, dtype: float64

In [None]:
We treat Engine Cylinders, Engine Fuel Type, Transmission Type, Driven_wheels, Vehicle Size, Make and style as 
Categorical variables based on our Data exploration. We use the LabelBinarizer to fit the variables on the entire 
set. The actual encoding will be done using the encoded values on the Train and test samples

In [36]:
car_eng_cyl = df_ordinary["Engine Cylinders"]
encoder_cyl = LabelBinarizer()
encoder_cyl.fit(car_eng_cyl)
print(encoder_cyl.classes_)

car_eng_fuel_type = df_ordinary["Engine Fuel Type"]
encoder_fuel = LabelBinarizer()
encoder_fuel.fit(car_eng_fuel_type)
print(encoder_fuel.classes_)

car_trans_type = df_ordinary["Transmission Type"]
encoder_trans = LabelBinarizer()
encoder_trans.fit(car_trans_type)
print(encoder_trans.classes_)

car_driven_wheels = df_ordinary["Driven_Wheels"]
encoder_wheels = LabelBinarizer()
encoder_wheels.fit(car_driven_wheels)
print(encoder_wheels.classes_)

car_vehicle_size = df_ordinary["Vehicle Size"]
encoder_size = LabelBinarizer()
encoder_size.fit(car_vehicle_size)
print(encoder_size.classes_)

car_make =df_ordinary["Make"]
encoder_make = LabelBinarizer()
encoder_make.fit(car_make)
print(encoder_make.classes_)


car_style =df_ordinary["Vehicle Style"]
encoder_style = LabelBinarizer()
encoder_style.fit(car_style)
print(encoder_style.classes_)

[ 0.  3.  4.  5.  6.  8.]
['diesel' 'electric' 'flex-fuel (unleaded/E85)'
 'flex-fuel (unleaded/natural gas)' 'natural gas'
 'premium unleaded (recommended)' 'premium unleaded (required)'
 'regular unleaded']
['AUTOMATED_MANUAL' 'AUTOMATIC' 'DIRECT_DRIVE' 'MANUAL' 'UNKNOWN']
['all wheel drive' 'four wheel drive' 'front wheel drive'
 'rear wheel drive']
['Compact' 'Large' 'Midsize']
['Chevrolet' 'Chrysler' 'Ford' 'Honda' 'Hyundai' 'Mazda' 'Nissan' 'Pontiac'
 'Subaru' 'Suzuki']
['2dr Hatchback' '2dr SUV' '4dr Hatchback' '4dr SUV' 'Cargo Minivan'
 'Cargo Van' 'Convertible' 'Convertible SUV' 'Coupe' 'Crew Cab Pickup'
 'Extended Cab Pickup' 'Passenger Minivan' 'Passenger Van'
 'Regular Cab Pickup' 'Sedan' 'Wagon']


In [9]:
# save encoders to file
pickle.dump(encoder_cyl, open("C:/users/hackuser1/encoder_cyl.pickle.dat", "wb"))
pickle.dump(encoder_fuel, open("C:/users/hackuser1/encoder_fuel.pickle.dat", "wb"))
pickle.dump(encoder_trans, open("C:/users/hackuser1/encoder_trans.pickle.dat", "wb"))
pickle.dump(encoder_wheels, open("C:/users/hackuser1/encoder_wheels.pickle.dat", "wb"))
pickle.dump(encoder_size, open("C:/users/hackuser1/encoder_size.pickle.dat", "wb"))
pickle.dump(encoder_make, open("C:/users/hackuser1/encoder_make.pickle.dat", "wb"))
pickle.dump(encoder_style, open("C:/users/hackuser1/encoder_style.pickle.dat", "wb"))

In [37]:
split = StratifiedShuffleSplit(n_splits=1,test_size=0.2,random_state=42)

for train_index, test_index in split.split(df_ordinary,df_ordinary["Age-cat"]):
    strat_train_set = df_ordinary.iloc[train_index]
    strat_test_set = df_ordinary.iloc[test_index]

In [38]:
#check distribution of Age Cat in the train data
strat_train_set["Age-cat"].value_counts() / len(strat_train_set)

1.0    0.424668
2.0    0.154907
0.0    0.144562
3.0    0.119629
4.0    0.080902
5.0    0.075332
Name: Age-cat, dtype: float64

In [39]:
#check distribution of Age Cat in the test data
strat_test_set["Age-cat"].value_counts() / len(strat_test_set)

1.0    0.424178
2.0    0.154825
0.0    0.144221
3.0    0.119830
4.0    0.081654
5.0    0.075292
Name: Age-cat, dtype: float64

Create the X and Y variables from the Feature analysis done in Exploration notebook. Repeat the same operations 
for Train and Test data.

In [40]:
carSales_X = strat_train_set.copy()
carSales_X = strat_train_set.drop("MSRP", axis=1) # drop labels for training set
carSales_X = strat_train_set.drop("log_MSRP", axis=1) # drop labels for training set
carSales_Y = strat_train_set["log_MSRP"].copy() # use log MSRP as labels for training set, based on data Exploration
carSales_Y_orig = strat_train_set["MSRP"].copy() # use MSRP as labels also for training set, to compare fit based on Log and original Price

carSales_test_X = strat_test_set.copy()
carSales_test_X = strat_test_set.drop("MSRP", axis=1) # drop labels for test set
carSales_test_X = strat_test_set.drop("log_MSRP", axis=1) # drop labels for test set
carSales_test_Y = strat_test_set["log_MSRP"].copy()# use log MSRP as labels for test set, based on data Exploration
carSales_test_Y_orig = strat_test_set["MSRP"].copy()

In [41]:
carSales_Y = carSales_Y.values.reshape(carSales_Y.shape[0],1)
carSales_test_Y = carSales_test_Y.values.reshape(carSales_test_Y.shape[0],1)
carSales_Y_orig = carSales_Y_orig.values.reshape(carSales_Y_orig.shape[0],1)
carSales_test_Y_orig = carSales_test_Y_orig.values.reshape(carSales_test_Y.shape[0],1)
print(carSales_X.shape)
print(carSales_Y.shape)
print(carSales_Y_orig.shape)
print(carSales_test_X.shape)
print(carSales_test_Y.shape)
print(carSales_test_Y_orig.shape)

(3770, 15)
(3770, 1)
(3770, 1)
(943, 15)
(943, 1)
(943, 1)


Now we need to remove unnecessary columns based on
Correlation analysis done in ExplorationNotebook, and do Encoding of Categorical variables. Also 
we need to do StandardNormalization before applying Regression models.

In [42]:
carSales_X.head()

Unnamed: 0,Make,Model,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Vehicle Size,Vehicle Style,MSRP,Age,log_city mpg,log_Engine HP,Age-cat
4464,Subaru,Outback,regular unleaded,175.0,4.0,AUTOMATIC,all wheel drive,4.0,Midsize,4dr SUV,27395.0,1,3.258097,5.170484,1.0
2379,Suzuki,Esteem,regular unleaded,122.0,4.0,AUTOMATIC,front wheel drive,4.0,Compact,Sedan,14299.0,15,3.178054,4.812184,3.0
3336,Suzuki,Grand Vitara,regular unleaded,166.0,4.0,AUTOMATIC,four wheel drive,4.0,Compact,4dr SUV,24949.0,6,2.995732,5.117994,2.0
1342,Honda,Civic,regular unleaded,158.0,4.0,MANUAL,front wheel drive,4.0,Midsize,Sedan,18640.0,1,3.332205,5.068904,1.0
1593,Chevrolet,Corvette,premium unleaded (required),650.0,8.0,MANUAL,rear wheel drive,2.0,Compact,Coupe,82270.0,2,2.772589,6.478509,1.0


We drop all categorical columns after making a copy, and retain only the numerical features of significance

In [43]:
carSales_X_num = carSales_X
carSales_X_num  = carSales_X_num.drop("Make",axis=1) # to be treated as categorical var
carSales_X_num  = carSales_X_num.drop("Model",axis=1)
carSales_X_num  = carSales_X_num.drop("Engine Cylinders",axis=1) # to be treated as categorical var
carSales_X_num  = carSales_X_num.drop("Engine Fuel Type",axis=1) # to be treated as categorical var
carSales_X_num  = carSales_X_num.drop("Transmission Type",axis=1) # to be treated as categorical var 
carSales_X_num  = carSales_X_num.drop("Driven_Wheels",axis=1) # to be treated as categorical var
carSales_X_num = carSales_X_num.drop("Number of Doors",axis=1) # to be treated as categorical var
carSales_X_num  = carSales_X_num.drop("Vehicle Style",axis=1)
carSales_X_num  = carSales_X_num.drop("Engine HP",axis=1)
carSales_X_num = carSales_X_num.drop("Vehicle Size",axis=1) # to be treated as categorical var
carSales_X_num = carSales_X_num.drop("Age-cat",axis=1)
carSales_X_num = carSales_X_num.drop("MSRP",axis=1)


In [44]:
carSales_X_num.head()

Unnamed: 0,Age,log_city mpg,log_Engine HP
4464,1,3.258097,5.170484
2379,15,3.178054,4.812184
3336,6,2.995732,5.117994
1342,1,3.332205,5.068904
1593,2,2.772589,6.478509


In [45]:
#Apply the same transformation on Test data
carSales_test_X_num = carSales_test_X
carSales_test_X_num  = carSales_test_X_num.drop("Make",axis=1)
carSales_test_X_num  = carSales_test_X_num.drop("Model",axis=1)
carSales_test_X_num  = carSales_test_X_num.drop("Engine Cylinders",axis=1)
carSales_test_X_num  = carSales_test_X_num.drop("Engine Fuel Type",axis=1)
carSales_test_X_num  = carSales_test_X_num.drop("Transmission Type",axis=1)
carSales_test_X_num  = carSales_test_X_num.drop("Driven_Wheels",axis=1)
carSales_test_X_num = carSales_test_X_num.drop("Number of Doors",axis=1)
carSales_test_X_num  = carSales_test_X_num.drop("Vehicle Style",axis=1)
carSales_test_X_num = carSales_test_X_num.drop("Vehicle Size",axis=1)
carSales_test_X_num  = carSales_test_X_num.drop("Engine HP",axis=1)
carSales_test_X_num = carSales_test_X_num.drop("Age-cat",axis=1)
carSales_test_X_num = carSales_test_X_num.drop("MSRP",axis=1)


In [46]:
carSales_test_X_num.head()

Unnamed: 0,Age,log_city mpg,log_Engine HP
6013,6,3.044522,5.808143
4715,16,3.135494,4.94876
2385,15,3.178054,4.812184
2320,8,3.401197,5.181784
2216,0,3.367296,4.997212


In [47]:
carSales_X_num["log_Engine HP"] = carSales_X_num["log_Engine HP"].astype("float32")
carSales_X_num["Age"] = carSales_X_num["Age"].astype("float32")
carSales_X_num.replace('null',np.NaN,inplace=True)
carSales_X_num = pd.DataFrame(carSales_X_num)
carSales_X_num = carSales_X_num.replace('?',0)
carSales_X_num = carSales_X_num.replace('NaN',0)
carSales_X_num = carSales_X_num.replace(np.NaN,0)

carSales_test_X_num["log_Engine HP"] = carSales_test_X_num["log_Engine HP"].astype("float32")
carSales_test_X_num["Age"] = carSales_test_X_num["Age"].astype("float32")
carSales_test_X_num.replace('null',np.NaN,inplace=True)
carSales_test_X_num = pd.DataFrame(carSales_test_X_num)
carSales_test_X_num = carSales_test_X_num.replace('?',0)
carSales_test_X_num = carSales_test_X_num.replace('NaN',0)
carSales_test_X_num = carSales_test_X_num.replace(np.NaN,0)

In [48]:
m=carSales_X_num.isnull().any()
print(m[m])
m=np.isfinite(carSales_X_num.select_dtypes(include=['float64'])).any()
print(m[m])
m=carSales_test_X_num.isnull().any()
print(m[m])
m=np.isfinite(carSales_test_X_num.select_dtypes(include=['float64'])).any()
print(m[m])

Series([], dtype: bool)
Age              True
log_city mpg     True
log_Engine HP    True
dtype: bool
Series([], dtype: bool)
Age              True
log_city mpg     True
log_Engine HP    True
dtype: bool


Wherever there are 0 values, we replace by the mean 

In [49]:
imputer = Imputer(missing_values=0,strategy="mean")
imputer.fit(carSales_X_num)
imputer.fit(carSales_test_X_num)

Imputer(axis=0, copy=True, missing_values=0, strategy='mean', verbose=0)

In [50]:
#Standardize the data using sklearn StandardScaler
scaler = StandardScaler()
train_X = scaler.fit_transform(carSales_X_num)
test_X = scaler.transform(carSales_test_X_num)
print(train_X.shape)

(3770, 3)


In [22]:
pickle.dump(scaler, open("C:/users/hackuser1/scaler.pickle.dat", "wb"))

Now add the Categorical variables using one-hot represenation, using the encoder already fit on the entire sample

In [51]:
car_eng_cyl = carSales_X["Engine Cylinders"]
car_eng_1hot = encoder_cyl.transform(car_eng_cyl)
print(car_eng_1hot.shape)

train_X = np.concatenate((train_X,car_eng_1hot),axis=1)

car_eng_fuel_type = carSales_X["Engine Fuel Type"]
car_fuel_1hot = encoder_fuel.transform(car_eng_fuel_type)
print(car_fuel_1hot.shape)

train_X = np.concatenate((train_X,car_fuel_1hot),axis=1)

car_trans_type = carSales_X["Transmission Type"]
car_trans_1hot = encoder_trans.transform(car_trans_type)
print(car_trans_1hot.shape)

train_X = np.concatenate((train_X,car_trans_1hot),axis=1)

car_driven_wheels = carSales_X["Driven_Wheels"]
car_drive_1hot = encoder_wheels.transform(car_driven_wheels)
print(car_drive_1hot.shape)

train_X = np.concatenate((train_X,car_drive_1hot),axis=1)

car_vehicle_size = carSales_X["Vehicle Size"]
car_size_1hot = encoder_size.transform(car_vehicle_size)
print(car_size_1hot.shape)

train_X = np.concatenate((train_X,car_size_1hot),axis=1)

car_vehicle_style = carSales_X["Vehicle Style"]
car_style_1hot = encoder_style.transform(car_vehicle_style)
print(car_style_1hot.shape)

train_X = np.concatenate((train_X,car_style_1hot),axis=1)

car_make = carSales_X["Make"]
car_make_1hot = encoder_make.transform(car_make)
print(car_make_1hot.shape)

train_X_make = np.concatenate((train_X,car_make_1hot),axis=1)

#We prepare two sets of train X features, with Make and without Make and compare the performance of both
print(train_X.shape)
print(train_X_make.shape)


(3770, 6)
(3770, 8)
(3770, 5)
(3770, 4)
(3770, 3)
(3770, 16)
(3770, 10)
(3770, 45)
(3770, 55)


In [52]:
car_eng_cyl = carSales_test_X["Engine Cylinders"]
car_eng_1hot = encoder_cyl.transform(car_eng_cyl)
print(car_eng_1hot.shape)

test_X = np.concatenate((test_X,car_eng_1hot),axis=1)

car_eng_fuel_type = carSales_test_X["Engine Fuel Type"]
car_fuel_1hot = encoder_fuel.transform(car_eng_fuel_type)
print(car_fuel_1hot.shape)

test_X = np.concatenate((test_X,car_fuel_1hot),axis=1)

car_trans_type_test = carSales_test_X["Transmission Type"]
car_trans_1hot_test = encoder_trans.transform(car_trans_type_test)
print(car_trans_1hot_test.shape)

test_X = np.concatenate((test_X,car_trans_1hot_test),axis=1)

car_driven_wheels_test = carSales_test_X["Driven_Wheels"]
car_drive_1hot_test = encoder_wheels.transform(car_driven_wheels_test)
print(car_drive_1hot_test.shape)

test_X = np.concatenate((test_X,car_drive_1hot_test),axis=1)

car_vehicle_size_test = carSales_test_X["Vehicle Size"]
car_size_1hot_test = encoder_size.transform(car_vehicle_size_test)
print(car_size_1hot_test.shape)

test_X = np.concatenate((test_X,car_size_1hot_test),axis=1)

car_vehicle_style_test = carSales_test_X["Vehicle Style"]
car_style_1hot_test = encoder_style.transform(car_vehicle_style_test)
print(car_style_1hot_test.shape)

test_X = np.concatenate((test_X,car_style_1hot_test),axis=1)

car_make_test = carSales_test_X["Make"]
car_make_1hot_test = encoder_make.transform(car_make_test)
print(car_make_1hot_test.shape)

test_X_make = np.concatenate((test_X,car_make_1hot_test),axis=1)

print(test_X.shape)
print(test_X_make.shape)

(943, 6)
(943, 8)
(943, 5)
(943, 4)
(943, 3)
(943, 16)
(943, 10)
(943, 45)
(943, 55)


In [53]:
train_Y = pd.DataFrame(carSales_Y)
m=train_Y.isnull().any()
print(m[m])
m=np.isfinite(train_Y.select_dtypes(include=['float64'])).any()
print(m[m])

train_Y_orig = pd.DataFrame(carSales_Y_orig)
m=train_Y_orig.isnull().any()
print(m[m])
m=np.isfinite(train_Y_orig.select_dtypes(include=['float64'])).any()
print(m[m])

test_Y = pd.DataFrame(carSales_test_Y)
m=test_Y.isnull().any()
print(m[m])
m=np.isfinite(test_Y.select_dtypes(include=['float64'])).any()
print(m[m])

test_Y_orig = pd.DataFrame(carSales_test_Y_orig)
m=test_Y_orig.isnull().any()
print(m[m])
m=np.isfinite(test_Y_orig.select_dtypes(include=['float64'])).any()
print(m[m])

Series([], dtype: bool)
0    True
dtype: bool
Series([], dtype: bool)
Series([], dtype: bool)
Series([], dtype: bool)
0    True
dtype: bool
Series([], dtype: bool)
Series([], dtype: bool)


We now take backup of the pre-processed data, so the modeling can be done instantaneously on the pre-processed data
at any later point of time

In [54]:
train_X_ordinary='C:/users/hackuser1/train_X_ordUSJap.pkl'
test_X_ordinary='C:/users/hackuser1/test_X_ordUSJap.pkl'
train_Y_ordinary='C:/users/hackuser1/train_Y_ordUSJap.pkl'
test_Y_ordinary='C:/users/hackuser1/test_Y_ordUSJap.pkl'
train_Y_ordinary_orig='C:/users/hackuser1/train_Y_ord_origUSJap.pkl'
test_Y_ordinary_orig='C:/users/hackuser1/test_Y_ord_origUSJap.pkl'

with open(train_X_ordinary, "wb") as f:
    w = pickle.dump(train_X,f)
with open(test_X_ordinary, "wb") as f:
    w = pickle.dump(test_X,f)
with open(train_Y_ordinary, "wb") as f:
    w = pickle.dump(train_Y,f)
with open(test_Y_ordinary, "wb") as f:
    w = pickle.dump(test_Y,f)
with open(train_Y_ordinary_orig, "wb") as f:
    w = pickle.dump(train_Y_orig,f)
with open(test_Y_ordinary_orig, "wb") as f:
    w = pickle.dump(test_Y_orig,f)
    
train_X_ord_make='C:/users/hackuser1/train_X_ord_makeUSJap.pkl'
test_X_ord_make='C:/users/hackuser1/test_X_ord_makeUSJap.pkl'

with open(train_X_ord_make, "wb") as f:
    w = pickle.dump(train_X_make,f)
with open(test_X_ord_make, "wb") as f:
    w = pickle.dump(test_X_make,f)
