This notebook is to prepare and pre-process data for various prediction models from the Used Car Price data 
at : https://www.kaggle.com/CooperUnion/cardataset, after the initial Data exploration, as given in ..... 
The numerical variables are scaled with StandardScaler, imputation strategy is used to replace 0 values with mean
StratifiedshuffleSplit is done based on Age of car (Curr Year - Year of Car), by creating Age category (Age / 5), and 
putting the values in different Age category buckets. The same distribution is maintained in Train and Test data.
The categorical variables (Transmission type, Vehicle Size and Drive wheels) are one-hot encoded and added to the feature vector. The numerical variables considered are : Age, City mpg, Engine Cylinders. Two sets of X features are 
produced, with Make and without Make and results compared.


In [2]:
#Import all necessary libraries
import pickle
import pandas as pd
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelBinarizer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import PolynomialFeatures
from pandas.plotting import scatter_matrix

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import SGDRegressor
from keras.models import Sequential
from keras.layers import Dense   
from keras import optimizers

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV

Using TensorFlow backend.


Read the pickle files prepared by stratifying the Car Sales Data based on Make and Price. This is necessary since
cars belong to different price segments, and including all make and models is not a viable solution, as the range of price
is different for the same features for different makes. The stratification details can be found in the 
Data exploration notebook, which precedes this and outputs the data into .pkl files, based on car segment/price category

In [3]:
df_ordinary=pd.read_pickle('C:/users/hackuser1/Hackathon18/ord_transformed.pkl')
df_deluxe=pd.read_pickle('C:/users/hackuser1/Hackathon18/del.pkl')
df_supdel=pd.read_pickle('C:/users/hackuser1/Hackathon18/supdel.pkl')
df_luxury=pd.read_pickle('C:/users/hackuser1/Hackathon18/luxury.pkl')
df_suplux=pd.read_pickle('C:/users/hackuser1/Hackathon18/suplux.pkl')

In [4]:
df_ordinary.head()


Unnamed: 0,Engine Cylinders,Number of Doors,MSRP,Age,log_MSRP,log_city mpg,diesel,electric,flex-fuel (unleaded/E85),flex-fuel (unleaded/natural gas),...,Mazda,Mitsubishi,Nissan,Oldsmobile,Plymouth,Pontiac,Scion,Subaru,Suzuki,Volkswagen
0,4.0,2.0,27495,0,10.221796,3.295837,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,4.0,2.0,24995,0,10.126471,3.295837,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,4.0,2.0,28195,0,10.246935,3.295837,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4.0,4.0,25170,2,10.133448,3.178054,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,4.0,4.0,23950,2,10.083765,3.178054,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
df_ordinary["Make"].value_counts()

Chevrolet     1123
Ford           881
Volkswagen     809
Dodge          626
Nissan         558
Honda          449
Mazda          423
Suzuki         351
Hyundai        303
Subaru         256
Kia            231
Mitsubishi     213
Chrysler       187
Pontiac        186
Oldsmobile     150
Plymouth        82
FIAT            62
Scion           60
Name: Make, dtype: int64

In [10]:
print(len(df_ordinary))
print(len(df_deluxe))
print(len(df_supdel))
print(len(df_luxury))
print(len(df_suplux))

6950
4372
52
521
19


We will take the ordinary segment as it has the most data, and fit our model. Once done the same model will be applied
to other segments.(For MVP, in actuality the modeling exercise needs to be repeated for each segment, as the relationships 
may be different)

In [5]:
df_ordinary["Number of Doors"] = df_ordinary["Number of Doors"].replace("?",0)
df_ordinary["Number of Doors"] = df_ordinary["Number of Doors"].astype('float32')
df_ordinary["MSRP"] = df_ordinary["MSRP"].replace("?",0)
df_ordinary["MSRP"] = df_ordinary["MSRP"].astype("float32")
df_ordinary["Engine HP"] = df_ordinary["Engine HP"].replace("?",0)
df_ordinary["Engine HP"] = df_ordinary["Engine HP"].astype("float32")

Use Stratified Shuffle Split on the basis of Age of Car, to represent the data accurately in both 
train and test samples

In [6]:
df_ordinary["Age"].value_counts()
#create a field Age-cat to divide the data into 5 Age categories, based on the Age of the car
df_ordinary["Age-cat"] = np.ceil(df_ordinary["Age"] / 5)
df_ordinary["Age-cat"].where(df_ordinary["Age-cat"] < 5, 5.0, inplace=True)
#check distribution of Age Cat in the original data
df_ordinary["Age-cat"].value_counts() / len(df_ordinary)

1.0    0.436835
2.0    0.157122
0.0    0.127914
3.0    0.106187
5.0    0.100432
4.0    0.071511
Name: Age-cat, dtype: float64

In [7]:
split = StratifiedShuffleSplit(n_splits=1,test_size=0.2,random_state=42)

for train_index, test_index in split.split(df_ordinary,df_ordinary["Age-cat"]):
    strat_train_set = df_ordinary.iloc[train_index]
    strat_test_set = df_ordinary.iloc[test_index]

In [8]:
#check distribution of Age Cat in the train data
strat_train_set["Age-cat"].value_counts() / len(strat_train_set)

1.0    0.436871
2.0    0.157194
0.0    0.127878
3.0    0.106115
5.0    0.100360
4.0    0.071583
Name: Age-cat, dtype: float64

In [9]:
#check distribution of Age Cat in the test data
strat_test_set["Age-cat"].value_counts() / len(strat_test_set)

1.0    0.436691
2.0    0.156835
0.0    0.128058
3.0    0.106475
5.0    0.100719
4.0    0.071223
Name: Age-cat, dtype: float64

Create the X and Y variables from the Feature analysis done in Exploration notebook. Repeat the same operations 
for Train and Test data.

In [10]:
carSales_X = strat_train_set.copy()
carSales_X = strat_train_set.drop("MSRP", axis=1) # drop labels for training set
carSales_Y = strat_train_set["MSRP"].copy()

carSales_test_X = strat_test_set.copy()
carSales_test_X = strat_test_set.drop("MSRP", axis=1) # drop labels for test set
carSales_test_Y = strat_test_set["MSRP"].copy()

In [12]:
carSales_Y = carSales_Y.reshape(carSales_Y.shape[0],1)
carSales_test_Y = carSales_test_Y.reshape(carSales_test_Y.shape[0],1)
print(carSales_X.shape)
print(carSales_Y.shape)
print(carSales_test_X.shape)
print(carSales_test_Y.shape)

(5560, 18)
(5560, 1)
(1390, 18)
(1390, 1)


We have 5560 rows in Train data, and 1390 rows in Test data. Now we need to remove unnecessary columns based on
Correlation analysis done in ExplorationNotebook, and do Encoding of Categorical variables. Also 
we need to do StandardNormalization before applying Regression models.

In [13]:
carSales_X.head()

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Age,MSRP_Median,MSRP_group,Age-cat
6188,Dodge,Journey,2015,regular unleaded,283.0,6.0,AUTOMATIC,all wheel drive,4.0,Crossover,Midsize,4dr SUV,24,16,2,23115.0,ordinary,1.0
9434,Chevrolet,Silverado 1500 Classic,2007,regular unleaded,285.0,8.0,AUTOMATIC,rear wheel drive,2.0,Flex Fuel,Large,Regular Cab Pickup,19,15,10,26430.0,ordinary,2.0
8974,Hyundai,Santa Fe Sport,2015,regular unleaded,264.0,4.0,AUTOMATIC,front wheel drive,4.0,Crossover,Midsize,4dr SUV,27,19,2,23400.0,ordinary,1.0
3102,Honda,Crosstour,2013,regular unleaded,278.0,6.0,AUTOMATIC,front wheel drive,4.0,"Crossover,Hatchback",Midsize,4dr Hatchback,30,20,4,26140.0,ordinary,1.0
100,Nissan,240SX,1997,regular unleaded,155.0,4.0,MANUAL,rear wheel drive,2.0,Performance,Compact,Coupe,26,19,20,28995.0,ordinary,4.0


First we drop all columns after making a copy, and retain only the numerical features of significance

In [16]:
carSales_X_num = carSales_X
carSales_X_num  = carSales_X_num.drop("Make",axis=1)
carSales_X_num  = carSales_X_num.drop("Model",axis=1)
carSales_X_num  = carSales_X_num.drop("Year",axis=1)
carSales_X_num  = carSales_X_num.drop("Engine Fuel Type",axis=1)
carSales_X_num  = carSales_X_num.drop("Transmission Type",axis=1)
carSales_X_num  = carSales_X_num.drop("Driven_Wheels",axis=1)
carSales_X_num = carSales_X_num.drop("Number of Doors",axis=1)
carSales_X_num  = carSales_X_num.drop("Market Category",axis=1)
carSales_X_num  = carSales_X_num.drop("Vehicle Style",axis=1)
carSales_X_num = carSales_X_num.drop("Vehicle Size",axis=1)
carSales_X_num = carSales_X_num.drop("highway MPG",axis=1)
carSales_X_num = carSales_X_num.drop("Age-cat",axis=1)
carSales_X_num = carSales_X_num.drop("MSRP_Median",axis=1)
carSales_X_num = carSales_X_num.drop("MSRP_group",axis=1)
carSales_X_num = carSales_X_num.drop("Engine HP",axis=1)

In [17]:
carSales_X_num.head()

Unnamed: 0,Engine Cylinders,city mpg,Age
6188,6.0,16,2
9434,8.0,15,10
8974,4.0,19,2
3102,6.0,20,4
100,4.0,19,20


In [18]:
#Apply the same transformation on Test data
carSales_test_X_num = carSales_test_X
carSales_test_X_num  = carSales_test_X_num.drop("Make",axis=1)
carSales_test_X_num  = carSales_test_X_num.drop("Year",axis=1)
carSales_test_X_num  = carSales_test_X_num.drop("Engine Fuel Type",axis=1)
carSales_test_X_num  = carSales_test_X_num.drop("Model",axis=1)
carSales_test_X_num  = carSales_test_X_num.drop("Transmission Type",axis=1)
carSales_test_X_num  = carSales_test_X_num.drop("Driven_Wheels",axis=1)
carSales_test_X_num = carSales_test_X_num.drop("Number of Doors",axis=1)
carSales_test_X_num  = carSales_test_X_num.drop("Market Category",axis=1)
carSales_test_X_num  = carSales_test_X_num.drop("Vehicle Style",axis=1)
carSales_test_X_num = carSales_test_X_num.drop("Vehicle Size",axis=1)
carSales_test_X_num = carSales_test_X_num.drop("highway MPG",axis=1)
carSales_test_X_num = carSales_test_X_num.drop("Age-cat",axis=1)
carSales_test_X_num = carSales_test_X_num.drop("MSRP_Median",axis=1)
carSales_test_X_num = carSales_test_X_num.drop("MSRP_group",axis=1)
carSales_test_X_num = carSales_test_X_num.drop("Engine HP",axis=1)

In [19]:
carSales_test_X_num.head()

Unnamed: 0,Engine Cylinders,city mpg,Age
5711,4.0,22,8
11400,4.0,20,15
6075,5.0,24,4
9734,4.0,25,1
5366,4.0,25,2


In [20]:
#carSales_X_num["Engine HP"] = carSales_X_num["Engine HP"].astype("float32")
carSales_X_num["Engine Cylinders"] = carSales_X_num["Engine Cylinders"].astype("float32")
carSales_X_num["city mpg"] = carSales_X_num["city mpg"].astype("float32")
carSales_X_num["Age"] = carSales_X_num["Age"].astype("float32")
carSales_X_num.replace('null',np.NaN,inplace=True)
carSales_X_num = pd.DataFrame(carSales_X_num)
carSales_X_num = carSales_X_num.replace('?',0)
carSales_X_num = carSales_X_num.replace('NaN',0)
carSales_X_num = carSales_X_num.replace(np.NaN,0)

#carSales_test_X_num["Engine HP"] = carSales_test_X_num["Engine HP"].astype("float32")
carSales_test_X_num["Engine Cylinders"] = carSales_test_X_num["Engine Cylinders"].astype("float32")
carSales_test_X_num["city mpg"] = carSales_test_X_num["city mpg"].astype("float32")
carSales_test_X_num["Age"] = carSales_test_X_num["Age"].astype("float32")
carSales_test_X_num.replace('null',np.NaN,inplace=True)
carSales_test_X_num = pd.DataFrame(carSales_test_X_num)
carSales_test_X_num = carSales_test_X_num.replace('?',0)
carSales_test_X_num = carSales_test_X_num.replace('NaN',0)
carSales_test_X_num = carSales_test_X_num.replace(np.NaN,0)

In [21]:
m=carSales_X_num.isnull().any()
print(m[m])
m=np.isfinite(carSales_X_num.select_dtypes(include=['float64'])).any()
print(m[m])
m=carSales_test_X_num.isnull().any()
print(m[m])
m=np.isfinite(carSales_test_X_num.select_dtypes(include=['float64'])).any()
print(m[m])

Series([], dtype: bool)
Engine Cylinders    True
city mpg            True
Age                 True
dtype: bool
Series([], dtype: bool)
Engine Cylinders    True
city mpg            True
Age                 True
dtype: bool


Wherever there are 0 values, we replace by the mean 

In [22]:
imputer = Imputer(missing_values=0,strategy="mean")
imputer.fit(carSales_X_num)
imputer.fit(carSales_test_X_num)

Imputer(axis=0, copy=True, missing_values=0, strategy='mean', verbose=0)

In [27]:
#Standardize the data using sklearn StandardScaler
scaler = StandardScaler()
train_X = scaler.fit_transform(carSales_X_num)
test_X = scaler.transform(carSales_test_X_num)
print(train_X.shape)

(5560, 3)


Now add the Categorical variables using one-hot represenation

In [28]:
car_trans_type = carSales_X["Transmission Type"]
encoder = LabelBinarizer()
car_trans_1hot = encoder.fit_transform(car_trans_type)
print(car_trans_1hot.shape)

train_X = np.concatenate((train_X,car_trans_1hot),axis=1)

car_driven_wheels = carSales_X["Driven_Wheels"]
encoder = LabelBinarizer()
car_drive_1hot = encoder.fit_transform(car_driven_wheels)
print(car_drive_1hot.shape)

train_X = np.concatenate((train_X,car_drive_1hot),axis=1)

car_vehicle_size = carSales_X["Vehicle Size"]
encoder = LabelBinarizer()
car_size_1hot = encoder.fit_transform(car_vehicle_size)
print(car_size_1hot.shape)

train_X = np.concatenate((train_X,car_size_1hot),axis=1)

car_make = carSales_X["Make"]
encoder = LabelBinarizer()
car_make_1hot = encoder.fit_transform(car_make)
print(car_make_1hot.shape)

train_X_make = np.concatenate((train_X,car_make_1hot),axis=1)
print(train_X.shape)
print(train_X_make.shape)


(5560, 5)
(5560, 4)
(5560, 3)
(5560, 18)
(5560, 15)
(5560, 33)


In [29]:
car_trans_type_test = carSales_test_X["Transmission Type"]
encoder = LabelBinarizer()
car_trans_1hot_test = encoder.fit_transform(car_trans_type_test)
print(car_trans_1hot_test.shape)

test_X = np.concatenate((test_X,car_trans_1hot_test),axis=1)

car_driven_wheels_test = carSales_test_X["Driven_Wheels"]
encoder = LabelBinarizer()
car_drive_1hot_test = encoder.fit_transform(car_driven_wheels_test)
print(car_drive_1hot_test.shape)

test_X = np.concatenate((test_X,car_drive_1hot_test),axis=1)

car_vehicle_size_test = carSales_test_X["Vehicle Size"]
encoder = LabelBinarizer()
car_size_1hot_test = encoder.fit_transform(car_vehicle_size_test)
print(car_size_1hot_test.shape)

test_X = np.concatenate((test_X,car_size_1hot_test),axis=1)

car_make_test = carSales_test_X["Make"]
encoder = LabelBinarizer()
car_make_1hot_test = encoder.fit_transform(car_make_test)
print(car_make_1hot_test.shape)

test_X_make = np.concatenate((test_X,car_make_1hot_test),axis=1)

print(test_X.shape)
print(test_X_make.shape)

(1390, 5)
(1390, 4)
(1390, 3)
(1390, 18)
(1390, 15)
(1390, 33)


In [30]:
train_Y = pd.DataFrame(carSales_Y)
m=train_Y.isnull().any()
print(m[m])
m=np.isfinite(train_Y.select_dtypes(include=['float64'])).any()
print(m[m])

test_Y = pd.DataFrame(carSales_test_Y)
m=test_Y.isnull().any()
print(m[m])
m=np.isfinite(test_Y.select_dtypes(include=['float64'])).any()
print(m[m])

Series([], dtype: bool)
Series([], dtype: bool)
Series([], dtype: bool)
Series([], dtype: bool)


We now take backup of the pre-processed data, so the modeling can be done instantaneously on the pre-processed data
at any later point of time

In [31]:
train_X_ordinary='C:/users/hackuser1/Hackathon18/train_X_ord.pkl'
test_X_ordinary='C:/users/hackuser1/Hackathon18/test_X_ord.pkl'
train_Y_ordinary='C:/users/hackuser1/Hackathon18/train_Y_ord.pkl'
test_Y_ordinary='C:/users/hackuser1/Hackathon18/test_Y_ord.pkl'

with open(train_X_ordinary, "wb") as f:
    w = pickle.dump(train_X,f)
with open(test_X_ordinary, "wb") as f:
    w = pickle.dump(test_X,f)
with open(train_Y_ordinary, "wb") as f:
    w = pickle.dump(train_Y,f)
with open(test_Y_ordinary, "wb") as f:
    w = pickle.dump(test_Y,f)
    
train_X_ord_make='C:/users/hackuser1/Hackathon18/train_X_ord_make.pkl'
test_X_ord_make='C:/users/hackuser1/Hackathon18/test_X_ord_make.pkl'

with open(train_X_ord_make, "wb") as f:
    w = pickle.dump(train_X_make,f)
with open(test_X_ord_make, "wb") as f:
    w = pickle.dump(test_X_make,f)
