This notebook is to prepare and pre-process data for various prediction models from the Used Car Price data 
at : https://www.kaggle.com/jpayne/852k-used-car-listings/data, after the initial Data exploration, as given in ..... 
The numerical variables are scaled with StandardScaler, imputation strategy is used to replace 0 values with mean
StratifiedshuffleSplit is done based on Age of car (Curr Year - Year of Car), by creating Age category (Age / 5), and 
putting the values in different Age category buckets. The same distribution is maintained in Train and Test data.
The categorical variables (Make, State) are one-hot encoded and added to the feature vector. The numerical variables considered are : Age of Car in yrs,  and Mileage. 

In [6]:
#Import all necessary libraries
import pickle
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelBinarizer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import PolynomialFeatures

Read the pickle files prepared by stratifying the Car Sales Data based on Make, Model, and State. This is necessary since cars belong to different price segments, and including all make and models is not a viable solution, as the range of price is different for the same features for different makes. The stratification details can be found in the 
Data exploration notebook, which precedes this and outputs the data into .pkl files, based on car segment/price category

In [7]:
df_ordinary=pd.read_pickle('C:/users/hackuser1/carSalesFordF150.pkl')

In [8]:
df_ordinary.head()

Unnamed: 0,Price,Mileage,Age
1298,11900,9,9
1299,5850,15,15
1304,11800,13,13
1326,28700,3,3
1332,17977,11,11


In [9]:
print(len(df_ordinary))


13263


We check the distribution of Car Sales on the basis of Age of Car, and create Age-cat and check the distribution of the Car data based on Age-cat (Age / 5). We plan to use StratifiedSampling to make sure both Test and Train data represents same distribution of cars based on Age of Car

In [10]:
df_ordinary["Age"].value_counts()
#create a field Age-cat to divide the data into 5 Age categories, based on the Age of the car
df_ordinary["Age-cat"] = np.ceil(df_ordinary["Age"] / 5)
df_ordinary["Age-cat"].where(df_ordinary["Age-cat"] < 5, 5.0, inplace=True)
#check distribution of Age Cat in the original data
df_ordinary["Age-cat"].value_counts() / len(df_ordinary)

1.0    0.647893
2.0    0.253412
3.0    0.085350
4.0    0.011611
5.0    0.001734
Name: Age-cat, dtype: float64

In [None]:
#We treat Make, Model, State as Categorical variables and these are already one-hot encoded as part of analysis

In [11]:
split = StratifiedShuffleSplit(n_splits=1,test_size=0.2,random_state=42)

for train_index, test_index in split.split(df_ordinary,df_ordinary["Age-cat"]):
    strat_train_set = df_ordinary.iloc[train_index]
    strat_test_set = df_ordinary.iloc[test_index]

In [12]:
#check distribution of Age Cat in the train data
strat_train_set["Age-cat"].value_counts() / len(strat_train_set)

1.0    0.647879
2.0    0.253440
3.0    0.085391
4.0    0.011593
5.0    0.001697
Name: Age-cat, dtype: float64

In [13]:
#check distribution of Age Cat in the test data
strat_test_set["Age-cat"].value_counts() / len(strat_test_set)

1.0    0.647946
2.0    0.253298
3.0    0.085187
4.0    0.011685
5.0    0.001885
Name: Age-cat, dtype: float64

Create the X and Y variables from the Feature analysis done in Exploration notebook. Repeat the same operations 
for Train and Test data.

In [14]:
carSales_X = strat_train_set.copy()
carSales_X = carSales_X.drop("Price", axis=1) # drop labels for training set
carSales_Y = strat_train_set["Price"].copy() # use Price as labels for training set, based on data Exploration
carSales_X = carSales_X.drop("Age-cat", axis=1)

carSales_test_X = strat_test_set.copy()
carSales_test_X = carSales_test_X.drop("Price", axis=1) # drop labels for test set
carSales_test_X = carSales_test_X.drop("Age-cat", axis=1)
carSales_test_Y = strat_test_set["Price"].copy()# use Price as labels for test set, based on data Exploration

In [15]:
carSales_Y = carSales_Y.values.reshape(carSales_Y.shape[0],1)
carSales_test_Y = carSales_test_Y.values.reshape(carSales_test_Y.shape[0],1)
#carSales_Y_log = np.log(carSales_Y)
#carSales_test_Y_log = np.log(carSales_test_Y)
print(carSales_Y.shape)
print(carSales_test_Y.shape)


(10610, 1)
(2653, 1)


In [16]:
carSales_X.head()

Unnamed: 0,Mileage,Age
39714,9,9
695489,4,4
706430,3,3
702727,3,3
682169,3,3


In [18]:
carSales_X_num = carSales_X.filter(['Mileage','Age'],axis=1)
carSales_test_X_num=carSales_test_X.filter(['Mileage','Age'],axis=1)
carSales_X_num.head()


Unnamed: 0,Mileage,Age
39714,9,9
695489,4,4
706430,3,3
702727,3,3
682169,3,3


In [19]:
m=carSales_X_num.isnull().any()
print(m[m])
m=np.isfinite(carSales_X_num.select_dtypes(include=['float64'])).any()
print(m[m])
m=carSales_test_X_num.isnull().any()
print(m[m])
m=np.isfinite(carSales_test_X_num.select_dtypes(include=['float64'])).any()
print(m[m])

Series([], dtype: bool)
Series([], dtype: bool)
Series([], dtype: bool)
Series([], dtype: bool)


Wherever there are 0 values, we replace by the mean 

In [20]:
imputer = Imputer(missing_values=0,strategy="mean")
imputer.fit(carSales_X_num)
imputer.fit(carSales_test_X_num)

Imputer(axis=0, copy=True, missing_values=0, strategy='mean', verbose=0)

In [22]:
#Standardize the data using sklearn StandardScaler
scaler = StandardScaler()
train_X = scaler.fit_transform(carSales_X_num)
test_X = scaler.transform(carSales_test_X_num)
print(train_X.shape)
print(test_X.shape)

(10610, 2)
(2653, 2)


In [63]:
#carSales_X_cat = carSales_X.drop(['Mileage','Age'],axis=1)
#carSales_test_X_cat = carSales_test_X.drop(['Mileage','Age'],axis=1)
#print(carSales_X_cat.shape)
#print(carSales_test_X_cat.shape)

(120340, 270)
(30086, 270)


In [64]:
#train_X =  np.concatenate((train_X,carSales_X_cat.values),axis=1)
#test_X =  np.concatenate((test_X,carSales_test_X_cat.values),axis=1)
print(train_X.shape)
print(test_X.shape)

(120340, 272)
(30086, 272)


In [23]:
train_Y = pd.DataFrame(carSales_Y)
m=train_Y.isnull().any()
print(m[m])
m=np.isfinite(train_Y.select_dtypes(include=['float64'])).any()
print(m[m])

#train_Y_log = pd.DataFrame(carSales_Y_log)
#m=train_Y_log.isnull().any()
#print(m[m])
#m=np.isfinite(train_Y_log.select_dtypes(include=['float64'])).any()
#print(m[m])

test_Y = pd.DataFrame(carSales_test_Y)
m=test_Y.isnull().any()
print(m[m])
m=np.isfinite(test_Y.select_dtypes(include=['float64'])).any()
print(m[m])

#test_Y_log = pd.DataFrame(carSales_test_Y_log)
#m=test_Y_log.isnull().any()
#print(m[m])
#m=np.isfinite(test_Y_log.select_dtypes(include=['float64'])).any()
#print(m[m])



Series([], dtype: bool)
Series([], dtype: bool)
Series([], dtype: bool)
Series([], dtype: bool)


We now take backup of the pre-processed data, so the modeling can be done instantaneously on the pre-processed data
at any later point of time

In [24]:
train_X_mileage='C:/users/hackuser1/train_X_mileage1.pkl'
test_X_mileage='C:/users/hackuser1/test_X_mileage1.pkl'
train_Y_mileage='C:/users/hackuser1/train_Y_mileage1.pkl'
test_Y_mileage='C:/users/hackuser1/test_Y_mileage1.pkl'
#train_Y_logf='C:/users/hackuser1/train_Y_mileage_log1.pkl'
#test_Y_logf='C:/users/hackuser1/test_Y_mileage_log1.pkl'

with open(train_X_mileage, "wb") as f:
    w = pickle.dump(train_X,f)
with open(test_X_mileage, "wb") as f:
    w = pickle.dump(test_X,f)
with open(train_Y_mileage, "wb") as f:
    w = pickle.dump(train_Y,f)
with open(test_Y_mileage, "wb") as f:
    w = pickle.dump(test_Y,f)
#with open(train_Y_logf, "wb") as f:
#    w = pickle.dump(train_Y_log,f)
#with open(test_Y_logf, "wb") as f:
#    w = pickle.dump(test_Y_log,f)
