# USED CAR PRICE PREDICTION

This is a given dataset of a used car price on India, you can find it available on kaggle here : https://www.kaggle.com/avikasliwal/used-cars-price-prediction. 

The main objective is to find insights from this dataset and build a model which can predict car price accurately. 

### DATA DICTIONARY 
- Owner_Type              = car's owner Type (First, Second, Third, Fourth & Above)
- Kilometers_Driven       = number of kilometers driven 
- Mileage                 = number of miles traveled or covered.
- Seats                   = number of seats
- Engine                  = car's engine propulsion 
- Fuel_Type               = car's fuel type (CNG, Diesel, Petrol, LPG) 
- Year_Gap                = number of year gap
- Power                   = car's power 
- Transmission            = car's transmission type (Manual, Automatic)

## IMPORT LIBRARIES

In [176]:
# data wrangling
import numpy as np
import pandas as pd

# modelling
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import ExtraTreeRegressor
from sklearn import metrics
import pickle

## OVERVIEW

In [177]:
# load data
df = pd.read_csv('cars_cleaned.csv')

In [178]:
# show top 5 data
df.head()

Unnamed: 0,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,Price
0,Maruti,Mumbai,2010,72000,CNG,Manual,First,26.6,998,58.16,5,1.75
1,Hyundai,Pune,2015,41000,Diesel,Manual,First,19.67,1582,126.2,5,12.5
2,Honda,Chennai,2011,46000,Petrol,Manual,First,18.2,1199,88.7,5,4.5
3,Maruti,Chennai,2012,87000,Diesel,Manual,First,20.77,1248,88.76,7,6.0
4,Audi,Coimbatore,2013,40670,Diesel,Automatic,Second,15.2,1968,140.8,5,17.74


In [179]:
# show info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5872 entries, 0 to 5871
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Name               5872 non-null   object 
 1   Location           5872 non-null   object 
 2   Year               5872 non-null   int64  
 3   Kilometers_Driven  5872 non-null   int64  
 4   Fuel_Type          5872 non-null   object 
 5   Transmission       5872 non-null   object 
 6   Owner_Type         5872 non-null   object 
 7   Mileage            5872 non-null   float64
 8   Engine             5872 non-null   int64  
 9   Power              5872 non-null   float64
 10  Seats              5872 non-null   int64  
 11  Price              5872 non-null   float64
dtypes: float64(3), int64(4), object(5)
memory usage: 550.6+ KB


## DATA EXPLORATION

In [180]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Year,5872.0,2013.477691,3.164568,1998.0,2012.0,2014.0,2016.0,2019.0
Kilometers_Driven,5872.0,58316.999149,92169.410006,171.0,33422.5,52609.0,72402.75,6500000.0
Mileage,5872.0,18.277839,4.365657,0.0,15.26,18.2,21.1,33.54
Engine,5872.0,1625.745572,601.641783,624.0,1198.0,1495.5,1991.0,5998.0
Power,5872.0,113.276894,53.881892,34.2,75.0,97.7,138.1,560.0
Seats,5872.0,5.283719,0.805081,2.0,5.0,5.0,5.0,10.0
Price,5872.0,9.603919,11.249453,0.44,3.5175,5.75,10.0,160.0


In [181]:
df.describe(include=np.object).T

Unnamed: 0,count,unique,top,freq
Name,5872,30,Maruti,1175
Location,5872,11,Mumbai,775
Fuel_Type,5872,4,Diesel,3152
Transmission,5872,2,Manual,4170
Owner_Type,5872,4,First,4839


In [182]:
df.Location.unique()

array(['Mumbai', 'Pune', 'Chennai', 'Coimbatore', 'Hyderabad', 'Jaipur',
       'Kochi', 'Kolkata', 'Delhi', 'Bangalore', 'Ahmedabad'],
      dtype=object)

In [183]:
df.Name.unique()

array(['Maruti', 'Hyundai', 'Honda', 'Audi', 'Nissan', 'Toyota',
       'Volkswagen', 'Tata', 'Land', 'Mitsubishi', 'Renault',
       'Mercedes-Benz', 'BMW', 'Mahindra', 'Ford', 'Porsche', 'Datsun',
       'Jaguar', 'Volvo', 'Chevrolet', 'Skoda', 'Mini', 'Fiat', 'Jeep',
       'Ambassador', 'Isuzu', 'ISUZU', 'Force', 'Bentley', 'Lamborghini'],
      dtype=object)

It would seems we'll drop the 'Name' and 'Location' since it's having too many unique value to encode. We'll also drop 'Year' since it's redundant, but before that we need to know the year gap by subracting the current year (assumption is 2020) and the values in the 'Year' feature.

In [184]:
df.Fuel_Type.value_counts()

Diesel    3152
Petrol    2655
CNG         55
LPG         10
Name: Fuel_Type, dtype: int64

In [185]:
df.Owner_Type.value_counts()

First             4839
Second             925
Third              101
Fourth & Above       7
Name: Owner_Type, dtype: int64

In [186]:
df.Transmission.value_counts()

Manual       4170
Automatic    1702
Name: Transmission, dtype: int64

In [187]:
# dropping name and location
df = df.iloc[:, 2:]

In [188]:
# creating year gap feature
df['Year_Gap'] = 2020 - df['Year']

# dropping year feature
df = df.drop(columns=['Year'])

In [189]:
# show the processed dataframe
df.head()

Unnamed: 0,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,Price,Year_Gap
0,72000,CNG,Manual,First,26.6,998,58.16,5,1.75,10
1,41000,Diesel,Manual,First,19.67,1582,126.2,5,12.5,5
2,46000,Petrol,Manual,First,18.2,1199,88.7,5,4.5,9
3,87000,Diesel,Manual,First,20.77,1248,88.76,7,6.0,8
4,40670,Diesel,Automatic,Second,15.2,1968,140.8,5,17.74,7


Now, we encode the data, and try to encode based on these structures : 

- Owner_Type              = 0 : Fourth & Above, 1 : Third, 2 : Second, 3 : First
- Fuel_Type               = 0 : CNG, 1 : Diesel, 2 : LPG , 3 : Petrol
- Transmission            = 0 : Manual, 1 : Automatic

In [190]:
# encode the categorical data
df['Owner_Type']   = df['Owner_Type'].replace({'Fourth & Above' : 0, 'Third' : 1, 'Second' : 2, 'First' : 3})
df['Fuel_Type']    = df['Fuel_Type'].replace({'CNG' : 0, 'Diesel' : 1, 'LPG' : 2, 'Petrol' : 3})
df['Transmission'] = df['Transmission'].replace({'Manual' : 0, 'Automatic' : 1})

In [191]:
# show top 5 data after encoding
df.head()

Unnamed: 0,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,Price,Year_Gap
0,72000,0,0,3,26.6,998,58.16,5,1.75,10
1,41000,1,0,3,19.67,1582,126.2,5,12.5,5
2,46000,3,0,3,18.2,1199,88.7,5,4.5,9
3,87000,1,0,3,20.77,1248,88.76,7,6.0,8
4,40670,1,1,2,15.2,1968,140.8,5,17.74,7


## FEATURE ENGINEERING WITH EXTRA TREE REGRESSOR

In [192]:
# split target feature
X = df.drop(columns=['Price'])
y = df.Price

In [193]:
# show splitted data
X, y

(      Kilometers_Driven  Fuel_Type  Transmission  Owner_Type  Mileage  Engine  \
 0                 72000          0             0           3    26.60     998   
 1                 41000          1             0           3    19.67    1582   
 2                 46000          3             0           3    18.20    1199   
 3                 87000          1             0           3    20.77    1248   
 4                 40670          1             1           2    15.20    1968   
 ...                 ...        ...           ...         ...      ...     ...   
 5867              27365          1             0           3    28.40    1248   
 5868             100000          1             0           3    24.40    1120   
 5869              55000          1             0           2    14.00    2498   
 5870              46000          3             0           3    18.90     998   
 5871              47000          1             0           3    25.44     936   
 
        Power 

In [194]:
# fit the extra-tree regressor model
feat = ExtraTreeRegressor(random_state=42)
feat.fit(X, y)

ExtraTreeRegressor(random_state=42)

In [195]:
# show importances
pd.Series(feat.feature_importances_, index=X.columns).sort_values(ascending=False)

Transmission         0.342983
Power                0.241244
Year_Gap             0.173409
Fuel_Type            0.067054
Engine               0.062262
Kilometers_Driven    0.044988
Mileage              0.030805
Seats                0.024968
Owner_Type           0.012287
dtype: float64

In [196]:
# locate only the important features
df = df[['Transmission', 'Power', 'Year_Gap', 'Fuel_Type', 'Price']]

## MODELLING

In [197]:
# split target feature
X = df.drop(columns=['Price'])
y = df.Price

In [198]:
# train test split, test size 20%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [199]:
# show train data
X_train, y_train

(      Transmission   Power  Year_Gap  Fuel_Type
 2029             1  126.32         6          1
 5589             1  272.00        11          3
 3129             0   85.80         5          3
 210              0   68.05         8          3
 3731             0   62.10        10          3
 ...            ...     ...       ...        ...
 3772             0   67.04         5          3
 5191             0   73.97         5          1
 5226             0   73.97         3          1
 5390             0   89.84         3          1
 860              1  184.00         9          3
 
 [4697 rows x 4 columns],
 2029     5.45
 5589    10.24
 3129     4.15
 210      2.90
 3731     1.50
         ...  
 3772     4.30
 5191     4.25
 5226     4.20
 5390     8.75
 860      9.75
 Name: Price, Length: 4697, dtype: float64)

In [200]:
# fit the random forest regressor model
model = RandomForestRegressor()
model.fit(X_train, y_train)

RandomForestRegressor()

In [201]:
# set the y predict
y_pred = model.predict(X_test)

In [202]:
# check train score
model.score(X_train,y_train)

0.9566966572917802

In [203]:
# check the r2 score
metrics.r2_score(y_test, y_pred)

0.8527712611102901

Seems it's a little overfit, but no matter we'll dump this model with pickle.

In [204]:
# dump model 
filename = 'model.pkl'
pickle.dump(model, open(filename, 'wb'))

## MODEL TESTING

To test the model, I'll try to load and test prediction on it.

In [205]:
# load model
model = pickle.load(open('model.pkl', 'rb'))

In [206]:
# test prediction #1
predict = model.predict([[1, 58.16, 10, 0]])
predict[0]

1.8194576388888892

In [207]:
# test prediction #2
predict = model.predict([[1, 272.00, 11, 3]])
predict[0]

11.205821666666669