## Used Car Price Prediction
1) **Problem Statement**.
* This dataset comprises used cars sold on cardekho.com in India as well as important features of these cars.
* If user can predict the price of the car based on input features.
* Prediction results can be used to give new seller the price suggestion based on market conditions.

2) **Data Collection**
* The dataset is collected from scrapping from cardekho website
* The data consists of 13 column and 15411 rows.

In [1]:
pip install seaborn  plotly

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings

warnings.filterwarnings("ignore")
%matplotlib inline

In [3]:
data = pd.read_csv(r"A:\Krish Naik\22-Random Forest\cardekho_imputated.csv",index_col = [0])

In [4]:
data.head()

Unnamed: 0,car_name,brand,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats,selling_price
0,Maruti Alto,Maruti,Alto,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5,120000
1,Hyundai Grand,Hyundai,Grand,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5,550000
2,Hyundai i20,Hyundai,i20,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5,215000
3,Maruti Alto,Maruti,Alto,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5,226000
4,Ford Ecosport,Ford,Ecosport,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5,570000


In [5]:
data.shape


(15411, 13)

### Data Cleaning
* Handling Missing values
* Handling Duplicates
* Check data type
* Understand dataset

In [6]:
data.isnull().sum()

car_name             0
brand                0
model                0
vehicle_age          0
km_driven            0
seller_type          0
fuel_type            0
transmission_type    0
mileage              0
engine               0
max_power            0
seats                0
selling_price        0
dtype: int64

In [7]:
## remove unecessary columns
data.drop('car_name',axis=1,inplace=True)
data.drop('brand',axis=1,inplace=True)

In [8]:
data.head()

Unnamed: 0,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats,selling_price
0,Alto,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5,120000
1,Grand,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5,550000
2,i20,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5,215000
3,Alto,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5,226000
4,Ecosport,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5,570000


In [9]:
data['model'].nunique()

120

In [10]:
pd.DataFrame(data.duplicated())

Unnamed: 0,0
0,False
1,False
2,False
3,False
4,False
...,...
19537,False
19540,False
19541,False
19542,False


In [11]:
# Getting all different types of Features
num_features = [feature for feature in data.columns if data[feature].dtype!='O']
print("Num. of Numerical Features:",len(num_features))
cat_features = [feature for feature in data.columns if data[feature].dtype=='O']
print("Num of categorical columns:",len(cat_features))
discrete_features = [feature for feature in num_features if len(data[feature].unique())<=25]
print("Num of Discrete features:",len(discrete_features))
continuous_features = [feature for feature in num_features if feature not in discrete_features]
print("Num of continuous features:",len(continuous_features))

Num. of Numerical Features: 7
Num of categorical columns: 4
Num of Discrete features: 2
Num of continuous features: 5


In [12]:
# Independent and dependent 
from sklearn.model_selection import train_test_split
X = data.drop(['selling_price'],axis=1)
y=data['selling_price']

In [13]:
X.head()

Unnamed: 0,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats
0,Alto,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5
1,Grand,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5
2,i20,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5
3,Alto,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5
4,Ecosport,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5


In [14]:
y.head()

0    120000
1    550000
2    215000
3    226000
4    570000
Name: selling_price, dtype: int64

### Feature Encoding and Scaling
**one hot encdoing for columns which had lesser unique values and not ordinal**
* One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction.

In [15]:
data['model'].value_counts()

model
i20             906
Swift Dzire     890
Swift           781
Alto            778
City            757
               ... 
Altroz            1
C                 1
Ghost             1
Quattroporte      1
Gurkha            1
Name: count, Length: 120, dtype: int64

In [16]:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
X['model'] = le.fit_transform(X['model'])

In [17]:
X.head()

Unnamed: 0,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats
0,7,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5
1,54,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5
2,118,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5
3,7,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5
4,38,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5


In [20]:
# Create column Transformer with 3 types of transformers
numeric_features = X.select_dtypes(exclude="object").columns
onehot_columns = ['seller_type','fuel_type','transmission_type']

from sklearn.preprocessing import OneHotEncoder,StandardScaler
from sklearn.compose import ColumnTransformer

numeric_transformer = StandardScaler()
oh_transformer = OneHotEncoder(drop='first')

preprocessor = ColumnTransformer(
    [
        ("OneHotEncoder",oh_transformer,onehot_columns),
        ("StandardScaler",numeric_transformer,numeric_features)
    ],remainder='passthrough' 
)
# By specifying remainder='passthrough', all remaining columns that were not specified in transformers, but present in the data passed to fit will be automatically passed through. This subset of columns is concatenated with the output of the transformers.



In [21]:
X = preprocessor.fit_transform(X)

In [23]:
pd.DataFrame(X)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,-1.519714,0.983562,1.247335,-0.000276,-1.324259,-1.263352,-0.403022
1,1.0,0.0,0.0,0.0,0.0,1.0,1.0,-0.225693,-0.343933,-0.690016,-0.192071,-0.554718,-0.432571,-0.403022
2,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.536377,1.647309,0.084924,-0.647583,-0.554718,-0.479113,-0.403022
3,1.0,0.0,0.0,0.0,0.0,1.0,1.0,-1.519714,0.983562,-0.360667,0.292211,-0.936610,-0.779312,-0.403022
4,0.0,0.0,1.0,0.0,0.0,0.0,1.0,-0.666211,-0.012060,-0.496281,0.735736,0.022918,-0.046502,-0.403022
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15406,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.508844,0.983562,-0.869744,0.026096,-0.767733,-0.757204,-0.403022
15407,0.0,0.0,0.0,0.0,0.0,1.0,1.0,-0.556082,-1.339555,-0.728763,-0.527711,-0.216964,-0.220803,2.073444
15408,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.407551,-0.012060,0.220539,0.344954,0.022918,0.068225,-0.403022
15409,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.426247,-0.343933,72.541850,-0.887326,1.329794,0.917158,2.073444


In [25]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)
X_train.shape,X_test.shape

((12328, 14), (3083, 14))

### Model training and model selection

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression,Ridge,Lasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error,root_mean_squared_error

In [38]:
def evaluate_model(true,predicted):
    mae = mean_absolute_error(true,predicted)
    mse = mean_squared_error(true,predicted)
    rmse =root_mean_squared_error(true,predicted)
    R2_score = r2_score(true,predicted)
    return mae,rmse,R2_score

In [42]:
models = {
    "Linear Regression":LinearRegression(),
    "Lasso":Lasso(),
    "Ridge":Ridge(),
    "K-Neighbors Regressor":KNeighborsRegressor(),
    "Decision Tree":DecisionTreeRegressor(),
    "Random Forest Regressor":RandomForestRegressor(),
}

for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_train,y_train)

    # Make predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    # Evaluate train and test set
    model_train_mae,model_train_rmse,model_train_r2 = evaluate_model(y_train,y_train_pred)

    model_test_mae,model_test_rmse,model_test_r2 = evaluate_model(y_test,y_test_pred)

    print(list(models.keys())[i])

    print('Model performance for training set')
    print("-Root mean squared error: {:.4f}".format(model_train_rmse))
    print("-Mean absolute error:{:.4f}".format(model_train_mae))
    print("-R2 Score:{:.4f}".format(model_train_r2))

    print("-"*40)

    print("Performance for test set-")
    print("-Root mean squared error: {:.4f}".format(model_test_rmse))
    print("-Mean absolute error:{:.4f}".format(model_test_mae))
    print("-R2 Score:{:.4f}".format(model_test_r2))

    print("="*40)
    print("\n")

Linear Regression
Model performance for training set
-Root mean squared error: 553855.6665
-Mean absolute error:268101.6071
-R2 Score:0.6218
----------------------------------------
Performance for test set-
-Root mean squared error: 502543.5930
-Mean absolute error:279618.5794
-R2 Score:0.6645


Lasso
Model performance for training set
-Root mean squared error: 553855.6710
-Mean absolute error:268099.2226
-R2 Score:0.6218
----------------------------------------
Performance for test set-
-Root mean squared error: 502542.6696
-Mean absolute error:279614.7461
-R2 Score:0.6645


Ridge
Model performance for training set
-Root mean squared error: 553856.3160
-Mean absolute error:268059.8015
-R2 Score:0.6218
----------------------------------------
Performance for test set-
-Root mean squared error: 502533.8230
-Mean absolute error:279557.2169
-R2 Score:0.6645


K-Neighbors Regressor
Model performance for training set
-Root mean squared error: 325873.0088
-Mean absolute error:91425.4705
-R2

### Hyper parameter tuning

In [45]:
knn_params = {"n_neighbors":[2,3,10,20,40,50]}
rf_params = {
    "max_depth":[5,8,15,None,10],
    "max_features":[5,7,"auto",8],
    "min_samples_split":[2,8,15,20],
    "n_estimators":[100,200,500,1000]
}

In [46]:
# models list fpr hyperparameter tuning
randomcv_models = [('KNN',KNeighborsRegressor(),knn_params),
('RF',RandomForestRegressor(),rf_params)      
                   ]

In [47]:
from sklearn.model_selection import RandomizedSearchCV

model_param = {}
for name,model,params in randomcv_models:
    random = RandomizedSearchCV(
        estimator=model,
        param_distributions=params,
        n_iter=100,
        cv=3,
        verbose=2,
        n_jobs=-1
    )

    random.fit(X_train,y_train)
    model_param[name] = random.best_params_

for model_name in model_param:
    print(f"------------------Best Params for {model_name}---------------------")
    print(model_param[model_name])

Fitting 3 folds for each of 6 candidates, totalling 18 fits
Fitting 3 folds for each of 100 candidates, totalling 300 fits
------------------Best Params for KNN---------------------
{'n_neighbors': 10}
------------------Best Params for RF---------------------
{'n_estimators': 1000, 'min_samples_split': 2, 'max_features': 7, 'max_depth': None}


In [48]:
# Retraining model with best parameters
models = {
    "K-Neighbors Regressor":KNeighborsRegressor(n_neighbors=10),
    "Random Forest Regressor":RandomForestRegressor(n_estimators=100,min_samples_split=2,max_features=7,max_depth=None),
}

for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_train,y_train)

    # Make predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    # Evaluate train and test set
    model_train_mae,model_train_rmse,model_train_r2 = evaluate_model(y_train,y_train_pred)

    model_test_mae,model_test_rmse,model_test_r2 = evaluate_model(y_test,y_test_pred)

    print(list(models.keys())[i])

    print('Model performance for training set')
    print("-Root mean squared error: {:.4f}".format(model_train_rmse))
    print("-Mean absolute error:{:.4f}".format(model_train_mae))
    print("-R2 Score:{:.4f}".format(model_train_r2))

    print("-"*40)

    print("Performance for test set-")
    print("-Root mean squared error: {:.4f}".format(model_test_rmse))
    print("-Mean absolute error:{:.4f}".format(model_test_mae))
    print("-R2 Score:{:.4f}".format(model_test_r2))

    print("="*40)
    print("\n")

K-Neighbors Regressor
Model performance for training set
-Root mean squared error: 363460.7706
-Mean absolute error:103472.0474
-R2 Score:0.8371
----------------------------------------
Performance for test set-
-Root mean squared error: 263888.0623
-Mean absolute error:117496.2131
-R2 Score:0.9075


Random Forest Regressor
Model performance for training set
-Root mean squared error: 116853.9245
-Mean absolute error:39315.6117
-R2 Score:0.9832
----------------------------------------
Performance for test set-
-Root mean squared error: 214420.4372
-Mean absolute error:99325.3622
-R2 Score:0.9389


