# MATHCO.THON - CAR PRICE PREDICTION

## About the Dataset:

With the rise in the variety of cars with differentiated capabilities and features such as model, production year, category, brand, fuel type, engine volume, mileage, cylinders, colour, airbags and many more, we are bringing a car price prediction challenge for all. We all aspire to own a car within budget with the best features available. To solve the price problem we have created a dataset of 19237 for the training dataset and 8245 for the test dataset.  
                                         - The Math.Co

## Loading important libraries

In [None]:
import pandas as pd 
import numpy as np

import warnings
warnings.filterwarnings("ignore")

from sklearn.preprocessing import LabelEncoder, RobustScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, StratifiedKFold, KFold
from sklearn.metrics import mean_squared_log_error

from xgboost import XGBRegressor
from lightgbm import LGBMRegressor

## Creating custom function for RMSLE (Root Mean Squared Log Error)

In [None]:
# Evaluation Metric 
# Custom fuction for RMSLE
def RMSLE(y_true, y_pred):
    score = np.sqrt(mean_squared_log_error(y_true, y_pred))
    return score

## 1. Loading the data 

In [None]:
##Loading train data 
train_data = pd.read_csv("Car Price Prediction/train.csv")

##Loading test data
test_data = pd.read_csv("Car Price Prediction/test.csv")

## 2. Exploratory Data Analysis 

To speed up the model building process. I am taking advantage of the `Pandas Profiling Library`, which is an open source Python module with which we can quickly do an exploratory data analysis with just few lines of code and generate a detailed report in `html` format. 

To view the report please download the report [here](https://drive.google.com/file/d/1u1wZCiFwMTPKu4doIW8oXAoFEObSqf1n/view?usp=sharing).

## 3. Dropping duplicated values 

In [None]:
## Dropping duplicated values in train data 
train_data.drop_duplicates(inplace = True)

## 4. Data Preprocessing & Feature Engineering

Creating a new feature `Turbo`. 

In [None]:
train_data['Turbo']=train_data['Engine volume'].replace(['2.0 Turbo', '2.2 Turbo','3.0 Turbo','1.4 Turbo','1.5 Turbo', '1.6 Turbo','2.3 Turbo','2.8 Turbo','1.8 Turbo', '2.4 Turbo', '3.5 Turbo', '3.2 Turbo','1.3 Turbo','2.5 Turbo','1.9 Turbo', '4.4 Turbo', '4.7 Turbo', '0.2 Turbo','4.8 Turbo', '4.6 Turbo','1.2 Turbo','1.7 Turbo', '6.3 Turbo', '2.7 Turbo','2.9 Turbo', '4.0 Turbo','3.6 Turbo','3.7 Turbo','5.5 Turbo','2.1 Turbo','0.7 Turbo', '0.6 Turbo','1.0 Turbo', '4.5 Turbo', '0.8 Turbo', '4.2 Turbo', '5.0 Turbo','5.7 Turbo','0.4 Turbo', '5.4 Turbo', '0.3 Turbo','1.1 Turbo'],'Turbo')
train_data['Turbo']=train_data['Turbo'].replace(['3.5','3','1.3','2.5','2','1.8','2.4','4','1.6','3.3','4.7','1.5','4.4','3.6','2.3','2.2','1.4','5.5','3.2','3.8','4.6','1.2','5', '1.7', '2.9', '0.5','1.9','2.7','4.8','5.3','0.4','2.8','1.1','2.1','0.7','5.4','3.7','1','2.6','5.7','3.4','4.3','4.2','5.9','6.8','4.5','0.6','7.3','0.1','6.3','6.4','5.2','5.8','0.8', '6.7', '6.2', '0', '20', '0.3', '0.2','5.6', '6', '3.9', '0.9','3.1'],'Non-Turbo')

test_data['Turbo']=test_data['Engine volume'].replace(['2.0 Turbo', '2.2 Turbo','3.0 Turbo','1.4 Turbo','1.5 Turbo', '1.6 Turbo','2.3 Turbo','2.8 Turbo','1.8 Turbo', '2.4 Turbo', '3.5 Turbo', '3.2 Turbo','1.3 Turbo','2.5 Turbo','1.9 Turbo', '4.4 Turbo', '4.7 Turbo', '0.2 Turbo','4.8 Turbo', '4.6 Turbo','1.2 Turbo','1.7 Turbo', '6.3 Turbo', '2.7 Turbo','2.9 Turbo', '4.0 Turbo','3.6 Turbo','3.7 Turbo','5.5 Turbo','2.1 Turbo','0.7 Turbo', '0.6 Turbo','1.0 Turbo', '4.5 Turbo', '0.8 Turbo', '4.2 Turbo', '5.0 Turbo','5.7 Turbo','0.4 Turbo', '5.4 Turbo', '0.3 Turbo','1.1 Turbo','2.6 Turbo','6.0 Turbo'],'Turbo')
test_data['Turbo']=test_data['Turbo'].replace(['3.5','3','1.3','2.5','2','1.8','2.4','4','1.6','3.3','4.7','1.5','4.4','3.6','2.3','2.2','1.4','5.5','3.2','3.8','4.6','1.2','5', '1.7', '2.9', '0.5','1.9','2.7','4.8','5.3','0.4','2.8','1.1','2.1','0.7','5.4','3.7','1','2.6','5.7','3.4','4.3','4.2','5.9','6.8','4.5','0.6','7.3','0.1','6.3','6.4','5.2','5.8','0.8', '6.7', '6.2', '0', '20', '0.3', '0.2','5.6', '6', '3.9', '0.9','3.1','6.1','6.6','10.8'],'Non-Turbo')

Removing the string parts associated with the column `Engine volume` and converting the column to appropiate datatype.

In [None]:
train_data['Engine volume']=train_data['Engine volume'].str.replace('Turbo','')
train_data['Engine volume']=train_data['Engine volume'].astype(float)

test_data['Engine volume']=test_data['Engine volume'].str.replace('Turbo','')
test_data['Engine volume']=test_data['Engine volume'].astype(float)

Removing the string parts associated with the column `Mileage` and converting the column to appropiate datatype.

In [None]:
train_data['Mileage']=train_data['Mileage'].str.replace('km',' ')
train_data['Mileage']=train_data['Mileage'].astype(int)

test_data['Mileage']=test_data['Mileage'].str.replace('km',' ')
test_data['Mileage']=test_data['Mileage'].astype(int)

Removing the discrepancy such as `'-'` in the `Levy` feature and considering them as null values. Imputing the null values with mean method.

In [None]:
train_data['Levy']=train_data['Levy'].replace({'-':np.nan})
train_data['Levy']=train_data['Levy'].astype(float)
train_data['Levy']=train_data['Levy'].fillna(train_data['Levy'].mean())


test_data['Levy']=test_data['Levy'].replace({'-':np.nan})
test_data['Levy']=test_data['Levy'].astype(float)
test_data['Levy']=test_data['Levy'].fillna(train_data['Levy'].mean())

Creating a new feature `Age`.

In [None]:
train_data['Age'] = 2021 - train_data['Prod. year']

test_data['Age'] = 2021 - test_data['Prod. year']

Removing outliers in the target variable.

In [None]:
Q1 = train_data.Price.quantile(0.25)
Q3 = train_data.Price.quantile(0.75)
print(Q1,Q3)


IQR = Q3 - Q1
print(IQR)

lower_limit = Q1 - 1.5*IQR
upper_limit = Q3 + 1.5*IQR
print( lower_limit,upper_limit)


train_data = train_data[(train_data.Price < upper_limit) & (train_data.Price > lower_limit)]

Getting the target feature. 

In [None]:
target = train_data['Price']

Dropping few columns which are not required for the model building. Based on EDA and feature importance from the basic model. 

In [None]:
train_data = train_data.drop(['ID','Prod. year','Price','Model','Manufacturer','Cylinders','Doors'],axis = 1)
test_data = test_data.drop(['ID','Prod. year','Price','Model','Manufacturer','Cylinders','Doors'],axis = 1)

Label Encoding the categorical features.

In [None]:
lbl = LabelEncoder()

train_data['Category'] = lbl.fit_transform(train_data['Category'])
test_data['Category'] = lbl.transform(test_data['Category'])

train_data['Leather interior'] = lbl.fit_transform(train_data['Leather interior'])
test_data['Leather interior'] = lbl.transform(test_data['Leather interior'])

train_data['Fuel type'] = lbl.fit_transform(train_data['Fuel type'])
test_data['Fuel type'] = lbl.transform(test_data['Fuel type'])

train_data['Gear box type'] = lbl.fit_transform(train_data['Gear box type'])
test_data['Gear box type'] = lbl.transform(test_data['Gear box type'])

train_data['Drive wheels'] = lbl.fit_transform(train_data['Drive wheels'])
test_data['Drive wheels'] = lbl.transform(test_data['Drive wheels'])

train_data['Wheel'] = lbl.fit_transform(train_data['Wheel'])
test_data['Wheel'] = lbl.transform(test_data['Wheel'])

train_data['Color'] = lbl.fit_transform(train_data['Color'])
test_data['Color'] = lbl.transform(test_data['Color'])

train_data['Turbo'] = lbl.fit_transform(train_data['Turbo'])
test_data['Turbo'] = lbl.transform(test_data['Turbo'])

Separating numerical columns and categorical columns and creating a separate list for them.

In [None]:
num_cols = ['Age','Levy','Engine volume','Mileage']
cat_cols = [ 'Category','Leather interior', 'Fuel type', 'Gear box type', 'Drive wheels',
       'Wheel','Airbags','Turbo','Color']

Convering the categorical features into `category` datatype.

In [None]:
def cat_converter(df):
    for i in df[cat_cols]:         
            df[i] = df[i].astype('category')
            
cat_converter(train_data)

cat_converter(test_data)

Standardizing numerical features with `Robust Scaler`.

In [None]:
# fit on training data column
scale = RobustScaler().fit(train_data[num_cols])
# transform the training data column
train_data[num_cols] = scale.transform(train_data[num_cols])
# transform the testing data column
test_data[num_cols] = scale.transform(test_data[num_cols])

Dummyfying the categorical features.

In [None]:
train_data = pd.get_dummies(train_data, columns = cat_cols)
test_data = pd.get_dummies(test_data, columns = cat_cols)

Creating `train set` and `validation set` for model building and checking model performance.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(train_data,target, test_size=0.20, random_state=42)

## 5. Model Building

#### XGBoost Regressor

In [None]:
XGB_model = XGBRegressor(max_depth = 15,learning_rate=0.1,n_estimators=1000)
XGB_model.fit(X_train, y_train)

XGB_Model_Y_train_pred = abs(XGB_model.predict(X_train))
XGB_Model_Y_test_pred = abs(XGB_model.predict(X_test))

Train_score_RF= RMSLE(y_train,XGB_Model_Y_train_pred)
Test_score_RF = RMSLE(y_test,XGB_Model_Y_test_pred)

print(Train_score_RF)
print(Test_score_RF)

#### Grid Search with KFold CV on XGBoost Regressor

In [None]:
gkf = KFold(n_splits=10, shuffle=True, random_state=42).split(X=X_train, y=y_train)

param_grid = {"learning_rate"    : [0.10, 0.15] ,
              "max_depth"        : [10,15],
              "n_estimators"     : [500,1000,2000]}

xgb_estimator = XGBRegressor()

gsearch = GridSearchCV(estimator=xgb_estimator, param_grid=param_grid, cv=gkf,verbose =0,n_jobs=-1)
XGB_model = gsearch.fit(X_train, y_train)

XGB_Model_Y_train_pred = abs(XGB_model.predict(X_train))
XGB_Model_Y_test_pred = abs(XGB_model.predict(X_test))

Train_score_XG= RMSLE(y_train,XGB_Model_Y_train_pred)
Test_score_XG = RMSLE(y_test,XGB_Model_Y_test_pred)

print(Train_score_XG)
print(Test_score_XG)

#### LightGBM Regressor

In [None]:
LGB_model = LGBMRegressor(boosting_type= 'dart', 
                          num_leaves = 62, 
                          objective = 'regression_l1', # l2,mape
                          max_depth = 10,
                          learning_rate = 0.1, # 0.1,0.05,0.001
                          metric = 'l1') # l2,mape)


LGB_model.fit(X_train, y_train,eval_set = (X_test,y_test),early_stopping_rounds = 50,verbose = 0)

LGB_Model_Y_train_pred = abs(LGB_model.predict(X_train))
LGB_Model_Y_test_pred = abs(LGB_model.predict(X_test))

Train_score_LGB= RMSLE(y_train,LGB_Model_Y_train_pred)
Test_score_LGB = RMSLE(y_test,LGB_Model_Y_test_pred)

print(Train_score_LGB)
print(Test_score_LGB)

#### Grid Search with KFold CV on LightGBM Regressor

In [None]:
gkf = KFold(n_splits=10, shuffle=True, random_state=42).split(X=X_train, y=y_train)

param_grid = {
    'num_leaves': [31, 62, 127],
    'reg_alpha': [0.1, 0.5],
    'max_depth': [4,5,6,7,8,10],
    'min_data_in_leaf': [30, 50, 100, 300],
    'learning_rate': [0.1,0.01,0.001]
    }

lgb_estimator = LGBMRegressor(boosting_type= 'dart',objective = 'regression_l1')

gsearch = GridSearchCV(estimator=lgb_estimator, param_grid=param_grid, cv=gkf,verbose =0,n_jobs=-1)
LGB_model = gsearch.fit(X_train, y_train)

LGB_Model_Y_train_pred = abs(LGB_model.predict(X_train))
LGB_Model_Y_test_pred = abs(LGB_model.predict(X_test))

Train_score_LGB= RMSLE(y_train,LGB_Model_Y_train_pred)
Test_score_LGB = RMSLE(y_test,LGB_Model_Y_test_pred)

print(Train_score_LGB)
print(Test_score_LGB)

## 6. Prediction on test data. 

In [None]:
Price = abs(XGB_model.predict(test_data))
Price = pd.DataFrame(Price,columns = ['Price'])
Price.to_csv("Model_XGB.csv",index = False)