## **Car Price Predictor**

In [31]:
import pandas as pd 
import numpy as np

In [32]:
df = pd.read_csv('quikr_car.csv')
df.head()

Unnamed: 0,name,company,year,Price,kms_driven,fuel_type
0,Hyundai Santro Xing XO eRLX Euro III,Hyundai,2007,80000,"45,000 kms",Petrol
1,Mahindra Jeep CL550 MDI,Mahindra,2006,425000,40 kms,Diesel
2,Maruti Suzuki Alto 800 Vxi,Maruti,2018,Ask For Price,"22,000 kms",Petrol
3,Hyundai Grand i10 Magna 1.2 Kappa VTVT,Hyundai,2014,325000,"28,000 kms",Petrol
4,Ford EcoSport Titanium 1.5L TDCi,Ford,2014,575000,"36,000 kms",Diesel


In [33]:
df.shape

(892, 6)

In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 892 entries, 0 to 891
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   name        892 non-null    object
 1   company     892 non-null    object
 2   year        892 non-null    object
 3   Price       892 non-null    object
 4   kms_driven  840 non-null    object
 5   fuel_type   837 non-null    object
dtypes: object(6)
memory usage: 41.9+ KB


### **Quality Issues** 
- Year has many non numeric values.
- Year is in object data type.
- Price has 35 Ask for Price values. 
- Price is in object data type.
- Price has one outlier. 
- Kilometers Driven is in object data type and has one Petrol Entry.
- Fuel Type has nan values.

### **Cleaning**

In [35]:
df = df[df['year'].str.isnumeric()]

In [36]:
df['year'] = df['year'].astype(int)

In [37]:
df = df[df['Price'] != 'Ask For Price']

In [38]:
df['Price'] = df["Price"].str.replace(',','').astype(int)

In [39]:
df = df[df['Price'] < 6e6]

In [40]:
df = df[df['kms_driven'] != 'Petrol']

In [41]:
df['kms_driven'] = df['kms_driven'].str.split(' ').str.get(0).str.replace(",","").astype(int)

In [42]:
df = df[~df['fuel_type'].isna()]

In [43]:
df['name'] = df['name'].str.split(' ').str.slice(0,3).str.join(' ')

In [44]:
df.describe()

Unnamed: 0,year,Price,kms_driven
count,815.0,815.0,815.0
mean,2012.442945,401793.3,46277.096933
std,4.005079,381588.8,34318.459638
min,1995.0,30000.0,0.0
25%,2010.0,175000.0,27000.0
50%,2013.0,299999.0,41000.0
75%,2015.0,490000.0,56879.0
max,2019.0,3100000.0,400000.0


In [45]:
df = df.reset_index(drop=True)

In [46]:
df.to_csv('clean_quikr_car.csv')

In [89]:
df['name'].value_counts()

name
Maruti Suzuki Swift            51
Maruti Suzuki Alto             42
Maruti Suzuki Wagon            28
Maruti Suzuki Ertiga           16
Hyundai Santro Xing            15
                               ..
Mercedes Benz A                 1
Tata Manza ELAN                 1
Volkswagen Polo Comfortline     1
Nissan Sunny                    1
Tata Zest XM                    1
Name: count, Length: 254, dtype: int64

### **Model**

In [48]:
X = df.drop(columns=['Price'])
y = df['Price']

In [49]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=124)

In [50]:
X_train

Unnamed: 0,name,company,year,kms_driven,fuel_type
467,Maruti Suzuki Vitara,Maruti,2017,36000,Diesel
212,Maruti Suzuki Alto,Maruti,2015,5000,Petrol
685,Hyundai Santro,Hyundai,2003,51000,Petrol
538,Maruti Suzuki Alto,Maruti,2019,9800,Petrol
786,Hyundai Eon,Hyundai,2018,25000,Petrol
...,...,...,...,...,...
681,Hyundai Santro AE,Hyundai,2011,45000,Petrol
135,Toyota Corolla Altis,Toyota,2012,59000,Petrol
17,Maruti Suzuki Alto,Maruti,2014,35550,Petrol
668,Maruti Suzuki Swift,Maruti,2014,11523,Petrol


In [72]:
from sklearn.linear_model import LinearRegression 
from sklearn.metrics import r2_score
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline

In [84]:
ohe = OneHotEncoder()
ohe.fit(X[['name', 'company', 'fuel_type']])

transformer = make_column_transformer(
    (OneHotEncoder(categories=ohe.categories_), ['name', 'company', 'fuel_type']),
      remainder='passthrough')

lr = LinearRegression()

pipe = make_pipeline(transformer, lr)

pipe.fit(X_train, y_train)

y_pred = pipe.predict(X_test)

r2_score(y_pred, y_test)

0.834097589979544

In [86]:
import pickle

In [87]:
pickle.dump(pipe, open('pipe.pkl', 'wb'))

In [32]:
pickle.dump(df, open('df.pkl', 'wb'))

In [88]:
pipe.predict(pd.DataFrame([['Maruti Suzuki Swift', 'Maruti', 2019, 100, 'Petrol']], columns=['name','company','year','kms_driven','fuel_type']))

array([438777.10574272])