# USED CAR PRICE PREDICTION

This is a given dataset of a used car price on India, you can find it available on kaggle here : https://www.kaggle.com/avikasliwal/used-cars-price-prediction. 

The main objective is to find insights from this dataset and build a model which can predict car price accurately. 

### DATA DICTIONARY 
- Owner_Type              = car's owner Type (First, Second, Third, Fourth & Above)
- Kilometers_Driven       = number of kilometers driven 
- Mileage                 = number of miles traveled or covered.
- Seats                   = number of seats
- Engine                  = car's engine propulsion 
- Fuel_Type               = car's fuel type (CNG, Diesel, Petrol, LPG) 
- Year_Gap                = number of year gap
- Power                   = car's power 
- Transmission            = car's transmission type (Manual, Automatic)


## IMPORT LIBRARIES

In [61]:
# data wrangling
import numpy as np
import pandas as pd

## OVERVIEW

In [62]:
# load data
df = pd.read_csv('cars.csv').drop(columns=['Unnamed: 0', 'New_Price'])

In [63]:
# show top 5
df.head()

Unnamed: 0,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,Price
0,Maruti Wagon R LXI CNG,Mumbai,2010,72000,CNG,Manual,First,26.6 km/kg,998 CC,58.16 bhp,5.0,1.75
1,Hyundai Creta 1.6 CRDi SX Option,Pune,2015,41000,Diesel,Manual,First,19.67 kmpl,1582 CC,126.2 bhp,5.0,12.5
2,Honda Jazz V,Chennai,2011,46000,Petrol,Manual,First,18.2 kmpl,1199 CC,88.7 bhp,5.0,4.5
3,Maruti Ertiga VDI,Chennai,2012,87000,Diesel,Manual,First,20.77 kmpl,1248 CC,88.76 bhp,7.0,6.0
4,Audi A4 New 2.0 TDI Multitronic,Coimbatore,2013,40670,Diesel,Automatic,Second,15.2 kmpl,1968 CC,140.8 bhp,5.0,17.74


In [64]:
# show info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6019 entries, 0 to 6018
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Name               6019 non-null   object 
 1   Location           6019 non-null   object 
 2   Year               6019 non-null   int64  
 3   Kilometers_Driven  6019 non-null   int64  
 4   Fuel_Type          6019 non-null   object 
 5   Transmission       6019 non-null   object 
 6   Owner_Type         6019 non-null   object 
 7   Mileage            6017 non-null   object 
 8   Engine             5983 non-null   object 
 9   Power              5983 non-null   object 
 10  Seats              5977 non-null   float64
 11  Price              6019 non-null   float64
dtypes: float64(2), int64(2), object(8)
memory usage: 564.4+ KB


We can see that there are some null values, and improper datatype on some features (suppose to be float or int), such as : 'Mileage', 'Engine', 'Power' etc. We'll try to fix this by dropping the null values first (it's not a best practices though).

In [65]:
# check null values
df.isna().sum()

Name                  0
Location              0
Year                  0
Kilometers_Driven     0
Fuel_Type             0
Transmission          0
Owner_Type            0
Mileage               2
Engine               36
Power                36
Seats                42
Price                 0
dtype: int64

In [66]:
# drop null values
df.dropna(inplace=True)

## PREPROCESSING

Now we're going to fix the previously stated problem by splitting the unnecessary abbreviation such as 'kmpl', 'CC', etc. And then passing only the number values to the original dataframe.

In [67]:
# mileage features
mil = []
for i in df.Mileage:
    num = 0
    mil.append(i.split(' ')[num])
    num = num+1
    
# pass to original dataframe
df.Mileage = mil

In [68]:
# engine feature
eng = []
for i in df.Engine:
    num = 0
    eng.append(i.split(' ')[num])
    num = num+1

# pass to original dataframe
df.Engine = eng

In [69]:
# power feature
pwr = []
for i in df.Power:
    num = 0
    pwr.append(i.split(' ')[num])
    num = num+1

# pass to original dataframe
df.Power = pwr

Seems there are some anomaly, there are some data on 'Power' features having a null bhp.

In [70]:
# show anomaly
df.loc[76]

Name                 Ford Fiesta 1.4 SXi TDCi
Location                               Jaipur
Year                                     2008
Kilometers_Driven                      111111
Fuel_Type                              Diesel
Transmission                           Manual
Owner_Type                              First
Mileage                                  17.8
Engine                                   1399
Power                                    null
Seats                                       5
Price                                       2
Name: 76, dtype: object

In [71]:
# show anomaly
df[df.Power == 'null']

Unnamed: 0,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,Price
76,Ford Fiesta 1.4 SXi TDCi,Jaipur,2008,111111,Diesel,Manual,First,17.8,1399,,5.0,2.00
79,Hyundai Santro Xing XL,Hyderabad,2005,87591,Petrol,Manual,First,0.0,1086,,5.0,1.30
89,Hyundai Santro Xing XO,Hyderabad,2007,73745,Petrol,Manual,First,17.0,1086,,5.0,2.10
120,Hyundai Santro Xing XL eRLX Euro III,Mumbai,2005,102000,Petrol,Manual,Second,17.0,1086,,5.0,0.85
143,Hyundai Santro Xing XO eRLX Euro II,Kochi,2008,80759,Petrol,Manual,Third,17.0,1086,,5.0,1.67
...,...,...,...,...,...,...,...,...,...,...,...,...
5861,Hyundai Santro Xing XO,Chennai,2007,79000,Petrol,Manual,First,17.0,1086,,5.0,1.85
5873,Hyundai Santro Xing XO eRLX Euro II,Pune,2006,47200,Petrol,Manual,Second,17.0,1086,,5.0,1.20
5925,Skoda Laura Classic 1.8 TSI,Pune,2010,85000,Petrol,Manual,First,17.5,1798,,5.0,2.85
5943,Mahindra Jeep MM 540 DP,Chennai,2002,75000,Diesel,Manual,First,0.0,2112,,6.0,1.70


In [72]:
# drop the anomaly
df = df[df.Power !='null']

In [73]:
# name feature
nme = []
for i in df.Name:
    num = 0
    nme.append(i.split(' ')[num])
    num = num+1

# pass to original dataframe
df.Name = nme

Now, we'll fix the datatype.

In [74]:
# fixing datatype
for i in df:
    if i =='Engine' or i =='Seats':
        df[i] = df[i].astype('int64')
    elif i =='Mileage' or i =='Power':
        df[i] = df[i].astype(float)

In [75]:
# show info on processed data
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5872 entries, 0 to 6018
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Name               5872 non-null   object 
 1   Location           5872 non-null   object 
 2   Year               5872 non-null   int64  
 3   Kilometers_Driven  5872 non-null   int64  
 4   Fuel_Type          5872 non-null   object 
 5   Transmission       5872 non-null   object 
 6   Owner_Type         5872 non-null   object 
 7   Mileage            5872 non-null   float64
 8   Engine             5872 non-null   int64  
 9   Power              5872 non-null   float64
 10  Seats              5872 non-null   int64  
 11  Price              5872 non-null   float64
dtypes: float64(3), int64(4), object(5)
memory usage: 596.4+ KB


In [76]:
# show top 5 processed data
df.head()

Unnamed: 0,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,Price
0,Maruti,Mumbai,2010,72000,CNG,Manual,First,26.6,998,58.16,5,1.75
1,Hyundai,Pune,2015,41000,Diesel,Manual,First,19.67,1582,126.2,5,12.5
2,Honda,Chennai,2011,46000,Petrol,Manual,First,18.2,1199,88.7,5,4.5
3,Maruti,Chennai,2012,87000,Diesel,Manual,First,20.77,1248,88.76,7,6.0
4,Audi,Coimbatore,2013,40670,Diesel,Automatic,Second,15.2,1968,140.8,5,17.74


I'll pass this processed data to new dataset. And continue to the modelling part.

In [77]:
# pass to new dataset
df.to_csv('cars_cleaned.csv', index=0)