# Building car price prediction model for CarDekho website

## About project
[CarDekho](https://www.cardekho.com/) is India's leading car search venture that helps users buy cars. Its website and app carry rich automotive content such as expert reviews, detailed specs and prices, comparisons as well as videos and pictures of all car brands and models available in India. The company has tie-ups with many auto manufacturers, more than 4000 car dealers and numerous financial institutions to facilitate the purchase of vehicles.

In this project, we'll collect data about used cars from CarDekho website and use it to build car price prediction model. This model may later be used by CarDekho for suggesting the price of posted by user used car.

In [2]:
# import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Data Collection

Since we don't have access to the database of CarDekho website, there are two main ways to collect data: find existing datasets or scrape data from CarDekho website. Kaggle contains two useful for our case datasets: [Vehicle dataset](https://www.kaggle.com/nehalbirla/vehicle-dataset-from-cardekho?select=Car+details+v3.csv) (\~8000 rows) and [Used Car Prices in India](https://www.kaggle.com/saisaathvik/used-cars-dataset-from-cardekhocom?select=Cardekho_Extract.csv) (\~20000 rows). We will use this data for model building, because:
* both datasets contain similar columns which will make it easy to union them;
* in sum it is about 28000 rows, which is enough for model building;
* using existing datasets takes less time than scraping new from website;
* according to Kaggle, ~30% of used cars from [Vehicle dataset](https://www.kaggle.com/nehalbirla/vehicle-dataset-from-cardekho?select=Car+details+v3.csv) was bought in 2016-2020 and ~16% of used cars from [Used Car Prices in India](https://www.kaggle.com/saisaathvik/used-cars-dataset-from-cardekhocom?select=Cardekho_Extract.csv) was bought in 2018-2021. Therefore the data is fresh enough.

# Importing the datasets

We'll start by taking a look at first rows and shapes of datasets.

## Vehicle dataset

In [3]:
# read in the data
raw_data_1 = pd.read_csv('Car details v3.csv')

raw_data_1.head()

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,torque,seats
0,Maruti Swift Dzire VDI,2014,450000,145500,Diesel,Individual,Manual,First Owner,23.4 kmpl,1248 CC,74 bhp,190Nm@ 2000rpm,5.0
1,Skoda Rapid 1.5 TDI Ambition,2014,370000,120000,Diesel,Individual,Manual,Second Owner,21.14 kmpl,1498 CC,103.52 bhp,250Nm@ 1500-2500rpm,5.0
2,Honda City 2017-2020 EXi,2006,158000,140000,Petrol,Individual,Manual,Third Owner,17.7 kmpl,1497 CC,78 bhp,"12.7@ 2,700(kgm@ rpm)",5.0
3,Hyundai i20 Sportz Diesel,2010,225000,127000,Diesel,Individual,Manual,First Owner,23.0 kmpl,1396 CC,90 bhp,22.4 kgm at 1750-2750rpm,5.0
4,Maruti Swift VXI BSIII,2007,130000,120000,Petrol,Individual,Manual,First Owner,16.1 kmpl,1298 CC,88.2 bhp,"11.5@ 4,500(kgm@ rpm)",5.0


In [4]:
# check the shape of dataset
raw_data_1.shape

(8128, 13)

Description of fields in Vehicle dataset:
* *name* - name of the car
* *year* - year in which car was first bought
* *selling_price* - selling price of the car
* *km_driven* - number of kilometers the car is driven
* *fuel* - fuel type of the car
* *seller_type* - tells if car is sold by individual or dealer
* *transmission* - gear transmission of the car (Automatic/Manual)
* *owner* - number of previous owners
* *mileage* - mileage of the car
* *engine* - engine capacity of the car
* *max_power* - max power of engine
* *torque* - torque of the car
* *seats* - number of seats in the car

## Used Car Prices in India dataset

In [6]:
# read in the data
raw_data_2 = pd.read_csv('Cardekho_Extract.csv')

raw_data_2.head()

Unnamed: 0,Source.Name,web-scraper-order,web-scraper-start-url,full_name,selling_price,new-price,year,seller_type,km_driven,owner_type,fuel_type,transmission_type,mileage,engine,max_power,seats
0,cardekho_extract(0-2000).csv,1611917819-1662,https://www.cardekho.com/used-car-details/used...,Maruti Alto Std,1.2 Lakh*,,2012.0,Individual,"1,20,000 kms",First Owner,Petrol,Manual,Mileage19.7 kmpl,Engine796 CC,Max Power46.3 bhp,Seats5
1,cardekho_extract(0-2000).csv,1611918361-1902,https://www.cardekho.com/used-car-details/used...,Hyundai Grand i10 Asta,5.5 Lakh*,New Car (On-Road Price) : Rs.7.11-7.48 Lakh*,2016.0,Individual,"20,000 kms",First Owner,Petrol,Manual,Mileage18.9 kmpl,Engine1197 CC,Max Power82 bhp,Seats5
2,cardekho_extract(0-2000).csv,1611917012-1306,https://www.cardekho.com/used-car-details/used...,Hyundai i20 Asta,2.15 Lakh*,,2010.0,Individual,"60,000 kms",First Owner,Petrol,Manual,Mileage17.0 kmpl,Engine1197 CC,Max Power80 bhp,Seats5
3,cardekho_extract(0-2000).csv,1611917695-1607,https://www.cardekho.com/used-car-details/used...,Maruti Alto K10 2010-2014 VXI,2.26 Lakh*,,2012.0,Individual,"37,000 kms",First Owner,Petrol,Manual,Mileage20.92 kmpl,Engine998 CC,Max Power67.1 bhp,Seats5
4,cardekho_extract(0-2000).csv,1611914861-367,https://www.cardekho.com/used-car-details/used...,Ford Ecosport 2015-2021 1.5 TDCi Titanium BSIV,5.7 Lakh*,New Car (On-Road Price) : Rs.10.14-13.79 Lakh*,2015.0,Dealer,"30,000 kms",First Owner,Diesel,Manual,Mileage22.77 kmpl,Engine1498 CC,Max Power98.59 bhp,Seats5


In [7]:
# check the shape of dataset
raw_data_2.shape

(20026, 16)

Description of fields in Vehicle dataset:
* *Source.Name* - invalid field
* *web-scraper-order* - row id
* *web-scraper-start-url* - url of data source for current row
* *full_name* - name of the car
* *selling_price* - current selling price of used car
* *new-price* - current cost price of brand new car
* *year* - yar in which car was first bought
* *seller_type* - type of seller
* *km_driven* - kilometers driven by the car
* *owner_type* - number of previous owners
* *fuel_type* - fuel type of the car
* *transmission_type* - gear transmission of the car (Automatic/Manual)
* *mileage* - mileage of the car
* *engine* - engine capacity of the car
* *max_power* - max horsepower of engine
* *seats* - number of seats in the car

## Dictionary of units in data
* Lakh - a unit in the Indian numbering system equal to one hundred thousand (100,000). In the Indian 2,2,3 convention of digit grouping, it is written as 1,00,000. For example, in India 150,000 rupees becomes 1.5 lakh rupees, written as ₹1,50,000 or INR 1,50,000.
* kmpl - Kilometers Per Litre is a term use to express the fuel efficiency of a vehicle. Fuel efficiency is defined as the ratio of distance travelled per unit of fuel consumed. 1 km/kg = 1.4 kmpl
* СС - The term “cc” stands for Cubic Centimeters or simply cm³ which is a metric unit to measure the Engine's Capacity or its volume. Usually increase power of car.
* bhp - Traditionally ‘brake horsepower’ (bhp) has been used as the definitive measurement of engine power. It’s distinct from horsepower because it takes into account power loss due to friction – it’s measured by running an engine up to full revs, then letting it naturally slow down to a dead stop.
* nm at rpm - Torque is rotational force, and since an engine relies on a rotating crank to do its work, torque is the force the engine is able to generate. Modern engines generate different levels of torque at different engine speeds (RPMs, or revolutions per minute that the engine is turning through). It’s expressed in Newton-Metres (Nm), and this is what you actually feel when you’re pushed back into your seat on acceleration. A car brochure will indicate the maximum torque the engine is able to generate, and the specific RPM at which it is generated. For instance, the Maruti Dzire generates 113Nm at 4200RPM (petrol) and 190Nm at 2000RPM (diesel). This means the petrol engine produces less torque at a much higher engine speed than the diesel motor, which produces more at a quite low engine speed. The bottom line: Look for a good torque (over 110Nm) with a low RPM (4,000 or so). 1 kgm = 9.80665 Nm