# UK used Cars - Car Price Prediction Regression Model
--- 

## Business Understanding
---

### Stakeholder:
Online used car listing company (e.g: OLX Autos, Carmudi, Carsome.id) 

### Business Model: 
Mereka beli mobil bekas dengan kondisi yang bagus harga dibawah pasar, untuk dijual lagi di platform mereka dengan harga yang lebih tinggi

### Business Problem:
- Jika membeli mobil harganya ketinggian maka profitnya lebih kecil

### Business Success Criteria:
- Mereka ingin mengetahui range harga mobil bekas yang akan mereka beli, agar mereka bisa menentukan harga jual yang tepat
- Tetap memperhitungkan profit dalam hasil prediksi harga mobil bekas

### 
Solution:
- Membuat model machine learning untuk memprediksi harga mobil bekas, sehingga ketika agen pembeli mobil bekas menemukan mobil bekas yang akan dibeli, mereka bisa mengetahui range harga beli yang tepat, sehingga profit yang didapat maksimal

## Data Preparation

In [1]:
## EDA Standard Libary

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.stats as ss

In [2]:
import os
#Define ROOT_DIR as the root directory of the project on your local machine
ROOT_DIR = os.path.realpath(os.path.join(os.path.dirname('1-Data Preparation.ipynb'), '.'))
ROOT_DIR


'/Users/Dwika/My Projects/FINAL PROJECT PURWADHIKA/finpro-alpha-repo/finpro-alpha'

*NOTE: Refer the file as Relative path!*

In [3]:
#Load data from the data folder
df_audi = pd.read_csv(os.path.join(ROOT_DIR, 'Raw Dataset/audi.csv'))
df_bmw = pd.read_csv(os.path.join(ROOT_DIR, 'Raw Dataset/bmw.csv'))
df_cclass = pd.read_csv(os.path.join(ROOT_DIR, 'Raw Dataset/cclass.csv'))
df_focus = pd.read_csv(os.path.join(ROOT_DIR, 'Raw Dataset/focus.csv'))
df_ford = pd.read_csv(os.path.join(ROOT_DIR, 'Raw Dataset/ford.csv'))
df_merc = pd.read_csv(os.path.join(ROOT_DIR, 'Raw Dataset/merc.csv'))
df_skoda = pd.read_csv(os.path.join(ROOT_DIR, 'Raw Dataset/skoda.csv'))
df_toyota = pd.read_csv(os.path.join(ROOT_DIR, 'Raw Dataset/toyota.csv')) 
df_unclean_cclass = pd.read_csv(os.path.join(ROOT_DIR, 'Raw Dataset/unclean cclass.csv'))
df_unclean_focus = pd.read_csv(os.path.join(ROOT_DIR, 'Raw Dataset/unclean focus.csv'))
df_vw = pd.read_csv(os.path.join(ROOT_DIR, 'Raw Dataset/vw.csv'))

In [4]:
df_audi

Unnamed: 0,model,year,price,transmission,mileage,fuelType,tax,mpg,engineSize
0,A1,2017,12500,Manual,15735,Petrol,150,55.4,1.4
1,A6,2016,16500,Automatic,36203,Diesel,20,64.2,2.0
2,A1,2016,11000,Manual,29946,Petrol,30,55.4,1.4
3,A4,2017,16800,Automatic,25952,Diesel,145,67.3,2.0
4,A3,2019,17300,Manual,1998,Petrol,145,49.6,1.0
...,...,...,...,...,...,...,...,...,...
10663,A3,2020,16999,Manual,4018,Petrol,145,49.6,1.0
10664,A3,2020,16999,Manual,1978,Petrol,150,49.6,1.0
10665,A3,2020,17199,Manual,609,Petrol,150,49.6,1.0
10666,Q3,2017,19499,Automatic,8646,Petrol,150,47.9,1.4


In [22]:
df_merc 

Unnamed: 0,model,year,price,transmission,mileage,fuelType,tax,mpg,engineSize
0,SLK,2005,5200,Automatic,63000,Petrol,325,32.1,1.8
1,S Class,2017,34948,Automatic,27000,Hybrid,20,61.4,2.1
2,SL CLASS,2016,49948,Automatic,6200,Petrol,555,28.0,5.5
3,G Class,2016,61948,Automatic,16000,Petrol,325,30.4,4.0
4,G Class,2016,73948,Automatic,4000,Petrol,325,30.1,4.0
...,...,...,...,...,...,...,...,...,...
13114,C Class,2020,35999,Automatic,500,Diesel,145,55.4,2.0
13115,B Class,2020,24699,Automatic,2500,Diesel,145,55.4,2.0
13116,GLC Class,2019,30999,Automatic,11612,Diesel,145,41.5,2.1
13117,CLS Class,2019,37990,Automatic,2426,Diesel,145,45.6,2.0


In [7]:
df_cclass_new = pd.concat([df_cclass, df_unclean_cclass])

In [8]:
df_cclass_new

Unnamed: 0,model,year,price,transmission,mileage,fuelType,engineSize,fuel type,engine size,mileage2,fuel type2,engine size2,reference
0,C Class,2020.0,30495,Automatic,1200,Diesel,2.0,,,,,,
1,C Class,2020.0,29989,Automatic,1000,Petrol,1.5,,,,,,
2,C Class,2020.0,37899,Automatic,500,Diesel,2.0,,,,,,
3,C Class,2019.0,30399,Automatic,5000,Diesel,2.0,,,,,,
4,C Class,2019.0,29899,Automatic,4500,Diesel,2.0,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4001,C Class,2017.0,"£14,700",Manual,31357,,,25,£150,70.6,Diesel,1.598,/ad/25451436
4002,C Class,2018.0,"£18,500",Automatic,28248,,,31,£150,64.2,Diesel,2.143,/ad/25451481
4003,C Class,2014.0,"£11,900",Manual,48055,,,31,£20,65.7,Diesel,2.143,/ad/25057204
4004,C Class,2014.0,"£11,300",Automatic,49865,,,46,£145,56.5,Diesel,2.143,/ad/25144481


In [18]:
#select duplicates based on cplumns
df_cclass_new[df_cclass_new.duplicated(subset=['model','year','price','mileage'], keep=False)]

Unnamed: 0,model,year,price,transmission,mileage,fuelType,engineSize,fuel type,engine size,mileage2,fuel type2,engine size2,reference
37,C Class,2016.0,14000,Automatic,45000,Diesel,2.1,,,,,,
60,C Class,2017.0,20990,Automatic,18000,Diesel,2.1,,,,,,
69,C Class,2017.0,22767,Automatic,30676,Diesel,2.1,,,,,,
70,C Class,2015.0,12000,Automatic,40005,Diesel,2.1,,,,,,
72,C Class,2016.0,14000,Automatic,45000,Diesel,2.1,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3959,C Class,2016.0,"£14,000",Automatic,,,,Diesel,2.1,45000,,,/ad/25438794
3960,C Class,2015.0,"£12,000",Automatic,,,,Diesel,2.1,40005,,,/ad/25483371
3986,C Class,2016.0,"£14,000",Automatic,,,,Diesel,2.1,45000,,,/ad/25438704
3987,C Class,2015.0,"£12,000",Automatic,,,,Diesel,2.1,40005,,,/ad/25483276
