Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

## Data preparation

In [1]:
import pandas as pd
import numpy as np
import math as mt
import matplotlib.pyplot as plt
import seaborn as sns


data = pd.read_csv('car_data.csv')

In [2]:
print(data.info())

# re-formatting column names
data = data.rename(columns = {'DateCrawled' : 'date_crawled', 'Price': 'price', 'VehicleType': 'vehicle_type',
                              'RegistrationYear': 'registration_year', 'Gearbox': 'gearbox', 'Power': 'power',
                              'Model': 'model', 'Mileage': 'mileage', 'RegistrationMonth': 'registration_month',
                              'FuelType': 'fuel_type', 'Brand': 'brand', 'NotRepaired': 'repaired',
                              'DateCreated': 'date_created', 'NumberOfPictures': 'number_of_pictures',
                              'PostalCode': 'postal_code', 'LastSeen': 'last_seen'})

print(data.isna().sum())
display(data.describe())
#print(data.sample(30, random_state=777))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Mileage            354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

Unnamed: 0,price,registration_year,power,mileage,registration_month,number_of_pictures,postal_code
count,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0
mean,4416.656776,2004.234448,110.094337,128211.172535,5.714645,0.0,50508.689087
std,4514.158514,90.227958,189.850405,37905.34153,3.726421,0.0,25783.096248
min,0.0,1000.0,0.0,5000.0,0.0,0.0,1067.0
25%,1050.0,1999.0,69.0,125000.0,3.0,0.0,30165.0
50%,2700.0,2003.0,105.0,150000.0,6.0,0.0,49413.0
75%,6400.0,2008.0,143.0,150000.0,9.0,0.0,71083.0
max,20000.0,9999.0,20000.0,150000.0,12.0,0.0,99998.0


**- ['not_repaired'] is a bit confusing at first, not_repaired == no ... is a double negative, but documentation states "Vehicle repaird or not", so no = no. For clarity I will be changing the name of this column to simply ['repaired'].**

**- ['number_of_pictures'] is completely filled with zeros, we will be removing this from the features since it has no value to our training.**

**- ['registration_month'] has zero values, but this doesn't make logical sense, becuase month range is (1 - January) to (12 - December)**

**- ['registration_year'] has values as low as 1000, and as high as 9999, also some vehicles are registered past the year of data extraction (2016), meaning time travel?**

**- Registration date does in fact impact vehicle resale value in the European Union, hence the target variable Price(euros), so we will be keeping these columns for training.**

**- ['date_crawled', 'date_created', 'number_of_pictures', 'postal_code', 'last_seen'] have no predictive power in relation to model training, so these will be removed.**

In [3]:
print('BEFORE:\nnumber of vehicles registered before the year 1898: ', (data['registration_year'] < 1898).sum())
print('number of vehicles registered after 2016: ', (data['registration_year'] > 2016).sum())
print('number of vehicles registered in the month 0: ', (data['registration_month'] < 1).sum())
#print(data[data['registration_year'] < 1898])
#print(data[data['registration_year'] > 2016])

# removing entries of vehicles registered before 1898 & after 2016 (all data was crawled during the year 2016, cars registered in 2018 are paradoxical)
data = data[(data['registration_year'] >= 1898) & (data['registration_year'] <= 2016)]

# replacing registration_month == 0, with the mean
mean_month = data.loc[data['registration_month'] != 0, 'registration_month'].mean()
data.loc[data['registration_month'] == 0, 'registration_month'] = round(mean_month)

print('AFTER:\nnumber of vehicles registered before the year 1898: ', (data['registration_year'] < 1898).sum())
print('number of vehicles registered after 2016: ', (data['registration_year'] > 2016).sum())
print('number of vehicles registered in the month 0: ', (data['registration_month'] < 1).sum())

BEFORE:
number of vehicles registered before the year 1898:  66
number of vehicles registered after 2016:  14530
number of vehicles registered in the month 0:  37352
AFTER:
number of vehicles registered before the year 1898:  0
number of vehicles registered after 2016:  0
number of vehicles registered in the month 0:  0


**- The Netherlands was the first country to ever require vehicle registration in the year 1898, so any registrations before this date are invalid, these entries also contain wildly inconsistent data, missing multiple columns of data. These will be removed.**

**- All data was captured during the year 2016, so cars registered in 2017 onward are logically impossible. The upper limit of this feature is 9999, proving that we have impossible values within this column. These amount for roughly ~4% of our data set, so they are safe for removal.**

**- Month values range from 1-12, 0 values are replaced with the mean.**

In [6]:
print(data.describe())
print(data.info())
#display(data.loc[data['power'] > 1000])


               price  registration_year          power        mileage  \
count  339773.000000      339773.000000  339773.000000  339773.000000   
mean     4471.307373        2002.482222     111.002711  128086.119262   
std      4546.019252           7.091181     186.879221   37895.647481   
min         0.000000        1910.000000       0.000000    5000.000000   
25%      1099.000000        1999.000000      69.000000  125000.000000   
50%      2799.000000        2003.000000     105.000000  150000.000000   
75%      6500.000000        2007.000000     143.000000  150000.000000   
max     20000.000000        2016.000000   20000.000000  150000.000000   

       registration_month  number_of_pictures    postal_code  
count       339773.000000            339773.0  339773.000000  
mean             6.349992                 0.0   50605.581132  
std              3.182215                 0.0   25806.453730  
min              1.000000                 0.0    1067.000000  
25%              4.000000  

**WE HAVE TO FIX 'power' RANGES, AND 'price' RANGES**

In [7]:
#numerical_features = ['registration_year', 'power', 'mileage', 'registration_month']
#melted = data[numerical_features].melt(var_name='feature', value_name='value')

#for feature in numerical_features:
    #plt.figure(figsize=(6, 4))
    #sns.boxplot(y=data[feature])
    #plt.title(f'Distribution of {feature}')
    #plt.show()

## Model training

## Model analysis

# Checklist

Type 'x' to check. Then press Shift+Enter.

- [x]  Jupyter Notebook is open
- [ ]  Code is error free
- [ ]  The cells with the code have been arranged in order of execution
- [ ]  The data has been downloaded and prepared
- [ ]  The models have been trained
- [ ]  The analysis of speed and quality of the models has been performed