# Data Preparation for Modelling

In this section we will prepare the dataset for modelling by removing the target variable and evaluating which features to put into our model. This will help us not only predict the value of the vehicle but also help us when analysing feature importance.

In [15]:
import pandas as pd 
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt

In [16]:
df = pd.read_csv('Data/feature_engineered_data.csv', index_col=0)

In [17]:
df.columns

Index(['title', 'Price', 'Mileage(miles)', 'Registration_Year',
       'Previous Owners', 'Fuel type', 'Body type', 'Engine', 'Gearbox',
       'Doors', 'Seats', 'Emission Class', 'Has_Service_History', 'Mileage',
       'Car_Age', 'Engine_Bin', 'Mileage_per_Year', 'Log_Price', 'Log_Mileage',
       'Age_Band', 'title_lower', 'Is_Premium', 'Brand', 'Model',
       'Usage_Level', 'Expected_Mileage', 'Mileage_Delta', 'Owners_per_Year',
       'Price_per_Seat', 'Price_per_Year_Age', 'Door_Category',
       'Is_Family_Car', 'Brand_Avg_Price', 'Model_Avg_Price',
       'Engine_per_Seat', 'Car_Age_Squared', 'Premium_Age'],
      dtype='object')

## Target Variable

we have two price variables Price and Log_Price. As our Price variable has a skewed distribution, using log_price will improve linear model performance and make the model more statisticaly stable.

As Price is our target variable we must remove all features created during analysis that are indicative of price. These include:

- `Price_per_Seat`
- `Price_per_Year_Age`
- `Brand_Avg_Price`
- `Model_Avg_Price`

In [21]:
df.drop(columns=["Price_per_Seat", "Price_per_Year_Age", "Brand_Avg_Price", "Model_Avg_Price"], inplace=True)


In [22]:
df.head()

Unnamed: 0,title,Price,Mileage(miles),Registration_Year,Previous Owners,Fuel type,Body type,Engine,Gearbox,Doors,...,Model,Usage_Level,Expected_Mileage,Mileage_Delta,Owners_per_Year,Door_Category,Is_Family_Car,Engine_per_Seat,Car_Age_Squared,Premium_Age
0,SKODA Fabia,6900,70189,2016,3,Diesel,Hatchback,1.4,Manual,5.0,...,Fabia,Normal,63000,7189,0.428571,Family,1,0.28,49,0
1,Vauxhall Corsa,1495,88585,2008,4,Petrol,Hatchback,1.2,Manual,3.0,...,Corsa,Normal,135000,-46415,0.266667,Small,1,0.24,225,0
2,Hyundai i30,949,137000,2011,3,Petrol,Hatchback,1.4,Manual,5.0,...,,Normal,108000,29000,0.25,Family,1,0.28,144,0
3,MINI Hatch,2395,96731,2010,5,Petrol,Hatchback,1.4,Manual,3.0,...,Hatch,Normal,117000,-20269,0.384615,Small,0,0.35,169,0
4,Vauxhall Corsa,1000,85000,2013,3,Diesel,Hatchback,1.3,Manual,5.0,...,Corsa,Normal,90000,-5000,0.3,Family,1,0.26,100,0


## Encoding Categorical Variables

In [24]:
df.dtypes

title                   object
Price                    int64
Mileage(miles)           int64
Registration_Year        int64
Previous Owners          int64
Fuel type               object
Body type               object
Engine                 float64
Gearbox                 object
Doors                  float64
Seats                  float64
Emission Class          object
Has_Service_History      int64
Mileage                  int64
Car_Age                  int64
Engine_Bin              object
Mileage_per_Year       float64
Log_Price              float64
Log_Mileage            float64
Age_Band                object
title_lower             object
Is_Premium               int64
Brand                   object
Model                   object
Usage_Level             object
Expected_Mileage         int64
Mileage_Delta            int64
Owners_per_Year        float64
Door_Category           object
Is_Family_Car            int64
Engine_per_Seat        float64
Car_Age_Squared          int64
Premium_