# Data Preparation for Modelling

In this section we will prepare the dataset for modelling by removing the target variable and evaluating which features to put into our model. This will help us not only predict the value of the vehicle but also help us when analysing feature importance.

In [12]:
import pandas as pd 
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import pickle

In [22]:
df = pd.read_csv('Data/feature_engineered_data.csv', index_col=0)

In [14]:
df.columns

Index(['title', 'Price', 'Mileage(miles)', 'Registration_Year',
       'Previous Owners', 'Fuel type', 'Body type', 'Engine', 'Gearbox',
       'Doors', 'Seats', 'Emission Class', 'Has_Service_History', 'Mileage',
       'Car_Age', 'Engine_Bin', 'Mileage_per_Year', 'Log_Price', 'Log_Mileage',
       'Age_Band', 'title_lower', 'Is_Premium', 'Brand', 'Model',
       'Usage_Level', 'Expected_Mileage', 'Mileage_Delta', 'Owners_per_Year',
       'Price_per_Seat', 'Price_per_Year_Age', 'Door_Category',
       'Is_Family_Car', 'Brand_Avg_Price', 'Model_Avg_Price',
       'Engine_per_Seat', 'Car_Age_Squared', 'Premium_Age'],
      dtype='object')

## Target Variable

we have two price variables Price and Log_Price. As our Price variable has a skewed distribution, using log_price will improve linear model performance and make the model more statisticaly stable.

As Price is our target variable we must remove all features created during analysis that are indicative of price. These include:

- `Price_per_Seat`
- `Price_per_Year_Age`
- `Brand_Avg_Price`
- `Model_Avg_Price`

In [15]:
df = df.drop(columns=[
    'Brand_Avg_Price',
    'Model_Avg_Price',
    'Price_per_Seat',
    'Price_per_Year_Age'
], errors='ignore')


## Encoding Categorical Variables

**Categroical Variables include:**

`'Fuel type'`

`'Body type'`

`'Gearbox'`

`'Emission Class'`

`'Engine_Bin'`

`'Age_Band'`

`'Brand'`

`'Model'`

`'Usage_Level'`

`'Door_Category'`

We will encode these for modelling.

In [16]:

categorical_cols = [
    'Fuel type',
    'Body type',
    'Gearbox',
    'Emission Class',
    'Engine_Bin',
    'Age_Band',
    'Brand',
    'Model',
    'Usage_Level',
    'Door_Category'
]


As this is a regression task, we will one-hot encode the categorical features.

In [17]:

df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

## Test/Train Split 

In [18]:
X = df.drop(columns=['Price', 'Log_Price'])
y = df['Log_Price']   


In [19]:


X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=23
)




In [20]:


with open("Data/train_test_split.pkl", "wb") as f:
    pickle.dump((X_train, X_test, y_train, y_test), f)
