# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

In [None]:
To address the business task, we will construct a regression model to predict used car prices.

In [None]:
The model will analyze features such as vehicle age, mileage, make, model, and condition.

In [None]:
By examining the relationships between these features and the sale prices, we aim to identify the most significant factors driving price variations.

In [None]:
The objective is to translate these insights into actionable strategies for pricing and inventory management.

### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

In [None]:
To understand the dataset,we start by loading and inspecting its structure, checking for any missing values, data types, and duplicates.

In [None]:
Perform exploratory data analysis (EDA) using visualizations to identify patterns, distributions, and correlations between features. 

In [None]:
 Assess the data quality by detecting outliers, inconsistencies, and potential inaccuracies that could impact the analysis.

In [None]:
Finally, determine if the dataset is suitable for modeling or if additional data is needed to address any gaps.

### Data Preparation

After our initial exploration and fine tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

In [None]:
Clean the dataset by addressing any remaining missing values, correcting outliers, and ensuring data integrity. 

In [None]:
Encode categorical variables and split the data into training and testing sets.  

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score

In [None]:

# Load dataset
data = pd.read_csv('data/vehicles.csv')
# Remove NaN values
data.dropna(inplace=True)
# Initial data inspection
print(data.head())
print(data.info())
print(data.describe(include='all'))

# Data Cleaning
data.drop_duplicates(inplace=True)
data.dropna(subset=['price', 'year', 'manufacturer', 'model', 'condition', 'odometer'], inplace=True)  # Drop rows with essential missing values

# Feature Engineering
data['age'] = 2024 - data['year']
data['mileage_per_year'] = data['odometer'] / data['age']

# Define features and target variable
X = data[['age', 'mileage_per_year', 'manufacturer', 'model', 'condition', 'cylinders', 'fuel', 'drive', 'size', 'type', 'paint_color', 'state', 'transmission']]
y = data['price']

# Data Transformation
numeric_features = ['age', 'mileage_per_year']
categorical_features = ['manufacturer', 'model', 'condition', 'fuel', 'drive', 'size', 'type', 'paint_color', 'state', 'transmission', 'cylinders']

numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])


### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

In [None]:
Build multiple regression models, including Linear Regression, Random Forest, and Gradient Boosting, to predict car prices. 

In [None]:
Explore and optimize different hyperparameters using GridSearchCV or RandomizedSearchCV. 

In [None]:
Apply k-fold cross-validation to ensure consistent performance across different data splits.

In [None]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# Define models
models = {
    'Linear Regression': Pipeline(steps=[('preprocessor', preprocessor),
                                         ('regressor', LinearRegression())]),  # Set the number of features to 8
    'Gradient Boosting': Pipeline(steps=[('preprocessor', preprocessor),
                                         ('regressor', GradientBoostingRegressor(n_estimators=512, max_depth=100,max_features=8))]),  # Set the number of trees to 100 and max depth to 3    
    'Random Forest': Pipeline(steps=[('preprocessor', preprocessor),
                                      ('regressor', RandomForestRegressor(n_estimators=512, max_depth=100,max_features=8))])  # Set the number of trees to 512 and max depth to 10}
}

### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight on drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

In [None]:
Evaluate the models by comparing their performance metrics against the business objective of identifying key drivers for used car prices.

In [None]:
Analyze how well the models provide insights into which features most influence price and assess their predictive accuracy. 

In [None]:
 Reflect on whether the results meet the business goals or if there are areas needing further refinement or exploration.

In [None]:
Based on this assessment, decide if any earlier phases of data preparation or modeling need revisiting or if the findings are ready to be presented to the client.

In [None]:
# Train and evaluate models
results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    results[name] = {'MSE': mse, 'R2': r2}
    
    print(f"{name} - MSE: {mse}, R2: {r2}")

# Model Comparison Plot

results_df = pd.DataFrame(results).T

# Plot R2 on the right axis
ax = results_df.plot(kind='bar', figsize=(10, 6))
ax2 = ax.twinx()
ax2.plot(results_df.index, results_df['R2'], color='red', marker='o')
ax2.set_ylabel('R2')
plt.title('Model Comparison')
plt.xlabel('Model')
plt.ylabel('Score')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine tuning their inventory.

In [None]:
It seems that the model Gradient Boosting is the best model based on the MSE and R2. Numerous iterations of parameters where tried with all three models.

In [None]:

# Print top five important features for each model
for name, model in models.items():
    if name == 'Linear Regression':
        feature_importances = model.named_steps['regressor'].coef_
        feature_names = model.named_steps['preprocessor'].transformers_[1][1].named_steps['onehot'].get_feature_names_out(categorical_features)
        top_features = pd.Series(feature_importances, index=numeric_features + list(feature_names))
        top_features = top_features.abs().sort_values(ascending=False).head(5)
        print(f"Top five important features for {name}:")
        print(top_features)
        print()
    elif name == 'K-Nearest Neighbors':
        print(f"Top five important features for {name}:")
        print("K-Nearest Neighbors does not provide feature importances")
        print()
    else:
        feature_importances = model.named_steps['regressor'].feature_importances_
        feature_names = model.named_steps['preprocessor'].transformers_[1][1].named_steps['onehot'].get_feature_names_out(categorical_features)
        top_features = pd.Series(feature_importances, index=numeric_features + list(feature_names))
        top_features = top_features.abs().sort_values(ascending=False).head(5)
        print(f"Top five important features for {name}:")
        print(top_features)
        print()


In [None]:
The results of the findings are as follows below.

In [None]:
Linear Regression - MSE: 92038296.64641926, R2: 0.5127299385145626
Gradient Boosting - MSE: 45754142.68705401, R2: 0.7577679647203505
Random Forest - MSE: 89977228.91472253, R2: 0.5236416637087793

In [None]:
At the same time it seems that Random Forest and Gradient Boosting had the best results to interpret.

In [None]:
Top five important features for Linear Regression:
model_benz sprinter           114278.431500
model_challenger srt demon    101050.294715
model_5 window coupe           88114.664943
model_850i                     84044.609071
model_hot rod                  83465.557273
dtype: float64

Top five important features for Gradient Boosting:
age                      0.132557
mileage_per_year         0.069720
cylinders_8 cylinders    0.030959
cylinders_8 cylinders    0.030959
fuel_diesel              0.030813
type_truck               0.029214
dtype: float64
type_truck               0.029214
dtype: float64
dtype: float64

Top five important features for Random Forest:
age                 0.098791
mileage_per_year    0.061506
fuel_diesel         0.044064
fuel_gas            0.041248
type_truck          0.032176
dtype: float64

In [None]:
It seems that age and mileage have the best indicator on car price. We recomend targeting these for buying and selling from the dealer point of view.