# <span style="color: #51e2f5;">What drives the price of a car?</span>

![](../images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from Kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

## <span style="color: #9df9ef;">CRISP-DM Framework</span>

<center>
    <img src = ../images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications, we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

## <span style="color: #9df9ef;">Business Understanding</span>

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

### <span style="color: #edf756;"> Data-task reframing </span>

Turn the business question — “what drives the price of a used car?” — into a supervised learning and explanatory-data-analysis problem: build and validate a predictive regression model that estimates a car’s sale price from its attributes (make, model, year, mileage, trim, fuel/engine type, transmission, location, condition, etc.), and use interpretable model-analysis (feature importance, partial dependence, SHAP) and rigorous EDA to identify the causal / correlational drivers of price. 

**Objective:** Predict price (continuous target) and rank the features that most strongly influence price while controlling for confounders and avoiding target leakage.

**Approach:** perform data cleaning (missing-data treatment, outlier detection), feature engineering (age, mileage-per-year, market-region indicators), categorical encoding, and exploratory analysis; train several regression models (linear models with regularization, tree-based models like Random Forest / XGBoost); evaluate with cross-validated metrics (MAE, RMSE, R²) and robustness checks.

**Interpretability & business output:** use global and local explainability (feature importances, partial dependence plots, SHAP values) to produce clear, actionable recommendations for the dealership (which attributes to highlight, which trade-ins to prioritize, pricing adjustments by age/mileage/brand/region).

**Success criteria:** a validated model with acceptable predictive error (business-defined MAE or RMSE threshold), stable feature ranking across models, and a short list of actionable drivers that explain price variance and map to dealership decisions (pricing, procurement, marketing).

### <span style="color: #edf756;"> Data Problem Definition (CRISP-DM): </span>
This project is a supervised machine learning regression task where the goal is to model and predict the continuous target variable price using vehicle attributes such as make, model, year, mileage, condition, and specifications. We will perform data preparation, including cleaning, transformation, and feature engineering, followed by exploratory data analysis (EDA) to understand patterns and correlations. The objective is to build an interpretable regression model and use feature importance and related analytical techniques to determine which variables most significantly influence used-car prices.

### <span style="color: #edf756;"> Data Understanding </span>

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

#### <span style="color: #ffa8B6;"> 1. Load and Inspect the Raw Data </span>
- Review dataset size (rows, columns) and data types.
- Display sample records to see how values are represented.
- Check for column descriptions or metadata.

#### <span style="color: #ffa8B6;"> 2. Assess Data Completeness </span>
- Identify missing values in each column.
- Analyze missingness patterns (random vs. systematic).
- Flag fields with high missing rates for potential removal or imputation.

#### <span style="color: #ffa8B6;"> 3. Evaluate Data Quality </span>
- Look for inconsistent or impossible values  for example negative mileage, future model years, price = 0.
- Identify duplicate entries.
- Detect outliers using histograms, boxplots, IQR, or z-scores.

#### <span style="color: #ffa8B6;"> 4. Understand Variable Distributions </span>
- Plot histograms and density curves for numerical variables.
- Review category frequencies for categorical variables.
- Identify dominant categories that may introduce imbalance.

#### <span style="color: #ffa8B6;"> 5. Explore Relationships Between Variables </span>
- Compute correlation matrices for numeric fields.
- Use scatter plots or boxplots to visualize relationships with price.
- Investigate interactions for example, mileage vs. price across different brands.

#### <span style="color: #ffa8B6;"> 6. Identify Potential Feature Engineering Needs </span>
- Consider new features such as:  
  - *Vehicle age* (current year – model year)  
  - *Mileage per year*  
  - *Brand segment* (Pickup, Truck, SUV, HAtchback)  
  - *Regional indicators*

#### <span style="color: #ffa8B6;"> 7. Validate Data Relevance to the Business Objective </span>
- Assess whether available data supports the goal of predicting car prices.
- Ensure essential factors for example condition, make/model, cylinders, etc. are present.
- Align dataset capabilities with dealership questions.

#### <span style="color: #ffa8B6;"> 8. Document Data Limitations </span>
- Note missing attributes that might impact model fidelity.
- Identify sampling or geographic biases.
- Record assumptions that may require clarification.

##### <span style="color: #a28089;"> Notes: </span>

## <span style="color: #9df9ef;"> Data Preparation </span>

After our initial exploration and fine-tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

### <span style="color: #edf756;"> Imports </span>

In [7]:
import numpy as np 
import pandas as pd 

import matplotlib.pyplot as plt 
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns

from sklearn.preprocessing import StandardScaler

import sys
from pathlib import Path

parent = Path.cwd().parent
sys.path.insert(0, str(parent))

from helpers.data_cleaners import summary_stats

import warnings
warnings.filterwarnings("ignore")

### <span style="color: #edf756;"> Load Data </span>

In [4]:
df = pd.read_csv('../data/vehicles.csv')

### <span style="color: #edf756;"> Raw Data Stats </span>
Use summary_stats and look for columns that are all unique (too much variation) and not unique at all (no variation)

In [8]:
summary_stats(df)

Unnamed: 0,dtype,sample_val,vals,miss_pct,unique,mean,mode,min,max,std,skew,kurtosis
id,int64,7311443555,426880,0.0,426880,7311487000.0,7207408000.0,7207408000.0,7317101000.0,4473170.41,-1.4,17.1
region,object,"kansas city, MO",426880,0.0,404,,,,,,,
price,int64,26985,426880,0.0,15655,75199.03,0.0,0.0,3736929000.0,12182282.17,254.4,69205.1
year,float64,2016.0,425675,0.3,114,2011.24,2017.0,1900.0,2022.0,9.45,-3.6,19.6
manufacturer,object,rover,409234,4.1,42,,,,,,,
model,object,cr-v ex l 4dr suv,421603,1.2,29649,,,,,,,
condition,object,excellent,252776,40.8,6,,,,,,,
cylinders,object,4 cylinders,249202,41.6,8,,,,,,,
fuel,object,gas,423867,0.7,5,,,,,,,
odometer,float64,57181.0,422480,1.0,104870,98043.33,100000.0,0.0,10000000.0,213881.5,38.0,1690.8


## <span style="color: #9df9ef;"> Modeling </span>

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

## <span style="color: #9df9ef;"> Evaluation </span>

With some modeling accomplished, we aim to reflect on what we identify as a high-quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight into drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

## <span style="color: #9df9ef;"> Deployment </span>

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine-tuning their inventory.