# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from Kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications, we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary.

### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

### Data Preparation

After our initial exploration and fine-tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline


In [4]:
df = pd.read_csv("sample_data/vehicles.csv")

In [7]:
df.head()

Unnamed: 0,id,region,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,VIN,drive,size,type,paint_color,state
0,7222695916,prescott,6000,,,,,,,,,,,,,,,az
1,7218891961,fayetteville,11900,,,,,,,,,,,,,,,ar
2,7221797935,florida keys,21000,,,,,,,,,,,,,,,fl
3,7222270760,worcester / central MA,1500,,,,,,,,,,,,,,,ma
4,7210384030,greensboro,4900,,,,,,,,,,,,,,,nc


In [8]:
essential_cols = [
    'price', 'year', 'odometer', 'manufacturer', 'model',
    'condition', 'cylinders', 'fuel', 'title_status',
    'transmission', 'drive', 'type'
]
df_cleaned = df.dropna(subset=essential_cols)

In [9]:
df_cleaned = df_cleaned[
    (df_cleaned['price'] >= 500) & (df_cleaned['price'] <= 100000) &
    (df_cleaned['odometer'] >= 1000) & (df_cleaned['odometer'] <= 300000)
]

In [10]:
df_cleaned.head(50)

Unnamed: 0,id,region,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,VIN,drive,size,type,paint_color,state
31,7316356412,auburn,15000,2013.0,ford,f-150 xlt,excellent,6 cylinders,gas,128000.0,clean,automatic,,rwd,full-size,truck,black,al
32,7316343444,auburn,27990,2012.0,gmc,sierra 2500 hd extended cab,good,8 cylinders,gas,68696.0,clean,other,1GT220CG8CZ231238,4wd,,pickup,black,al
33,7316304717,auburn,34590,2016.0,chevrolet,silverado 1500 double,good,6 cylinders,gas,29499.0,clean,other,1GCVKREH6GZ228691,4wd,,pickup,silver,al
34,7316285779,auburn,35000,2019.0,toyota,tacoma,excellent,6 cylinders,gas,43000.0,clean,automatic,,4wd,,truck,grey,al
35,7316257769,auburn,29990,2016.0,chevrolet,colorado extended cab,good,6 cylinders,gas,17302.0,clean,other,1GCHTCE37G1186784,4wd,,pickup,red,al
36,7316133914,auburn,38590,2011.0,chevrolet,corvette grand sport,good,8 cylinders,gas,30237.0,clean,other,1G1YR3DW3B5102190,rwd,,other,red,al
38,7315816316,auburn,32990,2017.0,jeep,wrangler unlimited sport,good,6 cylinders,gas,30041.0,clean,other,1C4BJWDG5HL705371,4wd,,other,silver,al
42,7315379459,auburn,37990,2016.0,chevrolet,camaro ss coupe 2d,good,8 cylinders,gas,9704.0,clean,other,1G1FF1R79G0140582,rwd,,coupe,red,al
45,7315270785,auburn,27990,2018.0,nissan,frontier crew cab pro-4x,good,6 cylinders,gas,37332.0,clean,other,1N6AD0EV5JN745213,4wd,,pickup,silver,al
55,7314560853,auburn,19900,2004.0,ford,f250 super duty,good,8 cylinders,diesel,88000.0,clean,automatic,,4wd,full-size,pickup,blue,al


In [12]:
df_cleaned['age'] = 2025 - df_cleaned['year']
df_cleaned = df_cleaned[df_cleaned['age'] <= 100]  # Remove unrealistic ages

df_cleaned = df_cleaned.drop(columns=[
    'id', 'region', 'VIN', 'paint_color', 'state', 'year'
])

### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

In [13]:
X = df_cleaned.drop(columns='price')
y = df_cleaned['price']

In [14]:

# Identify categorical and numeric features
categorical_cols = X.select_dtypes(include='object').columns.tolist()
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns.tolist()

In [15]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ])

In [16]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [17]:
pipeline = Pipeline(steps=[('preprocessor', preprocessor)])
X_train_prepared = pipeline.fit_transform(X_train)
X_test_prepared = pipeline.transform(X_test)

# Output the shape of processed data
X_train_prepared.shape, X_test_prepared.shape, y_train.shape, y_test.shape

((94439, 10310), (23610, 10310), (94439,), (23610,))

### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high-quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight into drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

Our business goal was to understand what drives the price of used vehicles. Specifically, we wanted to identify features that most strongly influence resale value so that the dealership can:
- Prioritize high-value inventory
- Identify undervalued cars
- Set competitive, data-driven prices

## Model Quality Evaluation

We trained a Random Forest regression model and evaluated its performance using key metrics:

- **Root Mean Squared Error (RMSE):** ~4,544
- **Mean Absolute Error (MAE):** ~2,682

The model explains ~85% of the variance in vehicle prices. This is a strong result for real-world prediction and suggests that the features in our dataset are highly informative.


## Key Learnings & Insights

Based on the feature importance and exploratory data analysis, we conclude the following:

- **Vehicle age** and **odometer reading** are strong negative predictors of price.
- **Manufacturer and model** play a significant role—premium and reliable brands retain more value.
- **Condition**, **drivetrain**, and **vehicle type** (truck vs sedan) also materially influence price.
- **Transmission** and **fuel type** show some influence, but are secondary compared to physical wear and brand.

These insights align well with consumer expectations and industry knowledge, making our findings actionable.


## Revisiting Earlier Phases

Given the model performance and clarity of the findings:
- We do **not need to revisit** data preparation or modeling strategy at this time.
- Additional refinements could include using more granular features (trim levels, accident history) or trying advanced models, for that we shall start by capturing the missing data we suggesting dropping out of the dataset.
- Data augmentation with external market trends or dealership-level performance could further enhance insights.

Overall, the model has met the business objective effectively and is ready for communication with stakeholders.

## Deliverables for the Client

- **Feature impact analysis** to guide purchasing decisions
- **Interactive pricing model** that could provide live pricing at the moment of search.
- **Visual summaries and business-friendly report** on which features most drive price

These deliverables support data-driven inventory and pricing strategy for the dealership, making this project a success.

### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine-tuning their inventory.

# Used Car Price Analysis: Insights & Recommendations
Prepared For: Used Car Dealership Partners

Prepared By: Wassim Bejaoui

Date: 08/12/2025

# Objective
Our goal was to uncover the key factors that influence the price of used cars to help your dealership make smarter inventory and pricing decisions. Using a dataset of over 426,000 used vehicles, we built a predictive model and conducted a deep dive into the data to identify patterns and pricing drivers.

# Approach

We followed the CRISP-DM methodology, focusing on:

* Business Understanding: Identify what makes a car more or less expensive.

* Data Understanding & Cleaning: Remove invalid entries and outliers; engineer features such as vehicle age.

* Modeling: Build and evaluate a machine learning model to predict price.

* Evaluation & Insights: Analyze which factors matter most and translate them into business value

# Key Findings
* Condition & Wear: Age and odometer reading are the strongest predictors of price.

* Newer cars and those with lower mileage command higher value.

* Brand & Model: Vehicles from brands like Toyota, Honda, and Ford retain value well.

* Even within brands, certain models stand out for their resale strength.

* Type & Features: SUVs and trucks consistently sell for more than sedans or coupes.

* 4WD/AWD drive systems boost price, especially in trucks and larger vehicles.

* Other Factors: Condition descriptions (“excellent”, “like new”) significantly increase price.

* Transmission type and fuel type have moderate impact but are less critical.



In [25]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

essential_cols = [
    'price', 'year', 'odometer', 'manufacturer', 'model',
    'condition', 'cylinders', 'fuel', 'title_status',
    'transmission', 'drive', 'type'
]
df = df.dropna(subset=essential_cols)
df = df[(df['price'] >= 500) & (df['price'] <= 100000)]
df = df[(df['odometer'] >= 1000) & (df['odometer'] <= 300000)]
df['age'] = 2025 - df['year']
df = df[df['age'] <= 100]


sample = df.sample(5000, random_state=42)

sns.set(style="whitegrid")

plt.figure(figsize=(8, 6))
sns.scatterplot(data=sample, x='age', y='price', alpha=0.6)
plt.title("Price vs. Vehicle Age")
plt.savefig("price_vs_age.png")
plt.close()


In [26]:
plt.figure(figsize=(8, 6))
sns.scatterplot(data=sample, x='odometer', y='price', alpha=0.6)
plt.title("Price vs. Odometer")
plt.savefig("price_vs_odometer.png")
plt.close()

In [27]:
plt.figure(figsize=(8, 6))
sns.boxplot(data=sample, x='condition', y='price')
plt.title("Price by Condition")
plt.xticks(rotation=45)
plt.savefig("price_by_condition.png")
plt.close()

In [28]:
plt.figure(figsize=(8, 6))
sns.boxplot(data=sample, x='type', y='price')
plt.title("Price by Vehicle Type")
plt.xticks(rotation=45)
plt.savefig("price_by_vehicle_type.png")
plt.close()