What Drives the Price of a Car?

Practical Application 11.1 — CRISP-DM Used Car Price Analysis

This project explores a Kaggle dataset of ~426K used-car listings from Craigslist to determine which vehicle attributes most strongly influence price. The findings are translated into actionable inventory recommendations for a used-car dealership.

📓 Notebook

The full analysis lives in prompt_II.ipynb.

🗂 Repository Structure

.
├── README.md             # this file
├── prompt_II.ipynb       # full CRISP-DM analysis
└── data/
    └── vehicles.csv      # raw dataset (download separately, see below)

🚀 How to Reproduce

Clone this repository.
Download the dataset from the assignment link and place vehicles.csv in the data/ folder.

Install dependencies:

pip install pandas numpy scikit-learn matplotlib seaborn jupyter

Launch the notebook:
```
jupyter notebook prompt_II.ipynb
```
Run all cells.

🧭 CRISP-DM Steps Followed

Business Understanding — translate the dealer's question into a regression problem with RMSE as the success metric.
Data Understanding — profile the 426K-row dataset, examine missingness, distribution of price, and cardinality of categoricals.
Data Preparation — drop identifiers, filter implausible prices/mileages/years, engineer car_age, impute missing values, and one-hot encode categoricals (with rare-category bucketing).
Modeling — fit and tune five regression models (Linear, Ridge, Lasso, Random Forest, Gradient Boosting) with 5-fold cross-validation; ensemble models use fixed sensible hyperparameters.
Evaluation — compare models on test-set RMSE, MAE, and R²; interpret the surviving Ridge coefficients to identify price drivers.
Deployment — translate model output into plain-English recommendations for the dealership.

🔑 Key Findings

After cleaning the data and fitting five regression models, the dominant patterns in used-car pricing are:

Age and mileage are the two biggest negative drivers of price. Each additional year of age and each additional 10K miles takes a measurable, consistent bite out of resale value.
Condition only really matters at the top. New and like-new vehicles command a clear premium; the gap between good, excellent, and fair is comparatively small.
Powertrain configuration carries a premium. Diesel engines, 4-wheel drive, and pickup/truck body types consistently outprice their gasoline / 2wd / sedan counterparts even after controlling for age and mileage.
Brand matters. Toyota, Ford, RAM, and GMC hold value notably better than budget brands.
Many listing fields add little signal. Lasso pruned a large share of the categorical features — paint colour and several minor fields don't meaningfully move price.
Non-linear models outperform linear ones. Random Forest and Gradient Boosting capture interaction effects (e.g., age × mileage × condition) that Ridge and Lasso cannot, delivering an estimated 15–25% lower RMSE.

💡 Recommendations to the Dealership

Recommendation	Why
Prioritise low-age, low-mileage inventory.	Age and odometer are the strongest negative coefficients in every model.
Stock diesel pickups and 4wd SUVs.	They consistently price above sedans of similar age.
Don't over-pay for "good" vs "excellent" condition.	The price tier collapses below like new.
Lean toward Toyota, Ford, RAM, GMC.	Top positive manufacturer coefficients.
Don't worry about paint colour.	Negligible impact once other features are controlled for.

⚠️ Caveats

The data are listing prices, not sale prices — actual transactions likely run ~5–10% lower.
Geographic features (state/region) were dropped in this iteration; a dealer in a rural truck market and one in a dense urban market will see meaningfully different effects from drive and type.
Listing quality (photos, description text) is invisible to the model — two identical cars can sell at very different prices based purely on presentation.
The ensemble models (Random Forest, Gradient Boosting) use sensible defaults but have not been hyperparameter-tuned; their true ceiling is higher than the numbers shown.

🔭 Next Steps

Tune the ensemble models. Random Forest and Gradient Boosting already outperform the linear models; a GridSearchCV over n_estimators, max_depth, and min_samples_leaf (RF) or learning_rate and max_features (GB) could squeeze out another 5–10% RMSE improvement.
Re-introduce geography via target encoding on state or region to capture local market effects.
Retrain on actual sale prices if available — the current data are listing prices, which run ~5–10% above real transaction values.
Build a price-suggestion tool. Wrap the final model in a small internal app so a buyer at auction can enter year/mileage/condition and get a recommended bid ceiling.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
.gitignore		.gitignore
README.md		README.md
prompt_II.ipynb		prompt_II.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What Drives the Price of a Car?

📓 Notebook

🗂 Repository Structure

🚀 How to Reproduce

🧭 CRISP-DM Steps Followed

🔑 Key Findings

💡 Recommendations to the Dealership

⚠️ Caveats

🔭 Next Steps

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

What Drives the Price of a Car?

📓 Notebook

🗂 Repository Structure

🚀 How to Reproduce

🧭 CRISP-DM Steps Followed

🔑 Key Findings

💡 Recommendations to the Dealership

⚠️ Caveats

🔭 Next Steps

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages