Practical Application 11.1 — CRISP-DM Used Car Price Analysis
This project explores a Kaggle dataset of ~426K used-car listings from Craigslist to determine which vehicle attributes most strongly influence price. The findings are translated into actionable inventory recommendations for a used-car dealership.
The full analysis lives in prompt_II.ipynb.
.
├── README.md # this file
├── prompt_II.ipynb # full CRISP-DM analysis
└── data/
└── vehicles.csv # raw dataset (download separately, see below)
- Clone this repository.
- Download the dataset from the assignment link and place
vehicles.csvin thedata/folder. - Install dependencies:
pip install pandas numpy scikit-learn matplotlib seaborn jupyter
- Launch the notebook:
jupyter notebook prompt_II.ipynb
- Run all cells.
- Business Understanding — translate the dealer's question into a regression problem with RMSE as the success metric.
- Data Understanding — profile the 426K-row dataset, examine missingness, distribution of price, and cardinality of categoricals.
- Data Preparation — drop identifiers, filter implausible prices/mileages/years, engineer
car_age, impute missing values, and one-hot encode categoricals (with rare-category bucketing). - Modeling — fit and tune five regression models (Linear, Ridge, Lasso, Random Forest, Gradient Boosting) with 5-fold cross-validation; ensemble models use fixed sensible hyperparameters.
- Evaluation — compare models on test-set RMSE, MAE, and R²; interpret the surviving Ridge coefficients to identify price drivers.
- Deployment — translate model output into plain-English recommendations for the dealership.
After cleaning the data and fitting five regression models, the dominant patterns in used-car pricing are:
- Age and mileage are the two biggest negative drivers of price. Each additional year of age and each additional 10K miles takes a measurable, consistent bite out of resale value.
- Condition only really matters at the top. New and like-new vehicles command a clear premium; the gap between good, excellent, and fair is comparatively small.
- Powertrain configuration carries a premium. Diesel engines, 4-wheel drive, and pickup/truck body types consistently outprice their gasoline / 2wd / sedan counterparts even after controlling for age and mileage.
- Brand matters. Toyota, Ford, RAM, and GMC hold value notably better than budget brands.
- Many listing fields add little signal. Lasso pruned a large share of the categorical features — paint colour and several minor fields don't meaningfully move price.
- Non-linear models outperform linear ones. Random Forest and Gradient Boosting capture interaction effects (e.g., age × mileage × condition) that Ridge and Lasso cannot, delivering an estimated 15–25% lower RMSE.
| Recommendation | Why |
|---|---|
| Prioritise low-age, low-mileage inventory. | Age and odometer are the strongest negative coefficients in every model. |
| Stock diesel pickups and 4wd SUVs. | They consistently price above sedans of similar age. |
| Don't over-pay for "good" vs "excellent" condition. | The price tier collapses below like new. |
| Lean toward Toyota, Ford, RAM, GMC. | Top positive manufacturer coefficients. |
| Don't worry about paint colour. | Negligible impact once other features are controlled for. |
- The data are listing prices, not sale prices — actual transactions likely run ~5–10% lower.
- Geographic features (state/region) were dropped in this iteration; a dealer in a rural truck market and one in a dense urban market will see meaningfully different effects from
driveandtype. - Listing quality (photos, description text) is invisible to the model — two identical cars can sell at very different prices based purely on presentation.
- The ensemble models (Random Forest, Gradient Boosting) use sensible defaults but have not been hyperparameter-tuned; their true ceiling is higher than the numbers shown.
- Tune the ensemble models. Random Forest and Gradient Boosting already outperform the linear models; a
GridSearchCVovern_estimators,max_depth, andmin_samples_leaf(RF) orlearning_rateandmax_features(GB) could squeeze out another 5–10% RMSE improvement. - Re-introduce geography via target encoding on
stateorregionto capture local market effects. - Retrain on actual sale prices if available — the current data are listing prices, which run ~5–10% above real transaction values.
- Build a price-suggestion tool. Wrap the final model in a small internal app so a buyer at auction can enter year/mileage/condition and get a recommended bid ceiling.