# What drives the price of a car?

![](images/kurt.jpeg)

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [5]:
contentRoot = '/content/drive/MyDrive/Github/machinelearning/used-car-pricing'

In [6]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures, StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer, TransformedTargetRegressor
from sklearn.inspection import permutation_importance
from sklearn.feature_selection import SequentialFeatureSelector
import numpy as np
import plotly.express as px
import pandas as pd
import warnings
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split

**OVERVIEW**

In this application, you will explore a dataset from kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework


To frame the task, throughout our practical applications we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary.



---

#### Business Objective
The client for this task is a used car dealership, whose interests are:


1.   Identify the right price to set for a used car in their inventory
2.   Stock cars in inventory that are likely to sell

With this information, the business objectives of this task, or the outputs, can be understood as:

*   (Primary) Given the attributes of an automobile, identify the features that are a good predictor of the price at which it will sell
*   (Secondary) Given the attributes of an automobile, predict the price it will sell for
*   (Future) Identify the features that contribute most to the sale of a used car. In other words, given the attributes of a customer and an automobile, what attributes are a good predictor of a used car sale? Since the provided dataset does not have enough information to predict this outcome, this can be treated as future work.

#### Situation Assessment
The major risks and constraints associated with this task are:


*   The quality of the dataset might affect the outcome. This can be mitigated to an extent using train/test data splits
*   There could be new factors affecting the price of a used car (for example: market conditions, pandemic) which could impact the applicability of our prediction to the client.
*   The time available to complete this task is only 1 week, which is a major constraint.

#### Data Mining Goals

The goal of this task is to follow the CRISP-DM framework to deliver a README.md and Jupyter notebook of the analysis. The final result will provide clear directions to the client on how to proceed with pricing their used car inventory.

---



### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

#### Import and explore the data

In [7]:
data = pd.read_csv('{}/data/vehicles.csv'.format(contentRoot))

In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 18 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            426880 non-null  int64  
 1   region        426880 non-null  object 
 2   price         426880 non-null  int64  
 3   year          425675 non-null  float64
 4   manufacturer  409234 non-null  object 
 5   model         421603 non-null  object 
 6   condition     252776 non-null  object 
 7   cylinders     249202 non-null  object 
 8   fuel          423867 non-null  object 
 9   odometer      422480 non-null  float64
 10  title_status  418638 non-null  object 
 11  transmission  424324 non-null  object 
 12  VIN           265838 non-null  object 
 13  drive         296313 non-null  object 
 14  size          120519 non-null  object 
 15  type          334022 non-null  object 
 16  paint_color   296677 non-null  object 
 17  state         426880 non-null  object 
dtypes: f

In [9]:
data.head()

Unnamed: 0,id,region,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,VIN,drive,size,type,paint_color,state
0,7222695916,prescott,6000,,,,,,,,,,,,,,,az
1,7218891961,fayetteville,11900,,,,,,,,,,,,,,,ar
2,7221797935,florida keys,21000,,,,,,,,,,,,,,,fl
3,7222270760,worcester / central MA,1500,,,,,,,,,,,,,,,ma
4,7210384030,greensboro,4900,,,,,,,,,,,,,,,nc


In [10]:
print(data.isna().sum())

id                   0
region               0
price                0
year              1205
manufacturer     17646
model             5277
condition       174104
cylinders       177678
fuel              3013
odometer          4400
title_status      8242
transmission      2556
VIN             161042
drive           130567
size            306361
type             92858
paint_color     130203
state                0
dtype: int64


In [11]:
num_unique = data.nunique(axis=0)

print("No.of.unique values in each column :\n", num_unique)

No.of.unique values in each column :
 id              426880
region             404
price            15655
year               114
manufacturer        42
model            29649
condition            6
cylinders            8
fuel                 5
odometer        104870
title_status         6
transmission         3
VIN             118246
drive                3
size                 4
type                13
paint_color         12
state               51
dtype: int64




---


##### Column Types
There are a number of categorical columns in the dataset. Many of them can be one hot encoded to prepare for analysis. The cardinality of some of these columns is pretty high, for example, `model` has 30,000 unique values, and we might have to cluster the data before one hot encoding. Other high cardinality columns are `region`, `year`, and `manufacturer`.

##### Data Quality
The data seems to have a large number of NaNs, especially the columns `size`, `condition`. `drive`, `paint_color` and `cylinders`. The column `VIN` has NaNs but are unique to a car, and are unlikely to predict the price. This column can be dropped.


---



### Data Preparation

After our initial exploration and fine tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`.

### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight on drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine tuning their inventory.