# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 



As an analyst for used car dealerships, my goal is to determine what particular features of the car are stronger predictors for the price. In addition to that, the geographical location of the car may also factor into its sale price depending on the economic conditions of the area. The data received contains information about the make, model, mechanical differences and specifications, color, and the region it is in. The data also contains the price of the car. Our goal is to fit a model that will select the best features to predict the price of the car.

### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

1. Remove all null values. Initial inspection of the data shows that there are some null values. These values would make it difficult to fit a model. 
2. Remove prices that don't make sense such as 0 dollar cars.
3. Remove irrelevant features. VIN numbers, ID are likely not going to be useful in predicting the price of the car.
4. Encode the categorical variables. The model will not be able to interpret the make and model of the car as a string. We will need to encode these variables.
5. Normalize the data. The data contains a wide range of values. Normalizing the data will help the model converge faster.

In [31]:
import statsmodels.api as sm
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

df = pd.read_csv('data/vehicles.csv')

In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 18 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            426880 non-null  int64  
 1   region        426880 non-null  object 
 2   price         426880 non-null  int64  
 3   year          425675 non-null  float64
 4   manufacturer  409234 non-null  object 
 5   model         421603 non-null  object 
 6   condition     252776 non-null  object 
 7   cylinders     249202 non-null  object 
 8   fuel          423867 non-null  object 
 9   odometer      422480 non-null  float64
 10  title_status  418638 non-null  object 
 11  transmission  424324 non-null  object 
 12  VIN           265838 non-null  object 
 13  drive         296313 non-null  object 
 14  size          120519 non-null  object 
 15  type          334022 non-null  object 
 16  paint_color   296677 non-null  object 
 17  state         426880 non-null  object 
dtypes: f

In [33]:
df.describe()

Unnamed: 0,id,price,year,odometer
count,426880.0,426880.0,425675.0,422480.0
mean,7311487000.0,75199.03,2011.235191,98043.33
std,4473170.0,12182280.0,9.45212,213881.5
min,7207408000.0,0.0,1900.0,0.0
25%,7308143000.0,5900.0,2008.0,37704.0
50%,7312621000.0,13950.0,2013.0,85548.0
75%,7315254000.0,26485.75,2017.0,133542.5
max,7317101000.0,3736929000.0,2022.0,10000000.0


In [34]:
df['fuel'].unique()

array([nan, 'gas', 'other', 'diesel', 'hybrid', 'electric'], dtype=object)

In [35]:
df['title_status'].unique()

array([nan, 'clean', 'rebuilt', 'lien', 'salvage', 'missing',
       'parts only'], dtype=object)

### Data Preparation

After our initial exploration and fine tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

In [36]:
df = df.dropna()
df = df[df['price'] > 0]
df = df.drop(['id', 'VIN'], axis=1)    
df 

Unnamed: 0,region,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,drive,size,type,paint_color,state
215,birmingham,4000,2002.0,toyota,echo,excellent,4 cylinders,gas,155000.0,clean,automatic,fwd,compact,sedan,blue,al
219,birmingham,2500,1995.0,bmw,525i,fair,6 cylinders,gas,110661.0,clean,automatic,rwd,mid-size,sedan,white,al
268,birmingham,9000,2008.0,mazda,miata mx-5,excellent,4 cylinders,gas,56700.0,clean,automatic,rwd,compact,convertible,white,al
337,birmingham,8950,2011.0,ford,f-150,excellent,6 cylinders,gas,164000.0,clean,automatic,fwd,full-size,truck,white,al
338,birmingham,4000,1972.0,mercedes-benz,benz,fair,6 cylinders,gas,88100.0,clean,automatic,rwd,full-size,coupe,silver,al
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
426785,wyoming,23495,2015.0,ford,f150 xlt 4x4,like new,8 cylinders,gas,146795.0,clean,automatic,4wd,full-size,truck,black,wy
426788,wyoming,12995,2016.0,chevrolet,cruze lt,like new,4 cylinders,gas,61127.0,clean,automatic,fwd,compact,sedan,silver,wy
426792,wyoming,32999,2014.0,ford,"f350, xlt",excellent,8 cylinders,diesel,154642.0,clean,automatic,4wd,full-size,pickup,brown,wy
426793,wyoming,15999,2018.0,chevrolet,"cruze, lt",excellent,4 cylinders,gas,36465.0,clean,automatic,fwd,mid-size,sedan,black,wy


In [37]:
df['cylinders'].unique()

array(['4 cylinders', '6 cylinders', '8 cylinders', '5 cylinders',
       '10 cylinders', '3 cylinders', 'other', '12 cylinders'],
      dtype=object)

In [39]:
from sklearn.preprocessing import LabelEncoder

# Initialize LabelEncoder
le = LabelEncoder()

# List of categorical columns to encode
categorical_columns = ["region", "manufacturer", "model", "condition", "fuel", "title_status", "transmission", "drive", "size", "type", "paint_color", "state"]

# Apply label encoding to categorical columns
for col in categorical_columns:
    df[col] = le.fit_transform(df[col])

# Drop rows where cylinders is 'other'
df = df[df['cylinders'] != 'other']

# Convert cylinders to numeric
df['cylinders'] = df['cylinders'].str.extract('(\d+)').astype(int)

df

  df['cylinders'] = df['cylinders'].str.extract('(\d+)').astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['cylinders'] = df['cylinders'].str.extract('(\d+)').astype(int)


Unnamed: 0,region,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,drive,size,type,paint_color,state
215,28,4000,2002.0,38,1591,0,4,2,155000.0,0,0,1,0,9,1,1
219,28,2500,1995.0,4,382,1,6,2,110661.0,0,0,2,2,9,10,1
268,28,9000,2008.0,25,3094,0,4,2,56700.0,0,0,2,0,2,10,1
337,28,8950,2011.0,13,1926,0,6,2,164000.0,0,0,1,1,10,10,1
338,28,4000,1972.0,26,751,1,6,2,88100.0,0,0,2,1,3,9,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
426785,383,23495,2015.0,13,2129,3,8,2,146795.0,0,0,0,1,10,0,50
426788,383,12995,2016.0,7,1346,3,4,2,61127.0,0,0,1,0,9,9,50
426792,383,32999,2014.0,13,2223,0,8,0,154642.0,0,0,0,1,8,2,50
426793,383,15999,2018.0,7,1356,0,4,2,36465.0,0,0,1,2,9,0,50


### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

In [41]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import GridSearchCV, KFold, LeaveOneOut
from sklearn.linear_model import Ridge, Lasso, LinearRegression
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.pipeline import Pipeline

# Assuming df is your preprocessed dataframe from the previous step

# Separate features and target
X = df.drop(['price'], axis=1)
y = df['price']

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Define the models and parameters for grid search
models = {
    'SFS_Ridge': Pipeline([
        ('sfs', SequentialFeatureSelector(Ridge())),
        ('ridge', Ridge())
    ]),
    'Lasso': Lasso(),
    'SFS_Linear': Pipeline([
        ('sfs', SequentialFeatureSelector(LinearRegression())),
        ('lr', LinearRegression())
    ])
}

param_grids = {
    'SFS_Ridge': {
        'sfs__n_features_to_select': [5, 10, 15],
        'sfs__direction': ['forward', 'backward'],
        'ridge__alpha': np.logspace(-3, 3, 7)
    },
    'Lasso': {
        'alpha': np.logspace(-3, 3, 7)
    },
    'SFS_Linear': {
        'sfs__n_features_to_select': [5, 10, 15],
        'sfs__direction': ['forward', 'backward']
    }
}

# Define cross-validation strategies
cv_strategies = {
    'KFold_5': KFold(n_splits=5, shuffle=True, random_state=42),
    'KFold_10': KFold(n_splits=10, shuffle=True, random_state=42),
    'LeaveOneOut': LeaveOneOut()
}

# Perform grid search for each model and CV strategy
results = {}

for model_name, model in models.items():
    for cv_name, cv in cv_strategies.items():
        print(f"Running GridSearchCV for {model_name} with {cv_name}")
        
        grid_search = GridSearchCV(
            estimator=model,
            param_grid=param_grids[model_name],
            cv=cv,
            scoring='neg_mean_squared_error',
            n_jobs=-1
        )
        
        grid_search.fit(X_scaled, y)
        
        results[f"{model_name}_{cv_name}"] = {
            'best_params': grid_search.best_params_,
            'best_score': -grid_search.best_score_  # Convert back to MSE
        }

# Print results
for key, value in results.items():
    print(f"\n{key}:")
    print(f"Best parameters: {value['best_params']}")
    print(f"Best MSE score: {value['best_score']}")

Running GridSearchCV for SFS_Ridge with KFold_5


70 fits failed out of a total of 210.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
70 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/georgeli/PycharmProjects/usedcar-price-factors/.venv/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/georgeli/PycharmProjects/usedcar-price-factors/.venv/lib/python3.12/site-packages/sklearn/base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/georgeli/PycharmProjects/usedcar-price-factors/.venv/lib/python3.12/site-packages/sklearn/pipeline.py", line 472, in 

Running GridSearchCV for SFS_Ridge with KFold_10


140 fits failed out of a total of 420.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
140 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/georgeli/PycharmProjects/usedcar-price-factors/.venv/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/georgeli/PycharmProjects/usedcar-price-factors/.venv/lib/python3.12/site-packages/sklearn/base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/georgeli/PycharmProjects/usedcar-price-factors/.venv/lib/python3.12/site-packages/sklearn/pipeline.py", line 472, i

Running GridSearchCV for SFS_Ridge with LeaveOneOut


KeyboardInterrupt: 

### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight on drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine tuning their inventory.