# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

The data involves formulating a predictive modeling problem to identify the key features and variables that influence used car prices. This entails leveraging historical data on various car attributes and using machine learning tools and techniques to build a model that can accurately predict prices of a used car.

### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

Initially I'd look at the different columns in the dataframe to see what features there are and if there are any datatypes I need to change. I'd also drop any null values if necessary. If there are categorical features I could encode them as numeric feature(s).

In [41]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_transformer, make_column_selector
import plotly.express as px
from random import shuffle, seed
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error 
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import PolynomialFeatures, StandardScaler, OneHotEncoder, OrdinalEncoder

In [42]:
df = pd.read_csv('data/vehicles.csv')

In [43]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 18 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            426880 non-null  int64  
 1   region        426880 non-null  object 
 2   price         426880 non-null  int64  
 3   year          425675 non-null  float64
 4   manufacturer  409234 non-null  object 
 5   model         421603 non-null  object 
 6   condition     252776 non-null  object 
 7   cylinders     249202 non-null  object 
 8   fuel          423867 non-null  object 
 9   odometer      422480 non-null  float64
 10  title_status  418638 non-null  object 
 11  transmission  424324 non-null  object 
 12  VIN           265838 non-null  object 
 13  drive         296313 non-null  object 
 14  size          120519 non-null  object 
 15  type          334022 non-null  object 
 16  paint_color   296677 non-null  object 
 17  state         426880 non-null  object 
dtypes: f

### Data Preparation

After our initial exploration and fine tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

In [44]:
# Custom function to extract the first letter of cylinders and convert to float
def convert_cylinders_to_float(value):
    if isinstance(value, str) and len(value) > 0 and value != 'other':
        return float(value[0])
    else:
        return float('nan')  # or any other default value for non-string or empty strings

In [45]:
df['cylinders'] = df['cylinders'].apply(convert_cylinders_to_float)

In [46]:
from scipy import stats

In [47]:
# Drop irrelevant columns
df = df.drop(columns = ['id', 'region', 'model', 'VIN', 'paint_color', 'state'])

In [52]:
numerical_columns = df.select_dtypes(include=['number']).columns

(0, 12)

In [11]:
df['condition'].unique()
df = df.dropna()
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 82784 entries, 31 to 426836
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   price         82784 non-null  int64  
 1   year          82784 non-null  float64
 2   manufacturer  82784 non-null  object 
 3   condition     82784 non-null  object 
 4   cylinders     82784 non-null  float64
 5   fuel          82784 non-null  object 
 6   odometer      82784 non-null  float64
 7   title_status  82784 non-null  object 
 8   transmission  82784 non-null  object 
 9   drive         82784 non-null  object 
 10  size          82784 non-null  object 
 11  type          82784 non-null  object 
dtypes: float64(3), int64(1), object(8)
memory usage: 8.2+ MB


In [8]:
X = df.drop(columns = ['price'])
y = df['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=41)

### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

In [9]:
ordinal_poly = make_column_transformer((OrdinalEncoder(categories = 'auto'), ['manufacturer', 'condition', 'fuel', 'title_status', 'transmission', 'drive', 'size', 'type']),
                                           (PolynomialFeatures(include_bias = False, degree = 2), make_column_selector(dtype_include=np.number)))

In [10]:
pipe = Pipeline([('transformer', ordinal_poly), 
                 ('scale', StandardScaler()),
                 ('select', SequentialFeatureSelector(estimator = Lasso(), n_features_to_select = 5)),
                 ('ridge', Ridge())])
param_dict = {'ridge__alpha': [0.001, 0.1, 1.0, 10.0, 100.0, 1000.0]}

In [13]:
pipe = Pipeline([('transformer', ordinal_poly), 
                 ('scale', StandardScaler()),
                 ('select', SequentialFeatureSelector(estimator = Lasso(), n_features_to_select = 5)),
                 ('ridge', Ridge(alpha=1000))])

In [14]:
pipe.fit(X_train, y_train)

In [15]:
pipe.named_steps['transformer'].get_feature_names_out()


array(['ordinalencoder__manufacturer', 'ordinalencoder__condition',
       'ordinalencoder__fuel', 'ordinalencoder__title_status',
       'ordinalencoder__transmission', 'ordinalencoder__drive',
       'ordinalencoder__size', 'ordinalencoder__type',
       'polynomialfeatures__year', 'polynomialfeatures__cylinders',
       'polynomialfeatures__odometer', 'polynomialfeatures__year^2',
       'polynomialfeatures__year cylinders',
       'polynomialfeatures__year odometer',
       'polynomialfeatures__cylinders^2',
       'polynomialfeatures__cylinders odometer',
       'polynomialfeatures__odometer^2'], dtype=object)

### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight on drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

In [16]:
coef = pipe.named_steps['ridge'].coef_
cols = []
i = 0
for s in pipe.named_steps['select'].support_:
    if s:
        cols.append(coef[i])
        i += 1
    else:
        cols.append(0)
print(cols)
features = pd.DataFrame([cols], columns=pipe.named_steps['transformer'].get_feature_names_out())

[-8604.171170269228, 0, 0, -4653.511216675449, -8113.356531363876, 0, -8619.915048724104, 0, 0, 0, 0, 0, 0, 0, 0, 0, -682.2016040660842]


In [17]:
features

Unnamed: 0,ordinalencoder__manufacturer,ordinalencoder__condition,ordinalencoder__fuel,ordinalencoder__title_status,ordinalencoder__transmission,ordinalencoder__drive,ordinalencoder__size,ordinalencoder__type,polynomialfeatures__year,polynomialfeatures__cylinders,polynomialfeatures__odometer,polynomialfeatures__year^2,polynomialfeatures__year cylinders,polynomialfeatures__year odometer,polynomialfeatures__cylinders^2,polynomialfeatures__cylinders odometer,polynomialfeatures__odometer^2
0,-8604.17117,0,0,-4653.511217,-8113.356531,0,-8619.915049,0,0,0,0,0,0,0,0,0,-682.201604


In [150]:
px.bar(x=pipe.named_steps['transformer'].get_feature_names_out(), y = cols)

In [21]:
result = df.groupby(['manufacturer'])['price'].mean()

print(result)

manufacturer
acura                8956.156499
alfa-romeo          17436.771429
aston-martin        53367.000000
audi                13587.650000
bmw                 12366.155752
buick              118961.165480
cadillac            11633.101629
chevrolet           20357.656344
chrysler             7548.056683
datsun              13434.928571
dodge               10362.261905
ferrari             76622.812500
fiat                 8855.896825
ford                85567.724056
gmc                 54779.126852
harley-davidson     14341.413793
honda                8560.113111
hyundai              7856.925275
infiniti            11130.894583
jaguar              13693.396552
jeep                13648.353800
kia                  8066.977946
land rover          18116.875000
lexus               12743.884179
lincoln             10318.800842
mazda                8047.718935
mercedes-benz       13721.061031
mercury              5440.327907
mini                 9170.752914
mitsubishi          12470.5600

In [160]:
result = df.groupby(['title_status'])['price'].mean()

print(result)

title_status
clean         81108.485969
lien          22058.972543
missing        5300.025773
parts only     3114.282609
rebuilt       19866.019410
salvage        9266.875427
Name: price, dtype: float64


In [18]:
result = df.groupby(['transmission'])['price'].mean()

print(result)

transmission
automatic    83386.798916
manual       11951.965883
other         8031.945285
Name: price, dtype: float64


In [19]:
result = df.groupby(['size'])['price'].mean()

print(result)

size
compact          8567.708281
full-size      131327.368856
mid-size        10418.362728
sub-compact      9181.960554
Name: price, dtype: float64


### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine tuning their inventory.

After analyzing data from historical used car sales and creating a model to predict the price of a used car, I've identified some key points you can use to tune your inventory.

The most important features when it comes to predicting the price of a used car are - Manufacturer, Title Status, Transmission, Size, and odometer ^ 2.

Manufacturer - Obviously the manufacturer of a car will have a lot to do with its price, both new and used. In general, higher end car manufacturers will resale for more.

Title Status - People like a car with a clean title. These cars sell for way more than cars with any other title status.

Transmission - Automatic cars tend to sell for more than cars with manual transmission or other types. However, this could simply be due to the fact that cars with manual transmissions tend to be older than new cars which are almost all automatic transmission.

Size - Bigger cars sell for more. The average used car price based on size of car going from highest to lowest is - full-size, mid-size, sub-compact, and compact

Odometer ^ 2 - The number of miles on a car is a big indicator of how much it will sell for. The sale price has a exponential relationship with the number of miles the car has.

In conclusion, you should try to have cars made by quality manufacturers, with clean titles, automatic transmission, preferably bigger cars, and with low mileage.