# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from Kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications, we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

In [6]:
# --- Business Understanding: The task reframed ---
# 
# We are to make use and analyze past vehicle data to determine what factors drive the price of a used car. 
# We are to build and develop a predictive model that estimates the price of a used car based on factors (such as Make, Model,...)
# and highlight the most important factors and better inform the car dealership in setting competitive prices to boost sales.


### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression, LinearRegression, Lasso, Ridge
from sklearn.model_selection import cross_val_score, KFold

from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.preprocessing import PolynomialFeatures, OneHotEncoder, OrdinalEncoder, LabelEncoder, StandardScaler
from sklearn.pipeline import Pipeline


In [9]:
data = pd.read_csv("/Users/m.jalloh/Downloads/practical_application_II_starter/data/vehicles.csv")


In [10]:
# Data cleaning

# -- Action: First, remove missing (NaN) values in the dataset
data_filtered = data.dropna()
data_filtered.info()


<class 'pandas.core.frame.DataFrame'>
Index: 34868 entries, 126 to 426836
Data columns (total 18 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   id            34868 non-null  int64  
 1   region        34868 non-null  object 
 2   price         34868 non-null  int64  
 3   year          34868 non-null  float64
 4   manufacturer  34868 non-null  object 
 5   model         34868 non-null  object 
 6   condition     34868 non-null  object 
 7   cylinders     34868 non-null  object 
 8   fuel          34868 non-null  object 
 9   odometer      34868 non-null  float64
 10  title_status  34868 non-null  object 
 11  transmission  34868 non-null  object 
 12  VIN           34868 non-null  object 
 13  drive         34868 non-null  object 
 14  size          34868 non-null  object 
 15  type          34868 non-null  object 
 16  paint_color   34868 non-null  object 
 17  state         34868 non-null  object 
dtypes: float64(2), int64(2), obj

In [11]:
# Filtered Data/ cleaned
# 
data_filtered.sample(10)

# After cleaning, check for any missing data
# 
data_filtered.isnull().sum()


id              0
region          0
price           0
year            0
manufacturer    0
model           0
condition       0
cylinders       0
fuel            0
odometer        0
title_status    0
transmission    0
VIN             0
drive           0
size            0
type            0
paint_color     0
state           0
dtype: int64

### Data Preparation

After our initial exploration and fine-tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

In [13]:
# numeric & categorical features
#
num_cols = ["id", "price", "year", "odometer"]
cat_cols = ["region", "manufacturer", "condition", "cylinders", "fuel", "title_status", "transmission", "drive", "size", "type", "paint_color", "state"]

# Preparing features
# 
X_num = data_filtered[num_cols].drop("price", axis = 1)
y = data_filtered["price"]

# encode Categorical features
# 
data_filtered_encode = pd.get_dummies(data_filtered, columns = cat_cols, drop_first = True)

X_cat = data_filtered_encode.drop(["price", "id"], axis = 1)
X_cat = X_cat.select_dtypes(include = ["int64", "float64"])
y = data_filtered_encode["price"].astype(float)


In [14]:
#  Fit Regression models
# 
model_LinReg = LinearRegression()
model_LinReg.fit(X_cat, y)



In [15]:
# Splitting Data into Train & Test Data
# 
X_train, X_test, y_train, y_test = train_test_split(X_num, y, test_size = 0.3, random_state = 22)

# 
model_LinReg = LinearRegression()
model_LinReg.fit(X_train, y_train)
print("LinReg model score", model_LinReg.score(X_test, y_test))


LinReg model score 0.10022832370938661


In [16]:
# Lasso (L1) Regrerssion
# 
model_lasso = Lasso(alpha = 0.1)
model_lasso.fit(X_train, y_train)
print("Lasso model score", model_lasso.score(X_test, y_test))

# Ridge Regression (L2)
# 
model_ridge = Ridge(alpha = 1.0)
model_ridge.fit(X_train, y_train)
print("Ridge model score", model_ridge.score(X_test, y_test))


Lasso model score 0.10022829466824468
Ridge model score 0.1002283188960571


### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

In [18]:
# cross validation
# 
# with 5-fold cross-validation
kf_5 = KFold(n_splits = 5, shuffle = True, random_state = 42)


In [19]:
# Linear Regression
# 
LinReg_score = cross_val_score(model_LinReg, X_test, y_test, cv = kf_5, scoring = "r2")
print("Linear Regression score: ", LinReg_score )
print("Mean score: ", np.mean(LinReg_score) )

# Lasso Regression
# 
Lasso_score = cross_val_score(model_lasso, X_test, y_test, cv = kf_5, scoring = "r2")
print("Linear Regression score: ", Lasso_score )
print("Mean score: ", np.mean(Lasso_score))

# Ridge Regression
# 
Ridge_score = cross_val_score(model_ridge, X_test, y_test, cv = kf_5, scoring = "r2")
print("Linear Regression score: ", Ridge_score )
print("Mean score: ", np.mean(Ridge_score))



Linear Regression score:  [0.11678728 0.11364483 0.1082419  0.15053774 0.04528549]
Mean score:  0.10689944576956581
Linear Regression score:  [0.11678739 0.1136449  0.10824193 0.15053767 0.04528523]
Mean score:  0.10689942395804757
Linear Regression score:  [0.11678733 0.11364486 0.10824191 0.1505377  0.04528538]
Mean score:  0.10689943873458736


### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high-quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight into drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine-tuning their inventory.