# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

The CRISP-DM framework can be used for data mining projects acrosss a variety of industries.
In this case, I need to develop machine learning algortihm that leverages historical data on used cars, including attributes such as model, size, year, region, etc. to predict the selling price of a used car and with that information, predict the kind of buyers for used cars. Using the learning algorithm, we can then predict what factors strongly influence the price of used cars and with that information, various business decisions can be made. 

### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

In [176]:
#First thing to do is to read the data into a dataframe
import pandas as pd
car_data = pd.read_csv('data/vehicles.csv')

In [177]:
#Looking at the info of the data
car_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 18 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            426880 non-null  int64  
 1   region        426880 non-null  object 
 2   price         426880 non-null  int64  
 3   year          425675 non-null  float64
 4   manufacturer  409234 non-null  object 
 5   model         421603 non-null  object 
 6   condition     252776 non-null  object 
 7   cylinders     249202 non-null  object 
 8   fuel          423867 non-null  object 
 9   odometer      422480 non-null  float64
 10  title_status  418638 non-null  object 
 11  transmission  424324 non-null  object 
 12  VIN           265838 non-null  object 
 13  drive         296313 non-null  object 
 14  size          120519 non-null  object 
 15  type          334022 non-null  object 
 16  paint_color   296677 non-null  object 
 17  state         426880 non-null  object 
dtypes: f

In [178]:
#Looking at the head of the data
car_data.head(100)

Unnamed: 0,id,region,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,VIN,drive,size,type,paint_color,state
0,7222695916,prescott,6000,,,,,,,,,,,,,,,az
1,7218891961,fayetteville,11900,,,,,,,,,,,,,,,ar
2,7221797935,florida keys,21000,,,,,,,,,,,,,,,fl
3,7222270760,worcester / central MA,1500,,,,,,,,,,,,,,,ma
4,7210384030,greensboro,4900,,,,,,,,,,,,,,,nc
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,7309798041,auburn,2100,2006.0,subaru,impreza,fair,4 cylinders,gas,97000.0,clean,automatic,,,,hatchback,,al
96,7309361405,auburn,80,2004.0,honda,,excellent,6 cylinders,gas,94020.0,clean,automatic,,,,,,al
97,7309271279,auburn,15990,2016.0,,Scion iM Hatchback 4D,good,,gas,29652.0,clean,other,JTNKARJEXGJ517925,fwd,,hatchback,blue,al
98,7309271051,auburn,20590,2013.0,acura,mdx sport utility 4d,good,6 cylinders,gas,77087.0,clean,other,2HNYD2H30DH510846,,,other,silver,al


The VIN number of a car is completely unique. Any rows that have the same VIN number can be considered a duplicate and can be removed.


In [179]:
#removing duplicate VINS
car_data = car_data.drop_duplicates(subset = 'VIN', keep = 'first')
car_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 118247 entries, 0 to 426833
Data columns (total 18 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            118247 non-null  int64  
 1   region        118247 non-null  object 
 2   price         118247 non-null  int64  
 3   year          117653 non-null  float64
 4   manufacturer  115179 non-null  object 
 5   model         117542 non-null  object 
 6   condition     61082 non-null   object 
 7   cylinders     70266 non-null   object 
 8   fuel          116608 non-null  object 
 9   odometer      116671 non-null  float64
 10  title_status  114608 non-null  object 
 11  transmission  117010 non-null  object 
 12  VIN           118246 non-null  object 
 13  drive         90898 non-null   object 
 14  size          30310 non-null   object 
 15  type          102062 non-null  object 
 16  paint_color   87564 non-null   object 
 17  state         118247 non-null  object 
dtypes: f

It is also ideal to remove any rows with the same model, odometer but with a different price so as to not confuse the
learning algorithm.

In [180]:
#removing the specified rows
car_data = car_data.drop_duplicates(subset = ['model', 'year'] , keep = 'first')

The id column isn't really a factor that affects the price of a car. The column can be removed completely.

In [181]:
car_data = car_data.drop(columns = 'id')

The VIN of a car doesn't affect the price either

In [182]:
car_data = car_data.drop(columns = 'VIN')

In [183]:
car_data = car_data.convert_dtypes()
car_data.isna().sum().sort_values(ascending = True)

region              0
price               0
state               0
model              55
year              202
transmission      217
odometer          401
fuel              932
title_status     2016
manufacturer     2088
type             4464
drive            8474
paint_color      9787
cylinders       11843
condition       14260
size            26257
dtype: int64

In [184]:
#Dropping other trivial attributes
car_data = car_data.drop(columns = 'paint_color')
car_data = car_data.drop(columns = 'cylinders')
car_data = car_data.drop(columns = 'state')
car_data = car_data.drop(columns = 'region')
car_data = car_data.drop(columns = 'size')
car_data = car_data.drop(columns = 'drive')
#car_data = car_data.drop(columns = 'title_status')

In [185]:
#Here is the final dataset with all the important attributes
car_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 36223 entries, 0 to 426833
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   price         36223 non-null  Int64 
 1   year          36021 non-null  Int64 
 2   manufacturer  34135 non-null  string
 3   model         36168 non-null  string
 4   condition     21963 non-null  string
 5   fuel          35291 non-null  string
 6   odometer      35822 non-null  Int64 
 7   title_status  34207 non-null  string
 8   transmission  36006 non-null  string
 9   type          31759 non-null  string
dtypes: Int64(3), string(7)
memory usage: 3.1 MB


In [186]:
#Dropping any remaining rows with na values
car_data = car_data.dropna()

### Data Preparation

After our initial exploration and fine tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

In [187]:
#Converting some of the non number columns into numbers so that a learning alg can be used on the data
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
car_data['manufacturer'] = le.fit_transform(car_data['manufacturer'])
car_data['model'] = le.fit_transform(car_data['model'])
car_data['condition'] = le.fit_transform(car_data['condition'])
car_data['fuel'] = le.fit_transform(car_data['fuel'])
car_data['title_status'] = le.fit_transform(car_data['title_status'])
car_data['transmission'] = le.fit_transform(car_data['transmission'])
car_data['type'] = le.fit_transform(car_data['type'])


In [188]:
car_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17725 entries, 27 to 426833
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   price         17725 non-null  Int64
 1   year          17725 non-null  Int64
 2   manufacturer  17725 non-null  int32
 3   model         17725 non-null  int32
 4   condition     17725 non-null  int32
 5   fuel          17725 non-null  int32
 6   odometer      17725 non-null  Int64
 7   title_status  17725 non-null  int32
 8   transmission  17725 non-null  int32
 9   type          17725 non-null  int32
dtypes: Int64(3), int32(7)
memory usage: 1.1 MB


In [189]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
normalized_car_data = pd.DataFrame(scaler.fit_transform(car_data), columns = car_data.columns)
car_data = normalized_car_data

Now that all the relevant columns have been converted into numbers, the learning algorithms can be applied.

### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

In [191]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [192]:
#makig the target variable and attribute dataframes
X = car_data.drop('price', axis = 1)
y = car_data['price']

In [193]:
#splitting the data into training data and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [194]:
#running a linear regression model on the data
model = LinearRegression()
model.fit(X_train, y_train)

LinearRegression()

In [195]:
y_pred = model.predict(X_test)

In [196]:
mse = mean_squared_error(y_test, y_pred)
print("MSE: " + str(mse))

MSE: 0.0001682849614032033


In [197]:
#getting the coefficents of the attributes
coefficients = pd.Series(model.coef_, index=X.columns)
print(coefficients)

year            0.026460
manufacturer   -0.002289
model           0.000860
condition       0.003377
fuel           -0.009012
odometer       -0.195607
title_status   -0.002731
transmission    0.006861
type           -0.000399
dtype: float64


### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight on drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

With the  linear regression model, I think I have enough information about the factors that affect the price of a used car
the most.

### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine tuning their inventory.

By using a linear regression model, I have come to the following conslusions:
Odometer is the number one factor that dictates the price of a car (reverse correlation). The higher the odometer reading,
the lower the price and vice versa. The second most important fasctor is the year the
car was manufactured in. The newer the car, the higher the price. The transmission type is the third most important factor and
condition is the fourth most important factor.

In summary, if a used car dealership is looking to sell higher priced cars, it should consider obtaining cars with low odometer readings that are also relatively new.