# What drives the price of a car?

**OVERVIEW**

In this application, you will explore a dataset from Kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

To frame the task, throughout our practical applications, we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

* Realign inventory to what customers value in order to boost sales.  
* Provide competitive pricing by understanding the market and customer preferences.
* Reduce operating costs by eliminating older vehicles that may require more maintenance, repairs, not as fuel efficient, and may not have the latest features that customers are looking for.
* Reduce operating costs by eliminating expensive vehicles that may take longer to sell. Thus freeing up space for vehicles that are more likely to sell. 

### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

* Explore price, eliminate outliers that may be skewing the data
* Explore columns that may be related to price such as year, mileage, and condition.
* Remove vehicles that do not have a clean title, as they may be less desirable to customers and could be priced lower than their market value.

### Data Preparation

After our initial exploration and fine-tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

In [2]:
#Import necessary libraries for data analysis and visualization
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [3]:
# read in data
cars = pd.read_csv('data/vehicles.csv')
cars.head()

Unnamed: 0,id,region,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,VIN,drive,size,type,paint_color,state
0,7222695916,prescott,6000,,,,,,,,,,,,,,,az
1,7218891961,fayetteville,11900,,,,,,,,,,,,,,,ar
2,7221797935,florida keys,21000,,,,,,,,,,,,,,,fl
3,7222270760,worcester / central MA,1500,,,,,,,,,,,,,,,ma
4,7210384030,greensboro,4900,,,,,,,,,,,,,,,nc


In [4]:
# Data cleaning 
# Drop rows where 'model' is NaN
cars = cars.dropna(subset=['model'])

# Drop rows where 'manufacturer' is NaN
cars = cars.dropna(subset=['manufacturer'])

# Let's clean up undesirable title status
remove_title_status = ['salvage', 'parts_only', 'lien', 'missing','rebuilt']
cars = cars[~cars['title_status'].isin(remove_title_status)]

# Let's remove entries that are parts only.
cars = cars[cars['title_status'] != 'parts_only']

# Let's drop older cars that may be less desirable to customers and could be priced lower than their market value. Also may require more maintenance, repairs, missing safety features, not as fuel efficient and cost money to store on the lot. Let's drop cars that over 30 years old.
cars = cars[cars['year'] >= 1995]

# Price to keep on the lot. Let's drop cars that are priced less than $600, that's $50/month in expenses (at least) to keep and store on the lot. 
cars = cars[cars['price'] >= 600]

# Let's get rid of cars that are priced over $120,000, as they may be outliers and skewing the data. Also may not be desirable to customers and could be priced higher than their market value.
# Anything over $120,000 is likely a luxury or exotic car,which we can analyze separately if we want to.  
cars = cars[cars['price'] <= 120000]

# Let's drop recors there manufacturer is Harley-Davidson, as they are motorcycles.  The idea is that luxury cars, motorcyles, have different features and the dealership is specialized in these areas.
cars = cars[cars['manufacturer'] != 'harley-davidson']

# Let's drop cars that have an odometer reading over 250,000 miles, as they may be less desirable to customers and could be priced lower than their market value. After 250,000. It may be time to replace the motor.
cars = cars[cars['odometer'] <= 250000]

### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

In [None]:
X = cars[['price','odometer']]
y = cars['year']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R^2 Score: {r2}')

Mean Squared Error: 14.012529486466727
R^2 Score: 0.5224808526757568


### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high-quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight into drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine-tuning their inventory.