# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

The business goal is to identify key variables that influence used car prices, in order to provide recommendations to used car dealers on optimizing inventory leading to cost avoidance and increased profitability

The data task is to perform regression analysis on the used car dataset, modeling used car price as the target variable and attributes like make, model, mileage, condition, etc as the predictor variables. Here are the steps

Data Preparation: Handle missing data, encode categorical variables

Model Fitting: Split data into train/test sets, fit regression models (linear, lasso) on train set

Model Evaluation: Evaluate model performance on test set, analyze coefficients

Business Insights: Identify key variables influencing used car prices based on model results to provide inventory recommendations

In summary, we will leverage regression modeling and analysis to quantify the relationships between used car attributes and price, in order to extract insights to guide used car dealer inventory decisions. 

### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

First step was to load the dataset into a pandas dataframe and display a few rows.

There appears to be a few quality issues with the dataset:

1/ Missing Values: Many columns in the first few rows contain NaN values, indicating that the data is missing for these fields. Some of these columns include 'year', 'manufacturer', 'model', 'condition', 'cylinders', 'fuel', 'odometer', 'title_status', 'transmission', 'VIN', 'drive', 'size', 'type', 'paint_color'. 

2/ Data Consistency: The 'price' column has values for all rows displayed, but most other columns are missing values.

3/ Data Relevance: Some columns are not relevant to the task at hand (i.e., predicting used car prices). For example, 'region', 'fuel' and 'VIN' might not be as impactful as 'year', 'manufacturer', or 'model' in determining a car's price.

### Data Preparation

After our initial exploration and fine tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

Initial Data Preparation:
As part of the initial data preparation steps first we identify and sort missing data by percent. Then drop columns with a high percentage of missing values. I chose 20% as my threshold. Then it was time to impute missing values and remove outliers. The last step was to drop irrelavent columns and remove duplicate rows. 

Feature Engineering:
This step involves creating new features from existing ones that might better represent the underlying problem. For this exercise, I created a new feature that represents the age of the car at the time of the sale.

Finally, I split the dataset into a training set and a test set. This allows us to evaluate how well our model generalizes to unseen data:

### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

I used Lasso and Ridge Regression modeling techniques and used 'cross_val_score' to perform cross-validation by splitting the training data into several subsets, training the model on some of these subsets, and evaluating it on the remaining ones. 

PS: regression takes forever to run :-) and was killing the kernel. So I took a sample of the data and reduced the cross validation score ('cv' argument from 5 to 3)

The next step was to conduct cross-validation for each model on our training data (X_train, y_train) using the cross_val_score function from sklearn.model_selection to perform 5-fold cross-validation on our model. The cross_val_score function trains the model on a portion of the training data and then evaluates it on a separate portion. This process is repeated five times (in the code it is cv=5), with different partitions of the data used each time. The function then returns the scores from each of these five evaluations.

After that we need to find the best hyperparameters for Ridge and Lasso regression models using alpha as the only parameter.

### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight on drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

Business Objective

The primary goal is to understand the drivers of used car prices. Our models can offer insight into this by examining the coefficients of the regression models. These coefficients indicate how much a unit increase in the corresponding feature will change the price, holding all other features constant.

In the models we tested, Ridge and Linear Regression had the lowest MSE, making them the best performing models. However, interpreting the coefficients from the Ridge Regression model can be tricky so our focus is on the Linear Regression model.


Model Insights
Looking at the coefficients of the Linear Regression model, the features with the highest absolute coefficients are the most influential in predicting used car prices.

However, because we used target encoding for 'manufacturer' and 'model', interpreting these coefficients might be less straightforward. The encoded values represent the average price for each category, so the coefficients would represent the change in price for a unit change in these average prices.

### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine tuning their inventory.

Key Insights
The features that strongly influence used car prices are those that have the most significant values in our model. Basically, the bigger the value, the more impact it has on the price of a used car.

However, there's a bit of complexity when considering 'manufacturer' and 'model'. Since we've used a special method to convert these categories into numbers, the impact they have on price is tied to the average price of cars in each category. So, a change in these values actually reflects changes in these average prices.

Firstly, the age of a car plays a significant role in determining its price. As expected, newer cars tend to command higher prices in the used car market. Therefore, it might be beneficial to stock more recent models in your inventory, as they are likely to bring in higher profits.

Our analysis also revealed that the manufacturer and model of a car significantly impact the car's price. For example, premium brands like Mercedes-Benz and BMW tend to have higher prices compared to other brands. It may be a good strategy to stock more cars from premium brands, as they could potentially bring higher returns.