# What Drives the Price of a Car?

## Business Understanding

Used car dealers have multiple factors to consider when pricing a used car. Market, mileage, overall condition and popularity affect a vehicle's worth to the consumer. How does a dealer evaluate all of those factors.The goals of this project are to identify the factors that make a car more or less expensive and to make recommendations to a used car dealership on what consumers value. Data analysis can be found in the associated notebook.

## Data Understanding

The dataset comes from Kaggle and contains information on 426,000 cars. The variable of interest is car price (US $). Figure 1 shows the distribution of the price. Prices range from \\$0 to 3.7 billion dollars. Note that price is highly skewed toward the lower end of the distribution with most prices falling between \\$5000 to \\$25000. The median value is \\$13,950.

![box_price.png](attachment:box_price.png) 

![Box_Median_Price.png](attachment:Box_Median_Price.png)

**Figure 1.** Top boxplot shows distribution of price. The lower boxplot has removed outliers and shows the median of the variable.

In addition to the targe variable price, there are 17 features included in this data set. Three of these are numeric and 14 are objects. The numeric features include the following:

##### Numeric Variables
* id: This is a unique identifer for each sample.
* year: The relationship of year and price is shown below in Figure 2. Note that price is plotted on a logarithmic scale. There may be a negative correlation with the log(price).

![year.png](attachment:year.png)

**Figure 2.** Scatterplot showing the relationship of vehicle price and year. Price is plotted on a logarithmic scale.

* odometer: The relationship of odometer and price is shown below in Figure 3. Note that price is plotted on a logarithmic scale. There is not a clear relationship.

![odometer.png](attachment:odometer.png)

**Figure 3.** Scatterplot showing the relationship of odometer reading with vehicle price. Price is plotted on a logarithmic scale.

##### Object Variables

There are 14 object features.

* region: Region where car was sold, generally a city or multiple cities, but also includes colloquial names. 404 unique values.
* manufacturer: Manufacturer of the vehicle. 42 unique values.
* model: The model of the vehicle. 29649 unique values.
* condition: Qualitative assessment of vehicle conditions. Six values: good, excellent, like new, fair, new, salvage.        
* cylinders: Number of cylinders in the vehicle engine: 6, 4, 8, 5, 10, 3, 12 or other.
* fuel: Type of fuel the vehicle uses: gas, diesel, hybrid, electric, other.
* title_status: Status of the vehicle title: clean, rebuilt, salvage, lien, missing, or parts only.
* transmission: Type of transmission: automatic, manual, other.
* VIN: Unique identifier for each vehicle.
* drive: Identifies how power is delivered to the wheels: front wheel drive (fwd), rear wheel drive (rwd), or 4-wheel drive (4wd).
* size: Size of the vehicle: full-size, mid-size, compact, or sub-compact.
* type: Type of vehicle: sedan, SUV, pickup, truck, other, coupe, hatchback, wagon, van, convertible, mini-van, offroad, or bus.
* paint_color: Color of the vehicle, includes 12 color values.
* state: State in which the vehicle was sold, 50 unique values.

## Data Preparation

##### Target Variable

Samples that listed a prices of $0 were removed from the dataset. It's possible this reflected salvage values, but it was unclear is this was missing data. So these samples were removed from the dataset. Cars with prices greater than or equal to \\$100,000 were also removed so the price variable was closer to a normal distribution.

The target variable was transformed by taking log of price.

##### Features Removed

The columns 'id' and 'VIN' were removed as these were unique identifiers and not correlated with price. The columns 'region' and 'model' were also removed. These included too many unique values and were thought to be too detailed for this applications.

##### Numeric Variables

The remaining numeric variables - 'year' and 'odometer' were not transformed. The square of these variables were explored in relationship to price and did not visually show a stronger relationship. 

###### Removing Null Values

Samples containing null values in any of the remaining columns were removed. The variable 'size' contained 306361 null values so this reduced the data by almost 75%. Size was considered to be a relevant variable so this cut was considered essential.

##### Object Variables

Dummy variables were created for all of the remaining object variables using get_dummies().

##### Final Dataset

The resulting dataset consisted of 154 features in addition to 'price' and 76620 samples. Ten percent of this dataset was randomly selected and removed to serve as a test set that would not be used for training or cross validation. 

## Modelling

##### Feature Elimination using Ridge 

Trying to model 154 variables was too computationally involved for this example. The first step was to eliminate variables to make modelling more manageable. I ran a Ridge regression on the full data set. The data was first standardized using StandardScaler(). GridSearchCV() was used to select the optimum alpha value with 5-fold cross validation. The optimum alpha value was 670. Figure 4 shows the coefficients produced using an alpha of 670. Feature labels are not included, but the goal of the plot is to show the range of coefficient values. Certain features have stronger weights than those that have coefficients close to zero.

![ridge_coef.png](attachment:ridge_coef.png)

**Figure 4.** Ridge coefficients using an optimized alpha value = 670. Feature labels are not included, but the plot shows the range of coefficient values.

I used the information if Figure 4 to eliminate variables. I eliminated any feature that had a coefficient < |0.05|. This reduced the dataset to 23 features.

##### Feature Selection using SequentialFeatureSelector() and LASSO

SequentialFeatureSelector() was used with linear regression to identify the best features. I used GridSearchCV() with 5-fold cross validation to determine the optimum number of features from the reduced dataset of 23 features. Twenty-one was the optimum number of features. These 21 features were selected to create a final model.

The reduced dataset was also used to model price using a Lasso regression. GridSearchCV() with 5-fold cross validation was used  to determine the optimum alpha (0.1). This regression reduced the dataset to 16 features with non-zero coefficients. These 16 features were used to create an alternative final model to compare to the other final model.

##### Final Model

Model 1 used the 21 features selected using SequentialFeatureSelector() and linear regression. Model 2 used the 16 features selected using Lasso. Both models were trained with the full dataset (minus the 10% withheld as a test set). Both models were then used to predict the test set. Results are shown below. 

![results.PNG](attachment:results.PNG)

**Table 1.** Results of two possible final models.

Model 1 performed slighlty better and was chosen as the final model. The results of the best model (Model 1) are shown below.

![ModelCoef.PNG](attachment:ModelCoef.PNG)

**Table 2.** Features selected for the final model along with their coefficients.

## Evaluation

##### Limitations of the Model

Outliers were removed from the training and test datasets. These included any samples with a price of \\$0 or a price greater than or equal to $100000. The model will not perform well on cars with price values outside of the range of the dataset. The adjusted R-squared was equal to 0.340969. This means the model can only account for 34% of the variability in price. In other words, the model does not account for two-thirds of the variability in price. This model should not be used to predict price. However, it can be used to identify those factors that most influence price and whether they have a negative or positive effect on price.

##### Suggestions for Model Improvement

The correlation between the original 154 predictor variables are shown below in a heatmap. Most variables have a correlation near zero. However, there are some variables with hight correlation (black or light squares). This indicates that not all of the variables are independent. Removing variables that lack independence could improve the model.

![heatmap-2.png](attachment:heatmap-2.png)

**Figure 5.** Heatmap of correlation between predictor variables.

The target variable 'price' was transformed using log(price). This variable could be explored further. Different functions of this variable could be tested and cross validated to find the best fit.

## Deployment

While this model has limitations on accurately predicting price, it can be used to identify the factors that influence the value of a car. The model coefficients from Table 2 are plotted below. This information is used to make recommendations to car dealears.

![modelCoeff-2.png](attachment:modelCoeff-2.png)

##### Recommendations for Car Dealers

* Year and odometer were selected by the model as influencing price, but they have very little impact on the price.
* The condition of the vehicle is important to the consumer. Buyers prefer a used vehicle that is 'like new.' Other conditions - fair, good and salvage negatively impact the price.
* Engines with 10 cylinders decrease the value of the car.
* Vehicles that use diesel fuel are more valuable to consumers.
* The type of transmission influences the value of the vehicle. Automatic transmissions slightly lower the value. Other transmissions strongly decrease the value of the car.
* The size of the vehicle influences the value. Compact size decreases the value, full size increases the value.
* The type of vehicle influences the value. Convertibles, Pickups and trucks increase the value. Sedans and wagons decrease the value. 
* Vehicles sold in the state of Florida decrease the value.