# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

Customer wants to be able to predict prices for used cars. For this purpose, we need to find a set of features that have the most effect on the price and find exactly what this effect is.

### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures, OrdinalEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
from sklearn.compose import ColumnTransformer
import plotly.express as px
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.metrics import mean_squared_error, explained_variance_score
import pickle
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from scipy import stats

plt.rcParams['figure.figsize'] = (10, 6)

In [None]:
cars = pd.read_csv("data/vehicles.csv")

In [None]:
cars.info()

In [None]:
cars.head()

In [None]:
cars.describe()

In [None]:
cars.nunique()

In [None]:
cars["condition"].unique()

In [None]:
cars["drive"].unique()

It seems that we have 4 numerical properties (id, which is not useful for our analysis, price, year, and odometer/mileage) and 13 text-based categorical properties, including VIN (which is useless for this analysis due to being unique identifier of a specific car).

Now let's see how much of it is empty

In [None]:
missing_values = (cars.isna().mean() * 100).reset_index().rename(columns={"index": "Features", 0: "Missing %"})
missing_values["Missing %"] = round(missing_values["Missing %"])
missing_values["# of missing"] = cars.isna().sum().to_frame().reset_index()[0]

missing_values

In [None]:
# Let's check for duplicates
cars.drop("id", axis=1).duplicated().sum()

In [None]:
cars[cars.drop(["id", "VIN"], axis=1).duplicated()]

In [None]:
print(cars[cars["VIN"].isna() == False].duplicated(['VIN']).sum())

~56.2k duplicated rows, and 200 more that only differ in VIN. Speaking of which, there are 147.5 thousand (~34%) records of sells using same cars. Combined with 38% missing, this means that over 50% of this column is unusable.

In [None]:
sns.boxplot(cars["price"])
plt.title("\n\nBox Plot of Price")

Vast majority of cars is sold at prices lower than 1M USD, typically around double digit thousands, but the box plot is heavily influenced by the outliers, going as high as 3.73 billion USD. Let's try removing them.

In [None]:
cars_no_price_out = cars[(np.abs(stats.zscore(cars["price"])) < 3)]

In [None]:
cars_no_price_out = cars_no_price_out[(np.abs(stats.zscore(cars_no_price_out["price"])) < 3)]
cars_no_price_out = cars_no_price_out[(np.abs(stats.zscore(cars_no_price_out["price"])) < 3)]

In [None]:
cars_no_price_out.describe()

In [None]:
sns.boxplot(cars_no_price_out["price"])
plt.title("\n\nBox plot of prices, no outliers")

In [None]:
px.box(cars_no_price_out["price"], title="Box plot of prices, no outliers")

As we can see, after removing outliers (and doing it three times to get a good picture), most used car sales go between 0 and 56k, with middle 50% of sales in range between 5.8k and 26k USD.

Now, let's do the same with odometer readings.

In [None]:
sns.boxplot(cars["odometer"])
plt.title("\n\nBox plot of odometer")

In [None]:
cars_no_odometer_out = cars.dropna()[(np.abs(stats.zscore(cars["odometer"].dropna())) < 3)]

In [None]:
cars_no_odometer_out = cars_no_odometer_out[(np.abs(stats.zscore(cars_no_odometer_out["odometer"])) < 3)]

In [None]:
cars_no_odometer_out.describe()

In [None]:
px.box(cars_no_odometer_out["odometer"], title = "Box plot of odometer, no outliers")

Most cars are sold with mileage between 0 and 253k miles (?), with middle 50% between 65k and 140k.

Now let's see the distribution of various categorical properties.

In [None]:
cars_no_price_out["price"].plot(kind="kde")
plt.title("KDE plot of price, no outliers")

In [None]:
cars_no_odometer_out["odometer"].plot(kind="kde")
plt.title("KDE plot of odometer, no outliers")

Now, let's make some bulk plots for individual groups.

In [None]:
features = ["condition", "transmission", "drive", "title_status", "fuel", "cylinders", "type"]
fig, axs = plt.subplots(4, 2, figsize=(18, 15))
counter = int(0)

for i in features:
    count = cars[i].value_counts().reset_index()
    col1 = count.columns[0]
    col2 = count.columns[1]
    coordy = counter // 2
    coordx = counter % 2
    fig = sns.barplot(data=count, x=i, y="count", ax=axs[coordy, coordx])
    ax=axs[coordy, coordx].set_title(f'\n\nVolume per {i.title()}')
    plt.xlabel(f"{i.title()}")
    plt.ylabel("Count")
    counter = counter + 1

plt.tight_layout()

In [None]:
count = cars["title_status"].value_counts().reset_index()
col1 = count.columns[0]
col2 = count.columns[1]
fig = sns.barplot(data=count[count["title_status"] != "clean"], x="title_status", y="count").set(title = '\n\nVolume per "title_status"')
plt.xlabel("title_status")
plt.ylabel("Count")

Some interesting finds here. Of course, most cars sold on used market run on gas, have 4/6/8 cylinder engines (as other types are very rare) in similar proportions, and have clean titles - and among relatively miniscule amount of other types, rebuilt and salvage titles come way ahead of lien or missing ones. Similarly, by now it is widely known that most of US aftermarket is taken by cars with automatic transmission. However, I find it unusual that number of "good" and "excellent" cars sold vastly exceeds that of "new" and "like new". Similarly surprising, at least for me, are relatively similar counts of 4WD and FWD cars, as well as volume of less popular body types like coupe and wagons - I expected them to be much lower.

In [None]:
total_price = cars.groupby(['condition', "transmission", "drive", "title_status", "fuel", "cylinders", "type"])['price'].median().reset_index()

fig, axs = plt.subplots(4, 2, figsize=(18, 15))

sns.barplot(x='condition', y='price', data=total_price, ax=axs[0, 0])
axs[0, 0].set_title('\nAvg. price by condition')

sns.barplot(x='transmission', y='price', data=total_price, ax=axs[0, 1])
axs[0, 1].set_title('\nAvg. price by tansmission')

sns.barplot(x='drive', y='price', data=total_price, ax=axs[1, 0])
axs[1, 0].set_title('\n\nAvg. price by type of drive')

sns.barplot(x='title_status', y='price', data=total_price, ax=axs[1, 1])
axs[1, 1].set_title('\n\nAvg. price by title status')

sns.barplot(x='fuel', y='price', data=total_price, ax=axs[2, 0])
axs[2, 0].set_title('\n\nAvg. price by fuel type')

sns.barplot(x='cylinders', y='price', data=total_price, ax=axs[2, 1])
axs[2, 1].set_title('\n\nAvg. price by number of cylinders')

sns.barplot(x='type', y='price', data=total_price, ax=axs[3, 0])
axs[3, 0].set_title('\n\nAvg. price by type of the car')

plt.tight_layout()
# plt.show()

A quick glance at average prices reveals some interesting conclusions. For example, type of transmission has very little effect on price of the car, while all other categories do have some - although it remains to be seen if it is direct influence or a consequence of another factor.

In [None]:
cars.manufacturer.value_counts().sort_values(ascending=False).plot(kind="bar", x="index", y="manufacturer")
plt.title("Number of cars sold per manufacturer")
plt.xlabel("Manufacturer")
plt.ylabel("Count")

The big three names (>2.5k sold) in this dataset seem to be Ford, Chevrolet, and Toyota. After that we see end of large difference between adjacent manufacturers and instead number gradually decreases from just over 2000 sold (Honda) to miniscule amounts, like with Ferrari and Land Rover. Speaking of premium brands, we should check the prices.

In [None]:
px.box(cars, x="manufacturer", y="price", title="Box plot of price per manufacturer")

<img src = "images/manufacturer-price-box fixed scale.png"/>

*Above plot with scale adjusted to actually see the individual boxes*

As you can see, while there are some severe outliers (surprisingly, most expensive individual cars were made by Toyota, Chevrolet, and Mercedes, Ford and Jeep getting some honorable mentions), for most manufacturers used cars go for about 5-30k USD, notable exceptions being Ferrari - 50% between 53-141k - and Aston-Martin, which can casually go for as much as 75 or even 180k.

In [None]:
sns.countplot(data=cars, x="condition", order=cars["condition"].value_counts().index)
plt.title("Number of cars sold per condition")

As expected, people usually buy cars that are in decent condition, don't usually sell new or like new cars, and only rarely sell (or buy) salvage. Still, good thing to take a look at.

In [None]:
px.box(cars, x="condition", y="price", title="Box plot of price per condition")

<img src = "images/condition-price-box fixed scale.png"/>

With some outliers, it is some surprise to see that cars in "excellent" and "like new" condition, as a rule, are sold cheaper than the ones in "new" and "good" conditions. Unsurprisingly, cars marked as "fair" and "salvage" are worth much less.

It is also obvious that age of car plays a large role in the purchase decision. Let's see how exactly.

In [None]:
px.histogram(data_frame=cars, x="year")

In [None]:
data_frame=cars[cars["year"] > 2000]["year"].plot(kind="kde")
plt.title("KDE plot of year of manufacturing")

In [None]:
cars_no_year_out = cars.dropna()[(np.abs(stats.zscore(cars["odometer"].dropna())) < 3)]
px.box(data_frame=cars_no_year_out, x="year", y="price", title="Price of cars per year")

Well, it looks like this dataset was created in 2021 or 2022, and that people mostly buy cars that are up to 15 years old, with age being in a curious correlation with the price - for the first 30 years or so, higher age means lower price, but then car becomes retro and price grows with age.

Last two topics I wanted to touch are sales across states and different models.

In [None]:
px.histogram(data_frame=cars, x="state", title="Number of cars sold per state").update_xaxes(categoryorder='total descending')

In [None]:
px.box(data_frame=cars, x="state", y="price", title="Price of cars per state")

<img src = "images/prices per state.png"/>

It seems that despite significant difference between number of cars sold per state (e.g. California), prices are more or less in line across the country, and even biggest upper fence outliers (e.g. Texas) are only different from other states by ~20k, or ~30%.

In [None]:
px.box(data_frame=cars, x="size", y="price", title="Price of cars per state")

<img src = "images/size-price-box fixed scale.png"/>


As we can see, size doesn't really affect price of the car. Tehre is some difference in full-size vs others, but it can just as easily be written off to this category also including trucks.

In [None]:
cars_top15_models = cars.groupby(['model']).size().to_frame().sort_values([0], ascending = False).head(15).reset_index()
px.histogram(data_frame=cars_top15_models, x='model', y=0, title="Top 15 used car models")

Apparently, deciding to take a look at models was wrong due to how many there are. However, it is interesting to see that most resold cars are pickups (F-150, Silverado (regular and 1500), what I assume to be Ram 1500, Jeep Wrangler), as well as Asian everyday cars: Toyota Camry, Honda Civic and Accord, Nissan Altima.

In [None]:
cars_top15_models_prices = cars[["model", "price"]]
cars_top15_models_prices = cars_top15_models_prices.merge(cars_top15_models, on="model")
#.groupby(['model']).size().to_frame().sort_values([0], ascending = False).head(15).reset_index()

cars_top15_models_prices.head()

In [None]:
px.box(data_frame=cars_top15_models_prices, x='model', y="price", title="Average price on Top 15 (by overall sales) used car models")

<img src = "images/model-price top15 fixed scale.png"/>

Here we can see (after some scaling to remove outliers) box charts for prices on the aforementioned top 15 models by nubmer of sales. interestingly, while there is some variation and tendencies - e.g. sedans seem to go for ~7k while trucks go for ~20k as their median prices - overall distinction is not so large as to be overall clear. Also, doing this requires some computing power and is not very reliable due to requiring data on *all* 29k models, many of which are not clearly defined: manually looking provided gems such as "#NAME?", "150" (no manufacturer), "-3500", "1" (again, no manufacturer. unlike with Mazda 3/6), years between 1990 and 2010 with no manufacturer, "-", ",2012,2013, SOME 2014 MODELS", ".. ect.", and variety of misapplied labels like providing all the info on the car ("2011 f-450 4x4 crew cab in-closed utility") or dividing (or not) different packages (e.g. standard and limited) into models of their own. Overall, while model *could* provide *some* information, it is not worth the effort.

In [None]:
px.box(data_frame=cars, x="paint_color", y="price", title="Average price per color")

<img src = "images/color vs price.png"/>


As we can see, while color of the car might have some effect on its price (e.g. green cars sell cheaper than others), it is not enough of an effect to warrant using in this study.

### Data Preparation

After our initial exploration and fine tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`.

**Convert number of cylinders to numeric**

This step will make building models based on the number of cylinders easier and will bring electric cars into fold.

In [None]:
# Converting the Cylinder columns to numerical and replacing the 'other' with 0 since this will be applied to electric cars

cars["cylinders"] = cars["cylinders"].str.replace("cylinders", "")
cars["cylinders"] = cars["cylinders"].str.strip()
cars["cylinders"] = cars["cylinders"].str.replace("other", "0")

**Convert electric**

Some of the cars on the list are electric. As such, they don't have any cylinders and use an "other" transmission. This step will ensure they all have this format.

In [None]:
# Converting the cylinders of electric cars to 0 

cars.loc[cars["fuel"] == "electric", "cylinders"] = cars.loc[cars["fuel"] == "electric", "cylinders"].fillna("0")

# Converting transmission

cars.loc[cars["fuel"] == "electric", "transmission"] = cars.loc[cars["fuel"] == "electric", "transmission"].str.replace("automatic", "other")

In [None]:
cars["cylinders"] = cars["cylinders"].astype(float)

**Remove useless features**

Some of the features have next to no effect on this study. As such, it will be better to delete them:

- VIN stands for Vehicle Identification Number and is unique for each vehicle. Since it is only a bureaucratic property that has no real effect on purchase, it can be dropped immediately.
- Model might provide some information, but with how much variation there is (29.5k models would require too much with encoding) and how misapplied this column is (see earlier about that), it is effectively useless. Same with the region.
- Paint color has no noticeable effect on the car price
- Manufacturer might have some effect, but with 42 different manufacturers it becomes too computationally expensive.

In [None]:
cars = cars.drop(["VIN", "model", "paint_color", "region"], axis=1)
#cars = cars.drop("manufacturer", axis=1)

**Remove outliers**

Earlier, we found some boundaries for outliers in price and odometer readings. Now we will remove them manually based on the numbers found earlier.

In [None]:
cars = cars.query("0 < price < 65_000").copy()
cars = cars.query("0 < odometer < 300_000").copy()

**Remove empty values**

Initial data exploration revealed that many positions are filled with NaN. To provide better analysis, these need to be removed.

In [None]:
"""
#cars_drop = cars.dropna()
#cars_drop.info()

# Dropping Rows that have more than 10 columns as NaN

cars_drop = cars.drop(cars[cars.isna().sum(axis=1) > 10].index)

# Dropping the rows with Year NaN

cars_drop.drop(cars_drop[cars_drop.year.isna()].index, inplace=True)

# Dropping the rows with Manufacturer NaN

cars_drop.drop(cars_drop[cars_drop.manufacturer.isna()].index, inplace=True)
"""

cars_drop = cars.dropna()

In [None]:
cars_drop.info()

**Remove duplicates**

In [None]:
#cars_drop = cars_drop[cars_drop.drop("id", axis=1).duplicated() == False]
cars_drop = cars_drop.drop_duplicates(subset=cars_drop.columns.drop("id"))

In [None]:
cars_drop.info()

**OneHot Encoding**

Finally, we will need to change categorical features into numerical with OneHot encoding

In [None]:
dummies=pd.get_dummies(cars_drop[['condition', 'fuel', 'title_status', 'transmission',
                             'drive', 'size', 'type', "state", "manufacturer"]])
cars_dummies = pd.concat([cars_drop, dummies], axis=1)
cars_dummies=cars_dummies.drop(columns = ['condition', 'fuel', 'title_status', 'transmission',
                             'drive', 'size', 'type', "state", "manufacturer"])

### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

In [None]:
# Split the dataset
y = cars_dummies["price"]
X = cars_dummies.drop(columns="price")

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.3)

In [None]:
train_mse = []
test_mse = []
explained_variance = []
model =[]

We will use several models: linear, polynomial, LASSO, and ridge regression.

In [None]:
# Linear
# Took just 1.6s to compute

linear_model = Pipeline([
    ('transform', PolynomialFeatures(degree=1, include_bias=False)),
    ('scale', StandardScaler()),
    ('linreg', LinearRegression())
])
linear_model.fit(X_train, y_train)

In [None]:
# Record MSE
linear_train_mse= round(mean_squared_error(linear_model.predict(X_train), y_train), 4)
linear_test_mse=round(mean_squared_error(linear_model.predict(X_test), y_test),4)
linear_EV = explained_variance_score(y_train, linear_model.predict(X_train))

train_mse.append(linear_train_mse)
test_mse.append(linear_test_mse)
explained_variance.append(linear_EV)
model.append("Model 1 - linear")

In [None]:
# Polynomial degree 2
# Toom me 3m 51.5s to compute

polynomial_model = Pipeline([
    ('transform', PolynomialFeatures(degree=2, include_bias=False)),
    ('scale', StandardScaler()),
    ('linreg', LinearRegression())
])
polynomial_model.fit(X_train, y_train)

In [None]:
# Record MSE
polynomial_train_mse= round(mean_squared_error(polynomial_model.predict(X_train), y_train), 4)
polynomial_test_mse=round(mean_squared_error(polynomial_model.predict(X_test), y_test),4)
polynomial_EV= explained_variance_score(y_train, polynomial_model.predict(X_train))

train_mse.append(polynomial_train_mse)
test_mse.append(polynomial_test_mse)
explained_variance.append(polynomial_EV)
model.append("Model 2 - polynomialDeg2")

In [None]:
# LASSO
# Warning! This took me 16m 4s to compute

lasso_model = Pipeline([
    ('transform', PolynomialFeatures(degree=2, include_bias=False)),
    ("scaler", StandardScaler()), 
    ('lasso', Lasso())
])
lasso_model.fit(X_train, y_train)

lasso_coef = lasso_model.named_steps['lasso'].coef_
lasso_coef

In [None]:
# Record MSE
lasso_train_mse= round(mean_squared_error(lasso_model.predict(X_train), y_train), 4)
lasso_test_mse=round(mean_squared_error(lasso_model.predict(X_test), y_test),4)
lasso_EV= explained_variance_score(y_train, lasso_model.predict(X_train))

train_mse.append(lasso_train_mse)
test_mse.append(lasso_test_mse)
explained_variance.append(lasso_EV)
model.append("Model 3 - LASSO")

In [None]:
# Ridge
# Took me 5m 15.7s to compute

ridge_model = Pipeline([
    ('transform', PolynomialFeatures(degree=2, include_bias=False)),
    ('scale', StandardScaler()),
    ('ridge', Ridge())
])

alpha_value = {'ridge__alpha': [0.1,1,10]}

model_finder = GridSearchCV(estimator = ridge_model, 
                           param_grid=alpha_value,
                           scoring = "neg_mean_squared_error"
                           )

model_finder.fit(X_train, y_train)

best_ridge_model=model_finder.best_estimator_
best_alpha = model_finder.best_params_

In [None]:
# Record MSE
ridge_train_mse = round(mean_squared_error(best_ridge_model.predict(X_train), y_train),4)
ridge_test_mse = round(mean_squared_error(best_ridge_model.predict(X_test), y_test),4)
ridge_EV= explained_variance_score(y_train, best_ridge_model.predict(X_train))

train_mse.append(ridge_train_mse)
test_mse.append(ridge_test_mse)
explained_variance.append(ridge_EV)
model.append("Model 4 - Ridge")

**Ridge2**

I had this idea of analyzing factors by themselves, but decided to do it by using another ridge model. There seems to be no other functional difference between the two

In [None]:
# Making another X/y and Test/Train split due to this idea requiring non-OHE'd dataset

y = cars_drop["price"]
X = cars_drop.drop(columns="price")

X_train2, X_test2, y_train2, y_test2 = train_test_split(X, y, random_state=42, test_size=0.3)

In [None]:
# Grabing the Object Columns for OneHotEncoding
# Condition will be OrdinalEncoding, hence removing it from the list 

obj_cols = cars_drop.select_dtypes("object").columns.to_list()

obj_cols.remove('condition')
obj_cols

In [None]:
# Using Ridge Regression 

alphas = np.linspace(0.1, 100, 50)

randome_state = [10, 100]

params = {"ridge__alpha": alphas, "ridge__random_state": randome_state}

col_transformer = ColumnTransformer(transformers=[
    ('ohe', OneHotEncoder(sparse_output=False), obj_cols),
    ("ord", OrdinalEncoder(), ['condition']),
    ("poly", PolynomialFeatures(degree=3, include_bias=False), ["cylinders", "odometer", "year"])
])


pipe = Pipeline([
    ("transformer", col_transformer),
    ("scaler", StandardScaler()),
    ("ridge", Ridge())
    ])

grid_ridge = GridSearchCV(pipe, param_grid=params, cv=3, scoring="neg_mean_squared_error", n_jobs=-1)


In [None]:
# Fitting

grid_ridge.fit(X_train2, y_train2)
best_model_ridge = grid_ridge.best_estimator_
best_model_ridge

In [None]:
# Record MSE
ridge_train_mse2 = round(mean_squared_error(best_model_ridge.predict(X_train2), y_train2),4)
ridge_test_mse2 = round(mean_squared_error(best_model_ridge.predict(X_test2), y_test2),4)
ridge_EV2= explained_variance_score(y_train, best_model_ridge.predict(X_train2))

train_mse.append(ridge_train_mse2)
test_mse.append(ridge_test_mse2)
explained_variance.append(ridge_EV2)
model.append("Model 5 - Ridge2")

### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight on drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

In [None]:
MetricsTable = pd.DataFrame({
    'model': model,
    'train_mse': train_mse,
    'test_mse':test_mse,
    'explained_variance': explained_variance
})

MetricsTable

From the table above, it is obvious that the best performing model is the Ridge model, and therefore is the one I will deploy.

However, we can take this a step further and identify the most and least important factors in price of a car, which would be of much better immediate use to the business.

In [None]:
coefs = best_model_ridge.named_steps["ridge"].coef_
cols = best_model_ridge.named_steps.transformer.get_feature_names_out()

In [None]:
#coef_features_df = pd.DataFrame({"features":cols, "coefs":np.abs(coefs)}).sort_values(by="coefs", ascending=False)
coef_features_df = pd.DataFrame({"features":cols, "coefs":coefs})#.sort_values(by="coefs", ascending=False)

In [None]:
px.bar(coef_features_df, x="features", y="coefs", title="Coefficients of all Features")

We have got coefficients for a bunch of internal features. Now we need to combine them into actual features.

In [None]:
def break_OneHot_coefs(df, feature):
    coef_df = df[df.features.apply(lambda x: feature in x)].copy()
    coef_df.features = coef_df.features.str.replace(f"ohe__{feature}_", "")
    coef_df.features = coef_df.features.str.title()
    return {feature.title(): coef_df}

def break_poly_coefs(df):
    polycol = df[df.features.apply(lambda x: "poly_" in x)].copy()
    polycol.drop(polycol[polycol.features.apply(lambda x: "^" in x)].index, inplace=True)
    polycol.features = polycol.features.str.replace(f"poly__", "")
    polycol.features = polycol.features.str.title()
    return polycol

In [None]:
#man_coef_df = break_OneHot_coefs(coef_features_df, "manufacturer")
type_coef_df = break_OneHot_coefs(coef_features_df, "type")
fuel_coef_df = break_OneHot_coefs(coef_features_df, "fuel")
title_coef_df = break_OneHot_coefs(coef_features_df, "title_status")
transmission_coef_df = break_OneHot_coefs(coef_features_df, "transmission")
drive_coef_df = break_OneHot_coefs(coef_features_df, "drive")
state_coef_df = break_OneHot_coefs(coef_features_df, "state")

#man_coef_df, 
all_coefs = [type_coef_df, fuel_coef_df, title_coef_df, transmission_coef_df, drive_coef_df, state_coef_df]

In [None]:
poly_cols = break_poly_coefs(coef_features_df)
poly_cols

In [None]:
px.bar(poly_cols, x="features", y="coefs")

In [None]:
fig, axs = plt.subplots(4, 2, figsize=(18, 15))
counter = int(0)

for i in all_coefs:
    coordy = counter // 2
    coordx = counter % 2

    label = list(i.keys())[0]
    dataframe = i[label]
    fig = sns.barplot(data=dataframe, x="features", y="coefs", ax=axs[coordy, coordx])

    plt.axhline(0, color='red', linestyle='--')
    ax=axs[coordy, coordx].set_title(f'\n\n{label} Coefficients')
    plt.xlabel(label)
    plt.ylabel("Coefficients")
    counter = counter + 1

plt.tight_layout()
    

### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine tuning their inventory.

In the end, we can find several regularities between features and price of the car:

- Of manufacturers, Tesla, Toyota, Porsche, and Lexus have very good markup coefficients, as do some others like Mersedes, Audi, and Ram. more popular budget car manufacturers, like Nissan, Hyundai, Kia, and Mitsubishi are on the opposite, experiencing pretty severe negative markup. This is actually tied into the specific models, but that required too much outside information and wasn't analyzed.
- Diesel usually sells for more money than other types, and gas cars are cheaper;
- Of types, trucks, convertibles, and coupe are the ones with markup. Sedans, SUVs, and hatchbacks, on the other hand, are usually marked down;
- Obviously, clean and lien cars sell for more than other title statuses. interestingly, offering a car as parts only is a good way to recuperate some of the loss if that is the only chance at selling it;
- Manual transmission is more expensive, and "other" (IIRC, it was electrical) are much cheaper, but overall there is little change depending on transmission type;
- 4WD cars are much more valued than others, and FWD are less so, with RWD in almost exact middle;
- Finally, you can expect higher prices if you are selling in CA, AK, AL, TN, or UT, and lower prices if you are located in NY, OH, or PA

The following is an example to test your car with my model to see the price it expects the car to sell.

In [None]:
# save the model to disk

filename_ridge = 'Ridge.sav'
pickle.dump(best_ridge_model, open(filename_ridge, 'wb'))

In [None]:
# Function to check the car

def predict_price(used_car):
    model = pickle.load(open("Ridge.sav", 'rb'))
    feature_set = [
        'manufacturer',
        'condition',
        'cylinders',
        'fuel',
        'title_status',
        'transmission',
        'drive',
        'size',
        'type',
        'state',
        'year',
        'odometer'
    ]
    used_car_df = pd.DataFrame([used_car], columns=feature_set)
    display(used_car_df)
    pred_with_ridge = model.predict(used_car_df)
    msg = f"The Estimated Price Of The Given Car Is: ${round(pred_with_ridge[0], 2)}"
    return msg

In [None]:
# Key:
# Manufacturer, string
# Condition, string
# # of cylinders, int
# Fueld type, string
# Title status, string
# Transmission type, string
# Drive type, string
# Size, string
# Body type, string
# State, string
# Year of manufacturing, int
# Odometer, int


my_car = [
    "Ford", 
    "good", 
    4,  
    "gas", 
    "clean", 
    "automatic", 
    "fwd", 
    "compact",
    "hatchback",
    'ca',
    2012, 
    200000
]


predict_price(my_car)