# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

# Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

The prices of new cars are fixed by manufacturer with additonal cost like sales tax, destination charges etc. so its very important for customers that the money they invest is worthy. After covid due to chip shortage new car production rate went drastically down. This caused increase in demand for used cars. Due to Russia-ukraine war, Supply chain got disrupted which made used car demand shoot up even higher.

Due to rising inflation in united states, People have less money to spend and this has caused used car demand to go up even higher. For consumer its very important to get a good quality car for the money they spend and its very important for Car dealership to build their inventory with good cars which helps them in increasing their revenue and also increase customer satisfaction.

In this project, i am trying to build various regression models which helps in predicting used car prices with high accuracy and provide some insights to car dealership on what features customers value more and drives the car prices high.

# Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

In [None]:
import matplotlib
import numpy as np
import pandas as pd

%matplotlib inline
import pickle
import warnings

import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objs as go
import seaborn as sns
import ydata_profiling
from sklearn import preprocessing
from sklearn.ensemble import ExtraTreesRegressor, RandomForestRegressor
from sklearn.experimental import enable_iterative_imputer
from sklearn.feature_selection import RFE
from sklearn.impute import IterativeImputer, SimpleImputer
from sklearn.linear_model import (BayesianRidge, Lasso, LassoCV,
                                  LinearRegression, Ridge, RidgeCV)
from sklearn.metrics import (mean_absolute_error, mean_squared_error,
                             mean_squared_log_error, r2_score)
from sklearn.model_selection import (GridSearchCV, KFold, StratifiedKFold,
                                     cross_val_score, train_test_split)
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.tree import DecisionTreeRegressor
from tqdm import tqdm
from yellowbrick.regressor import AlphaSelection

warnings.filterwarnings("ignore")

In [None]:
cars = pd.read_csv("data/vehicles.csv")

In [None]:
cars.head()

In [None]:
print(f"There are {cars.shape[0]} rows and {cars.shape[1]} columns in Dataset!")

In [None]:
# Pandas Dataset profiling which gives insights into data.

ydata_profiling.ProfileReport(cars)

#### Used Pandas Profling report below are some of important things i noted

* 4 Numeric columns and 14 Categorical columns
* 15.8% of cells have Nan value
* Price(target) and odometer is highly skewed
* size has 71% of missing values followed by condition, cylinders, VIN, drive etc. These columns needs to be fixed.

## Exploring all columns to understand data better

#### Price column

Price column is our target feature for which we need to build prediction for.
* It has 15655 distinct values
* It has maximum zeros. 32895 rows.
* Mean is 75199, min is 0 and max is 3736928700. Values are highly skewed and need to be treated

In [None]:
# leaving 880 rows with highest price point(Outliers)
px.histogram(cars.nsmallest(n=426000, columns=['price']), x="price", nbins=20, title="Price histogram")

#### Year column

* Year has values ranging from 1900 to 2022. It has 1905 missing values.
* Most of the cars are in 1990 to 2022 Year range.

In [None]:
px.histogram(cars, x="year", nbins=20, title="Number of cars categorized on Year")

#### Manufacturer Column
* There are 42 distinct Manufacturers 
* 17646 rows have values missing
* Ford, Toyota and Chevy are top brands listed in this dataset.
* 1995 to 2022 seems to cover most of dataset.

In [None]:
px.histogram(cars, x="manufacturer", nbins=20, title="Number of cars categorized on Manufacturer")

#### Model Column
* Model has high cardinality with 29649 unique values.
* ~5000 rows have values missing.

In [None]:
top_50_models = cars['model'].value_counts()[:50].index.tolist()
px.histogram(cars.query('model in @top_50_models'), x="model", nbins=20, title="Number of cars categorized on Model(Top 50 models)")

#### condition column
* Condition has 6 unique values
* 174104 rows have missing values
* Good and excellent condition are in maximum number of rows

In [None]:
px.histogram(cars, x="condition", nbins=20,title="Number of cars categorized on Car condition")

#### Cylinder column
* Has 8 unique values
* 177678 rows have missing values
* 8, 6 and 4 constitue many values.

In [None]:
px.histogram(cars, x="cylinders", nbins=20, title="Number of cars categorized on Cylinders")

#### Fuel column
* Fuel has 5 unique values
* 3013 rows have values missing
* Gas, Other, Diesel constitute majority

In [None]:
px.histogram(cars, x="fuel", nbins=20, title="Number of cars categorized on Fuel type")

#### Odometer column
* Odometers column is highly skewed
* Only 4400 rows have missing values
* Majority of odometers rating are below 350k miles

In [None]:
px.histogram(cars, x="odometer", nbins=500,title="Odometer histogram")

#### Title Status column
* has 6 distinct values
* Only 8242 rows have missing values
* Most of vehicles have clean title status


In [None]:
px.histogram(cars, x="title_status", title="Number of cars categorized on Title status")

#### Transmission column
* Transmission has 3 unique values
* 2556 rows have missing columns

In [None]:
px.histogram(cars, x="transmission", title="Number of cars categorized on Transmission")

#### Drive column
* has 3 unique values
* 130567 rows have missing columns

In [None]:
px.histogram(cars, x="drive", title="Number of cars categorized on Drive type")

#### Size Columns
* has 4 unique values
* 306361 rows have values missing

In [None]:
px.histogram(cars, x="size", title="Number of cars categorized on Size")

#### Type column
* 13 unique values
* 92858 rows have values missing

In [None]:
px.histogram(cars, x="type", title="Number of cars categorized on Type")

#### Paint color column
* 12 distinct values
* 130203 rows have values missing

In [None]:
px.histogram(cars, x="paint_color", title="Number of cars categorized on Color")

## Observations

* Many categorical columns have Nan values. We need to either fill them in or remove the rows altogather.
* Some of columns have outliers. This needs to be fixed.
* Target column 'Price' has outliers and needs fixing. Price column is skewed as well.

# Data Preparation

After our initial exploration and fine tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

In [None]:
cars = cars.reindex(columns=[
    'id', 'region', 'year',
    'manufacturer', 'model', 'condition',
    'cylinders', 'fuel', 'odometer',
    'title_status', 'transmission',
    'VIN', 'drive', 'size',
    'type', 'paint_color', 'state', 'price'])

In [None]:
cars.head()

In [None]:
print("Below are number of Nan values in each column!")
cars.isnull().sum()

In [None]:
print("Dropping ID and VIN since it does not affect car prices!")
print("Dropping state and region as this does not affect "
      "car prices much when there is demand!")
cars = cars.drop(columns=['id', 'VIN', 'state', 'region'])

In [None]:
fig = px.imshow(cars.isnull())
fig.update_layout(
    title = "Heatmap showing Nan values in each column")
fig.update_layout(barmode='group', bargap=0.30,bargroupgap=0.0)
fig.show()

In [None]:
num_features = [
    'year',
    'odometer'
]
cat_features = [
    'manufacturer',
    'model',
    'condition',
    'cylinders',
    'fuel',
    'title_status',
    'transmission',
    'drive',
    'size',
    'type',
    'paint_color'
]
print(f"These are numerical features in dataset: {num_features}")
print(f"These are categorical features in dataset: {cat_features}")

In [None]:
# Making a copy of cars dataset.
cars_imputer = cars.copy()

encoder = preprocessing.LabelEncoder()

def encode(data):
    nonulls = np.array(data.dropna())
    impute_reshape = nonulls.reshape(-1,1)
    impute_ordinal = encoder.fit_transform(impute_reshape)
    data.loc[data.notnull()] = np.squeeze(impute_ordinal)
    return data

# Encoding Categorical values in to Numerical using LabelEncoder()
for i in tqdm(range(len(cat_features))):
    encode(cars_imputer[cat_features[i]])

In [None]:
# Estimate the score on the entire dataset by filling missing values by 4 different iterative imputers
estimators = [
    BayesianRidge(),
    DecisionTreeRegressor(
        max_features='sqrt',
        random_state=0
    ),
    ExtraTreesRegressor(
        n_estimators=10,
        random_state=0
    ),
    KNeighborsRegressor(
        n_neighbors=15
    )
]
score = pd.DataFrame()
for estimator in estimators:
    print(f"Estimating using {estimator.__class__.__name__} estimator!")
    imputer = IterativeImputer(estimator)
    cars_impute = cars_imputer.copy()
    for col in cars_imputer.columns:
        impute_data=imputer.fit_transform(
            cars_impute[col].values.reshape(-1,1)
        )
        impute_data=impute_data.astype('int64')
        impute_data = pd.DataFrame(
            np.ravel(impute_data)
        )
        cars_impute[col]=impute_data
    X = cars_impute.iloc[:,:-1]
    y = np.ravel(cars_impute.iloc[:,-1:])
    score[estimator.__class__.__name__] = cross_val_score(
        estimator,
        X,
        y,
        scoring='neg_mean_squared_error',
        cv=6
    )
del cars_imputer

In [None]:
# MSE scores of each estimator for cv=6
score

In [None]:
fig, ax = plt.subplots(figsize=(12, 6))
means = -score.mean()
errors = score.std()
means.plot.barh(xerr=errors, ax=ax)
ax.set_title('MSE with Different Imputation Methods')
ax.set_xlabel('MSE')
ax.set_yticks(np.arange(means.shape[0]))
ax.set_yticklabels(means.index.tolist())
plt.tight_layout(pad=1)
plt.show()

#### Above figure shows that Bayesian Ridge Imputer is best with lower MSE

In [None]:
# Nan values in Numerical features
cars.isnull().sum()[num_features]

In [None]:
cars_num = cars[num_features]

# Using estimators[0] = BayesianRidge to fill Nan values in Numerical features.
imputer_num = IterativeImputer(estimators[0])
impute_data = imputer_num.fit_transform(cars_num)
cars[num_features] = impute_data

In [None]:
# Missing values after filling
cars.isnull().sum()[num_features]

In [None]:
# Nan values in Categorical features
cars.isnull().sum()[cat_features]

In [None]:
# Using BayesianRidge imputer for categorical columns as well.
cars_cat = cars[cat_features]
encoder=preprocessing.LabelEncoder()

for columns in cat_features:
    encode(cars_cat[columns])
    imputer = IterativeImputer(BayesianRidge())
    impute_data = imputer.fit_transform(cars_cat[columns].values.reshape(-1, 1))
    impute_data = impute_data.astype('int64')
    impute_data = pd.DataFrame(impute_data)
    impute_data = encoder.inverse_transform(impute_data.values.reshape(-1, 1))
    cars_cat[columns]=impute_data
cars[cat_features]=cars_cat    

In [None]:
cars.isnull().sum()[cat_features]

In [None]:
fig = px.imshow(cars.isnull())
fig.update_layout(
    title = "Heatmap showing all Nan values are eliminated!")
fig.update_layout(barmode='group', bargap=0.30,bargroupgap=0.0)
fig.show()

In [None]:
# Again check for Nan values on whole dataset! Just to make sure :)
cars.isnull().sum()

In [None]:
# Saving the cleaned data to disk just in case if i need fresh copy further down.
cars.to_csv('cars_cleaned.csv', index=False)

In [None]:
cars.head()

In [None]:
def outliers_range(arr: list, col: str) -> tuple:
    """
    Function to find outliers range for given Array and column
    """
    x_values = sorted(arr[col].values.ravel())
    q_25 = 25 / 100 * (len(x_values) + 1)
    i_p = int(str(q_25).split(".")[0])
    f_p = int(str(q_25).split(".")[1])
    q1 = x_values[i_p] + f_p * (x_values[i_p + 1] - x_values[i_p])
    q_75 = 75/100*(len(x_values)+1)
    i_p = int(str(q_75).split(".")[0])
    f_p = int(str(q_75).split(".")[1])
    q3 = x_values[i_p] + f_p * (x_values[i_p + 1] - x_values[i_p])
    iqr = q3 - q1
    x_values_1 = q1 - 1.5 * iqr
    x_values_2 = q3 + 1.5 * iqr
    return (x_values_1, x_values_2)

In [None]:
def min_max_price(df: pd.DataFrame) -> tuple:
    """
    Function to find min and max price to remove outliers
    """
    range_ = []
    q1, q3 = (df['logprice'].quantile([0.25,0.75]))
    range_.append(q1 - 1.5 * (q3 - q1))
    range_.append(q3 + 1.5 * (q3 - q1))
    return (range_)

# Adding logprice since price column is skewed. This brings normal distribution to price column.
cars['logprice'] = np.log(cars['price'])
x = cars['logprice']
price_range = list(range(0, int(max(cars['logprice'])) + 1))
red_square = dict(markerfacecolor='g', marker='s')
plt.boxplot(x, vert=False)
plt.xticks(price_range)
plt.text(min_max_price(cars)[0]-0.3,1.05,str(round(min_max_price(cars)[0],2)))
plt.text(min_max_price(cars)[1]-0.5,1.05,str(round(min_max_price(cars)[1],2)))
plt.title("Box Plot of Price")
plt.show()

#### Above Box plot shows that Prices below log 6.43 and above 12.44 are outliers.

In [None]:
fig, ax1 = plt.subplots()
ax1.set_title('Figure 2: Box Plot of Odometer')
ax1.boxplot(cars['odometer'], vert=False, flierprops=red_square)
plt.show()

#### Above box plot shows that Odometer rating anything below -107725.0 and above 282235.0 are outliers

In [None]:
fig,(ax1,ax2)=plt.subplots(ncols=2,figsize=(12,5))

#ploting boxplot
o1, o2 = outliers_range(cars,'year')
ax1.boxplot(sorted(cars['year']), vert=False, flierprops=red_square)
ax1.set_xlabel("Years")
ax1.set_title("Figure 3: Box Plot of Year")
ax1.text(o1-8,1.05,str(round(o1,2)))

#ploting histogram
hist,bins=np.histogram(cars['year'])
n, bins, patches = ax2.hist(x=cars['year'], bins=bins)
ax2.set_xlabel("Years")
ax2.set_title("Figure 4: Histogram of Year")
for i in range(len(n)):
    if(n[i]>2000):
        ax2.text(bins[i],n[i]+3000,str(n[i]))

plt.tight_layout()
plt.show()

#### Above box plot shows that anything below 1995 and above 2022 are outliers.

In [None]:
# Removing outliers using outliers_range() funciton on logprice, odometer and year columns
cars_new = cars.copy()
out = np.array([
    'logprice',
    'odometer',
    'year'
])
for col in out:
    o1,o2 = outliers_range(cars_new, col)
    cars_new = cars_new[(cars_new[col]>=o1) & (cars_new[col]<=o2)]
    print('IQR of',col,'=',o1,o2)
cars_new = cars_new[cars_new['price']!=0]
cars_new.drop('logprice',axis=1,inplace=True)

In [None]:
print(f"Shape before process={cars.shape}")
print(f"Shape After process={cars_new.shape}")
print(
    f"Total {cars.shape[0]-cars_new.shape[0]} rows "
    f"and {cars.shape[1]-cars_new.shape[1]} columns were removed")
cars_new.to_csv("vehicles_finalized.csv",index=False)

In [None]:
cars_new.head()

### In a nutshell below Data cleanups were done

* Dropped VIN and state columns. VIN doesn't add any value for price prediction and state is same info as region.
* Used BayesianRidge, DecisionTreeRegressor, ExtraTreesRegressor, KNeighboursRegressor as estimator for Imputer method and found that BayesianRidge had lower MSE so used it to fill missing values for Categorical values.
* Found outliers in Price, Odometer and Year columns and removed them using IQR.

### Totally 62427 rows were eliminated while removing outliers from Year, Price and Odometer columns.

### Data visualization After data prepration 

In [None]:
cars_cleaned = cars_new.copy()
cars_cleaned['year'] = cars_cleaned['year'].astype('int64')

In [None]:
cars_cleaned.shape

In [None]:
cars_cleaned.columns

In [None]:
cars_sample = cars_cleaned.sample(1000)
cars_sample.shape

In [None]:
# Plotting a pairplot to view distribution of numerical features.
colors = iter([
    'xkcd:red purple', 'xkcd:pale teal', 'xkcd:warm purple',
    'xkcd:light forest green', 'xkcd:blue with a hint of purple',
    'xkcd:light peach', 'xkcd:dusky purple', 'xkcd:pale mauve',
    'xkcd:bright sky blue'])

def my_scatter(x,y, **kwargs):
    kwargs['color'] = next(colors)
    plt.scatter(x,y, **kwargs)

def my_hist(x, **kwargs):
    kwargs['color'] = next(colors)
    plt.hist(x, **kwargs)

g = sns.PairGrid(cars_sample)
g.map_diag(my_hist)
g.map_offdiag(my_scatter)

#### Pair plot is not very conclusive. 

In [None]:
fig = px.histogram(cars_cleaned, x="price", nbins=20, title="Price histogram")
fig.show()

In [None]:
def barplot_generator(df=pd.DataFrame(), x='', y='', title='', hue=''):
    """
    Function which take df, x, y, title and hue as input
    and generates a bar plot using seaborn.barplot.
    """
    fig, axis=plt.subplots()
    if hue:
        fig.set_size_inches(10, 6)
        sns.barplot(x=x, y=y, data=df, ax=axis, hue=hue)
    else:
        fig.set_size_inches(10, 6)
        sns.barplot(x=x, y=y, data=df, ax=axis)
    axis.set_title(title)
    plt.xticks(rotation=45)
    plt.show()

In [None]:
barplot_generator(cars_cleaned, 'fuel', 'price', 'Car price by Fuel Type')

#### Hybrid cars have lower price. Diesel cars cost more than electric ones.

In [None]:
barplot_generator(cars_cleaned, 'fuel', 'price', 'Car price by Fuel Type with condition as hue', hue='condition')

#### Irrespective of fuel type, Car condition decides the car prices. Salvaged cars have lower price point.

In [None]:
barplot_generator(cars_cleaned, 'year', 'price', 'Car price by Year')

#### Car prices are ever increasing starting 2000

In [None]:
barplot_generator(cars_cleaned, 'condition', 'price', 'Car price by Condition')

In [None]:
barplot_generator(cars_cleaned, 'condition', 'price', 'Car price by Condition', hue='size')

#### The above 2 plots clearly shows that car condition drives the car price. Size of car impacts the prices as well.

In [None]:
barplot_generator(cars_cleaned, 'transmission', 'price', 'Car price by Transmission')

#### Manual car prices are low. Other types transmission have higher prices.

In [None]:
barplot_generator(cars_cleaned, 'transmission', 'price', 'Car price by Transmission, hue by size', hue='size')

In [None]:
barplot_generator(cars_cleaned, 'type', 'price', 'Car price by Type')

In [None]:
barplot_generator(cars_cleaned, 'manufacturer', 'price', 'Car price by manufacturer')

In [None]:
barplot_generator(cars_cleaned, 'size', 'price', 'Car price by Size')

In [None]:
barplot_generator(cars_cleaned, 'size', 'price', 'Car price by Size and hue as drive', hue='drive')

In [None]:
barplot_generator(cars_cleaned, 'drive', 'price', 'Car price by Drive')

In [None]:
barplot_generator(cars_cleaned, 'drive', 'price', 'Car price by Drive and hue as size', hue='size')

# Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

In [None]:
num_features = ['year','odometer']
cat_features = [
    'manufacturer',
    'model',
    'condition',
    'cylinders',
    'fuel',
    'title_status',
    'transmission',
    'drive',
    'size',
    'type',
    'paint_color'
]

#### Converting all categorical values to numerical values using sklearn's LabelEncoder.
#### Converting price to logarithmic scale since data is skewed and its not normally distributed
#### Normalizing other numerical columns

In [None]:
label_encoder = preprocessing.LabelEncoder()
cars_cleaned[cat_features] = cars_cleaned[cat_features].apply(
    label_encoder.fit_transform)

In [None]:
cars_cleaned

In [None]:
# Scaling numerical data
norm = StandardScaler()
cars_cleaned['price'] = np.log(cars_cleaned['price'])
cars_cleaned['odometer'] = norm.fit_transform(np.array(cars_cleaned['odometer']).reshape(-1,1))
cars_cleaned['year'] = norm.fit_transform(np.array(cars_cleaned['year']).reshape(-1,1))
cars_cleaned['model'] = norm.fit_transform(np.array(cars_cleaned['model']).reshape(-1,1))

# Scaling target variable
q1, q3 = (cars_cleaned['price'].quantile([0.25,0.75]))
o1 = q1-1.5*(q3-q1)
o2 = q3+1.5*(q3-q1)
cars_cleaned = cars_cleaned[(cars_cleaned.price>=o1) & (cars_cleaned.price<=o2)]

In [None]:
cars_cleaned.head()

#### Using 90% of dataset as Training data and 10% as Test data

In [None]:
def split_dataset(df, n):
    """
    Function to split training and test dataset
    """
    X = df.iloc[:,n]
    y = df.iloc[:,-1:].values.T
    y = y[0]
    X_train, X_test, y_train, y_test = train_test_split(
        X,
        y,
        train_size=0.9,
        test_size=0.1,
        random_state=0
    )
    return (X_train,X_test,y_train,y_test)

X_train, X_test, y_train, y_test = split_dataset(
    cars_cleaned,
    list(range(len(list(cars_cleaned.columns))-1))
)

In [None]:
def remove_neg(y_test, y_pred):
    """
    Function to remove negative values predicted by models.
    """
    index_ = [index for index in range(len(y_pred)) if(y_pred[index]>0)]
    y_pred = y_pred[index_]
    y_test = y_test[index_]
    y_pred[y_pred<0]
    return (y_test,y_pred)

def evaluate(y_test, y_pred):
    """
    Function to evalute the model
    """
    result = []
    result.append(mean_squared_log_error(y_test, y_pred))
    result.append(np.sqrt(result[0]))
    result.append(r2_score(y_test,y_pred))
    result.append(round(r2_score(y_test,y_pred)*100,4))
    return (result)

# Dataframe to store the performance of each model
# Using MSLE since we have applied logarithmic to price target variable.
accuracy = pd.DataFrame(index=['MSLE', 'Root MSLE', 'R2 Score','Accuracy(%)'])    

## Linear regression with RFE

In [None]:
folds = KFold(n_splits = 5, shuffle = True, random_state = 100)

hyper_params = [{'n_features_to_select': list(range(1, 14))}]

lm = LinearRegression()
lm.fit(X_train, y_train)
rfe = RFE(lm)  

# Performing GridSearchCV with RFE
model_cv = GridSearchCV(
    estimator = rfe, 
    param_grid = hyper_params, 
    scoring= 'r2', 
    cv = folds, 
    verbose = 1,
    return_train_score=True
)

model_cv.fit(X_train, y_train)   

cv_results = pd.DataFrame(model_cv.cv_results_)

In [None]:
fig = go.Figure()
fig.add_traces(go.Line(x=cv_results["param_n_features_to_select"], y=cv_results["mean_test_score"], mode='lines', name="Test Score"))
fig.add_traces(go.Line(x=cv_results["param_n_features_to_select"], y=cv_results["mean_train_score"], mode='lines', name="Train Score"))
fig.update_layout(
    title="Optimal number of features",
    xaxis_title="Number of features",
    yaxis_title="R2 Score"
)
fig.show()

In [None]:
# Based on above graph picking 10 as optimal number of features
n_features_optimal = 10

lm = LinearRegression()
lm.fit(X_train, y_train)

rfe = RFE(lm, n_features_to_select=n_features_optimal)             
rfe = rfe.fit(X_train, y_train)

y_pred = rfe.predict(X_test)

In [None]:
# calculating error/accuracy
y_test_1, y_pred_1 = remove_neg(
    y_test,
    y_pred
)
r1_lr = evaluate(y_test_1,y_pred_1)

print(f"MSLE : {r1_lr[0]}")
print(f"Root MSLE : {r1_lr[1]}")
print(f"R2 Score : {r1_lr[2]} or {r1_lr[3]}%")

accuracy['Linear Regression'] = r1_lr

In [None]:
fig = px.scatter(x=y_test, y=y_pred, labels={'x': "Actual Car Price", 'y': "Predicted Car Price"}, title="Linear regression: Used Car Prediction with Log Price")
fig.show()

In [None]:
fig = px.scatter(x=np.exp(y_test), y=np.exp(y_pred), labels={'x': "Actual Car Price", 'y': "Predicted Car Price"}, title="Linear regression: Used Car Prediction with Actual Price")
fig.show()

In [None]:
fig = px.bar(x=X_train.columns, y=rfe.estimator.coef_, title="Linear Regression with RFE Model Coefs")
fig.show()

### Below are highlights from Linear Regression model with RFE

* Model accuracy is ~63%
* Year is most important feature. Higher the year higher the price. Same for Cylinders.
* Odometer has higher negative coef. Higher the odometer higher the penalty.
* Paint color does not affect car price much.

## Ridge Regression

In [None]:
# Predicting value of alpha

alphas = 10**np.linspace(10,-2,400)
model = RidgeCV(alphas=alphas)
visualizer = AlphaSelection(model)
visualizer.fit(X_train,y_train)
visualizer.show()

In [None]:
# Using alpha found in above graph

ridge_model = Ridge(alpha=23.357,solver='auto')
ridge_model.fit(X_train, y_train)
y_pred = ridge_model.predict(X_test)

In [None]:
fig = px.scatter(x=y_test, y=y_pred, labels={'x': "Actual Car Price", 'y': "Predicted Car Price"}, title="Ridge: Used Car Prediction with Log Price")
fig.show()

In [None]:
fig = px.scatter(x=np.exp(y_test), y=np.exp(y_pred), labels={'x': "Actual Car Price", 'y': "Predicted Car Price"}, title="Ridge: Used Car Prediction with Actual price")
fig.show()

In [None]:
# calculating error/accuracy

y_test_2, y_pred_2 = remove_neg(
    y_test,
    y_pred
)
r2_ridge = evaluate(y_test_2, y_pred_2)

print(f"MSLE : {r2_ridge[0]}")
print(f"Root MSLE : {r2_ridge[1]}")
print(f"R2 Score : {r2_ridge[2]} or {r2_ridge[3]}%")

accuracy['Ridge Regression']=r2_ridge

In [None]:
fig = px.bar(x=X_train.columns, y=ridge_model.coef_, title="Ridge Model Coefs")
fig.show()

### Below are highlights from Ridge Regression model with best alpha=23.357

* Model accuracy is ~63%
* Ridge Model coefs are similar to Linear Regression model
* Car Year drives price higher.
* Odometer reading drives it lower.
* Cylinders, Transmission, title status, fuel drives the price of used car.

### Lasso Regression

In [None]:
# predicting value of alpha

alphas = 10**np.linspace(10,-2,400)
model = LassoCV(alphas=alphas)
visualizer = AlphaSelection(model)
visualizer.fit(X_train,y_train)
visualizer.show()

In [None]:
# model object and fitting it
lasso_model = Lasso(alpha=0.010)
lasso_model.fit(X_train,y_train)
y_pred = lasso_model.predict(X_test)

In [None]:
# calculating error/accuracy

y_test_3, y_pred_3 = remove_neg(
    y_test,
    y_pred
)
r3_lasso = evaluate(y_test_3,y_pred_3)

print(f"MSLE : {r3_lasso[0]}")
print(f"Root MSLE : {r3_lasso[1]}")
print(f"R2 Score : {r3_lasso[2]} or {r3_lasso[3]}%")

accuracy['Lasso Regression'] = r3_lasso

In [None]:
fig = px.scatter(x=y_test, y=y_pred, labels={'x': "Actual Car Price", 'y': "Predicted Car Price"}, title="Lasso Model: Used Car Prediction with Log price")
fig.show()

In [None]:
fig = px.scatter(x=np.exp(y_test), y=np.exp(y_pred), labels={'x': "Actual Car Price", 'y': "Predicted Car Price"}, title="Lasso Model: Used Car Prediction with Actual price")
fig.show()

In [None]:
fig = px.bar(x=X_train.columns, y=lasso_model.coef_, title="Lasso Model Coefs")
fig.show()

### Below are highlights from Lasso Regression model with best alpha=0.010

* Model accuracy is ~63%
* Lasso Model coefs are similar to Ridge / Linear Regression model
* Car Year drives price higher.
* Odometer reading drives it lower.
* Cylinders, Transmission, title status, fuel drives the price of used car.

## Random Forest Regressor

In [None]:
random_forest_reg = RandomForestRegressor(
    n_estimators=500,
    random_state=0,
    min_samples_leaf=2,
    max_features=0.5,
    n_jobs=-1,
    oob_score=True
)
random_forest_reg.fit(X_train,y_train)
y_pred = random_forest_reg.predict(X_test)

In [None]:
r5_rf = evaluate(y_test,y_pred)
print(f"MSLE : {r5_rf[0]}")
print(f"Root MSLE : {r5_rf[1]}")
print(f"R2 Score : {r5_rf[2]} or {r5_rf[3]}%")
accuracy['RandomForest Regressor']=r5_rf

In [None]:
fig = px.scatter(x=y_test, y=y_pred, labels={'x': "Actual Car Price", 'y': "Predicted Car Price"}, title="Random Forest: Used Car Prediction with Log price")
fig.show()

In [None]:
fig = px.scatter(x=np.exp(y_test), y=np.exp(y_pred), labels={'x': "Actual Car Price", 'y': "Predicted Car Price"}, title="Random Forest: Used Car Prediction with Actual price")
fig.show()

In [None]:
importances = random_forest_reg.feature_importances_
features = X_train.columns
fig = px.bar(x=features, y=importances, title="Random Forest Regression Feature Importance")
fig.show()

### Below are highlights from Random Forest Regression

* Model accuracy is ~91%
* Year is the most important feature, Size is least important feature.
* Odometer, Model, Manufacturer, Cylinders Fuel are major factors affecting used car price.

### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight on drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

In [None]:
accuracy

In [None]:
model_accuracy = accuracy.loc['Accuracy(%)']
model_accuracy = pd.DataFrame({'Algorithm':model_accuracy.index, 'Accuracy':model_accuracy.values})

In [None]:
fig = px.line(model_accuracy, x='Algorithm', y='Accuracy', title='Model Performance!', markers=True)
fig.update_traces(textposition="bottom right")
fig.show()

## From Above plot we can see that Random Foreste Regression based model has higher accuracy.

In [None]:
# Saving this model for later use
pickle.dump(random_forest_reg, open('random_forest_reg.sav', 'wb'))

## Recommendation to Car dealership

* Year, Odometer are most important items which consumers value most and determines price range of the car.
* Diesel and Electric can sell for higher prices when compared to gas car.
* Higher No of cylinders drive the car price up.
* Title status and condition affects the prices. Salvaged cars are penalized more and brings down prices.
* rwd are penalized more when compared to fwd/4wd. Rwd cars have lower price points.
* Automatic and other transmission type have higer price points and Manual reduces the car price.

These are some of recommendation which car dealership can use to procure used car to drive sales up and provide great customer satisfaction.

### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine tuning their inventory.

## Here i have tried to deploy my model which car dealership can use to get a used cars price prediction.

In [None]:
features = [
    "year",
    "manufacturer",
    "model",
    "condition",
    "cylinders",
    "fuel",
    "odometer",
    "title_status",
    "transmission",
    "drive",
    "size",
    "type",
    "paint_color"
]

In [None]:
# Loading my best model from disk for prediction
my_best_model = pickle.load(open('random_forest_reg.sav', 'rb'))

In [None]:
df = pd.read_csv('vehicles_finalized.csv')
df.head()

## Car dealership can use below python function to get predicted car prices by providing values for all below arguments

1. **Year**: 1995 to 2022 </br></br>

2. **Odometer**: Integer value greater than 0 </br></br>

3. **Manufacturer**: Use one of below values. </br></br>
    'infiniti' 'gmc' 'chevrolet' 'toyota' 'ford' 'jeep' 'nissan' 'ram' </br>
    'mazda' 'cadillac' 'honda' 'dodge' 'lexus' 'jaguar' 'buick' 'chrysler' </br>
    'volvo' 'audi' 'lincoln' 'alfa-romeo' 'subaru' 'acura' 'hyundai' </br>
    'mercedes-benz' 'bmw' 'mitsubishi' 'volkswagen' 'porsche' 'kia' 'rover' </br>
    'ferrari' 'mini' 'pontiac' 'fiat' 'tesla' 'saturn' 'mercury' </br>
    'harley-davidson' 'aston-martin' 'land rover' 'morgan' </br></br>
 
4. **Condition**: Use one of below values. </br></br>
    'fair' 'good' 'excellent' 'like new' 'new' 'salvage'</br></br>
    
5. **Cylinders**: Use one of below values. </br></br>
    '5 cylinders' '8 cylinders' '6 cylinders' '4 cylinders' '3 cylinders' '10 cylinders' 'other' '12 cylinders'</br></br>

6. **Fuel**: Use one of below values. </br></br>
    'gas' 'other' 'diesel' 'hybrid' 'electric'</br></br>

7. **Transmission**: Use one of below values. </br></br>
    'automatic' 'other' 'manual'</br></br>

8. **Drive**: Use one of below values. </br></br>
    '4wd' 'rwd' 'fwd' </br></br>
   
9. **Size**: Use one of below values. </br></br>
    'full-size' 'mid-size' 'compact' 'sub-compact'  </br></br>

10. **Type**: Use one of below values. </br></br>
    'offroad' 'pickup' 'truck' 'other' 'coupe' 'SUV' 'hatchback' 'mini-van' 'sedan' 'bus' 'convertible' 'wagon' 'van'</br></br>
    
11. **Paint Color**: Use one of below values. </br></br>
    'grey' 'white' 'blue' 'red' 'black' 'silver' 'brown' 'yellow' 'orange' 'green' 'custom' 'purple' </br></br>
  
12. **Model**: Use right model names
    

In [None]:
def predict_used_cars_prices(
        year=2019,
        manufacturer='bmw',
        model='g series',
        condition='good',
        cylinders='5 cylinders',
        fuel='gas',
        odometer=70000,
        title_status='clean',
        transmission='automatic',
        drive='4wd',
        size='full-size',
        type_='coupe',
        paint_color='red'):
    """
    Function to predict Used car prices based on below parameters.
    "year",
    "manufacturer",
    "model",
    "condition",
    "cylinders",
    "fuel",
    "odometer",
    "title_status",
    "transmission",
    "drive",
    "size",
    "type",
    "paint_color"
    """
    # Get normalized value of odometer and year
    year_odometer = pd.DataFrame(
        data=[[year, odometer]],
        columns=['year','odometer']
    )
    norm = StandardScaler()
    norm.fit(df[['year', 'odometer']])
    standardvalues=norm.transform(df[['year', 'odometer']])
    df['year']=standardvalues[:,:1].flatten()
    df['odometer']=standardvalues[:,1:].flatten()
    values = norm.transform(year_odometer[['year', 'odometer']]).flatten()
    input_ = pd.DataFrame(data=[[
        values[0],
        list(df['manufacturer'].unique()).index(manufacturer),
        list(df['model'].unique()).index(model),
        list(df['condition'].unique()).index(condition),
        list(df['cylinders'].unique()).index(cylinders),
        list(df['fuel'].unique()).index(fuel),
        values[1],
        list(df['title_status'].unique()).index(title_status),
        list(df['transmission'].unique()).index(transmission),
        list(df['drive'].unique()).index(drive),
        list(df['size'].unique()).index(size),
        list(df['type'].unique()).index(type_),
        list(df['paint_color'].unique()).index(paint_color),]
    ],columns=features)
    pred = my_best_model.predict(input_)
    price = np.exp(pred[0])
    print(f"Predictied price of {manufacturer} {model}->{year} car is: {price:.2f}")


In [None]:
predict_used_cars_prices(
    year=2022,
    manufacturer='infiniti',
    model='g series',
    condition='good',
    cylinders='5 cylinders',
    fuel='gas',
    odometer=70000,
    title_status='clean',
    transmission='automatic',
    drive='4wd',
    size='full-size',
    type_='coupe',
    paint_color='red'
)

In [None]:
predict_used_cars_prices(
    year=1995,
    manufacturer='bmw',
    model='525i',
    condition='salvage',
    cylinders='5 cylinders',
    fuel='gas',
    odometer=170000,
    title_status='clean',
    transmission='automatic',
    drive='rwd',
    size='full-size',
    type_='coupe',
    paint_color='black'
)

In [None]:
predict_used_cars_prices(
    year=1995,
    manufacturer='honda',
    model='odyssey',
    condition='salvage',
    cylinders='5 cylinders',
    fuel='gas',
    odometer=270000,
    title_status='clean',
    transmission='automatic',
    drive='4wd',
    size='full-size',
    type_='coupe',
    paint_color='red'
)

In [None]:
predict_used_cars_prices(
    year=2020,
    manufacturer='tesla',
    model='model 3 long range sedan',
    condition='good',
    cylinders='5 cylinders',
    fuel='electric',
    odometer=3996,
    title_status='clean',
    transmission='other',
    drive='4wd',
    size='full-size',
    type_='sedan',
    paint_color='white'
)

# We can also build a basic Flask or Django web app which accepts these values from UI and runs this function to send predicted prices.

# This enables a full e2e solution for Used Car Price prediction ML application.

# Next Steps

* Continue to analyse each features further and remove features which does not affect prices much to improve model performance.
* Try to deploy a basic Flask/Django application to build e2e ML solution.
* Apply XGBoost and other algorightm once we go through it in future modules.
 