### Business Scenario

RestoreMasters is a car restoration company based in New York, USA. Within short span of time, this company has become renowned for restoring vintage cars. Their team takes great pride in each of their projects, no matter how big or small. They offer paint jobs, frame build-ups, engine restoration, body work etc. They restore cars of various origins including USA, Europe and Asia. 

The management wants to expand their business by increasing the capacity of the number of cars that can be restored. They want to generate greater revenue for the company through cost cutting and providing a data driven approach to their current process. They feel that the insights from existing data will help them in making data-driven decisions and also automate some of the key tasks in the process.

<hr style="border:2px solid gray">

#**STEP: 0/4** - Run the following lines of code and move to Step 1. 

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline as sklearn_pipeline
from sklearn.metrics import classification_report

from sklearn.inspection import permutation_importance

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

In [None]:
%%capture
!pip install category_encoders==2.*
!pip install pdpbox
!pip install shap

from category_encoders import OrdinalEncoder
from pdpbox.pdp import pdp_interact, pdp_interact_plot
import shap

In [None]:
#Update the DATA_PATH variable

import sys

if 'google.colab' in sys.modules:
  # If you're on Colab:
  DATA_PATH = 'https://raw.githubusercontent.com/bloominstituteoftechnology/ds_code_along_unit_2/main/data/restoremasters/'
else:
  # If you're working locally:
  DATA_PATH = '..../data/'

In [None]:
# importing the dataset to Pandas DataFrame: cars_df
cars_df=pd.read_csv(DATA_PATH +'auto_mpg.csv')

In [None]:
# display the data in DataFrame: cars_df
cars_df.head()

In [None]:
# Replacing missing values with mean horsepower of cars with similar cylinders and model_year
cars_df['horsepower'] = cars_df.groupby(['cylinders', 'model_year'])['horsepower'].apply(lambda x: x.fillna(x.mean()))

In [None]:
# checking for missing values in columns of the DataFrame: cars_df
cars_df.isna().any()

In [None]:
# get the duplicate records in the DataFrame: cars_df
cars_df[cars_df.duplicated()]

In [None]:
# dropping the duplicate records in the DataFrame: cars_df
cars_df.drop_duplicates(inplace=True)
cars_df.duplicated().sum()

In [None]:
# finding the upper and lower limit of horespower and acceleration columns

def find_outlier_limits(col_name):
    Q1,Q3=cars_df[col_name].quantile([.25,.75])
    IQR=Q3-Q1
    low=Q1-(1.5* IQR)
    high=Q3+(1.5* IQR)
    return (high,low)

high_hp,low_hp=find_outlier_limits('horsepower')
print('Horsepower: ','upper limit: ',high_hp,' lower limit: ',low_hp)
high_acc,low_acc=find_outlier_limits('acceleration')
print('Acceleration: ','upper limit: ',high_acc,' lower limit:',low_acc)

# Replacing outlier values in horespower and acceleration columns with respective 
# upper and lower limits

cars_df.loc[cars_df['horsepower']>high_hp,'horsepower']=high_hp
cars_df.loc[cars_df['acceleration']>high_acc,'acceleration']=high_acc
cars_df.loc[cars_df['acceleration']<low_acc,'acceleration']=low_acc

In [None]:
# Extracting the company name from the name column
cars_df['company']=cars_df['name'].apply(lambda x:x.split()[0])
cars_df.head()

In [None]:
# based on mpg and cylinders, creating a new column - car_type (hatchback, sedan, SUV, sports)

cars_df.loc[(cars_df['cylinders']==3),'car_type']='Hatchback'
cars_df.loc[(cars_df['cylinders']==4) & (cars_df['mpg']>=30),'car_type']='Hatchback'
cars_df.loc[(cars_df['cylinders']==5),'car_type']='Sedan'
cars_df.loc[(cars_df['cylinders']==4) & (cars_df['mpg']<30),'car_type']='Sedan'
cars_df.loc[(cars_df['cylinders']==6),'car_type']='SUV'
cars_df.loc[(cars_df['cylinders']==8),'car_type']='Sports'

In [None]:
# drop the ```cylinders``` and ```name``` columns

cars_df.drop(columns=['cylinders','name'],inplace=True)

In [None]:
cars_df.head()

In [None]:
# Split the data into Feature Matrix and Target Vector
target = 'car_type'
y = cars_df[target]
X = cars_df.drop(columns=[target])

# Split data into Train and Test Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [None]:
# Ordinal Encode categorical values (we will not be using pipelines today)
oe = OrdinalEncoder()
oe.fit(X_train)
XT_train = oe.transform(X_train)
XT_test = oe.transform(X_test)

In [None]:
# Build a Random Forest Classifier Model
model_rf = RandomForestClassifier(random_state=42,n_jobs=-1)model_rf.fit(XT_train, y_train);

<hr style="border:2px solid gray">

---



#**STEP: 1/4** - Evaluate and interpret model performance


In [None]:
# check accuracy
print('Test Accuracy:', accuracy_score(y_test, model_rf.predict(XT_test)))

In [None]:
# build classification report
print(classification_report(y_test, model_rf.predict(XT_test)))

In [None]:
# Plot the top ten feature importances

importances = model_rf.feature_importances_
features = oe.get_feature_names()
feat_imp = pd.Series(importances, index=features).sort_values(key=abs)
feat_imp.tail(20).plot(kind='barh')
plt.xlabel('Reduction in Gini Impurity')
plt.ylabel('Features')
plt.title('Feature Importances');

<hr style="border:2px solid gray">

#**STEP: 2/4** - Permutation Importances


In [None]:
# calculate permutation importances

perm_imp = 

In [None]:
perm_imp

In [None]:
#create a dataframe for easy interpretation

data_perm = {'imp_mean':perm_imp['importances_mean'],
             'imp_std':perm_imp['importances_std']}

df_perm = pd.DataFrame(data_perm, index=XT_test.columns).sort_values('imp_mean')

In [None]:
df_perm

<hr style="border:2px solid gray">

#**STEP: 3/4** - PDP Interact Plot


In [None]:
#select two features
two_selected_features = 

In [None]:
# instantiate pdp_interact class

interact = 

In [None]:
# Plot PDP interact plot


<hr style="border:2px solid gray">

#**STEP: 4/4** - Shapley Plot

In [None]:
# select sample

sample_row = 

In [None]:
# final model prediction for this sample


In [None]:
# Create an instance of TreeExplainer
explainer = shap.TreeExplainer(model_rf)  # does not like pipelines

# get shap values
shap_values = explainer.shap_values(sample_row)

shap.initjs() #initialization of java script.

# force plot
