## Predictive Model for Car Rental Reservations

### **Project Objective**: 
This project uses car rental data from Statistics Canada to predict which customers are looking to reserve a car rental.

### Modelling Process

Our client wants to determine which customers are likely to rent an automobile, so that they can be targetted with marketing campaigns. From a modelling perspective, that would require a binary classifier, with the outcome of interest being whether people will rent a car or not. The steps taken to build the binary classifier are as follows - 

<b> 1. Data Cleaning: </b>

   - The data provided by the client did not have very intuitive column names, so one of the first data cleaning tasks was to rename the columns, to make them more intuitive and easier to work with. 
   - The data collected seems to be from a customer survey / form and questions are often skipped by the customer - so the data points have to be adjusted based on what is known for a given feature / column. 
   - Categorical data is present as integers in the raw data and have to be one-hot encoded / converted into dummy variables.


<b> 2. Feature Engineering: </b> Different columns are added, binned to generate features. Features can be bucketed into the following broad categories - 

        - Route of Entry into Canada
        - Class of Travel while Entering Canada
        - Reason for Trip to Canada
        - Mode of Transportation while in Canada
        - Duration of Stay
        - Accomodation in Canada 
        - Demographic Features of Customer
        - Activity in Canada
        - Customer Spend in Canada


<b> 3. Feature Extraction: </b> An econometric method to feature selection is chosen here, where the features which can potentially impact a customer's decision to rent a car or not are selected, to be input into the model. 


<b> 4. Model Fitting: </b> 
- The data provided is split into in a 70:30 ratio into training and test sets respectively. The model will be fitted on the training set, essentially the model will learn the underlying patterns in the data from the training set. The model's performance will then be evaluated on the test set. The training and test sets are kept completely separate from each other, to prevent data leakage. 


- Three kinds of models are chosen for this binary classification problem:

    1. <b> Decision Tree Classifier: </b> A decision tree approach splits the data in a way the gini impurity / misclassification rate can be minimized, until the positive and negative classes are distinctly separated. The positive class, in this case, is people who will rent, and the negative class constitutes people who would not rent. The decision tree also provides us with a feature importance chart, where the most important features impacting the decision to rent or not, can be seen to bubble up.

    2. <b> K-Nearest Neighbors Classifier (KNN): </b> The KNN classifier segments customers based on their resemblance / proximity to other similar data points. The most important features identified from the decision tree is used as input to the KNN classifier as it is a very computationally expensive algorithm. The data input to the model is scaled and normalized so that all features range between -1 and 1.  
    
    3. <b> Logistic Regression Classifier: </b> The logistic regression classifier fits a sigmoid function to the underlying data to estimate the probability of a customer renting a car or not. Since the training data was imbalanced (there were more people who didn't rent than people who did), threshold tuning was performed.
    

<b> 5. Hyper Parameter Tuning + Cross Validation: </b> The hyper-parameters, or configuration settings of the model which can impact model performance, are selected. In this case, a grid search approach is taken, where the model is provided with a range of input values for the hyper-parameters - the model is trained on each value in the solution space and the set of values which provide the best output is chosen. The range of input values are then narrowed based on the findings and the model is trained on each value again, now in a smaller solution space, to identify the best output. This is repeated a few times, until the model performance stops changing. 

This is paired with cross-validation, implying the model is trained multiple times for each hyper parameter, on different slices of the data, every single time, and the average of the results is presented. This ensures the model stays general and does not memorize the noise / overfit on any specific part of the dataset.

While the hyper-parameter tuning is performed, 'recall' is used as the scoring criteria and the hyper-parameter tuning aimed to maximize the recall for each model. Since the model will be used for marketing purposes, it is essential to cast a wider net and capture as many potential customers as possible. (Further explanation provided in Answer to 1c and 1d) 


<b> 6. Model Evaluation: </b> For each model, the precision, recall, F-1 score and Area Under the Curve (AUC) are calculated. The precision score indicates how accurately people who will rent a car are being predicted. The recall indicates what proportion of people who will actually rent are being captured by the model. F-1 score and AUC indicate how well the model fits the underlying data. These metrics are calculated differently for the training datasets and the testing datasets for each model. The confusion matrix is also printed for each model, as that can help us quantify the False Positives and False Negatives in each model. The evaluation metrics in the train and test sets are compared and contrasted for each model, to ensure there is no overfitting.

In [94]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import recall_score, precision_score, accuracy_score, f1_score, auc
from sklearn.linear_model import LogisticRegression
from dmba import classificationSummary
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

In [95]:
df = pd.read_excel('/Users/mohammadananjaved/Desktop/MMA867/VTS_2018_PUMF_CSV.xlsx', index_col = 'VPUMFID')

In [96]:
df.head()

Unnamed: 0_level_0,VQUARTER,VPRVENTP,VTPSZE,VRSN3P,VMODENTP,VRTEN,VRTEX,VCFARE1,VCFARE2,VCFARE3,...,VM45_54P,VM55_64P,VM_65P,VF0_17P,VF18_24P,VF25_34P,VF35_44P,VF45_54P,VF55_64P,VF_65P
VPUMFID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
100000,3,35,5,1,2,1,1,6,6,6,...,0,0,0,1,0,1,0,0,0,0
100001,2,35,2,1,2,1,1,6,6,6,...,0,1,1,0,0,0,0,0,0,0
100002,4,35,1,10,1,2,2,2,2,1,...,0,0,0,0,0,0,0,0,0,0
100003,3,59,2,5,2,1,1,6,6,6,...,0,0,0,0,0,0,1,0,0,0
100004,2,24,1,2,1,2,2,2,2,1,...,1,0,0,0,0,0,0,0,0,0


### Binary Classifiers to Predict Whether Customer Will Rent Car or Not

#### Filtering for to Ontario visitors only

In [97]:
df.rename(columns = {'VPRVENTP' : 'Canadian_province_of_entry'}, inplace = True)

In [98]:
df = df[df['Canadian_province_of_entry'] == 35]

#### Outcome Variable - Rented Automobile 

In [99]:
df['Rented_Automobile'] = np.where(df['VTRNIN02'] == 1, 1, 0)

### Data Cleaning + Feature Engineering

#### Route & Class of Travel while Entering Canada

In [100]:
df['First_Class'] = np.where(df['VCFARE1'] == 1, 1, 0)
df['Business_Class'] = np.where(df['VCFARE2'] == 1, 1, 0)
df['Economy_Class'] = np.where(df['VCFARE3'] == 1, 1, 0)
df['Charter_Flight'] = np.where(df['VCFARE4'] == 1, 1, 0)
df['Booked_with_rewards'] = np.where(df['VCFARE5'] == 1, 1, 0)

df['Mode_of_Entry'] = df['VMODENTP'].map({1:'Air', 2:'Land', 3:'Bus_Train'})
df['Location_of_Entry'] = df['VRTEN'].map({1:'US_Direct', 2:'Other_Country_Direct', 3:'Other_Country_via_US'})
df['Location_of_Exit'] = df['VRTEN'].map({1:'US_Direct', 2:'Other_Country_Direct', 3:'Other_Country_via_US'})

#### Reason for trip reason

In [101]:
df['Trip_Reason'] = df['VRSN3P'].map({1:'Holiday', 2:'Personal', 4:'Work',
                                     5:'Study', 6:'Medical', 8:'Personal', 10:'Work', 11:'Work', 14:'Work', 
                                      15:'In_Transit'})

#### Modes of Transportation While in Canada

In [102]:
df['Flights'] = np.where(df['VTRNIN01'] == 1, 1, np.where(df['VTRNIN06'] == 1, 1, 0))

df['VTRNIN03'] = np.where(df['VTRNIN03'] == 1, 1, 0)
df['VTRNIN08'] = np.where(df['VTRNIN08'] == 1, 1, 0)

df['Bus_Train'] = df[['VTRNIN03', 'VTRNIN08']].max(axis = 1)

df['VTRNIN05'] = np.where(df['VTRNIN05'] == 1, 1, 0)
df['VTRNIN04'] = np.where(df['VTRNIN04'] == 1, 1, 0)
df['VTRNIN09'] = np.where(df['VTRNIN09'] == 1, 1, 0)

df['Ferry_Cruise_Ship_Private_Boat'] = df[['VTRNIN05', 'VTRNIN04']].max(axis = 1)

df['Private_vehicle'] = np.where(df['VTRNIN07'] == 1, 1, 0)

#### Duration of Stay

In [103]:
df['Number_of_nights_ON'] = df['VNIGHTON']
df['Time_Spent_ON_Nights'] = df['VPRSNONP']
df['Time_Spent_Other_Provinces_Nights'] = df['VPRSNYUP'] + df['VPRSNSAP'] + df['VPRSNPQP'] + df['VPRSNNBP'] + df['VPRSNPEP'] + df['VPRSNNWP'] + df['VPRSNNUP'] + df['VPRSNNSP'] + df['VPRSNNFP'] + df['VPRSNMAP'] + df['VPRSNBCP'] + df['VPRSNATP']
df['First_Visit_to_Canada'] = np.where(df['VVISIT'] == 1, 1, 0)

df['Timing_of_Visit_Q1'] = np.where(df['VQUARTER'] == 1, 1, 0) 
df['Timing_of_Visit_Q2'] = np.where(df['VQUARTER'] == 2, 1, 0) 
df['Timing_of_Visit_Q3'] = np.where(df['VQUARTER'] == 3, 1, 0) 
df['Timing_of_Visit_Q4'] = np.where(df['VQUARTER'] == 4, 1, 0) 

#### Accomodation in Canada

In [104]:
accomodation_types = ['A', 'B', 'C', 'D', 'E']

for accomodation in accomodation_types:
    for visit_num in range(1, 11):
        if visit_num < 10:
            column_name = 'VACCV0' + str(visit_num) + accomodation
        else:
            column_name = 'VACCV' + str(visit_num) + accomodation            
        df[column_name] = np.where(df[column_name] == 1, 1, 0)
        

df['Hotel'] = df[['VACCV01A', 'VACCV02A', 'VACCV03A', 'VACCV04A', 'VACCV05A', 'VACCV06A', 'VACCV07A', 'VACCV08A',
                  'VACCV09A', 'VACCV10A', 'VACCV01B', 'VACCV02B', 'VACCV03B', 'VACCV04B', 'VACCV05B', 'VACCV06B',
                  'VACCV07B', 'VACCV08B', 'VACCV09B', 'VACCV10B']].max(axis = 1)

df['Home_of_Friends_Family'] = df[['VACCV01C', 'VACCV02C', 'VACCV03C', 'VACCV04C', 'VACCV05C', 'VACCV06C', 'VACCV07C', 'VACCV08C',
                  'VACCV09C', 'VACCV10C']].max(axis = 1)

df['Camp_Trailer_Cottage'] = df[['VACCV01D', 'VACCV02D', 'VACCV03D', 'VACCV04D', 'VACCV05D', 'VACCV06D', 'VACCV07D', 'VACCV08A',
                  'VACCV09D', 'VACCV10D', 'VACCV01E', 'VACCV02E', 'VACCV03E', 'VACCV04E', 'VACCV05E', 'VACCV06E',
                  'VACCV07E', 'VACCV08E', 'VACCV09E', 'VACCV10E']].max(axis = 1)

#### Demographic Features

In [105]:
df['Male_Travellers_Below_17'] = np.where(df['VM0_17P'] != 99, df['VM0_17P'], 0) 
df['Male_Travellers_18_to_24'] = np.where(df['VM18_24P'] != 99, df['VM18_24P'], 0) 
df['Male_Travellers_25_to_34'] = np.where(df['VM25_34P'] != 99, df['VM25_34P'], 0) 
df['Male_Travellers_35_to_44'] = np.where(df['VM35_44P'] != 99, df['VM35_44P'], 0) 
df['Male_Travellers_45_to_54'] = np.where(df['VM45_54P'] != 99, df['VM45_54P'], 0) 
df['Male_Travellers_55_to_64'] = np.where(df['VM55_64P'] != 99, df['VM55_64P'], 0) 
df['Male_Travellers_Above_65'] = np.where(df['VM_65P'] != 99, df['VM_65P'], 0) 

df['Female_Travellers_Below_17'] = np.where(df['VF0_17P'] != 99, df['VF0_17P'], 0) 
df['Female_Travellers_18_to_24'] = np.where(df['VF18_24P'] != 99, df['VF18_24P'], 0) 
df['Female_Travellers_25_to_34'] = np.where(df['VF25_34P'] != 99, df['VF25_34P'], 0) 
df['Female_Travellers_35_to_44'] = np.where(df['VF35_44P'] != 99, df['VF35_44P'], 0) 
df['Female_Travellers_45_to_54'] = np.where(df['VF45_54P'] != 99, df['VF45_54P'], 0) 
df['Female_Travellers_55_to_64'] = np.where(df['VF55_64P'] != 99, df['VF55_64P'], 0) 
df['Female_Travellers_Above_65'] = np.where(df['VF_65P'] != 99, df['VF_65P'], 0) 

df['Travel_Party_Size'] = df['VTPSZE']
df['People_Visiting_Ontario'] = df['VPRPR06P']

#### Money Spent During Visit

In [106]:
df['Total_Spend'] = df['VGLTOTSP']
df['Transportation_Spend'] = df['VGLTRASP']
df['Spend_on_Canadian_Carriers'] = df['VCDNFARE']
df['Money_Spent_in_Canada'] = np.where(df['VFARES'] != 9999996, df['VFARES'], 0) 

#### Activity During Visit

In [107]:
df['Visiting_Friends_Family'] = np.where(df['VACT01'] == 1, 1, 0)
df['Shopping'] = np.where(df['VACT02'] == 1, 1, 0)

for activity in range(3, 33):
    if activity < 10:
        column_name = 'VACT0' + str(activity)
    else:
        column_name = 'VACT' + str(activity)
    
    df[column_name] = np.where(df[column_name] == 1, 1, 0)


df['Indoor_Activities'] = df['VACT03'] + df['VACT04'] + df['VACT05'] + df['VACT06'] + df['VACT07'] + df['VACT08'] + df['VACT09'] + df['VACT10'] + df['VACT11'] + df['VACT12'] + df['VACT13'] + df['VACT14']      
df['Outdoor_Activities'] = df['VACT15'] + df['VACT16'] + df['VACT20'] + df['VACT21'] + df['VACT22'] + df['VACT23'] + df['VACT24'] + df['VACT25'] + df['VACT26'] + df['VACT27'] + df['VACT28'] + df['VACT29'] + df['VACT30'] + df['VACT31'] + df['VACT32']

df['Medical_Treatment'] = np.where(df['VACT17'] == 1, 1, 0)
df['Business_Trip'] = np.where(df['VACT18'] == 1, 1, 0)
df['Play_Sports'] = np.where(df['VACT19'] == 1, 1, 0)

#### Feature Selection

The features in the model can be broadly categorized into the following 9 themes. An econometric approach to feature selection is taken, where the features which are potentially thought to impact a customer's decision to rent or not rent a vehicle, based on theory, are used as feature inputs to the model 

        - Route of Entry into Canada
        - Class of Travel while Entering Canada
        - Reason for Trip to Canada
        - Mode of Transportation while in Canada
        - Duration of Stay
        - Accomodation in Canada 
        - Demographic Features of Customer
        - Activity in Canada
        - Customer Spend in Canada

In [108]:
model_input = df.iloc[:, 268:]

In [109]:
model_input.head()

Unnamed: 0_level_0,Rented_Automobile,First_Class,Business_Class,Economy_Class,Charter_Flight,Booked_with_rewards,Mode_of_Entry,Location_of_Entry,Location_of_Exit,Trip_Reason,...,Transportation_Spend,Spend_on_Canadian_Carriers,Money_Spent_in_Canada,Visiting_Friends_Family,Shopping,Indoor_Activities,Outdoor_Activities,Medical_Treatment,Business_Trip,Play_Sports
VPUMFID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
100000,0,0,0,0,0,0,Land,US_Direct,US_Direct,Holiday,...,47,9999996,0,0,0,4,0,0,0,0
100001,0,0,0,0,0,0,Land,US_Direct,US_Direct,Holiday,...,115,0,0,0,0,0,2,0,0,0
100002,0,0,0,1,0,0,Air,Other_Country_Direct,Other_Country_Direct,Work,...,90,0,1600,0,1,2,0,0,1,0
100007,0,0,0,1,1,0,Air,Other_Country_Direct,Other_Country_Direct,Personal,...,0,1140,1140,1,1,0,0,0,0,0
100009,0,0,0,1,0,0,Air,Other_Country_Direct,Other_Country_Direct,Personal,...,0,0,850,1,1,3,1,0,0,0


#### Function for Evaluation of Classifier

In [110]:
def model_evaluation(y_test, y_predicted, model_name):

    from sklearn import metrics
    from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

    fpr, tpr, thresholds = metrics.roc_curve(y_test, y_predicted)
    auc = metrics.auc(fpr, tpr)
    precision = precision_score(y_test, y_predicted)
    recall = recall_score(y_test, y_predicted)
    f1_score = 2 * (precision * recall) / (precision + recall)

    plt.plot([0,1], [0,1], 'k--')
    plt.plot(fpr, tpr, label = "auc = " + str(auc))
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curve - ' + model_name)
    plt.legend()
    plt.show()

    print(f'Precision for ' + model_name + ' = ' + str(precision))
    print(f'Recall for ' + model_name + ' = ' + str(recall))
    print(f'F1 Score for ' + model_name + ' = ' + str(f1_score))
    print(f'Area under the curve for ' + model_name + ' = ' + str(auc))

### Binary Classifier 1 - Decision Tree

In [111]:
outcome = 'Rented_Automobile'

X = model_input.drop(columns = outcome)
y = model_input[outcome]

# Creating dummy variables for any categorical variables

X = pd.get_dummies(X, drop_first = True)

# Creating a train/test split of 70:30

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

param_grid = {'max_depth': [5, 10, 15, 20, 25, 30, 35, 40, 45, 50],
             'min_samples_split': [10, 20, 30, 40, 50, 100],
             'min_impurity_decrease': [0, 0.1, 0.001, 0.0001]}

# Grid search along with cross-validation is used to identify the best set of hyper parameters for the model

tree_grid_search = GridSearchCV(DecisionTreeClassifier(random_state = 1), param_grid = param_grid, 
                                cv = 3, n_jobs = -1, scoring = 'recall')

tree_grid_search.fit(X_train, y_train)

predicted_train_tree = tree_grid_search.predict(X_train)
predicted_test_tree = tree_grid_search.predict(X_test)

tree_grid_search.best_params_

{'max_depth': 25, 'min_impurity_decrease': 0, 'min_samples_split': 10}

In [112]:
param_grid = {'max_depth': [5, 7, 10, 13, 15],
             'min_samples_split': [10, 13, 15],
             'min_impurity_decrease': [0, 0.00005, 0.0001]}

tree_grid_search = GridSearchCV(DecisionTreeClassifier(random_state = 1), param_grid = param_grid, 
                                cv = 5, n_jobs = -1, scoring = 'recall')

tree_grid_search.fit(X_train, y_train)

best_tree = tree_grid_search.best_estimator_

predicted_train_tree = best_tree.predict(X_train)
predicted_test_tree = best_tree.predict(X_test)

In [113]:
tree_grid_search.best_params_

{'max_depth': 7, 'min_impurity_decrease': 5e-05, 'min_samples_split': 10}

In [114]:
model_evaluation(y_train, predicted_train_tree, 'Decision Tree Train')

Precision for Decision Tree Train = 0.8094059405940595
Recall for Decision Tree Train = 0.7062634989200864
F1 Score for Decision Tree Train = 0.754325259515571
Area under the curve for Decision Tree Train = 0.8415406008825932


  plt.show()


In [115]:
model_evaluation(y_test, predicted_test_tree, 'Decision Tree Test')

Precision for Decision Tree Test = 0.7254901960784313
Recall for Decision Tree Test = 0.6332518337408313
F1 Score for Decision Tree Test = 0.6762402088772846
Area under the curve for Decision Tree Test = 0.7993480607350136


  plt.show()


In [116]:
classificationSummary(y_train, predicted_train_tree)

Confusion Matrix (Accuracy 0.9437)

       Prediction
Actual    0    1
     0 6489  154
     1  272  654


In [117]:
classificationSummary(y_test, predicted_test_tree)

Confusion Matrix (Accuracy 0.9236)

       Prediction
Actual    0    1
     0 2738   98
     1  150  259


In [118]:
importances = best_tree.feature_importances_
feature_names = X_train.columns

# Sort the feature importances in descending order
indices = np.argsort(importances)[::-1]
sorted_feature_names = [feature_names[i] for i in indices]
sorted_importances = importances[indices]

# Plot the feature importances
plt.figure(figsize=(15, 7))
plt.title("Feature Importances")
plt.bar(range(len(importances)), sorted_importances, align="center")
plt.xticks(range(len(importances)), sorted_feature_names, rotation=90)
plt.xlabel("Features")
plt.ylabel("Importance")
plt.tight_layout()
plt.show()

  plt.show()


### Binary Classifier 2 - KNN Classifier

The top features as seen in the feature importance chart above, are chosen as inputs to the KNN Classifier. Since, KNN is a clustering algorithm and works based off of calculating the distance between subsequent data points, it is a very computationally expensive algorithm, and requires a small feature set to complete the training step in a short time frame.

In [119]:
predictors = ['Transportation_Spend', 'Hotel', 'Private_vehicle', 'Bus_Train', 'Total_Spend', 
              'Outdoor_Activities', 'Trip_Reason_Study', 'Money_Spent_in_Canada', 'Number_of_nights_ON', 
              'Mode_of_Entry_Land', 'Flights', 'Spend_on_Canadian_Carriers', 'Indoor_Activities', 
              'Time_Spent_ON_Nights', 'Mode_of_Entry_Bus_Train', 'Economy_Class', 'Male_Travellers_55_to_64',
             'Camp_Trailer_Cottage']

The numerical and one-hot encoded columns are separated. A prerequisite to modelling using a clustering algorithm, such as, KNN is to scale and normalize the data to a range between -1 and +1, so that all columns / features are equally weighted by the KNN algorithm when the distances are being calculated for the classification to take place.

In [120]:
numerical_columns = ['Transportation_Spend', 'Total_Spend', 
                     'Outdoor_Activities', 'Money_Spent_in_Canada',  
                    'Number_of_nights_ON', 'Spend_on_Canadian_Carriers', 'Indoor_Activities',
                     'Time_Spent_ON_Nights', 'Male_Travellers_55_to_64'
                    ]

one_hot_encoded_columns = ['Hotel', 'Private_vehicle', 'Bus_Train', 
                           'Trip_Reason_Study', 'Mode_of_Entry_Land',
                          'Flights', 'Mode_of_Entry_Bus_Train', 'Economy_Class', 'Camp_Trailer_Cottage'
                          ]

In [121]:
sc = StandardScaler()

X_train_scaled = pd.DataFrame(sc.fit_transform(X_train[numerical_columns]), columns = numerical_columns)
X_test_scaled = pd.DataFrame(sc.fit_transform(X_test[numerical_columns]), columns = numerical_columns)

X_train_one_hot_encoded = X_train[one_hot_encoded_columns].reset_index().drop(columns = 'VPUMFID')
X_test_one_hot_encoded = X_test[one_hot_encoded_columns].reset_index().drop(columns = 'VPUMFID')

# The one hot encoded columns are merged back with the scaled and normalized columns.

X_train_scaled = X_train_scaled.merge(X_train_one_hot_encoded, how = 'left', left_index = True, right_index = True)
X_test_scaled = X_test_scaled.merge(X_test_one_hot_encoded, how = 'left', left_index = True, right_index = True)

In [122]:
X_train_scaled.head()

Unnamed: 0,Transportation_Spend,Total_Spend,Outdoor_Activities,Money_Spent_in_Canada,Number_of_nights_ON,Spend_on_Canadian_Carriers,Indoor_Activities,Time_Spent_ON_Nights,Male_Travellers_55_to_64,Hotel,Private_vehicle,Bus_Train,Trip_Reason_Study,Mode_of_Entry_Land,Flights,Mode_of_Entry_Bus_Train,Economy_Class,Camp_Trailer_Cottage
0,-0.468059,-0.044649,2.10916,-0.748362,-0.338739,1.808713,0.477296,-0.202557,5.373413,0,1,0,0,1,0,0,0,1
1,-0.468059,-0.30207,-0.430489,0.136509,-0.065786,-0.552742,-0.024531,-0.145003,-0.374783,0,1,0,0,0,0,0,1,0
2,-0.467876,2.048258,2.10916,1.56189,-0.236382,-0.552286,0.477296,-0.145003,-0.374783,1,0,1,0,0,0,0,0,0
3,-0.468028,0.364144,-0.430489,-0.226288,-0.168144,-0.553026,-0.526358,-0.231334,-0.374783,1,0,0,0,0,0,0,1,0
4,-0.468059,-0.460692,-0.430489,0.431466,0.241286,-0.553026,0.979123,0.11399,-0.374783,0,1,0,0,0,0,0,1,0


In [123]:
param_grid = {'n_neighbors': np.arange(5, 100, 5)}

knn = KNeighborsClassifier()

knn_gscv = GridSearchCV(knn, param_grid, cv = 5, scoring = 'recall')
knn_gscv.fit(X_train_scaled, y_train)

best_knn = knn_gscv.best_estimator_

predicted_train_knn = best_knn.predict(X_train_scaled)
predicted_test_knn = best_knn.predict(X_test_scaled)

In [124]:
knn_gscv.best_params_

{'n_neighbors': 5}

In [125]:
model_evaluation(y_train, predicted_train_knn, 'KNN - Train')

  plt.show()


Precision for KNN - Train = 0.768377253814147
Recall for KNN - Train = 0.5982721382289417
F1 Score for KNN - Train = 0.6727383120825743
Area under the curve for KNN - Train = 0.7865664469558077


In [126]:
model_evaluation(y_test, predicted_test_knn, 'KNN - Test')

  plt.show()


Precision for KNN - Test = 0.6397306397306397
Recall for KNN - Test = 0.46454767726161367
F1 Score for KNN - Test = 0.5382436260623229
Area under the curve for KNN - Test = 0.713409240605419


In [127]:
classificationSummary(y_train, predicted_train_knn)

Confusion Matrix (Accuracy 0.9288)

       Prediction
Actual    0    1
     0 6476  167
     1  372  554


In [128]:
classificationSummary(y_test, predicted_test_knn)

Confusion Matrix (Accuracy 0.8995)

       Prediction
Actual    0    1
     0 2729  107
     1  219  190


### Binary Classifier 3 - Logistic Regression

In [129]:
model = LogisticRegression(penalty = 'l2', C=1e42, solver='liblinear')

model.fit(X_train, y_train)

predicted_train_logistic = model.predict(X_train)
predicted_test_logistic = model.predict(X_test)

#### Threshold Tuning - To account for class imbalance. The dataset has much more people who won't rent than people who will.

In [130]:
predicted_proba_train_logistic = model.predict_proba(X_train)
predicted_proba_test_logistic = model.predict_proba(X_test)

recall_score_results = []

for threshold in np.arange(0.1, 1, 0.1):
    predicted_train_logistic_tuned = [1 if x >= threshold else 0 for x in predicted_proba_train_logistic[:,1:]]
    recall_score_results.append([threshold, recall_score(y_train, predicted_train_logistic_tuned)])

recall_score_results = pd.DataFrame(recall_score_results).rename(columns = {0:'Threshold', 1:'Recall'})

In [131]:
recall_score_results

Unnamed: 0,Threshold,Recall
0,0.1,0.859611
1,0.2,0.666307
2,0.3,0.49568
3,0.4,0.207343
4,0.5,0.060475
5,0.6,0.0
6,0.7,0.0
7,0.8,0.0
8,0.9,0.0


In [132]:
tuned_predicted_results_train = [1 if x >= 0.1 else 0 for x in predicted_proba_train_logistic[:,1:]]
tuned_predicted_results_test = [1 if x >= 0.1 else 0 for x in predicted_proba_test_logistic[:,1:]]

In [133]:
model_evaluation(y_train, tuned_predicted_results_train, 'Logistic Regression - Train')

  plt.show()


Precision for Logistic Regression - Train = 0.1419147798181494
Recall for Logistic Regression - Train = 0.8596112311015118
F1 Score for Logistic Regression - Train = 0.24361132364192806
Area under the curve for Logistic Regression - Train = 0.5675445889061677


In [134]:
model_evaluation(y_test, tuned_predicted_results_test, 'Logistic Regression - Test')

  plt.show()


Precision for Logistic Regression - Test = 0.14452027298273787
Recall for Logistic Regression - Test = 0.8801955990220048
F1 Score for Logistic Regression - Test = 0.2482758620689655
Area under the curve for Logistic Regression - Test = 0.5643925808932309


In [135]:
classificationSummary(y_train, tuned_predicted_results_train)

Confusion Matrix (Accuracy 0.3469)

       Prediction
Actual    0    1
     0 1830 4813
     1  130  796


In [136]:
classificationSummary(y_test, tuned_predicted_results_test)

Confusion Matrix (Accuracy 0.3282)

       Prediction
Actual    0    1
     0  705 2131
     1   49  360


### Evaluation of Model Performance

In [137]:
model_evaluation = pd.DataFrame({
    
    'Binary Classifier (Model)'    : ['Decision Tree', 'KNN', 'Logistic Regression'],
    
    'Precision (Train)' : [precision_score(y_train, predicted_train_tree), 
                           precision_score(y_train, predicted_train_knn),
                           precision_score(y_train, tuned_predicted_results_train)],
    
    'Precision (Test)' : [precision_score(y_test, predicted_test_tree), 
                           precision_score(y_test, predicted_test_knn),
                           precision_score(y_test, tuned_predicted_results_test)],
    
    'Recall (Train)' : [recall_score(y_train, predicted_train_tree), 
                           recall_score(y_train, predicted_train_knn),
                           recall_score(y_train, tuned_predicted_results_train)],
    
    'Recall (Test)' : [recall_score(y_test, predicted_test_tree), 
                           recall_score(y_test, predicted_test_knn),
                           recall_score(y_test, tuned_predicted_results_test)],

    'F1 Score (Train)' : [f1_score(y_train, predicted_train_tree), 
                           f1_score(y_train, predicted_train_knn),
                           f1_score(y_train, tuned_predicted_results_train)],
    
    'F1 Score (Test)' : [f1_score(y_test, predicted_test_tree), 
                           f1_score(y_test, predicted_test_knn),
                           f1_score(y_test, tuned_predicted_results_test)],
    
    'Accuracy (Train)' : [accuracy_score(y_train, predicted_train_tree), 
                           accuracy_score(y_train, predicted_train_knn),
                           accuracy_score(y_train, tuned_predicted_results_train)],
    
    'Accuracy (Test)' : [accuracy_score(y_test, predicted_test_tree), 
                           accuracy_score(y_test, predicted_test_knn),
                           accuracy_score(y_test, tuned_predicted_results_test)]
    
})

In [138]:
model_evaluation

Unnamed: 0,Binary Classifier (Model),Precision (Train),Precision (Test),Recall (Train),Recall (Test),F1 Score (Train),F1 Score (Test),Accuracy (Train),Accuracy (Test)
0,Decision Tree,0.809406,0.72549,0.706263,0.633252,0.754325,0.67624,0.943718,0.923575
1,KNN,0.768377,0.639731,0.598272,0.464548,0.672738,0.538244,0.928788,0.899538
2,Logistic Regression,0.141915,0.14452,0.859611,0.880196,0.243611,0.248276,0.346941,0.328197


The evaluation metrics for the 3 binary classifiers constructed are shown in the table above. 

- <b> Precision </b> indicates how correctly the people who are looking to rent are captured by the model. A high precision indicates a low False Positive rate.


- <b> Recall </b> indicates what proportion of the people who are actually going to rent are predicted by the model. In the context of this model, recall would quantify the reach of the our client through their marketing campaigns, and is a measure of how wide of a net are they casting, in terms of who sees the campaign. A high recall indicates a low False Negative rate.


- <b> F-1 score and Accuracy score </b> are both measures of how well the model predicts both positive and negative classes (people who will rent an automobile and not rent respectively) and quantifies how well the model fits the underlying data.

The most important KPI for evaluating model performance in this case would be recall. Since our client's intent is to use the results of the model in a marketing campaign, it is ok to have False Positives in the final outcome. However, False Negatives would be more damaging to our client in this case, as False Negatives represent people who would have rented, but the model erroneously predicted they will not. False Negatives, in the context of this problem, represent opportunity cost, and should be minimized, in whatever model is picked. Hence, the best model for this use case would be model with the highest recall, or the lowest false negative rate. This would ensure a wide net is being cast and the maximum number of people are seeing the campaign and missed opportunities are minimized.

<b> The decision tree classifier </b> performs similarly in the train and test sets across all 4 KPI's, indicating there is no overfitting in the model. The test KPI's are lower than the train KPI's, which is expected. The model does quite well in capturing the maximum number of people who will rent and does so correctly (shown by the high precision and recall across both train & test sets).

<b> The KNN classifier </b> performs similarly across all 4 evaluation KPI's. However, the KPI's in the test set are much lower than their equivalent KPI's in the training set, indicating there might be some overfitting in the model. 

<b> The logistic regression classifier </b> performs really well in capturing most of the people who will rent, indicated by the high recall, across both train and test sets, but has a really high False Positive rate (indicated by the very low precision). This means, while the model does capture almost all customers who will actually rent, but also erroneously includes a lot of customers who do not express an intent to rent a vehicle from our client. From a marketing perspective, this might make the target audience for the campaign super diluted and reduce the Return on Ad Spend (ROAS) from the campaign. The model also poorly fits the underlying data.

### Recommendations to Client : Decision Tree Classifier

In [139]:
importances = best_tree.feature_importances_
feature_names = X.columns

# Sort the feature importances in descending order
indices = np.argsort(importances)[::-1]
sorted_feature_names = [feature_names[i] for i in indices]
sorted_importances = importances[indices]

# Plot the feature importances
plt.figure(figsize=(15, 7))
plt.title("15 Most Important Features Impacting Decision to Rent a Car")
plt.bar(range(len(importances[:15])), sorted_importances[:15], align="center")
plt.xticks(range(len(importances[:15])), sorted_feature_names[:15], rotation=90)
plt.xlabel("Features")
plt.ylabel("Importance")
plt.tight_layout()
plt.show()

  plt.show()


The decision tree classifier should be presented to our client. 

<b> Model Performance </b> : From a model performance standpoint, the decision tree classifier fits the underlying data well, as it has the highest F-1 score and accuracy score out of all 3 binary classifiers. 

The decision tree classifier has a high recall (may be not the highest), but does a good job of capturing most customers who express an intent to rent from our client, without including too many False Positives in the final outcome, which might dilute our client's target audience and reduce the Return on Ad Spend (ROAS). The decision tree classifier has the best balance of precision and recall out of all 3 classifiers constructed.

<b> Explainability & Interpretability </b> : The decision tree model also allows us to build a feature importance chart, which essentially bubbles up the most important features impacting a customer's decision to rent or not to rent. This makes it much easier to explain the inner workings of the model to our client, making the model all the more interpretable and explainable, and seem less "black box". The tree can also be plotted, if the client is interested in understanding the logic behind the classifier. 

The KNN and logistic regression approaches do not offer the same level of interpretability as the decision tree and does not have the same level of model performance either. The decision tree provides a very fine balance of an explainable, interpretable model that captures most people who will rent from our client.

As we can see from the feature importance chart above, the most important features all make intuitive sense. If customers have a large budget for transportation, they are more likely to rent a car. If customers do not have their own vehicle in Ontario, they are more likely to rent. Customers staying in hotels are also more likely to rent - if they are staying with friends/family, they will most likely use their car or friends and family will show them around and do not need to rent a car. Customers who come to Ontario to study are also more likely to rent as students do not have their own vehicle in a new location / country. Customers who come to Ontario for outdoor activities are also likely to rent as beaches, natural parks, hiking trails, etc. tend to be far away and not very accessible with public transit.

<b> Callouts </b> : 
- There is a lot of instances of customers skipping questions while answering the survey, leading to a lot of missing data. A customer might have actually done something but it was not captured in the data because the question was skipped, which introduces noise to the data.
- The data for the most important features shown above, might not always be available, at the time of making the prediction. Hence, the model might need to be re-trained based on the data at hand, which might reduce accuracy.