Create a new notebook. Use the data from airbnb with a target of price_gte_150 to fit a Decision tree model using the randomsearch/gridsearch approach demonstrated in the tutorial. Use precision as the scoring measure to optimize.

Create a discussion section at the end of your notebook. In this section, present and discuss your findings.

* Importing the modules

In [12]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier 
from matplotlib import pyplot as plt
import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

np.random.seed(86089106)

* Load the data

In [13]:
X_train = pd.read_csv('airbnb_train_X_price_gte_150.csv') 
y_train = pd.read_csv('airbnb_train_y_price_gte_150.csv') 
X_test = pd.read_csv('airbnb_test_X_price_gte_150.csv') 
y_test = pd.read_csv('airbnb_test_y_price_gte_150.csv') 

* Conducting an initial random search across a wide range of possible parameters.

In [14]:
score_measure = "precision"
kfolds = 5

param_grid = {
    'min_samples_split': np.arange(4,200),  
    'min_samples_leaf': np.arange(2,200),
    'min_impurity_decrease': np.arange(0.0001, 0.001, 0.00005),
    'max_leaf_nodes': np.arange(10, 200), 
    'max_depth': np.arange(3,50), 
    'criterion': ['entropy', 'gini'],
}

dtree = DecisionTreeClassifier()
rand_search = RandomizedSearchCV(estimator = dtree, param_distributions=param_grid, cv=kfolds, n_iter=1000,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = rand_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search.best_score_}")
print(f"... with parameters: {rand_search.best_params_}")

bestRecallTree = rand_search.best_estimator_

Fitting 5 folds for each of 1000 candidates, totalling 5000 fits
The best precision score is 0.8511730370184265
... with parameters: {'min_samples_split': 127, 'min_samples_leaf': 16, 'min_impurity_decrease': 0.0008000000000000001, 'max_leaf_nodes': 166, 'max_depth': 27, 'criterion': 'entropy'}


In [15]:
c_matrix = confusion_matrix(y_test, rand_search.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
print(f"Accuracy={(TP+TN)/(TP+TN+FP+FN):.7f} Precision={TP/(TP+FP):.7f} Recall={TP/(TP+FN):.7f} F1={2*TP/(2*TP+FP+FN):.7f}")

Accuracy=0.8378632 Precision=0.8185053 Recall=0.8662900 F1=0.8417200


### Discussion

* In the above cells, I have trained the model using 5 fold cross validation with precision to optimize the model. This indicates that data set was divided into 5 equal folds and model was trained and evaluated 5 times, each time using a different fold as test set and the remaining folds as training set. This process is repeated for a total of 1000 different candidate models, resulting in 5000 fits.

* The best precision score achieved by one of the candidate models is 0.8511730370184265. Here, precision indicates the accuracy of positive predictions made by the model. A higher precision score means that a model has lower rate of false positive predictions.

* Overall, among the 1000 candidate models, the one with the parameters indicated in the output achieved the best precision score.

* From the confusion matrix of the test data, we can analyse the performance of the model. Here, the model predicts the price of a room on airbnb. 

    * The model achieved an accuracy of 0.8378632, which measures the overall correctness of the predictions. 
    * Precision indicates the accuracy of positive predictions, indicating how well the model avoids false positive predictions. The precision score of 0.8185053 indicates that the model has a high proportion of true positive predictions relative to the total number of positive predictions.
    * Recall indicates the model's ability to find all the relevant positive instances. The recall score of 0.8662900 represents the proportion of true positive predictions the model was able to identify out of all actual positive instances in the dataset. 
    * F1 score combines both precision and recall into a single metric, where higher value indicates better overall performance. The F1 score of 0.8417200 is the harmonic mean of precision and recall, providing an overall assessment of the model's performance.
    * From the results, I can say the model achieved reasonably good results on the test data for predicting the price of a room on Airbnb.
    
    

### Further Exploration

*  We can also conduct a further exhaustive search across a smaller range of parameters around the parameters found in the initial random search for further optimization.

In [16]:
score_measure = "precision"

kfolds = 5
min_samples_split = rand_search.best_params_['min_samples_split']
min_samples_leaf = rand_search.best_params_['min_samples_leaf']
min_impurity_decrease = rand_search.best_params_['min_impurity_decrease']
max_leaf_nodes = rand_search.best_params_['max_leaf_nodes']
max_depth = rand_search.best_params_['max_depth']
criterion = rand_search.best_params_['criterion']

param_grid = {
    'min_samples_split': np.arange(min_samples_split-2,min_samples_split+2),  
    'min_samples_leaf': np.arange(min_samples_leaf-2,min_samples_leaf+2),
    'min_impurity_decrease': np.arange(min_impurity_decrease-0.0001, min_impurity_decrease+0.0001, 0.00005),
    'max_leaf_nodes': np.arange(max_leaf_nodes-2,max_leaf_nodes+2), 
    'max_depth': np.arange(max_depth-2,max_depth+2), 
    'criterion': [criterion]
}

dtree = DecisionTreeClassifier()
grid_search = GridSearchCV(estimator = dtree, param_grid=param_grid, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = grid_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search.best_score_}")
print(f"... with parameters: {grid_search.best_params_}")

bestRecallTree = grid_search.best_estimator_

Fitting 5 folds for each of 1280 candidates, totalling 6400 fits
The best precision score is 0.8511730370184265
... with parameters: {'criterion': 'entropy', 'max_depth': 25, 'max_leaf_nodes': 164, 'min_impurity_decrease': 0.0007000000000000001, 'min_samples_leaf': 15, 'min_samples_split': 125}


In [17]:
c_matrix = confusion_matrix(y_test, grid_search.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
print(f"Accuracy={(TP+TN)/(TP+TN+FP+FN):.4f} Precision={TP/(TP+FP):.4f} Recall={TP/(TP+FN):.4f} F1={2*TP/(2*TP+FP+FN):.4f}")

Accuracy=0.8379 Precision=0.8185 Recall=0.8663 F1=0.8417


* Importance of features

In [18]:
np.round(grid_search.best_estimator_.feature_importances_,2)

array([0.  , 0.  , 0.06, 0.1 , 0.64, 0.01, 0.01, 0.06, 0.  , 0.  , 0.01,
       0.  , 0.  , 0.02, 0.01, 0.  , 0.01, 0.  , 0.  , 0.  , 0.  , 0.  ,
       0.  , 0.  , 0.  , 0.01, 0.03, 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ,
       0.  , 0.01, 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ,
       0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ])

### Summary of Further Exploration

* Above, we have done grid search using cross validation to further optimize the decision tree model again with precision as scoring measure.

* Here we have used Grid search with 5-fold cross validation. The grid search iterates over all possible combinations of hyperparameters within the specified ranges and evaluates each model's performance using precision as the scoring metric. The best precision score and the corresponding best parameters are printed in the output.

* Comparision of performance with respect to the confusion matrix results -

    * By observation, I can say that results obtained after the grid search did not result in any improvement or deterioration in the performance metrics compared to the previous results. The precision, recall, accuracy, and F1 score remain the same. It suggests that the model's performance did not significantly change after fine-tuning the hyperparameters using the grid search.
