# WE03b-Ensembles - Modelling

In this notebook, I will be implementing the following ensemble models to find the best fit for predicting the Car Acceptability.

* Decision Tree with RandomSearchCV
* Decision Tree with GridSearchCV
* Random Forest (Default)
* Random Forest with RandomSearch
* Random Forest with GridSearch
* AdaBoost (Default)
* Gradiant Boosting (Default)


## Install and import necessary packages

In [1]:
# You may need to install xgboost (it's not part of the sklearn package)
# !conda install xgboost 

In [2]:
# import packages
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier 
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from xgboost import XGBClassifier

np.random.seed(86089106)

## Load data 

In [3]:
X_train = pd.read_csv('car-X-train_data.csv') 
y_train = pd.read_csv('car-y-train_data.csv') 
X_test = pd.read_csv('car-X-test_data.csv') 
y_test = pd.read_csv('car-y-test_data.csv')

### Creating a dataframe to store the results of the models

In [4]:
performance = pd.DataFrame({"model": [], "Accuracy": [], "Precision": [], "Recall": [], "F1": []})

## Prediction with Decision Tree (using default parameters)



In [5]:
dtree=DecisionTreeClassifier()

Fit the model to the training data

In [6]:
_ = dtree.fit(X_train, y_train)

Review of the performance of the model on the validation/test data

In [7]:
y_pred = dtree.predict(X_test)

In [8]:
performance = pd.concat([performance, pd.DataFrame({'model':"Dtree Default",       
                                                    'Accuracy': accuracy_score(y_test, y_pred), 
                                                    'Precision': precision_score(y_test, y_pred, average = 'macro'), 
                                                    'Recall': recall_score(y_test, y_pred, average = 'macro'), 
                                                    'F1': f1_score(y_test, y_pred, average = 'macro')
                                                     }, index=[0])])

performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,Dtree Default,0.971098,0.932645,0.966359,0.948756


## USING DECISION TREE WITH RANDOM SEARCH CV TO TRAIN AND TEST THE MODEL

In [9]:
score_measure = "accuracy"
kfolds = 5

param_grid = {
    'min_samples_split': np.arange(2,200),  
    'min_samples_leaf': np.arange(1,200),
    'min_impurity_decrease': np.arange(0.0001, 0.001, 0.00005),
    'max_leaf_nodes': np.arange(10, 200), 
    'max_depth': np.arange(3,50), 
    'criterion': ['entropy', 'gini'],
}

dtree = DecisionTreeClassifier()
rand_search = RandomizedSearchCV(estimator = dtree, param_distributions=param_grid, cv=kfolds, n_iter=1000,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = rand_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search.best_score_}")
print(f"... with parameters: {rand_search.best_params_}")

bestRecallTree = rand_search.best_estimator_



Fitting 5 folds for each of 1000 candidates, totalling 5000 fits
The best accuracy score is 0.9536847158876581
... with parameters: {'min_samples_split': 7, 'min_samples_leaf': 3, 'min_impurity_decrease': 0.0004, 'max_leaf_nodes': 123, 'max_depth': 44, 'criterion': 'entropy'}


In [10]:
c_matrix = confusion_matrix(y_test, rand_search.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]

performance = pd.concat([performance, pd.DataFrame({'model':"Dtree with RandomSearch", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,Dtree Default,0.971098,0.932645,0.966359,0.948756
0,Dtree with RandomSearch,0.944882,0.73913,0.944444,0.829268


## USING DECISION TREE WITH GRID SEARCH CV TO TRAIN AND TEST THE MODEL

In [11]:
score_measure = "accuracy"
kfolds = 5

param_grid = {
    'min_samples_split': [2,10,50,100,200],  
    'min_samples_leaf': [1,5,10,20,50],
    'min_impurity_decrease': [0.0001, 0.0005, 0.0010, 0.0020, 0.0050],
    'max_leaf_nodes': [10,25,50,100,200], 
    'max_depth': [5,10,20,30], 
    'criterion': ['entropy', 'gini'],
}

dtree = DecisionTreeClassifier()
grid_search = GridSearchCV(estimator = dtree, param_grid=param_grid, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = grid_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search.best_score_}")
print(f"... with parameters: {grid_search.best_params_}")

bestRecallTree = grid_search.best_estimator_

Fitting 5 folds for each of 5000 candidates, totalling 25000 fits
The best accuracy score is 0.9669112856212063
... with parameters: {'criterion': 'entropy', 'max_depth': 20, 'max_leaf_nodes': 50, 'min_impurity_decrease': 0.0001, 'min_samples_leaf': 1, 'min_samples_split': 2}


In [12]:
c_matrix = confusion_matrix(y_test, grid_search.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
#print(f"Accuracy={(TP+TN)/(TP+TN+FP+FN):.4f} Precision={TP/(TP+FP):.4f} Recall={TP/(TP+FN):.4f} F1={2*TP/(2*TP+FP+FN):.4f}")

performance = pd.concat([performance, pd.DataFrame({'model':"Dtree with GridSearch", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,Dtree Default,0.971098,0.932645,0.966359,0.948756
0,Dtree with RandomSearch,0.944882,0.73913,0.944444,0.829268
0,Dtree with GridSearch,0.945312,0.772727,0.894737,0.829268


## Prediction with RandomForest (using default parameters)

In [13]:
rforest = RandomForestClassifier()

In [14]:
#_ = rforest.fit(X_train, y_train)
_ = rforest.fit(X_train, y_train.values.ravel())

In [15]:
y_pred = rforest.predict(X_test)

In [16]:
performance = pd.concat([performance, pd.DataFrame({'model':"RandomForest Default",       
                                                    'Accuracy': accuracy_score(y_test, y_pred), 
                                                    'Precision': precision_score(y_test, y_pred, average = 'macro'), 
                                                    'Recall': recall_score(y_test, y_pred, average = 'macro'), 
                                                    'F1': f1_score(y_test, y_pred, average = 'macro')
                                                     }, index=[0])])

performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,Dtree Default,0.971098,0.932645,0.966359,0.948756
0,Dtree with RandomSearch,0.944882,0.73913,0.944444,0.829268
0,Dtree with GridSearch,0.945312,0.772727,0.894737,0.829268
0,RandomForest Default,0.953757,0.869281,0.879193,0.874113


## USING RANDOM FOREST CLASSIFIER WITH RANDOM SEARCH CV TO TRAIN AND TEST THE MODEL 

In [17]:
score_measure = "accuracy"
kfolds = 5

param_grid = {
    'min_samples_split': np.arange(2,200),  
    'min_samples_leaf': np.arange(1,200),
    'min_impurity_decrease': np.arange(0.0001, 0.001, 0.00005),
    'max_leaf_nodes': np.arange(10, 200), 
    'max_depth': np.arange(3,50), 
    'criterion': ['entropy', 'gini'],
}

rforest = RandomForestClassifier()
rand_search = RandomizedSearchCV(estimator = rforest, param_distributions=param_grid, cv=kfolds, n_iter=1000,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = rand_search.fit(X_train, y_train.values.ravel())

print(f"The best {score_measure} score is {rand_search.best_score_}")
print(f"... with parameters: {rand_search.best_params_}")

bestRecallTree = rand_search.best_estimator_

Fitting 5 folds for each of 1000 candidates, totalling 5000 fits
The best accuracy score is 0.9462295531703303
... with parameters: {'min_samples_split': 6, 'min_samples_leaf': 1, 'min_impurity_decrease': 0.0009000000000000002, 'max_leaf_nodes': 79, 'max_depth': 44, 'criterion': 'entropy'}


In [18]:
c_matrix = confusion_matrix(y_test, rand_search.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
#print(f"Accuracy={(TP+TN)/(TP+TN+FP+FN):.7f} Precision={TP/(TP+FP):.7f} Recall={TP/(TP+FN):.7f} F1={2*TP/(2*TP+FP+FN):.7f}")

performance = pd.concat([performance, pd.DataFrame({'model':"Random Forest with Random search", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

## USING RANDOM FOREST CLASSIFIER WITH GRID SEARCH CV TO TRAIN AND TEST THE MODEL 

In [19]:
score_measure = "accuracy"
kfolds = 5

param_grid = {
    'min_samples_split': [2,10,50,100,200],  
    'min_samples_leaf': [1,5,10,20,50],
    'min_impurity_decrease': [0.0001, 0.0005, 0.0010, 0.0020, 0.0050],
    'max_leaf_nodes': [10,25,50,100,200], 
    'max_depth': [5,10,20,30], 
    'criterion': ['entropy', 'gini'],
}

rforest = RandomForestClassifier()
grid_search = GridSearchCV(estimator = rforest, param_grid=param_grid, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = grid_search.fit(X_train, y_train.values.ravel())

print(f"The best {score_measure} score is {grid_search.best_score_}")
print(f"... with parameters: {grid_search.best_params_}")

bestRecallTree = grid_search.best_estimator_

Fitting 5 folds for each of 5000 candidates, totalling 25000 fits
The best accuracy score is 0.9578169472926168
... with parameters: {'criterion': 'entropy', 'max_depth': 30, 'max_leaf_nodes': 100, 'min_impurity_decrease': 0.0005, 'min_samples_leaf': 1, 'min_samples_split': 2}


In [20]:
c_matrix = confusion_matrix(y_test, rand_search.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
#print(f"Accuracy={(TP+TN)/(TP+TN+FP+FN):.7f} Precision={TP/(TP+FP):.7f} Recall={TP/(TP+FN):.7f} F1={2*TP/(2*TP+FP+FN):.7f}")

performance = pd.concat([performance, pd.DataFrame({'model':"Random Forest with Grid search", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,Dtree Default,0.971098,0.932645,0.966359,0.948756
0,Dtree with RandomSearch,0.944882,0.73913,0.944444,0.829268
0,Dtree with GridSearch,0.945312,0.772727,0.894737,0.829268
0,RandomForest Default,0.953757,0.869281,0.879193,0.874113
0,Random Forest with Random search,0.941176,0.785714,0.733333,0.758621
0,Random Forest with Grid search,0.941176,0.785714,0.733333,0.758621


## Prediction with ADABoost (using default parameters)

In [21]:
aboost = AdaBoostClassifier()

In [22]:
_ = aboost.fit(X_train, y_train.values.ravel())

In [23]:
y_pred = aboost.predict(X_test)

In [24]:

performance = pd.concat([performance, pd.DataFrame({'model':"AdaBoost Default",       
                                                    'Accuracy': accuracy_score(y_test, y_pred), 
                                                    'Precision': precision_score(y_test, y_pred, average = 'macro'), 
                                                    'Recall': recall_score(y_test, y_pred, average = 'macro'), 
                                                    'F1': f1_score(y_test, y_pred, average = 'macro')
                                                     }, index=[0])])

performance

  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,Dtree Default,0.971098,0.932645,0.966359,0.948756
0,Dtree with RandomSearch,0.944882,0.73913,0.944444,0.829268
0,Dtree with GridSearch,0.945312,0.772727,0.894737,0.829268
0,RandomForest Default,0.953757,0.869281,0.879193,0.874113
0,Random Forest with Random search,0.941176,0.785714,0.733333,0.758621
0,Random Forest with Grid search,0.941176,0.785714,0.733333,0.758621
0,AdaBoost Default,0.809249,0.488789,0.604545,0.521558


## Prediction with GradientBoostingClassifier

In [25]:
gboost = GradientBoostingClassifier()

In [26]:
_ = gboost.fit(X_train, y_train.values.ravel())

In [27]:
y_pred = gboost.predict(X_test)

In [28]:

performance = pd.concat([performance, pd.DataFrame({'model':"GradientBoost Default",       
                                                    'Accuracy': accuracy_score(y_test, y_pred), 
                                                    'Precision': precision_score(y_test, y_pred, average = 'macro'), 
                                                    'Recall': recall_score(y_test, y_pred, average = 'macro'), 
                                                    'F1': f1_score(y_test, y_pred, average = 'macro')
                                                     }, index=[0])])

performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,Dtree Default,0.971098,0.932645,0.966359,0.948756
0,Dtree with RandomSearch,0.944882,0.73913,0.944444,0.829268
0,Dtree with GridSearch,0.945312,0.772727,0.894737,0.829268
0,RandomForest Default,0.953757,0.869281,0.879193,0.874113
0,Random Forest with Random search,0.941176,0.785714,0.733333,0.758621
0,Random Forest with Grid search,0.941176,0.785714,0.733333,0.758621
0,AdaBoost Default,0.809249,0.488789,0.604545,0.521558
0,GradientBoost Default,0.965318,0.899981,0.958024,0.924825


## Results     


In [29]:
performance.sort_values(by=['Accuracy'])

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,AdaBoost Default,0.809249,0.488789,0.604545,0.521558
0,Random Forest with Random search,0.941176,0.785714,0.733333,0.758621
0,Random Forest with Grid search,0.941176,0.785714,0.733333,0.758621
0,Dtree with RandomSearch,0.944882,0.73913,0.944444,0.829268
0,Dtree with GridSearch,0.945312,0.772727,0.894737,0.829268
0,RandomForest Default,0.953757,0.869281,0.879193,0.874113
0,GradientBoost Default,0.965318,0.899981,0.958024,0.924825
0,Dtree Default,0.971098,0.932645,0.966359,0.948756


## Conclusion

Based on the above results of ensemble models for used car acceptability, we can analyse the results as follows:

1. AdaBoost Default:
   - Accuracy: 0.809249
   - Precision: 0.488789
   - Recall: 0.604545
   - F1 Score: 0.521558
   * From the scores, I can say, this model seems to have moderate accuracy and performance metrics compared to the other models.

2. Random Forest with Random Search:
   - Accuracy: 0.941176
   - Precision: 0.785714
   - Recall: 0.733333
   - F1 Score: 0.758621
   
   * Random Forest with Random Search shows improved performance across all metrics compared to the AdaBoost model.

3. Random Forest with Grid Search:
   - Accuracy: 0.941176
   - Precision: 0.785714
   - Recall: 0.733333
   - F1 Score: 0.758621
   
   * The Random Forest model with Grid Search achieves the same performance as the Random Forest with Random Search.

4. Decision Tree with Random Search:
   - Accuracy: 0.944882
   - Precision: 0.739130
   - Recall: 0.944444
   - F1 Score: 0.829268
   * The Decision Tree model with Random Search demonstrates higher recall but lower precision compared to the Random Forest models.

5. Decision Tree with Grid Search:
   - Accuracy: 0.945312
   - Precision: 0.772727
   - Recall: 0.894737
   - F1 Score: 0.829268
   
   * Similar to the Decision Tree with Random Search, this model has high recall but slightly improved precision.

6. RandomForest Default:
   - Accuracy: 0.953757
   - Precision: 0.869281
   - Recall: 0.879193
   - F1 Score: 0.874113
   
   * The default Random Forest model has high accuracy and performance across all metrics.

7. Gradient Boost Default:
   - Accuracy: 0.965318
   - Precision: 0.899981
   - Recall: 0.958024
   - F1 Score: 0.924825
   
   * The Gradient Boosting model achieves even higher accuracy and performance metrics, making it one of the top-performing models.

8. Decision Tree Default:
   - Accuracy: 0.971098
   - Precision: 0.932645
   - Recall: 0.966359
   - F1 Score: 0.948756
   
   * The default Decision Tree model exhibits the highest accuracy and performance metrics among all the models.

Finally, by observation, I can say that Decision Tree models, especially the default one, performed the best on the used car acceptability data. Among the ensemble models, the Random Forest models also showed strong performance, while AdaBoost falls behind in terms of accuracy and metrics. Gradient Boosting achieved the highest accuracy, precision, recall, and F1 score among all the models.