# Car Evaluations dataset

We now look at the car evaluations dataset. We start by downloading the same, and then loading the same into a pandas dataframe. We then convert it into a numpy array, same as done with the wdbc dataset. 

There are 6 attributes in the dataset:

1. buying       v-high, high, med, low
2. maint        v-high, high, med, low
3. doors        2, 3, 4, 5-more
4. persons      2, 4, more
5. lug_boot     small, med, big
6. safety       low, med, high

## Importing Libraries

We start by importing all necessary libraries for our analysis.

In [1]:
# importing all necessary libraries
import numpy as np
import math
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import neighbors, tree, linear_model, naive_bayes
from sklearn.svm import SVC
from sklearn import datasets, preprocessing, metrics
from sklearn.metrics import f1_score, make_scorer, accuracy_score, matthews_corrcoef
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, KFold
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer
import scikitplot

import warnings
warnings.filterwarnings('ignore')

## Loading Data

In [2]:
# load entire data in one numpy array
car_raw = pd.read_csv('car.data', delimiter=",", names=[
        'buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class'])
car_raw = car_raw.to_numpy()

## Training Testing Split

We split this data into training and testing with 80:20 ratio to keep aside some data for our final model evaluation. We do our model selection and hyperparameter search on the training part.

In [3]:
car_x = car_raw[:, :-1].copy()
car_y = car_raw[:, -1].copy()

car_x_train, car_x_test, car_y_train, car_y_test = train_test_split(
                                          car_x, car_y, test_size=0.3, random_state=5)

car_y_train = car_y_train.ravel()
car_y_test = car_y_test.ravel()

#see sample division of training and testing
print("Sample count of training: ", car_y_train.shape[0])
print("Sample count of testing: ", car_y_test.shape[0])

Sample count of training:  1209
Sample count of testing:  519


## Benchmark

We are now calculating model performance for the following model, where we are predicting everything as our majority class, and set it as a benchmark to see compare our final model performance.

In [4]:
cat, cat_count = np.unique(car_y_train, return_counts=True)
max_cat = cat[cat_count == cat_count.max()][0]

print("The category occuring max number of times in our target variable: ", max_cat)

pred = np.full(car_y_test.shape, max_cat)

print("Accuracy for this model: ", metrics.accuracy_score(car_y_test, pred))
print("Confusion Matrix for this model: \n", metrics.confusion_matrix(car_y_test, pred))
print("F-scores for each class for this model: ", metrics.f1_score(car_y_test, pred, average=None))
print("Kappa coefficient for this model is: ", metrics.cohen_kappa_score(car_y_test, pred))
print("MCC for this model is: ", metrics.matthews_corrcoef(car_y_test, pred))
print("Precision for each class for this model: ", metrics.precision_score(car_y_test, pred, average=None))
print("Recall for each class for this model: ", metrics.recall_score(car_y_test, pred, average=None))

The category occuring max number of times in our target variable:  unacc
Accuracy for this model:  0.7090558766859345
Confusion Matrix for this model: 
 [[  0   0 119   0]
 [  0   0  14   0]
 [  0   0 368   0]
 [  0   0  18   0]]
F-scores for each class for this model:  [0.         0.         0.82976325 0.        ]
Kappa coefficient for this model is:  0.0
MCC for this model is:  0.0
Precision for each class for this model:  [0.         0.         0.70905588 0.        ]
Recall for each class for this model:  [0. 0. 1. 0.]


## Pre processing

Here, we are creating a copy of our training and testing data with one-hot encoding, especially for algorithms that rely on distance calculations, eg. KNN. 

We also convert the categorical data into numerical by taking into account the natural ordering of the categories that we have.

In [5]:
enc = preprocessing.OneHotEncoder(handle_unknown='ignore')

#create copies of dataset
car_x_train_encoded = car_x_train.copy()
car_x_test_encoded = car_x_test.copy()

#fit encoder to training data
enc = enc.fit(car_x_train_encoded)

#transform datasets
car_x_train_encoded = enc.transform(car_x_train_encoded).toarray()
car_x_test_encoded = enc.transform(car_x_test_encoded).toarray()

#ordinal encoding for numerical processing of variables
ord_enc = preprocessing.OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=np.nan,
                                      categories=[['low', 'med', 'high', 'vhigh'],
                                                 ['low', 'med', 'high', 'vhigh'],
                                                 ['2', '3', '4', '5more'],
                                                 ['2', '4', 'more'],
                                                 ['small', 'med', 'big'],
                                                 ['low', 'med', 'high']])

#create copies of dataset
car_x_train_ordinal_encoded = car_x_train.copy()
car_x_test_ordinal_encoded = car_x_test.copy()

#fit encoder to training data
ord_enc = ord_enc.fit(car_x_train_ordinal_encoded)

#transform datasets
car_x_train_ordinal_encoded = ord_enc.transform(car_x_train_ordinal_encoded)
car_x_test_ordinal_encoded = ord_enc.transform(car_x_test_ordinal_encoded)

We define a dictionary here which will record the best performance evaluation of all the models we will look at. We use MCC as our scoring method, as the data is imbalanced, it contains 70% of the 'unacc' class.

Matthews Correlation Coefficient (MCC) takes all the four blocks of the Confusion Matrix into consideration in its formula. Originally developed by Matthews in 1975.

Similar to Correlation Coefficient, the range of values of MCC lie between -1 to +1. As already explained, it is similar as applying Pearson Correlation Coefficient to binary classification problems where two random variables are prediction and label. That is to say, MCC is a discrete case for Pearson Correlation Coefficient.
ref: https://sarit-maitra.medium.com/mathews-correlation-coefficient-for-imbalanced-classes-705d93184aed

To compare the model performances of all the strategies and models we try, we save the mean of MCC score in a dictionary, and check at the end the model which performs the best.

In [6]:
mcc_scores = {}

#adding benchmark
mcc_scores['benchmark'] = 0.0

The below code is for defining the custom scorer, inner and outer cross validation folds, which will be used in evaluating all models.

In [7]:
# defining a custom scorer
# we use mcc for all our model evaluations
custom_scorer = make_scorer(matthews_corrcoef, greater_is_better=True)

# code for accuracy scorer maker
# custom_scorer = make_scorer(accuracy_score, greater_is_better=True)

# create folds for cross validation
inner_cv = KFold(n_splits=5, shuffle=True, random_state=5)
outer_cv = KFold(n_splits=5, shuffle=True, random_state=5)

## KNN

We start by implementing and evaluating the performance of KNN. Since all the attributes in our dataset are categorical, we use hamming distance for our distance calculations among data points. The rest of the process remains same, where er use the inner cross validation for hyperparameter tuning, and the outer cross validation for performance evaluation using our chosen scoring technique.

This code is for the one-hot encoded dataset.

In [8]:
# defining ranges for hyperparameters for KNN
p_grid_knn = {'n_neighbors' : range(1, 100), 'weights' : ['uniform', 'distance']}

# classifier
knn = neighbors.KNeighborsClassifier()

# inner CV
clf = GridSearchCV(estimator=knn, param_grid=p_grid_knn, 
                   cv=inner_cv, scoring=custom_scorer)

# outer CV
nested_score = cross_val_score(clf, X=car_x_train_encoded, y=car_y_train, 
                               cv=outer_cv, scoring=custom_scorer, error_score='raise')

mcc_scores['knn_one_hot'] = nested_score.mean()

The below code is for the numerical converted attributes.

In [9]:
# defining ranges for hyperparameters for KNN
p_grid_knn = {'n_neighbors' : range(1, 100), 'weights' : ['uniform', 'distance']}

# classifier
knn = neighbors.KNeighborsClassifier()

# inner CV
clf = GridSearchCV(estimator=knn, param_grid=p_grid_knn, 
                   cv=inner_cv, scoring=custom_scorer)

# outer CV
nested_score = cross_val_score(clf, X=car_x_train_ordinal_encoded, y=car_y_train, 
                               cv=outer_cv, scoring=custom_scorer, error_score='raise')

mcc_scores['knn_numerical'] = nested_score.mean()

## Decision Tree

We now move onto decision tree classifier. Similar to KNN, we use one-hot encoded dataset, and evaluate the performance of this classifier using nested cross validation and hyperparameter tuning.

The below code is for one hot encoded dataset.

In [10]:
# defining search space for hyperparameters for Decision tree
p_grid_dtree = {'max_depth' : Integer(1, 15), 
          'min_samples_split': Integer(2,20), 
          'min_samples_leaf': Integer(4,10),
          'min_impurity_decrease': Real(0,2)
         }

# classifier
dtree = tree.DecisionTreeClassifier(criterion="gini")

# inner CV
clf = BayesSearchCV(estimator=dtree, search_spaces=p_grid_dtree, 
                   cv=inner_cv, scoring=custom_scorer)

# outer CV
nested_score = cross_val_score(clf, X=car_x_train_encoded, y=car_y_train, 
                               cv=outer_cv, scoring=custom_scorer, error_score='raise')

mcc_scores['dtree_one_hot'] = nested_score.mean()

The below code is for our numerical dataset.

In [11]:
# defining search space for hyperparameters for Decision tree
p_grid_dtree = {'max_depth' : Integer(1, 15), 
          'min_samples_split': Integer(2,20), 
          'min_samples_leaf': Integer(4,10),
          'min_impurity_decrease': Real(0,2)
         }

# classifier
dtree = tree.DecisionTreeClassifier(criterion="gini")

# inner CV
clf = BayesSearchCV(estimator=dtree, search_spaces=p_grid_dtree, 
                   cv=inner_cv, scoring=custom_scorer)

# outer CV
nested_score = cross_val_score(clf, X=car_x_train_ordinal_encoded, y=car_y_train, 
                               cv=outer_cv, scoring=custom_scorer, error_score='raise')

mcc_scores['dtree_numerical'] = nested_score.mean()

## Naive Bayes

We now evaluate Naive Bayes classifier on our training dataset. All the process remains same as used in other algorithms. The only difference is we need are only using ordinal encoded dataset, i.e. the numerical dataset in this classifier, since we have a separate Categorical classifier..

In [12]:
# defining search space for hyperparameters for Naive Bayes
p_grid_nb = {'alpha' : Real(0,100), 
          'fit_prior': Categorical([True,False])
         }

# classifier
nb = naive_bayes.CategoricalNB()

# inner CV
clf = BayesSearchCV(estimator=nb, search_spaces=p_grid_nb, 
                   cv=inner_cv, scoring=custom_scorer)

# outer CV
nested_score = cross_val_score(clf, X=car_x_train_ordinal_encoded, y=car_y_train, 
                               cv=outer_cv, scoring=custom_scorer, error_score='raise')

mcc_scores['nb'] = nested_score.mean()

## Logistic Regression

We are now evaluating logistic regression on our training dataset using the same process. We again return to our one hot encoded dataset for logistic regression.

The below evaluation is for one-hot encoded dataset.

In [13]:
# defining ranges for hyperparameters for Logistic regression
p_grid_lreg = {
    'penalty' : Categorical(['l1', 'l2', 'none', 'elasticnet']), 
    'tol': Real(0,1),
    'C': Real(1,100),
    'l1_ratio': Real(0,1)
}

#classifier
lreg = linear_model.LogisticRegression(solver='saga') 

# inner CV
clf = BayesSearchCV(estimator=lreg, search_spaces=p_grid_lreg, 
                   cv=inner_cv, scoring=custom_scorer)

# outer CV
nested_score = cross_val_score(clf, X=car_x_train_encoded, y=car_y_train, 
                               cv=outer_cv, scoring=custom_scorer, error_score='raise')

mcc_scores['lreg_one_hot'] = nested_score.mean()

The below code is for numeric converted attributes.

In [14]:
# defining ranges for hyperparameters for Logistic regression
p_grid_lreg = {
    'penalty' : Categorical(['l1', 'l2', 'none', 'elasticnet']), 
    'tol': Real(0,1),
    'C': Real(1,100),
    'l1_ratio': Real(0,1)
}

#classifier
lreg = linear_model.LogisticRegression(solver='saga') 

# inner CV
clf = BayesSearchCV(estimator=lreg, search_spaces=p_grid_lreg, 
                   cv=inner_cv, scoring=custom_scorer)

# outer CV
nested_score = cross_val_score(clf, X=car_x_train_ordinal_encoded, y=car_y_train, 
                               cv=outer_cv, scoring=custom_scorer, error_score='raise')

mcc_scores['lreg_numerical'] = nested_score.mean()

## Support Vector Classifier

We now evaluate the model performance for support vector classifier.

The below code evaluates the performance for one-hot encoded, i.e, categorical handling of the attributes.

In [15]:
# defining search space
p_grid_svc = {
    'C': Categorical([0.0000001, 0.00001, 0.001, 0.1, 1, 10, 100, 1000]),
    'kernel' : Categorical(['linear', 'poly', 'rbf', 'sigmoid']), 
    'gamma': Real(0.000000001,1),
    'tol': Real(0.000000001,1)
}

# classifier
svc = SVC() 

# inner CV
clf = BayesSearchCV(estimator=svc, search_spaces=p_grid_svc, 
                   cv=inner_cv, scoring=custom_scorer)

# outer CV
nested_score = cross_val_score(clf, X=car_x_train_encoded, y=car_y_train, 
                               cv=outer_cv, scoring=custom_scorer, error_score='raise')

mcc_scores['svc_one_hot'] = nested_score.mean()

We run the below code for numerical handling of our attributes.

In [16]:
# defining search space
# reducing the hyperparameter search space due to complexity constraints
p_grid_svc = {
    'C': Categorical([0.001, 0.1, 1, 10, 100]),
    'kernel' : Categorical(['linear', 'rbf']), 
    'gamma': Categorical(['scale', 'auto'])
}

# classifier
svc = SVC() 

# inner CV
clf = BayesSearchCV(estimator=svc, search_spaces=p_grid_svc, 
                   cv=inner_cv, scoring=custom_scorer)

# outer CV
nested_score = cross_val_score(clf, X=car_x_train_ordinal_encoded, y=car_y_train, 
                               cv=outer_cv, scoring=custom_scorer, error_score='raise')

mcc_scores['svc_numerical'] = nested_score.mean()

## Selecting final model

We now compare our MCC scores on the training datasets for all the models we have built. 

In [17]:
mcc_scores

{'benchmark': 0.0,
 'knn_one_hot': 0.7744087597260918,
 'knn_numerical': 0.8254232667751232,
 'dtree_one_hot': 0.8643358805588685,
 'dtree_numerical': 0.878822387236512,
 'nb': 0.6859574123987107,
 'lreg_one_hot': 0.8376353578093255,
 'lreg_numerical': 0.61253573919297,
 'svc_one_hot': 0.966959440927441,
 'svc_numerical': 0.9386309882815548}

In [18]:
best_mcc_score = max(mcc_scores, key=mcc_scores.get)
print(best_mcc_score)

svc_one_hot


## Evaluate performance of selected model - Support Vector

We now fit our selected model to the training data, and check peformance on the entire data, including test data so that we evaluate model performance on some data which has not been seen by the model before.

In [25]:
# one hot encoding
enc = preprocessing.OneHotEncoder(handle_unknown='ignore')

#create copies of dataset
car_x_train_encoded = car_x_train.copy()
car_x_test_encoded = car_x_test.copy()
car_x_encoded = car_x.copy()

#fit encoder to training data
enc = enc.fit(car_x_train_encoded)

#transform datasets
car_x_train_encoded = enc.transform(car_x_train_encoded).toarray()
car_x_test_encoded = enc.transform(car_x_test_encoded).toarray()
car_x_encoded = enc.transform(car_x_encoded).toarray()

custom_scorer = make_scorer(matthews_corrcoef, greater_is_better=True)

# create folds for cross validation
inner_cv = KFold(n_splits=5, shuffle=True, random_state=5)
outer_cv = KFold(n_splits=5, shuffle=True, random_state=5)

#search space for Bayesian hyperparameter setting
# defining search space
p_grid_svc = {
    'C': Categorical([0.0000001, 0.00001, 0.001, 0.1, 1, 10, 100, 1000]),
    'kernel' : Categorical(['linear', 'poly', 'rbf', 'sigmoid']), 
    'gamma': Real(0.000000001,1),
    'tol': Real(0.000000001,1)
}

# classifier
svc = SVC() 

# inner CV
clf = BayesSearchCV(estimator=svc, search_spaces=p_grid_svc, 
                   cv=inner_cv, scoring=custom_scorer)

clf.fit(X=car_x_train_encoded, y=car_y_train)
clf = clf.best_estimator_
clf.fit(X=car_x_train_encoded, y=car_y_train)
pred = clf.predict(car_x_encoded)

# evaluation metrics
print("Accuracy for this model: ", metrics.accuracy_score(car_y, pred))
print("Confusion Matrix for this model: \n", metrics.confusion_matrix(car_y, pred))
print("F-scores for each class for this model: ", metrics.f1_score(car_y, pred, average=None))
print("Kappa coefficient for this model is: ", metrics.cohen_kappa_score(car_y, pred))
print("MCC for this model is: ", metrics.matthews_corrcoef(car_y, pred))
print("Precision for each class for this model: ", metrics.precision_score(car_y, pred, average=None))
print("Recall for each class for this model: ", metrics.recall_score(car_y, pred, average=None))

Accuracy for this model:  0.9965277777777778
Confusion Matrix for this model: 
 [[ 380    0    4    0]
 [   0   69    0    0]
 [   2    0 1208    0]
 [   0    0    0   65]]
F-scores for each class for this model:  [0.9921671  1.         0.99752271 1.        ]
Kappa coefficient for this model is:  0.9923976565306976
MCC for this model is:  0.9924012988646376
Precision for each class for this model:  [0.9947644  1.         0.99669967 1.        ]
Recall for each class for this model:  [0.98958333 1.         0.99834711 1.        ]


We see that the overall accuracy of our final model is around 99.6%. The precision for all classes is greater than 99%. The recall for all classes is 98%. The MCC for this model is 0.99, which means it performs much better than a random classifier. We also see the confusion matrix for all classes in our output.

In [23]:
# code for looking at categories in confusion matrix
np.unique(car_y, return_counts=True)

(array(['acc', 'good', 'unacc', 'vgood'], dtype=object),
 array([ 384,   69, 1210,   65]))