## Car Evaluation using Nested Cross-validation with Categorical and Ordinal traget variable

Allison Liu

2023/09/29

# Car Evaluation

### Workflow of Building Models using Nested Cross-Validation with Categoric and Ordinal dataset
#### 1. Data Preprocessing - Encoding Categorical Data
   * One Hot Encoding 
   * Ordinal Data  
#### 2. Modeling
   * Build five different models with two kind of dataset
   * Evaluate overall scores
#### 3. Retrain Model
   * Retain the model using the most optimal one
   * Tune hyperparameters
#### 4. Performance Evaluation
   * Classification report
   * Confusion matrix, predictive accuracy, precision, recall, f-measure
   * ROC and Lift curve

In [5]:
from ucimlrepo import fetch_ucirepo 
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
  
# fetch dataset 
car_evaluation = fetch_ucirepo(id=19) 
  
# data (as pandas dataframes) 
X = car_evaluation.data.features 
y = car_evaluation.data.targets 
print(X.shape)
print(y.shape)
# variable information 
print(car_evaluation.variables) 

(1728, 6)
(1728, 1)
       name     role         type demographic  \
0    buying  Feature  Categorical        None   
1     maint  Feature  Categorical        None   
2     doors  Feature  Categorical        None   
3   persons  Feature  Categorical        None   
4  lug_boot  Feature  Categorical        None   
5    safety  Feature  Categorical        None   
6     class   Target  Categorical        None   

                                         description units missing_values  
0                                       buying price  None             no  
1                           price of the maintenance  None             no  
2                                    number of doors  None             no  
3              capacity in terms of persons to carry  None             no  
4                           the size of luggage boot  None             no  
5                        estimated safety of the car  None             no  
6  evaulation level (unacceptable, acceptable, go...  N

In [6]:
X.head(20)

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety
0,vhigh,vhigh,2,2,small,low
1,vhigh,vhigh,2,2,small,med
2,vhigh,vhigh,2,2,small,high
3,vhigh,vhigh,2,2,med,low
4,vhigh,vhigh,2,2,med,med
5,vhigh,vhigh,2,2,med,high
6,vhigh,vhigh,2,2,big,low
7,vhigh,vhigh,2,2,big,med
8,vhigh,vhigh,2,2,big,high
9,vhigh,vhigh,2,4,small,low


In [7]:
for col in X:
    unique_vals = X[col].unique()
    print("{}: {}".format(col, unique_vals))

buying: ['vhigh' 'high' 'med' 'low']
maint: ['vhigh' 'high' 'med' 'low']
doors: ['2' '3' '4' '5more']
persons: ['2' '4' 'more']
lug_boot: ['small' 'med' 'big']
safety: ['low' 'med' 'high']


In [8]:
# Display total counts for each of the unique values in the label column.
y.value_counts()

class
unacc    1210
acc       384
good       69
vgood      65
dtype: int64

## Data Preprocessing
#### Encoding Categorical Data
I found that the although original dataset is categorical variables, it **contains order such as "vhigh, high, med, low"** in the whole dataset, so I tried two categorical data encoding method.

#### Pros and cons of treating ordinal data as categorical or ordinal dataset
1. Pros and cons of one hot encoding   
* Pros: The effect of the ordering may not be all that big or all that important, and we will not overstate the result   
* Cons: Losing ordinal information, because we treat is as 0 or 1; when the original datase is large, it may fall into the curse of dimensionality  

2. Pros and cons of ordinal dataset    
* Pros: Preserves the information of ordering  
* Cons: Requires the assumption that the numerical distance between each set of subsequent categories is equal  

### 1. One Hot Encoding dataset

In [9]:
X_encoded = pd.get_dummies(X, columns = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety'], drop_first=True)
X_encoded.head()

Unnamed: 0,buying_low,buying_med,buying_vhigh,maint_low,maint_med,maint_vhigh,doors_3,doors_4,doors_5more,persons_4,persons_more,lug_boot_med,lug_boot_small,safety_low,safety_med
0,0,0,1,0,0,1,0,0,0,0,0,0,1,1,0
1,0,0,1,0,0,1,0,0,0,0,0,0,1,0,1
2,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0
3,0,0,1,0,0,1,0,0,0,0,0,1,0,1,0
4,0,0,1,0,0,1,0,0,0,0,0,1,0,0,1


In [10]:
y_encoded = pd.get_dummies(y, columns = ['class'], drop_first=True)
y_encoded.head()

Unnamed: 0,class_good,class_unacc,class_vgood
0,0,1,0
1,0,1,0
2,0,1,0
3,0,1,0
4,0,1,0


In [11]:
X_encoded.shape

(1728, 15)

In [12]:
y_encoded.shape

(1728, 3)

### 2. Ordinal dataset

In [109]:
X_order = X.copy()
y_order = y.copy()

In [110]:
# Create the new encoded columns in the DataFrame by mapping the feature and label columns with the desired order.
X_order['buying_ordinal'] = X_order['buying'].map({'low':0, 'med':1, 'high':2, 'vhigh':3})
X_order['maint_ordinal'] = X_order['maint'].map({'low':0, 'med':1, 'high':2, 'vhigh':3})
X_order['doors_ordinal'] = X_order['doors'].map({'2':0, '3':1, '4':2, '5more':3})
X_order['persons_ordinal'] = X_order['persons'].map({'2':0, '4':1, 'more':2})
X_order['lug_boot_ordinal'] = X_order['lug_boot'].map({'small':0, 'med':1, 'big':2})
X_order['safety_ordinal'] = X_order['safety'].map({'low':0, 'med':1, 'high':2})
y_order['class_ordinal'] = y_order['class'].map({'unacc':0, 'acc':1, 'good':2, 'vgood':3})
# Remove the original columns.
X_order.drop(columns=['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety'], inplace=True)
y_order.drop(columns=['class'], inplace=True)

# View last five rows of DataFrame.
y_order.tail()

Unnamed: 0,class_ordinal
1723,2
1724,3
1725,0
1726,2
1727,3


In [111]:
X_order.tail()

Unnamed: 0,buying_ordinal,maint_ordinal,doors_ordinal,persons_ordinal,lug_boot_ordinal,safety_ordinal
1723,0,0,3,2,1,1
1724,0,0,3,2,1,2
1725,0,0,3,2,2,0
1726,0,0,3,2,2,1
1727,0,0,3,2,2,2


## Build the five models
I tried to build five models with **one hot encoding data** and **ordinal data**:
1. Decision tree
2. Logistic regression
3. KNN
4. Naïve Bayes classifier
5. Support vector machine

In [45]:
import warnings

# Use filterwarnings to suppress specific warning
warnings.filterwarnings("ignore") 

#Define the inner and outer loops
inner_cv = KFold(n_splits = 5, shuffle=True, random_state=1)
outer_cv = KFold(n_splits = 5, shuffle=True, random_state=1)
#scoring = ['precision', 'recall', 'f1', 'accuracy']

#Set the hyperparameters of separate models
dt_grid = {'max_depth':range(1,21),
          'min_samples_split':range(1,21),
          'criterion':['gini']}

lr_grid = {'penalty':['l1', 'l2'],
           'C':[0.001, 0.01, 0.1, 1, 10, 100, 1000],
           'multi_class':['ovr'],
           'solver' :['liblinear']}

knn_grid = {'n_neighbors':range(1,21),
            'leaf_size':range(1,11),
            'weights':['uniform']}

nb_grid = {'var_smoothing': np.logspace(0,-9, num=100)}

svc_grid = {"C": [1, 10, 100],
            "gamma": [0.01, 0.001, 0.0001, 0.1],
            'kernel': ['rbf', 'poly', 'linear']}

dt = tree.DecisionTreeClassifier()
lr = LogisticRegression()
knn = neighbors.KNeighborsClassifier()
svm = SVC()
gnb = GaussianNB()
svm.probability = True

#Create inner loop cv using GridSearchCV
dt_clf = GridSearchCV(estimator=dt, param_grid=dt_grid, cv=inner_cv, refit=True)
lr_clf = GridSearchCV(estimator=lr, param_grid=lr_grid, cv=inner_cv, refit=True)
knn_clf = GridSearchCV(estimator=knn, param_grid=knn_grid, cv=inner_cv, refit=True)
nb_clf = GridSearchCV(estimator=gnb, param_grid=nb_grid, cv=inner_cv, refit=True)
svc_clf = GridSearchCV(estimator=svm, param_grid=svc_grid, cv=inner_cv, refit=True)

#Create outer loop cv using cross_val_score
dt_score = cross_val_score(dt_clf, X=X_encoded, y=y, cv=outer_cv)
print("Mean score of decision tree:\n", dt_score.mean())

lr_score = cross_val_score(lr_clf, X=X_encoded, y=y, cv=outer_cv)
print("Mean score of Logistic regression:\n", lr_score.mean())

knn_score = cross_val_score(knn_clf, X=X_encoded, y=y, cv=outer_cv)
print("Mean score of KNN:\n", knn_score.mean())

nb_score = cross_val_score(knn_clf, X=X_encoded, y=y, cv=outer_cv)
print("Mean score of Naiye Bayes:\n", knn_score.mean())

svc_score = cross_val_score(svc_clf, X=X_encoded, y=y, cv=outer_cv)
print("Mean score of SVM:\n", svc_score.mean())

Mean score of decision tree:
 0.9398022953840999
Mean score of Logistic regression:
 0.897555499706794
Mean score of KNN:
 0.8477909022367429
Mean score of Naiye Bayes:
 0.8477909022367429
Mean score of SVM:
 0.9953723716176593


In [19]:
import warnings

# Use filterwarnings to suppress specific warning
warnings.filterwarnings("ignore") 

#Define the inner and outer loops
inner_cv = KFold(n_splits = 5, shuffle=True, random_state=1)
outer_cv = KFold(n_splits = 5, shuffle=True, random_state=1)
#scoring = ['precision', 'recall', 'f1', 'accuracy']

#Set the hyperparameters of separate models
dt_grid = {'max_depth':range(1,21),
          'min_samples_split':range(2,21),
          'criterion':['gini']}

lr_grid = {'penalty':['l1', 'l2'],
           'C':[0.001, 0.01, 0.1, 1, 10, 100, 1000],
           'multi_class':['auto'],
           'solver' :['liblinear']}

knn_grid = {'n_neighbors':range(1,21),
            'leaf_size':range(1,11),
            'weights':['uniform']}

nb_grid = {'var_smoothing': np.logspace(0,-9, num=100)}

svc_grid = {"C": [0.001, 0.01, 0.1, 1, 10, 100, 1000],
            "gamma": [0.01, 0.001, 0.0001, 0.1],
            'kernel': ['rbf', 'poly', 'linear']}

dt = tree.DecisionTreeClassifier()
lr = LogisticRegression()
knn = neighbors.KNeighborsClassifier()
svm = SVC()
gnb = GaussianNB()
svm.probability = True

#Create inner loop cv using GridSearchCV
dt_clf = GridSearchCV(estimator=dt, param_grid=dt_grid, cv=inner_cv, refit=True)
lr_clf = GridSearchCV(estimator=lr, param_grid=lr_grid, cv=inner_cv, refit=True)
knn_clf = GridSearchCV(estimator=knn, param_grid=knn_grid, cv=inner_cv, refit=True)
nb_clf = GridSearchCV(estimator=gnb, param_grid=nb_grid, cv=inner_cv, refit=True)
svc_clf = GridSearchCV(estimator=svm, param_grid=svc_grid, cv=inner_cv, refit=True)

#Create outer loop cv using cross_val_score
dt_score = cross_val_score(dt_clf, X=X_order, y=y_order, cv=outer_cv)
print("Mean score of decision tree:\n", dt_score.mean())

lr_score = cross_val_score(lr_clf, X=X_order, y=y_order, cv=outer_cv)
print("Mean score of Logistic regression:\n", lr_score.mean())

knn_score = cross_val_score(knn_clf, X=X_order, y=y_order, cv=outer_cv)
print("Mean score of KNN:\n", knn_score.mean())

nb_score = cross_val_score(knn_clf, X=X_order, y=y_order, cv=outer_cv)
print("Mean score of Naiye Bayes:\n", knn_score.mean())

svc_score = cross_val_score(svc_clf, X=X_order, y=y_order, cv=outer_cv)
print("Mean score of SVM:\n", svc_score.mean())

Mean score of decision tree:
 0.9826405294462596
Mean score of Logistic regression:
 0.8177012649744493
Mean score of KNN:
 0.9369272011393148
Mean score of Naiye Bayes:
 0.9369272011393148
Mean score of SVM:
 0.9843880371952751


### Overall model performance comparison
**1. Models of One hot encoding vs. Ordinal dataset**   
* Overall models with ordernal dataset perform better than one hot encoding dataset.  

**2. Models comparision of ordinal dataset**  
* In the five models with ordinal data, I found that **mean scores of mostly models are up to 90%, and support vector machine performs the best.** Thus I use SVM to rebuild the model and tune hyperparameters again.

**3. Model comparison**
Model | Mean accuracy of categorical data | Mean accuracy of ordinal data
--- | --- | --- 
Decision Tree | 0.94 | 0.98 
Logistic Regression | 0.90 | 0.82 
KNN | 0.85 | 0.94 
Naive Bayes | 0.85 | 0.94 
**SVM** | **0.99** | **0.98** 

## Final model using the best performance of model - SVM
SVM of ordinal data shows the most stable and robust performance, so I choose SVM to rebuild and tune hyperparameters again.

In [33]:
#X_train_order, X_test_order, y_train_order, y_test_order = train_test_split(X_order, y_order, test_size=0.3, random_state=20)
X_train_enc, X_test_enc, y_train_enc, y_test_enc = train_test_split(X_encoded, y, test_size=0.3, random_state=20)

In [34]:
X_train_order, X_test_order, y_train_order, y_test_order = train_test_split(X_order, y, test_size=0.3, random_state=20)

In [35]:
print(X_train_order.shape)
print(X_test_order.shape)
print(y_train_order.shape)
print(y_test_order.shape)

(1209, 6)
(519, 6)
(1209, 1)
(519, 1)


In [36]:
warnings.filterwarnings("ignore") 
inner_cv = KFold(n_splits = 5, shuffle=True, random_state=20)

# Build the model again using the best model
svc_grid_tune = {"C": [0.1, 1, 10],
            "gamma": [0.1, 1, 10],
            'kernel': ['rbf']}
svm = SVC()
svc_clf_final = GridSearchCV(estimator=svm, param_grid=svc_grid_tune, cv = inner_cv)
svc_clf_final.fit(X_train_order, y_train_order)
svc_clf_final.best_params_

{'C': 10, 'gamma': 1, 'kernel': 'rbf'}

In [37]:
warnings.filterwarnings("ignore") 
inner_cv = KFold(n_splits = 5, shuffle=True, random_state=20)

# Build the model again using the best model
svc_grid_tune = {"C": [9.5, 10, 10.5],
            "gamma": [0.1, 1, 10],
            'kernel': ['rbf']}
svm = SVC()
svc_clf_final = GridSearchCV(estimator=svm, param_grid=svc_grid_tune, cv = inner_cv)
svc_clf_final.fit(X_train_order, y_train_order)
svc_clf_final.best_params_

{'C': 9.5, 'gamma': 1, 'kernel': 'rbf'}

#### After final tuning hyperparameters, I got the best hyperparameters, build the model again, and make prediciton.

In [114]:
warnings.filterwarnings("ignore") 
svc_model = SVC(C = 9.5, gamma = 1, kernel= 'rbf', probability=True)
svc_model.fit(X_train_order, y_train_order)
y_pred_ord = svc_model.predict(X_test_order)
y_pred_proba = svc_model.predict_proba(X_test_order)[:, 1]

report_svc = classification_report(y_test_order, y_pred_ord)

print("Confusion report of SVM:\n", report_svc)

print("The kappa stats is: ", cohen_kappa_score(y_test_order, y_pred_ord))
print("The MCC stats is: ", matthews_corrcoef(y_test_order, y_pred_ord))

Confusion report of SVM:
               precision    recall  f1-score   support

         acc       0.97      0.97      0.97       117
        good       0.95      1.00      0.98        20
       unacc       0.99      1.00      0.99       361
       vgood       1.00      0.90      0.95        21

    accuracy                           0.99       519
   macro avg       0.98      0.97      0.97       519
weighted avg       0.99      0.99      0.99       519

The kappa stats is:  0.9706843544990196
The MCC stats is:  0.9707344473823967


#### Final SVM model
Description | Result 
--- | --- 
Model | SVM 
Best parameters | {'C': 9.5, 'gamma': 1, 'kernel': 'rbf'} 
Accuracy | 0.97 
Kappa | 0.95 
MCC | 0.97 

Based on the classification report, Kappa score, and MCC score, **we could conclude that the overall SVM model performs well with 99% accuracy, kappa = 0.97, and MCC score = 0.97.**

#### Reference
https://www.theanalysisfactor.com/pros-and-cons-of-treating-ordinal-variables-as-nominal-or-continuous/  
https://www.geeksforgeeks.org/multiclass-classification-using-scikit-learn/   
https://www.kaggle.com/code/satishgunjal/multiclass-logistic-regression-using-sklearn  