# Car Evaluation

Dataset: http://archive.ics.uci.edu/ml/datasets/Car+Evaluation

I built 5 models: decision tree, logistic regression, KNN, Navie bayes, and SVM classification to classify the car into 4 classes: unacc, acc, good, and vgood.

## Data Processing

In [27]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

In [28]:
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

In [29]:
df = pd.read_csv('car.data', header = None)
col_names =['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']
df.columns = col_names
df.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   buying    1728 non-null   object
 1   maint     1728 non-null   object
 2   doors     1728 non-null   object
 3   persons   1728 non-null   object
 4   lug_boot  1728 non-null   object
 5   safety    1728 non-null   object
 6   class     1728 non-null   object
dtypes: object(7)
memory usage: 94.6+ KB


## Data Exploration

In [31]:
# Check missing value
df.isnull().sum()

buying      0
maint       0
doors       0
persons     0
lug_boot    0
safety      0
class       0
dtype: int64

In [32]:
# Check if there is imbalanced data
df.groupby('class')['class'].count()

class
acc       384
good       69
unacc    1210
vgood      65
Name: class, dtype: int64

In [33]:
# define dependent and independent variables
x = df.iloc[:,0:6]
y = df['class']

### Transform input values

Since I cannot define the distance between each class (e.g. the gap between 'low' and 'med'), I use OneHotEncoder to transform ordinal data (x) to categorical data (x_cat) to train the model. 

In [34]:
# ordinal to categorical
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
x_ohe = ohe.fit_transform(x).toarray()
x_cat = pd.DataFrame(x_ohe)

## Modeling

### Nested Grid Search CV

In [35]:
# Cross-validation for inner and outer loops
i = 42
inner_cv = KFold(n_splits=4, shuffle=True, random_state=i)
outer_cv = KFold(n_splits=4, shuffle=True, random_state=i)
score = 'accuracy'

### Tuning hyperparameters

In [36]:
# Decision Tree
tree = DecisionTreeClassifier()
depth = list(range(2,10))
min_s_leaf = list(range(1,5))
weight = ["entropy", "gini"]
t_grid = dict(max_depth = depth, min_samples_leaf = min_s_leaf, criterion = weight)

# Logistic Regression
reg = LogisticRegression(solver='liblinear')
c_rng = [0.01, 0.1, 0.5, 1, 10, 100]
penal = ['l1','l2','elastinet']
lr_grid = {'C': c_rng, 'penalty':penal}

# KNN
knn =  KNeighborsClassifier()
k = list(range(2,5))
knn_grid = {'n_neighbors':k}

# Naive Bayse
nb = GaussianNB()
nb_grid = {}

# SVM
svm = SVC()
kernels = ['rbf']
c = [1,10,100]
g = [0.1,0.5,1]
svm_grid = {'kernel':kernels,'C':c,'gamma':g} 

### Non-nested parameter search and scoring

In [37]:
tree_clf = GridSearchCV(estimator=tree, param_grid=t_grid, scoring = score, cv=inner_cv)
reg_clf = GridSearchCV(estimator=reg, param_grid=lr_grid, scoring = score, cv=inner_cv)
knn_clf = GridSearchCV(estimator=knn, param_grid=knn_grid, scoring = score, cv=inner_cv)
nb_clf = GridSearchCV(estimator=nb, param_grid=nb_grid, scoring = score, cv=inner_cv)
svm_clf = GridSearchCV(estimator=svm, param_grid=svm_grid, scoring = score, cv=inner_cv)

### Nested CV with parameter optimization

In [38]:
# decision tree
t_score = cross_val_score(tree_clf, X=x_cat, y=y, cv=outer_cv) 

# logistic regression
lr_score = cross_val_score(reg_clf, X=x_cat, y=y, cv=outer_cv) 

# knn
knn_score = cross_val_score(knn_clf, X=x_cat, y=y, cv=outer_cv) 

# naive bayse
nb_score = cross_val_score(nb_clf, X=x_cat, y=y, cv=outer_cv)

# SVM
svm_score = cross_val_score(svm_clf, X=x_cat, y=y, cv=outer_cv)

### Compare scores between 5 models

In [39]:
score = {}
score['Decision Tree'] = t_score.mean()
score['Logistic Regression'] = lr_score.mean()
score['KNN'] = knn_score.mean()
score['NB'] = nb_score.mean()
score['SVM'] = svm_score.mean()
score

{'Decision Tree': 0.9542824074074074,
 'Logistic Regression': 0.8981481481481481,
 'KNN': 0.8304398148148149,
 'NB': 0.8026620370370371,
 'SVM': 0.9942129629629629}

## Final Model

I choose SVM method to fit the model because it performs the best score (0.99).

In [40]:
# split training(0.8)-testing(0.2) dataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
x_train, x_test, y_train, y_test = train_test_split(x_cat, y, test_size=0.2, random_state = 42)

In [41]:
# fit the model
svm_clf.fit(x_train, y_train)
y_pred = svm_clf.predict(x_test)
print('best params: ', svm_clf.best_params_)
print('best score: ', svm_clf.best_score_)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

best params:  {'C': 10, 'gamma': 0.1, 'kernel': 'rbf'}
best score:  0.99131272514032
[[ 76   7   0   0]
 [  0  11   0   0]
 [  0   0 235   0]
 [  1   2   0  14]]
              precision    recall  f1-score   support

         acc       0.99      0.92      0.95        83
        good       0.55      1.00      0.71        11
       unacc       1.00      1.00      1.00       235
       vgood       1.00      0.82      0.90        17

    accuracy                           0.97       346
   macro avg       0.88      0.93      0.89       346
weighted avg       0.98      0.97      0.97       346



Based on the confusion matrix above, the overall accuracy is 0.97.