This file shows the model building process using nested CV to find the best model that fit with the car evulation dataset,
I tested multiple learning techniques including decision tree, KNN, Naive Bayes, SVM and Logistic regression
Processing steps: 
1. Data encoding / ordinal data trans as a categorical data
2. Normalized the data
3. Doing nested CV for each model and pick the best performance model on the dataset
4. Grind Search for tuning the model with best hyperparameter
5. Apply the model on the testing set to evulate the performance

In [1]:
# import the library 
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn import neighbors, datasets
from sklearn.svm import SVC
from sklearn import tree
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data"
columns = ["buying", "maint", "doors", "persons", "lug_boot", "safety", "class"]
df = pd.read_csv(url, names=columns)

df.head()
# Display the first few rows and class distribution
df['class'].value_counts()

class
unacc    1210
acc       384
good       69
vgood      65
Name: count, dtype: int64

#### Data Encoding as Numerical

In [4]:
# code as a numeric
from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder()
df = pd.DataFrame(encoder.fit_transform(df), columns=df.columns)

df.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,3.0,3.0,0.0,0.0,2.0,1.0,2.0
1,3.0,3.0,0.0,0.0,2.0,2.0,2.0
2,3.0,3.0,0.0,0.0,2.0,0.0,2.0
3,3.0,3.0,0.0,0.0,1.0,1.0,2.0
4,3.0,3.0,0.0,0.0,1.0,2.0,2.0


In [7]:
# Split dataset into features and target variable
X = df.drop("class", axis=1) # features
y = df["class"] # target variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [13]:
# Apply SMOTE only to the training data
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_train, y_train = smote.fit_resample(X_train, y_train)

# Check Balanced Class Distribution in Training Data
balanced_classes = pd.Series(y_train).value_counts()
print("Balanced Class Distribution after SMOTE on Training Data:\n", balanced_classes)

Balanced Class Distribution after SMOTE on Training Data:
 class
3.0    852
2.0    852
0.0    852
1.0    852
Name: count, dtype: int64


In [15]:
# Normalize the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

1. Decision Tree Model 

In [17]:
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn import tree
from sklearn.model_selection import GridSearchCV

In [19]:
# Decision Tree Model with Nested Cross-Validation
param_grid_dt = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

nested_scores = []

for train_idx, test_idx in outer_cv.split(X_train_scaled, y_train):
    X_train_outer, X_test_outer = X_train_scaled[train_idx], X_train_scaled[test_idx]
    y_train_outer, y_test_outer = y_train.iloc[train_idx], y_train.iloc[test_idx]

    inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
    grid_search_dt = GridSearchCV(tree.DecisionTreeClassifier(random_state=42), param_grid_dt, cv=inner_cv, scoring='accuracy', n_jobs=-1)
    grid_search_dt.fit(X_train_outer, y_train_outer)

    best_model = grid_search_dt.best_estimator_

    y_pred = best_model.predict(X_test_outer)
    nested_scores.append(grid_search_dt.best_score_)
    #print(classification_report(y_test_outer, y_pred))

print("Average Nested CV Score:", sum(nested_scores)/len(nested_scores))


Average Nested CV Score: 0.9870153094604615


2. KNN

In [21]:
from sklearn.neighbors import KNeighborsClassifier

In [23]:
# K-NN Model with Nested Cross-Validation
param_grid_knn = {
    'n_neighbors': [3, 5, 7, 9],
    'weights': ['uniform', 'distance'],
    'p': [1, 2]  # Manhattan and Euclidean distances
}

outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

nested_scores = []

for train_idx, test_idx in outer_cv.split(X_train_scaled, y_train):
    X_train_outer, X_test_outer = X_train_scaled[train_idx], X_train_scaled[test_idx]
    y_train_outer, y_test_outer = y_train.iloc[train_idx], y_train.iloc[test_idx]

    grid_search_knn = GridSearchCV(KNeighborsClassifier(), param_grid_knn, cv=inner_cv, scoring='accuracy', n_jobs=-1)
    grid_search_knn.fit(X_train_outer, y_train_outer)

    best_model = grid_search_knn.best_estimator_
    y_pred = best_model.predict(X_test_outer)
    nested_scores.append(grid_search_knn.best_score_)
    # print(classification_report(y_test_outer, y_pred))

print("Average Nested CV Score:", np.mean(nested_scores))


Average Nested CV Score: 0.9487956137757682


3. Naive Beyes

In [25]:
from sklearn.naive_bayes import GaussianNB

In [27]:
# Naive Bayes Model with Nested Cross-Validation
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

nested_scores = []

for train_idx, test_idx in outer_cv.split(X_train_scaled, y_train):
    X_train_outer, X_test_outer = X_train_scaled[train_idx], X_train_scaled[test_idx]
    y_train_outer, y_test_outer = y_train.iloc[train_idx], y_train.iloc[test_idx]

    nb_model = GaussianNB()
    nb_model.fit(X_train_outer, y_train_outer)

    y_pred = nb_model.predict(X_test_outer)
    nested_scores.append(nb_model.score(X_test_outer, y_test_outer))

print("Average Nested CV Score:", np.mean(nested_scores))

Average Nested CV Score: 0.6015244099370858


4. Logistic Regression

In [29]:
from sklearn.linear_model import LogisticRegression

In [31]:
# Logistic Regression with Nested Cross-Validation
param_grid_lr = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l2']
}

outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

nested_scores = []
for train_idx, test_idx in outer_cv.split(X_train_scaled, y_train):
    X_train_outer, X_test_outer = X_train_scaled[train_idx], X_train_scaled[test_idx]
    y_train_outer, y_test_outer = y_train.iloc[train_idx], y_train.iloc[test_idx]

    inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
    grid_search_lr = GridSearchCV(LogisticRegression(max_iter=1000), param_grid_lr, cv=inner_cv, scoring='accuracy', n_jobs=-1)
    grid_search_lr.fit(X_train_outer, y_train_outer)

    best_model = grid_search_lr.best_estimator_
    y_pred = best_model.predict(X_test_outer)
    nested_scores.append(grid_search_lr.best_score_)
    
print("Average Nested CV Score:", np.mean(nested_scores))


Average Nested CV Score: 0.5779777684082652


5. SVM

In [33]:
from sklearn.svm import SVC

In [35]:
# SVM Model with Nested Cross-Validation
param_grid_svm = {
    'C': [0.1,1,10],
    'kernel': ['linear', 'rbf', 'poly'],
    'gamma': ['scale', 'auto']
}

outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

nested_scores = []

for train_idx, test_idx in outer_cv.split(X_train_scaled, y_train):
    X_train_outer, X_test_outer = X_train_scaled[train_idx], X_train_scaled[test_idx]
    y_train_outer, y_test_outer = y_train.iloc[train_idx], y_train.iloc[test_idx]

    inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
    grid_search_svm = GridSearchCV(SVC(), param_grid_svm, cv=3, scoring='accuracy', n_jobs=-1)
    grid_search_svm.fit(X_train_scaled, y_train)

    best_model = grid_search_svm.best_estimator_
    y_pred = best_model.predict(X_test_outer)
    nested_scores.append(grid_search_svm.best_score_)

print("Average Nested CV Score:", np.mean(nested_scores))


Average Nested CV Score: 0.9976525821596244


##### After doing the nested CV based on the training dataset, the SVM classification has the best performance on the average nested CV score. Further, I'll run the grid search corss validation again to find the best hyper parameter and testing the model use the testing dataset to evulate the model performance.

In [112]:
from sklearn.model_selection import train_test_split, GridSearchCV

param_grid_svm = {
    'C': [0.1, 1, 0.01],
    'kernel': ['linear', 'rbf'],
    'gamma': ['scale', 'auto']
}

grid_search_svm = GridSearchCV(SVC(), param_grid_svm, cv=3, scoring='accuracy', n_jobs=-1)
grid_search_svm.fit(X_train_scaled, y_train)

# Display best parameters and results
print("Best Parameters:", grid_search_svm.best_params_)


Best Parameters: {'C': 1, 'gamma': 'scale', 'kernel': 'rbf'}


In [114]:
# Train SVM model with best parameters
final_model = SVC(C=1, gamma='scale', kernel='rbf')
final_model.fit(X_train_scaled, y_train)

# Evaluate the final model on the test set
y_pred_test = final_model.predict(X_test_scaled)
print("SVM Model Performance on Test Data:")
print(classification_report(y_test, y_pred_test))

SVM Model Performance on Test Data:
              precision    recall  f1-score   support

         0.0       0.95      0.98      0.97       356
         1.0       0.98      1.00      0.99       347
         2.0       1.00      0.95      0.98       385
         3.0       1.00      1.00      1.00       364

    accuracy                           0.98      1452
   macro avg       0.98      0.98      0.98      1452
weighted avg       0.98      0.98      0.98      1452



##### The final SVM model achieved on overall accuracy of 98%, illustrating its good preditive capability. It perform well between class 0 2 and 3. The class 1 has a lower precision for class 1 suggests a potential overfitting compare to other classes.

##### Now i started to do the categorical encoding version

In [41]:
from sklearn.preprocessing import LabelEncoder
import pandas as pd

# Apply one-hot encoding to the entire dataset (except the target variable) and convert booleans to integers
car_encoded_df = pd.get_dummies(df, columns=df.columns[:-1]).astype(int)
# Initialize LabelEncoder
le = LabelEncoder()

# Label encode the target variable
car_encoded_df['class'] = le.fit_transform(df['class'])

# Display the first few rows of the encoded dataset
car_encoded_df.head()


Unnamed: 0,class,buying_0.0,buying_1.0,buying_2.0,buying_3.0,maint_0.0,maint_1.0,maint_2.0,maint_3.0,doors_0.0,...,doors_3.0,persons_0.0,persons_1.0,persons_2.0,lug_boot_0.0,lug_boot_1.0,lug_boot_2.0,safety_0.0,safety_1.0,safety_2.0
0,2,0,0,0,1,0,0,0,1,1,...,0,1,0,0,0,0,1,0,1,0
1,2,0,0,0,1,0,0,0,1,1,...,0,1,0,0,0,0,1,0,0,1
2,2,0,0,0,1,0,0,0,1,1,...,0,1,0,0,0,0,1,1,0,0
3,2,0,0,0,1,0,0,0,1,1,...,0,1,0,0,0,1,0,0,1,0
4,2,0,0,0,1,0,0,0,1,1,...,0,1,0,0,0,1,0,0,0,1


In [45]:
from imblearn.over_sampling import RandomOverSampler

# Step 1: Prepare Features and Target
X = car_encoded_df.drop(columns=['class']).reset_index(drop=True)  # Reset index before splitting
y = car_encoded_df['class'].reset_index(drop=True)

# Step 2: Train-Test Split BEFORE applying Over-Sampling
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

# Step 3: Apply Random Over-Sampling only to Training Data
ros = RandomOverSampler(random_state=42)
X_train, y_train = ros.fit_resample(X_train, y_train)

# Step 4: Convert Resampled Data Back to DataFrame
X_train = pd.DataFrame(X_train, columns=X.columns)  # Restore column names
y_train= pd.Series(y_train, name='class')  # Restore target as a Pandas Series

# Step 5: Check Class Balance
balanced_classes = y_train.value_counts()
print("Balanced Class Distribution After Random Over-Sampling:\n", balanced_classes)

Balanced Class Distribution After Random Over-Sampling:
 class
2    847
0    847
1    847
3    847
Name: count, dtype: int64


1. Descision Tree

In [47]:
# Define Hyperparameter Grid
param_grid_dt = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Outer CV for model evaluation
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

nested_scores = []

# Nested Cross-Validation Loop
for train_idx, test_idx in outer_cv.split(X_train, y_train):
    # Ensure proper indexing (assuming X_train is a DataFrame and y_train is a Series)
    X_train_outer, X_test_outer = X_train.iloc[train_idx], X_train.iloc[test_idx]
    y_train_outer, y_test_outer = y_train.iloc[train_idx], y_train.iloc[test_idx]

    # Inner CV for hyperparameter tuning
    inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
    grid_search_dt = GridSearchCV(tree.DecisionTreeClassifier(random_state=42),param_grid_dt,cv=inner_cv,scoring='accuracy',n_jobs=-1)
    grid_search_dt.fit(X_train_outer, y_train_outer)
    
    best_model = grid_search_dt.best_estimator_
    y_pred = best_model.predict(X_test_outer)
    outer_accuracy = accuracy_score(y_test_outer, y_pred)
    nested_scores.append(outer_accuracy)

# Final Nested Cross-Validation Score
print("Average Nested CV Score:", sum(nested_scores) / len(nested_scores))


Average Nested CV Score: 0.9946854725210563


2. KNN

In [49]:
from sklearn.neighbors import KNeighborsClassifier

In [56]:
# K-NN Model with Nested Cross-Validation
param_grid_knn = {
    'n_neighbors': [3, 5, 7, 9],
    'weights': ['uniform', 'distance'],
    'p': [1, 2]  # Manhattan and Euclidean distances
}

outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

nested_scores = []

for train_idx, test_idx in outer_cv.split(X_train, y_train):
    X_train_outer, X_test_outer = X_train.iloc[train_idx], X_train.iloc[test_idx]
    y_train_outer, y_test_outer = y_train.iloc[train_idx], y_train.iloc[test_idx]

    grid_search_knn = GridSearchCV(KNeighborsClassifier(), param_grid_knn, cv=inner_cv, scoring='accuracy', n_jobs=-1)
    grid_search_knn.fit(X_train_outer, y_train_outer)

    best_model = grid_search_knn.best_estimator_
    y_pred = best_model.predict(X_test_outer)
    nested_scores.append(grid_search_knn.best_score_)
    # print(classification_report(y_test_outer, y_pred))

print("Average Nested CV Score:", np.mean(nested_scores))


Average Nested CV Score: 0.9726991640451199


3. Naive Beyes

In [60]:
from sklearn.naive_bayes import GaussianNB


# Outer and inner cross-validation
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

nested_scores = []
best_models = []

for train_idx, test_idx in outer_cv.split(X_train, y_train):
    X_train_outer, X_test_outer = X_train.iloc[train_idx], X_train.iloc[test_idx]
    y_train_outer, y_test_outer = y_train.iloc[train_idx], y_train.iloc[test_idx]

    # Naive Bayes does not have hyperparameters to tune extensively, so we fit directly
    nb_model = GaussianNB()
    nb_model.fit(X_train_outer, y_train_outer)

    best_models.append(nb_model)
    y_pred = nb_model.predict(X_test_outer)
    nested_scores.append(accuracy_score(y_test_outer, y_pred))

print("Average Nested CV Score:", np.mean(nested_scores))



Average Nested CV Score: 0.8645220323917334


4. SVM

In [65]:
param_grid_svm = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf'],
    'gamma': ['scale', 'auto']
}

# Outer and inner cross-validation
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

nested_scores = []
best_models = []

for train_idx, test_idx in outer_cv.split(X_train, y_train):
    X_train_outer, X_test_outer = X_train.iloc[train_idx], X_train.iloc[test_idx]
    y_train_outer, y_test_outer = y_train.iloc[train_idx], y_train.iloc[test_idx]

    # Grid search for hyperparameter tuning
    grid_search_svm = GridSearchCV(SVC(), param_grid_svm, cv=inner_cv, scoring='accuracy', n_jobs=-1)
    grid_search_svm.fit(X_train_outer, y_train_outer)

    best_models.append(nb_model)
    y_pred = nb_model.predict(X_test_outer)
    nested_scores.append(accuracy_score(y_test_outer, y_pred))

print("Average Nested CV Score:", np.mean(nested_scores))

Average Nested CV Score: 0.8645220323917334


5. Logistic Regresssion

In [68]:
# Logistic Regression with Nested Cross-Validation
param_grid_lr = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l2']
}

outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

nested_scores = []
for train_idx, test_idx in outer_cv.split(X_train, y_train):
    X_train_outer, X_test_outer = X_train.iloc[train_idx], X_train.iloc[test_idx]
    y_train_outer, y_test_outer = y_train.iloc[train_idx], y_train.iloc[test_idx]

    inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
    grid_search_lr = GridSearchCV(LogisticRegression(max_iter=1000), param_grid_lr, cv=inner_cv, scoring='accuracy', n_jobs=-1)
    grid_search_lr.fit(X_train_outer, y_train_outer)

    best_model = grid_search_lr.best_estimator_
    y_pred = best_model.predict(X_test_outer)
    nested_scores.append(grid_search_lr.best_score_)
    
print("Average Nested CV Score:", np.mean(nested_scores))


Average Nested CV Score: 0.9650979037426867


##### Desision tree has the best Nested CV score, so the following section shows the decision tree grid search with testing dataset performance on the model

In [75]:
param_grid_dt = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [3, 5, 10]
}

grid_search_dt = GridSearchCV(tree.DecisionTreeClassifier(random_state=42), param_grid_dt, cv=5, scoring='accuracy')
grid_search_dt.fit(X_train, y_train)

# Retrieve the best parameters and the best model
best_params_dt = grid_search_dt.best_params_
best_model_dt = grid_search_dt.best_estimator_

# Evaluate the best model on the test set
y_pred_test_dt = best_model_dt.predict(X_test)
final_accuracy_dt = accuracy_score(y_test, y_pred_test_dt)

print("Best Parameters for Decision Tree:", best_params_dt)
print("Test Set Accuracy for Decision Tree:", final_accuracy_dt)
print(classification_report(y_test, y_pred_test_dt))

Best Parameters for Decision Tree: {'criterion': 'entropy', 'max_depth': 10}
Test Set Accuracy for Decision Tree: 0.9730250481695568
              precision    recall  f1-score   support

           0       0.95      0.93      0.94       115
           1       0.72      1.00      0.84        21
           2       1.00      0.98      0.99       363
           3       1.00      1.00      1.00        20

    accuracy                           0.97       519
   macro avg       0.92      0.98      0.94       519
weighted avg       0.98      0.97      0.97       519



#### Comparing with numerical and categorical data, SVM with numerical encoding stands out as the best performer.


#### Treating Ordinal Variables as Numeric / Map ordinal categories to integers and use directly in the model.
Pros:
Preserves Order: Models can leverage the ordinal relationships (e.g., "high" > "medium" > "low").
Simplifies Data: Reduces dimensionality,  leading to faster training and less memory usage.
Works Well with Distance-Based Models: Algorithms like SVM and K-NN that rely on distance metrics
Cons:
Assumes Uniform Distance: Implies equal spacing between categories, which may not reflect reality.

#### Treating Ordinal Variables as Categorical (One-Hot Encoding) / Approach: Convert each ordinal variable into binary dummy variables, representing the presence/absence of each category.
Pros:
No Assumed Order or Distance: Models are free from potentially misleading numeric assumptions.
Better for Tree-Based Models: Decision Trees and Random Forests can split on individual binary features, leading to clearer decision boundaries.
Cons:
Loss of Ordinal Information: The natural order between categories is not preserved.
Increased Dimensionality: One-hot encoding can slow down training and increase the risk of overfitting

#### The SVM with numeric encoding outperformed the Decision Tree with categorcial encoding, achieving higher accuracy and more balanced class-wise performance.