<a href="https://colab.research.google.com/github/eju1377/car-evaluation-machine-learning/blob/main/Machine_Learning_on_Car_Evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Set Up and Preprocessing

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Import Dataset From UCI Machine Learning Repository

In [None]:
pip install ucimlrepo

In [None]:
from ucimlrepo import fetch_ucirepo

# fetch dataset
car_evaluation = fetch_ucirepo(id=19)

# data (as pandas dataframes)
X = car_evaluation.data.features
y = car_evaluation.data.targets

# metadata
print(car_evaluation.metadata)

# variable information
print(car_evaluation.variables)


In [None]:
# combine features and targets into one dataframe
df = pd.concat([X, y], axis=1)

### EDA


In [None]:
df.head()

In [None]:
df.info()

In [None]:
!pip -q install ydata-profiling

In [None]:
from ydata_profiling import ProfileReport

profile = ProfileReport(
df,
title="Car Evaluation EDA", #Name your EDA output here
explorative=True,
minimal=False
)

profile.to_file("car_data_eda1.html")

In [None]:
from google.colab import files

files.download("car_data_eda1.html")

In [None]:
# Source - https://stackoverflow.com/a/55329863
# Posted by Tal
# Retrieved 2026-02-18, License - CC BY-SA 4.0

import IPython
IPython.display.HTML(filename="/content/car_data_eda1.html")


In [None]:
sns.countplot(df['class'])
plt.title('Distribution of Class')

The target variable (class) is unbalanced, so we will need to address this in our model. The target variable is mainly correlated with safety and persons.

### Preprocessing

In [None]:
# Drop duplicates
df = df.drop_duplicates()

In [None]:
# Check for null values
print(df.isnull().sum())

In [None]:
# Encode categorical variables manually to avoid incorrect ordering
buying_order = {'low' : 0, 'med' : 1, 'high' : 2, 'vhigh' : 3}
df['buying'] = df['buying'].map(buying_order)

maint_order = {'low' : 0, 'med' : 1, 'high' : 2, 'vhigh' : 3}
df['maint'] = df['maint'].map(maint_order)

doors_order = {'2' : 0, '3' : 1, '4' : 2, '5more' : 3}
df['doors'] = df['doors'].map(doors_order)

persons_order = {'2' : 0, '4' : 1, 'more' : 2}
df['persons'] = df['persons'].map(persons_order)

lug_boot_order = {'small' : 0, 'med' : 1, 'big' : 2}
df['lug_boot'] = df['lug_boot'].map(lug_boot_order)

safety_order = {'low' : 0, 'med' : 1, 'high' : 2}
df['safety'] = df['safety'].map(safety_order)

class_order = {'unacc' : 0, 'acc' : 1, 'good' : 2, 'vgood' : 3}
df['class'] = df['class'].map(class_order)

# Check DataFrame is properly encoded
df.head()

In [None]:
from sklearn.model_selection import train_test_split

# Splitting data into features and target
X = df.drop('class', axis = 1)
y = df['class']

# Splitting data into train and test sets
# Making sure to stratify
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, stratify=y, random_state = 42)

## Entropy Model Implementation and Evaluation

The goal for this model is to predict the class of a car based on given features.

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
# Instantiate the model
dt = DecisionTreeClassifier(criterion = 'entropy', class_weight= 'balanced', random_state = 42)

# Fit the model
dt.fit(X_train, y_train)

# Predict results using test data
y_e_dt_pred = dt.predict(X_test)

In [None]:
from sklearn.tree import plot_tree

plt.figure(figsize = (20, 10))
plot_tree(
dt,
filled=True,
feature_names = X_train.columns,
class_names = ['Unacceptable', 'Acceptable', 'Good', 'Very Good'],
max_depth = 2
)
plt.title("Decision Tree")
plt.show()

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(y_test, y_e_dt_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=dt.classes_)
disp.plot(cmap = plt.cm.Blues)
plt.title('Entropy Decision Tree Confusion Matrix')
plt.show()


In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_e_dt_pred))

### Analyzing Our Results

In [None]:
baseline_accuracy = df['class'].value_counts(normalize = True).max()

print(f"Baseline Accuracy: {baseline_accuracy:.2%}")

The accuracy of this model is well above the baseline accuracy for this dataset. The precision and recall are both high and balanced. This could show overfitting to the dataset as a whole. I don't have any other data to validate this. The confusion matrix shows that there are 10 incorrect predictions. That is 0.58% of the data.

## Gini Model Implementation and Evaluation

The goal of this model is to predict the class of a car based on given features.

In [None]:
# Instantiate the model
dt_gini = DecisionTreeClassifier(criterion = 'gini', class_weight= 'balanced', random_state = 42)

# Fit the model
dt_gini.fit(X_train, y_train)

# Predict results using test data
y_dt_gini_pred = dt_gini.predict(X_test)

In [None]:
from sklearn.tree import plot_tree

plt.figure(figsize = (20, 10))
plot_tree(
dt_gini,
filled=True,
feature_names = X_train.columns,
class_names = ['Unacceptable', 'Acceptable', 'Good', 'Very Good'],
max_depth = 3
)
plt.title("Decision Tree Using Gini")
plt.show()

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(y_test, y_dt_gini_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=dt.classes_)
disp.plot(cmap = plt.cm.Blues)
plt.title('Gini Decision Tree Confusion Matrix')
plt.show()

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_dt_gini_pred))

### Analyzing Our Results

In [None]:
baseline_accuracy = df['class'].value_counts(normalize = True).max()

print(f"Baseline Accuracy: {baseline_accuracy:.2%}")

These results are practically identical to the entropy model. There is no significant advantage to using this model except computational efficiency.

## Random Forest Model Implementation and Evaluation

This will train multiple decision trees and address any overfitting.

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Instantiate the model
classifier = RandomForestClassifier(n_estimators=10, criterion = 'entropy', random_state=42)

# Fit the model
classifier.fit(X_train, y_train)

# Predict results using test data
y_rand_pred = classifier.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(y_test, y_rand_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=dt.classes_)
disp.plot(cmap = plt.cm.Blues)
plt.title('Random Forest Confusion Matrix')
plt.show()

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_rand_pred))

### Analyzing Our Results

The Random Forest model was slightly better than the decision tree models. The incorect predictions are spread between different labels. This results in a higher probablity to correctly predict off of any given data. It incorrectly predicted 0.52% of the labels.

## Conclusion

The Random Forest model is the best for this dataset. It is able to correctly predict more labels than either of the Decision Tree models.