# About the dataset

Add Suggestion
Car Evaluation Database was derived from a simple hierarchical decision model originally developed for the demonstration of DEX, M. Bohanec, V. Rajkovic: Expert system for decision making. Sistemica 1(1), pp. 145-157, 1990.). The model evaluates cars according to the following concept structure:

CAR car acceptability
- . PRICE overall price
- . . buying buying price
- . . maint price of the maintenance
- . TECH technical characteristics
- . . COMFORT comfort
- . . . doors number of doors
- . . . persons capacity in terms of persons to carry
- . . . lug_boot the size of luggage boot
- . . safety estimated safety of the car

---

The Car Evaluation Database contains examples with the structural information removed, i.e., directly relates CAR to the six input attributes: buying, maint, doors, persons, lug_boot, safety.

Because of known underlying concept structure, this database may be particularly useful for testing constructive induction and structure discovery methods.

---
Attribute Information:

Class Values:

(unacc, acc, good, vgood) for (unacceptable, acceptable, good, very good)

Attributes:

buying: vhigh, high, med, low.
maint: vhigh, high, med, low.
doors: 2, 3, 4, 5more.
persons: 2, 4, more.
lug_boot: small, med, big.
safety: low, med, high.

---

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [4]:
col_names = ['buying_price', 'maintenance_cost', 'no_of_doors', 'no_of_persons', 'lug_boot', 'safety', 'decision']

df = pd.read_csv('car_evaluation.csv', skiprows=1, names=col_names)
df.head()

Unnamed: 0,buying_price,maintenance_cost,no_of_doors,no_of_persons,lug_boot,safety,decision
0,vhigh,vhigh,2,2,small,med,unacc
1,vhigh,vhigh,2,2,small,high,unacc
2,vhigh,vhigh,2,2,med,low,unacc
3,vhigh,vhigh,2,2,med,med,unacc
4,vhigh,vhigh,2,2,med,high,unacc


In [5]:
df.shape

(1727, 7)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1727 entries, 0 to 1726
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   buying_price      1727 non-null   object
 1   maintenance_cost  1727 non-null   object
 2   no_of_doors       1727 non-null   object
 3   no_of_persons     1727 non-null   object
 4   lug_boot          1727 non-null   object
 5   safety            1727 non-null   object
 6   decision          1727 non-null   object
dtypes: object(7)
memory usage: 94.6+ KB


In [7]:
for col in df.columns:
    print(df[col].value_counts())

buying_price
high     432
med      432
low      432
vhigh    431
Name: count, dtype: int64
maintenance_cost
high     432
med      432
low      432
vhigh    431
Name: count, dtype: int64
no_of_doors
3        432
4        432
5more    432
2        431
Name: count, dtype: int64
no_of_persons
4       576
more    576
2       575
Name: count, dtype: int64
lug_boot
med      576
big      576
small    575
Name: count, dtype: int64
safety
med     576
high    576
low     575
Name: count, dtype: int64
decision
unacc    1209
acc       384
good       69
vgood      65
Name: count, dtype: int64


In [8]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

df['buying_price'] = encoder.fit_transform(df['buying_price'])
print(f"{['buying_price']} {dict(zip(encoder.classes_, encoder.transform(encoder.classes_)))}")

df['maintenance_cost'] = encoder.fit_transform(df['maintenance_cost'])
print(f"{['maintenance_cost']} {dict(zip(encoder.classes_, encoder.transform(encoder.classes_)))}")

df['no_of_doors'] = encoder.fit_transform(df['no_of_doors'])
print(f"{['no_of_doors']} {dict(zip(encoder.classes_, encoder.transform(encoder.classes_)))}")

df['no_of_persons'] = encoder.fit_transform(df['no_of_persons'])
print(f"{['no_of_persons']} {dict(zip(encoder.classes_, encoder.transform(encoder.classes_)))}")

df['lug_boot'] = encoder.fit_transform(df['lug_boot'])
print(f"{['lug_boot']} {dict(zip(encoder.classes_, encoder.transform(encoder.classes_)))}")

df['safety'] = encoder.fit_transform(df['safety'])
print(f"{['safety']} {dict(zip(encoder.classes_, encoder.transform(encoder.classes_)))}")


['buying_price'] {'high': 0, 'low': 1, 'med': 2, 'vhigh': 3}
['maintenance_cost'] {'high': 0, 'low': 1, 'med': 2, 'vhigh': 3}
['no_of_doors'] {'2': 0, '3': 1, '4': 2, '5more': 3}
['no_of_persons'] {'2': 0, '4': 1, 'more': 2}
['lug_boot'] {'big': 0, 'med': 1, 'small': 2}
['safety'] {'high': 0, 'low': 1, 'med': 2}


In [9]:
df.head()

Unnamed: 0,buying_price,maintenance_cost,no_of_doors,no_of_persons,lug_boot,safety,decision
0,3,3,0,0,2,2,unacc
1,3,3,0,0,2,0,unacc
2,3,3,0,0,1,1,unacc
3,3,3,0,0,1,2,unacc
4,3,3,0,0,1,0,unacc


In [10]:
# separating the dataset to features and target
X = df.drop(['decision'], axis=1)
y = df['decision']

In [11]:
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 42)

In [12]:
X_train.shape, X_test.shape

((1157, 6), (570, 6))

In [13]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, accuracy_score

# Define parameter grid
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [2, 4, 6, 8, 10, 12, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Set up GridSearchCV
grid_search = GridSearchCV(
    estimator=DecisionTreeClassifier(random_state=0),
    param_grid=param_grid,
    scoring='accuracy',
    cv=5,
    verbose=2,
    n_jobs=-1
)

# Fit the search
grid_search.fit(X_train, y_train)

# Extract the best parameters
print("Best Parameters:", grid_search.best_params_)
best_model = grid_search.best_estimator_

import joblib

# Save the model after training
joblib.dump(best_model, 'best_decision_tree_model.pkl')


# Evaluate the optimized model
y_test_pred = best_model.predict(X_test)
print(f"Accuracy on Test Data: {accuracy_score(y_test, y_test_pred)}")
print("Classification Report:")
print(classification_report(y_test, y_test_pred))


Fitting 5 folds for each of 126 candidates, totalling 630 fits
Best Parameters: {'criterion': 'entropy', 'max_depth': 10, 'min_samples_leaf': 1, 'min_samples_split': 2}
Accuracy on Test Data: 0.9596491228070175
Classification Report:
              precision    recall  f1-score   support

         acc       0.96      0.90      0.93       127
        good       0.62      0.83      0.71        18
       unacc       0.99      0.99      0.99       399
       vgood       0.80      0.92      0.86        26

    accuracy                           0.96       570
   macro avg       0.84      0.91      0.87       570
weighted avg       0.96      0.96      0.96       570



---