# Car Evaluator Predictor

## About Dataset

Car Acceptability Classification Database was derived from a simple hierarchical decision model originally developed for the demonstration of DEX, M. Bohanec, V. Rajkovic: Expert system for decision making. Sistemica 1(1), pp. 145-157, 1990.). 

This is a multiclass classification dataset using which we can evaluate the different classes of acceptability of a car depending on the different parameters.

## Source

This dataset is available on Kaggle in the following link:

> https://www.kaggle.com/datasets/subhajeetdas/car-acceptability-classification-dataset/data

## Data Dictionary

* **Buying_Price**: Buying price of the car. Categorical Data(v-high, high, med, low)
* **Maintenance_Price**: Price of the maintenance of car. Categorical Data (v-high, high, med, low)
* **No_of_Doors**: Number of doors in the car. Categorical Data  (2, 3, 4, 5-more)
* **Person_Capacity**: Capacity in terms of persons to carry by the car. Categorical Data (2, 4, more)
* **Size_of_Luggage**: The size of luggage boot in the car. Categorical Data (small, med, big)
* **Safety**: Estimated safety of the car. Categorical Data  (low, med, high)
* **Car_Acceptability**: Car acceptability is the target. (unacc: unacceptible, acc: acceptible, good: good, v-good: very good)

## Problem Statement

1. **Model Training**: The objective of classification is to evaluate the class of care acceptability.
2. **Model Evaluation**: The objective of model evaluation is to evaluate the peroformance of the trained model using accuracy, precision, recall ans F1 score
3. **Hyperparameter Tuning** The objective of hyperparameter tuning is to find the optimal model by setting different hyperparameters of the model.

### Load Necessary Libraries

In [1]:
# General
import pandas as pd
import numpy as np
import warnings

# Preprocessing
from sklearn.model_selection import train_test_split

# Model & Evaluation
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Hyperparameter
from sklearn.model_selection import GridSearchCV

### Settings

In [2]:
warnings.filterwarnings("ignore")

### Load Dataset

In [19]:
# csv_path = "car_encoded.csv"
csv_path = "car_oversampled.csv"
df = pd.read_csv(csv_path)

In [20]:
# Sanity check
df.head()

Unnamed: 0,Buying_Price,Maintenance_Price,No_of_Doors,Person_Capacity,Size_of_Luggage,Safety,Car_Acceptability
0,4,4,2,2,1,1,1
1,4,4,2,2,1,2,1
2,4,4,2,2,1,3,1
3,4,4,2,2,2,1,1
4,4,4,2,2,2,2,1


### Preprocessing 

In [21]:
# Separate Input and Output features
X= df.iloc[:, :-1]
y= df.iloc[:, -1]

In [22]:
# Split Train and Test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state=42)

## Model Training and Evaluation

In [23]:
# Initialize model
model = LogisticRegression()

# Train the model with data
model.fit(X_train, y_train)

In [24]:
# Define model evaluation function
def evaluate_model(true, pred):
    avg= 'weighted'
    print(f"Accuracy: {accuracy_score(true, pred): .2f}")
    print(f"Precision: {precision_score(true, pred, average=avg): .2f}")
    print(f"Recall: {recall_score(true, pred, average=avg): .2f}")
    print(f"F1: {f1_score(true, pred, average=avg): .2f}")

In [25]:
# Predict traing data
y_train_pred = model.predict(X_train)
evaluate_model(y_train, y_train_pred)

Accuracy:  0.89
Precision:  0.89
Recall:  0.89
F1:  0.89


In [26]:
# Predict Test data
y_test_pred = model.predict(X_test)
evaluate_model(y_test, y_test_pred)

Accuracy:  0.87
Precision:  0.87
Recall:  0.87
F1:  0.86


### Hyperparameter Tuning

In [36]:
# Define Parameter Grid
params = {
    "penalty": ["l1", "l2", "none"],
    "max_iter": [100, 150, 200, 250, 300],
    "solver": ["lbfgs", "liblinear", "sag"]
}

In [37]:
# Initialize Estimator
lr = LogisticRegression()

# Setup Hyperparameter Tuning
gscv = GridSearchCV(lr, params, cv= 10, verbose= 1)

# Train the model with dataset
gscv.fit(X, y)

Fitting 10 folds for each of 45 candidates, totalling 450 fits


In [38]:
# Getting Best parameters and Scores
best_params = gscv.best_params_
print(best_params)
print(f"Best Accuray: {gscv.best_score_}")

{'max_iter': 150, 'penalty': 'l2', 'solver': 'sag'}
Best Accuray: 0.8719008264462811


In [39]:
model = LogisticRegression(**best_params)
# Train the model
model.fit(X_train, y_train)

In [40]:
# Predict and evaluate model
y_train_pred = model.predict(X_train)
evaluate_model(y_train, y_train_pred)

Accuracy:  0.88
Precision:  0.88
Recall:  0.88
F1:  0.88


In [41]:
# Predict and evaluate model
y_test_pred = model.predict(X_test)
evaluate_model(y_test, y_test_pred)

Accuracy:  0.87
Precision:  0.86
Recall:  0.87
F1:  0.86
