# Car Evaluator Predictor

## About Dataset

Car Acceptability Classification Database was derived from a simple hierarchical decision model originally developed for the demonstration of DEX, M. Bohanec, V. Rajkovic: Expert system for decision making. Sistemica 1(1), pp. 145-157, 1990.). 

This is a multiclass classification dataset using which we can evaluate the different classes of acceptability of a car depending on the different parameters.

## Source

This dataset is available on Kaggle in the following link:

> https://www.kaggle.com/datasets/subhajeetdas/car-acceptability-classification-dataset/data

## Data Dictionary

* **Buying_Price**: Buying price of the car. Categorical Data(v-high, high, med, low)
* **Maintenance_Price**: Price of the maintenance of car. Categorical Data (v-high, high, med, low)
* **No_of_Doors**: Number of doors in the car. Categorical Data  (2, 3, 4, 5-more)
* **Person_Capacity**: Capacity in terms of persons to carry by the car. Categorical Data (2, 4, more)
* **Size_of_Luggage**: The size of luggage boot in the car. Categorical Data (small, med, big)
* **Safety**: Estimated safety of the car. Categorical Data  (low, med, high)
* **Car_Acceptability**: Car acceptability is the target. (unacc: unacceptible, acc: acceptible, good: good, v-good: very good)

## Problem Statement

1. **Feature Engineering**: The objective of feature engineering is to encode the categorical features into numerical values.
2. **Feature Selection**: The objective of feature selection is to select the most influential features for prediction.

### Load Necessary Libraries

In [1]:
import pandas as pd
import numpy as np
from imblearn.over_sampling import SMOTE
import warnings

### Settings

In [2]:
warnings.filterwarnings("ignore")

## Load Dataset

In [14]:
csv_path = "car.csv"
df = pd.read_csv(csv_path)

In [15]:
# Check 1st 5 rows
df.head()

Unnamed: 0,Buying_Price,Maintenance_Price,No_of_Doors,Person_Capacity,Size_of_Luggage,Safety,Car_Acceptability
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


### Feature Encoding

1. Convert the categorical features corresponfing numeric values.
2. As all the features are ordinal in nature so we encode them by mapping a specified ordered values.

In [20]:
df["Car_Acceptability"].value_counts()

Car_Acceptability
unacc    1210
acc       384
good       69
vgood      65
Name: count, dtype: int64

In [21]:
# Encode Buying_Price and Maintenance_Price
price_dict = {
    "low": 1,
    "med": 2,
    "high": 3,
    "vhigh": 4
}
df["Buying_Price"] = df["Buying_Price"].map(price_dict)
df["Maintenance_Price"] = df["Maintenance_Price"].map(price_dict)



In [22]:
# Encode Doors 
doors_dict = {
    "2": 2,
    "3": 3,
    "4": 4,
    "5more": 5
}
df["No_of_Doors"] = df["No_of_Doors"].map(doors_dict)


In [23]:
# Encode Person_Capacity
capacity_dict ={
    "2": 2,
    "4": 4,
    "more": 5
}
df["Person_Capacity"] = df["Person_Capacity"].map(capacity_dict)

In [24]:
# Encode Size_of_Luggage
size_dict = {
    "small": 1,
    "med": 2,
    "big": 3
}
df["Size_of_Luggage"] = df["Size_of_Luggage"].map(size_dict)

In [25]:
# Encode Safety
safety_dict = {
    "low": 1,
    "med": 2,
    "high": 3
}
df["Safety"] = df["Safety"].map(safety_dict)

In [26]:
# Encode Car_Acceptability
evaluate_dict = {
    "unacc": 1,
    "acc": 2,
    "good": 3,
    "vgood": 4
}
df["Car_Acceptability"] = df["Car_Acceptability"].map(evaluate_dict)

In [27]:
# Sanity Check
df.head()

Unnamed: 0,Buying_Price,Maintenance_Price,No_of_Doors,Person_Capacity,Size_of_Luggage,Safety,Car_Acceptability
0,4,4,2,2,1,1,1
1,4,4,2,2,1,2,1
2,4,4,2,2,1,3,1
3,4,4,2,2,2,1,1
4,4,4,2,2,2,2,1


In [None]:
# Save the encoded dataframe
df.to_csv("car_encoded.csv", index= False)

### Data Oversampling

To correct the imbalance in the data set we can perform an unbalanced oversampling technique, in order to counteract the bias in the data and to expand the number of instance that will feed our model.

For this case we will use de Synthetic Minority Over-sampling Technique (SMOTE) to up-sample the minority classes while avoiding overfitting.

In [28]:
# Separate Input and output features
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

In [29]:
s = SMOTE(k_neighbors = 2)
X_r, y_r = s.fit_resample(X, y)

In [44]:
df_y = pd.DataFrame(y_r, columns=["Car_Acceptability"])
df_r = pd.concat([X_r, df_y], axis=1)

In [45]:
df_r.shape

(4840, 7)

In [46]:
df_r["Car_Acceptability"].value_counts()

Car_Acceptability
1    1210
2    1210
4    1210
3    1210
Name: count, dtype: int64

In [47]:
# Save oversampled dataframe
df_r.to_csv("car_oversampled.csv", index=False)