# Cars Safety Classification Project

Welcome to the Cars Safety Classification Project! In this project, we aim to classify the safety level of cars based on various attributes using machine learning algorithms. The dataset used in this project contains information about the buying price, maintenance cost, number of doors, capacity, trunk size, and safety of different cars.

## Project Overview

In this project, we will:
- Explore the dataset to understand its structure and attributes.
- Perform data preprocessing including handling missing values, encoding categorical variables, and scaling numerical features.
- Split the dataset into training and testing sets.
- Build and train machine learning models using various algorithms such as Decision Trees, K-Nearest Neighbors (KNN), Support Vector Machines (SVM), and Logistic Regression.
- Evaluate the performance of each model using metrics such as accuracy, precision, recall, and F1-score.
- Analyze the results and select the best-performing model for classifying the safety level of cars.

## Dataset Description

The dataset consists of the following features:
- **Buying**: The buying price of the car (e.g., vhigh, high, med, low).
- **Maint**: The maintenance cost of the car (e.g., vhigh, high, med, low).
- **Doors**: The number of doors (e.g., 2, 3, 4, 5more).
- **Person**: The capacity of the car (e.g., 2, 4, more).
- **Lug_boot**: The size of the trunk/boot (e.g., small, med, big).
- **Safety**: The safety level of the car (classes: unacc, acc, good, vgood).

## Notebooks and Files

- `cars_safety_classification.ipynb`: Jupyter Notebook containing the code implementation of the project.
- `cars_train.csv`: CSV file containing the clean training dataset.
- `cars_test.csv`: CSV file containing the clean testing dataset.

Let's dive into the project and start exploring the dataset!


## Imported Libraries

Below are the Python libraries imported for this project and their purposes:

- **Pandas**: Used for data manipulation and analysis.
- **NumPy**: Provides support for mathematical functions on arrays and matrices.
- **LabelEncoder**: Used for encoding categorical variables into numerical values.
- **StandardScaler**: Used for scaling numerical features to have a mean of 0 and a standard deviation of 1.
- **DecisionTreeClassifier**: A classifier based on decision trees for classification tasks.
- **confusion_matrix**: Function to compute confusion matrix to evaluate the performance of a classification model.
- **accuracy_score**: Function to compute accuracy score to evaluate the performance of a classification model.
- **classification_report**: Function to generate a text report showing the main classification metrics.
- **tree**: Module containing functions related to decision trees in scikit-learn.
- **LogisticRegression**: A classifier for logistic regression.
- **SVC**: Support Vector Classification.
- **KNeighborsClassifier**: A classifier based on the k-nearest neighbors algorithm.



In [7]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, accuracy_score , classification_report
from sklearn import tree
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier


## Data Loading and Exploration

### Reading the Dataset
To load the dataset into a Pandas DataFrame, we used the `pd.read()` method. For example:
```python
df = pd.read_csv('dataset.csv')
```








In [8]:
cars_train=pd.read_csv(r'cars_train.csv',header=None)
cars_test=pd.read_csv(r'cars_test.csv',header=None)

### Shape of the DataFrame
To get the shape of the DataFrame (number of rows and columns), we used the `shape` attribute. For example:
```python
data_shape = df.shape
```


In [9]:
cars_train.shape


(1382, 7)

### Column Names
To get the column names of the DataFrame, we used the `columns()` method. For example:
```python
column_names = df.columns
```

In [10]:
cars_train.columns=['buying','maint','doors','person','lug_boot','safety','classes']
cars_test.columns=['buying','maint','doors','persons','lug_boot','safety','classes']


### Displaying the First Few Rows
To display the first few rows of the DataFrame, we used the `head()` method. For example:
```python
first_few_rows = df.head()
```

In [11]:
cars_train.head()

Unnamed: 0,buying,maint,doors,person,lug_boot,safety,classes
0,vhigh,high,3,more,small,low,unacc
1,low,vhigh,3,4,small,med,unacc
2,low,high,5more,more,big,low,unacc
3,high,med,4,2,small,med,unacc
4,low,low,3,more,big,med,good


In [12]:
cars_test.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,classes
0,med,vhigh,5more,4,small,low,unacc
1,vhigh,high,2,2,big,med,unacc
2,low,high,2,more,small,low,unacc
3,vhigh,vhigh,3,2,big,high,unacc
4,low,med,4,4,med,med,good



### Missing Values
To check for missing values in the DataFrame, we used the `isnull().sum()` method. This method returns the count of missing values for each column. For example:
```python
missing_values_count = df.isnull().sum()
```


In [13]:
cars_train.isnull().sum()

buying      0
maint       0
doors       0
person      0
lug_boot    0
safety      0
classes     0
dtype: int64

In [14]:
cars_test.isnull().sum()

buying      0
maint       0
doors       0
persons     0
lug_boot    0
safety      0
classes     0
dtype: int64

### Dropping Columns or Rows
To drop columns or rows from the DataFrame, we used the `drop()` method. For example, to drop a single column:
```python
df.drop(columns=['Column_Name'], inplace=True)
```
To drop multiple columns:
```python
df.drop(columns=['Column1', 'Column2'], inplace=True)
```
To drop rows based on index:
```python
df.drop(index=[0, 1, 2], inplace=True)
```


In [15]:
cars_test.drop("classes",axis=1,inplace=True)
cars_test.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety
0,med,vhigh,5more,4,small,low
1,vhigh,high,2,2,big,med
2,low,high,2,more,small,low
3,vhigh,vhigh,3,2,big,high
4,low,med,4,4,med,med


In [16]:
colname=cars_train.columns
colname

Index(['buying', 'maint', 'doors', 'person', 'lug_boot', 'safety', 'classes'], dtype='object')

## Label Encoding Using a For Loop

### Description
Label encoding is a technique used to convert categorical variables into numerical representations. In this project, we applied label encoding to multiple columns using a for loop.

### Usage
We used the `LabelEncoder` from the `sklearn.preprocessing` module to perform label encoding. Here's an example of how we applied label encoding using a for loop:
```python
from sklearn.preprocessing import LabelEncoder

# Initialize LabelEncoder
le = LabelEncoder()

# Iterate through column names
for col in colname:
    cars_train[col] = le.fit_transform(cars_train[col])
```

### Outcome
Label encoding replaces categorical values with numerical representations, making it easier for machine learning algorithms to process the data. Using a for loop allows for efficient label encoding of multiple columns in the DataFrame.



In [17]:
le=LabelEncoder()

for x in colname:
    cars_train[x]=le.fit_transform(cars_train[x])

In [18]:
cars_train.head()
#acc ==> 0
#good ==> 1
#unacc ==> 2
#vgood ==> 3

Unnamed: 0,buying,maint,doors,person,lug_boot,safety,classes
0,3,0,1,2,2,1,2
1,1,3,1,1,2,2,2
2,1,0,3,2,0,1,2
3,0,2,2,0,2,2,2
4,1,1,1,2,0,2,1


In [19]:
colname=cars_test.columns
colname

Index(['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety'], dtype='object')

In [20]:
le=LabelEncoder()

for x in colname:
    cars_test[x]=le.fit_transform(cars_test[x])

In [21]:
cars_test.head()
#acc ==> 0
#good ==> 1
#unacc ==> 2
#vgood ==> 3

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety
0,2,3,3,1,2,1
1,3,0,0,0,0,2
2,1,0,0,2,2,1
3,3,3,1,0,0,0
4,1,2,2,1,1,2


## Feature Selection

### Description
Feature selection is a crucial step in machine learning where we select the input features (X) and the target variable (y) for training our model. In this project, we extracted the features (X) and the target variable (y) from the dataset.

### Usage
We used NumPy arrays to extract the features (X) and the target variable (y) from the DataFrame. Here's how we did it:
```python
x = cars_train.values[:, 0:-1]
y = cars_train.values[:, -1]
```

### Explanation
- `x`: Represents the input features (independent variables) used for training the model. We extracted all rows and columns from the DataFrame except for the last column, which contains the target variable.
- `y`: Represents the target variable (dependent variable) used for training the model. We extracted all rows from the last column of the DataFrame.

### Outcome
By separating the features (X) and the target variable (y), we are able to train our machine learning model using the appropriate input data and target variable.



In [22]:
x=cars_train.values[:,0:-1]   # --> 0 to -2
y=cars_train.values[:,-1]

## Data Type Conversion

We converted the data type of the target variable `y` from float to integer using the `astype(int)` method.


In [23]:
y=y.astype(int)
print(x.shape)
print(y.shape)

(1382, 6)
(1382,)


## Feature Scaling

### Description
Feature scaling is a technique used to standardize the range of independent variables or features in the dataset. In this project, we applied feature scaling to the input features (X) using the `StandardScaler` method.

### Usage
We used the `StandardScaler` from the `sklearn.preprocessing` module to perform feature scaling. Here's how we did it:
```python
scaler = StandardScaler()
scaler.fit(x)
x = scaler.transform(x)
```



In [24]:
scaler = StandardScaler()
scaler.fit(x)
x=scaler.transform(x)


### Explanation
- `scaler`: Represents the StandardScaler object used to scale the features.
- `fit(x)`: Computes the mean and standard deviation of each feature in the training dataset (x).
- `transform(x)`: Standardizes the input features (x) using the computed mean and standard deviation.


In [25]:
print(x)

[[ 1.33507272 -1.3488262  -0.45682233  1.21505861  1.22565305  0.00176987]
 [-0.44760409  1.32688358 -0.45682233 -0.01064285  1.22565305  1.22474807]
 [-0.44760409 -1.3488262   1.33418038  1.21505861 -1.21505663  0.00176987]
 ...
 [-1.33894249  1.32688358  1.33418038 -0.01064285  0.00529821 -1.22120833]
 [ 0.44373431  0.43498032  0.43867903 -0.01064285 -1.21505663  0.00176987]
 [ 0.44373431 -0.45692294  1.33418038  1.21505861  1.22565305 -1.22120833]]


### Outcome
Feature scaling ensures that all features have the same scale, which can improve the performance and convergence of machine learning algorithms, especially those that are sensitive to feature scaling, such as gradient descent-based algorithms.

## Train-Test Split

### Description
Train-test split is a technique used to divide the dataset into two subsets: one for training the model and the other for testing the model's performance. In this project, we split the dataset into training and testing sets using the `train_test_split` method.

### Usage
We used the `train_test_split` function from the `sklearn.model_selection` module to split the dataset. Here's how we did it:
```python
from sklearn.model_selection import train_test_split

# Split the data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=10)
```



In [26]:
from sklearn.model_selection import train_test_split

#Split the data into test and train
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=10)

### Explanation
- `x_train`: Represents the features (input variables) of the training set.
- `x_test`: Represents the features (input variables) of the testing set.
- `y_train`: Represents the target variable (output variable) of the training set.
- `y_test`: Represents the target variable (output variable) of the testing set.
- `test_size`: Specifies the proportion of the dataset to include in the testing set (e.g., 0.2 for 20%).
- `random_state`: Controls the randomness of the data splitting process to ensure reproducibility.

### Outcome
Train-test split allows us to evaluate the performance of the machine learning model on unseen data, which helps assess its generalization ability and avoid overfitting.


In [27]:
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

(1105, 6)
(1105,)
(277, 6)
(277,)


## Model Creation and Fitting

### Description
In machine learning, creating a model involves selecting an appropriate algorithm and configuring its parameters. Once the model is created, it is trained using the training dataset to learn patterns and relationships between features and target variables. In this project, we created a Decision Tree model and fitted it to the training data.

### Usage
We used the `DecisionTreeClassifier` class from the `sklearn.tree` module to create a Decision Tree model. Here's how we did it:
```python
from sklearn.tree import DecisionTreeClassifier

# Create a Decision Tree model
model_DecisionTree = DecisionTreeClassifier(random_state=10)

# Fit the model to the training data
model_DecisionTree.fit(x_train, y_train)
```




In [28]:
#create a model
model_DecisionTree=DecisionTreeClassifier(random_state=10) # ,splitter='best',
                                          # criterion ='entropy',min_samples_split=5,
                                          # min_samples_leaf=3,max_depth=10)
#min_samples_leaf, min_samples_split, max_depth, max_features, max_leaf_nodes

#fitting training data to the model
model_DecisionTree.fit(x_train,y_train)

y_pred=model_DecisionTree.predict(x_test)
#print(y_pred)
print(list(zip(y_test,y_pred)))

[(2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (0, 0), (2, 2), (2, 2), (1, 1), (0, 0), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (0, 0), (0, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (0, 0), (2, 2), (0, 0), (1, 0), (2, 2), (2, 2), (2, 2), (0, 0), (2, 2), (2, 2), (0, 0), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (3, 3), (0, 0), (2, 2), (0, 0), (3, 3), (0, 0), (2, 2), (0, 0), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (0, 0), (2, 2), (0, 0), (2, 2), (2, 2), (0, 0), (2, 2), (2, 2), (1, 0), (2, 2), (0, 0), (0, 0), (2, 2), (2, 2), (0, 0), (3, 3), (0, 0), (0, 0), (2, 2), (2, 2), (0, 0), (2, 2), (0, 0), (2, 2), (2, 2), (0, 0), (2, 2), (2, 2), (0, 0), (2, 2), (2, 2), (2, 2), (0, 0), (2, 2), (2, 2), (0, 0), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (0, 0), (3, 3), (2, 2), (2, 2), (0, 0), (0, 0), (2, 2), (2, 2), (2, 2), (0, 0), (2, 2), (0, 0), (3, 3), (2, 2), (0, 0), (3, 3), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (3, 3), (2, 2), (0, 0), (2, 2), (0, 0),

### Explanation
- `model_DecisionTree`: Represents the Decision Tree model created with default parameters. Additional parameters, such as `splitter`, `criterion`, `min_samples_split`, `min_samples_leaf`, `max_depth`, `max_features`, and `max_leaf_nodes`, can be specified for customization.
- `x_train`: Represents the features (input variables) of the training set.
- `y_train`: Represents the target variable (output variable) of the training set.

### Outcome
The model is trained on the training data, and patterns and relationships are learned from the features to predict the target variable. Once trained, the model can be used to make predictions on unseen data.


----------------------------------

## Model Evaluation

### Confusion Matrix
A confusion matrix is a table that is often used to describe the performance of a classification model on a set of test data for which the true values are known. It allows visualization of the performance of an algorithm. In this project, we generated the confusion matrix using the `confusion_matrix` function.

```python
cfm = confusion_matrix(y_test, y_pred)
print(cfm)
```

### Classification Report
A classification report is used to evaluate the quality of predictions from a classification algorithm. It provides various metrics such as precision, recall, F1-score, and support for each class. In this project, we generated the classification report using the `classification_report` function.

```python
print("Classification report:")
print(classification_report(y_test, y_pred))
```

### Accuracy Score
Accuracy is the ratio of correctly predicted observations to the total observations. It is one of the most straightforward metrics used for evaluating classification models. In this project, we calculated the accuracy score using the `accuracy_score` function.

```python
acc = accuracy_score(y_test, y_pred)
print("Accuracy of the model:", acc * 100, "%")
```

### Outcome
The confusion matrix provides insights into the model's performance by showing the number of correct and incorrect predictions for each class. The classification report presents precision, recall, F1-score, and support for each class, allowing a comprehensive evaluation of the model's performance. The accuracy score represents the overall accuracy of the model in predicting the target variable.

In [29]:
cfm=confusion_matrix(y_test,y_pred)
print(cfm)

print("Classification report: ")

print(classification_report(y_test,y_pred))

acc=accuracy_score(y_test,y_pred)
print("Accuracy of the model: ",acc*100,'%')

[[ 69   1   1   0]
 [  4   8   0   0]
 [  0   0 185   0]
 [  0   0   0   9]]
Classification report: 
              precision    recall  f1-score   support

           0       0.95      0.97      0.96        71
           1       0.89      0.67      0.76        12
           2       0.99      1.00      1.00       185
           3       1.00      1.00      1.00         9

    accuracy                           0.98       277
   macro avg       0.96      0.91      0.93       277
weighted avg       0.98      0.98      0.98       277

Accuracy of the model:  97.83393501805054 %


## Model Performance on Training Set

We evaluated the performance of the Decision Tree model on the training set using the `score` method. The score represents the accuracy of the model's predictions on the training data.


In [30]:
model_DecisionTree.score(x_train, y_train)

1.0

## Feature Importances

We calculated the feature importances of the Decision Tree model using the `feature_importances_` attribute. This attribute provides information about the importance of each feature in predicting the target variable.

```python
print(list(zip(cars_train.columns[0:-1], model_DecisionTree.feature_importances_)))
```

The output is a list of tuples, where each tuple contains the name of a feature and its corresponding importance score assigned by the model.


In [31]:
print(list(zip(cars_train.columns[0:-1],model_DecisionTree.feature_importances_)))

[('buying', 0.21976792783843332), ('maint', 0.18220929667385857), ('doors', 0.062005934167191426), ('person', 0.19425872158174767), ('lug_boot', 0.09772725134941933), ('safety', 0.24403086838934968)]


In [32]:
sample = pd.DataFrame()
sample['Column']=cars_train.columns[0:-1]
sample['Imp value']=model_DecisionTree.feature_importances_

sample.sort_values("Imp value",ascending=False)

Unnamed: 0,Column,Imp value
5,safety,0.244031
0,buying,0.219768
3,person,0.194259
1,maint,0.182209
4,lug_boot,0.097727
2,doors,0.062006


## Export Decision Tree Model as Graphviz File

We exported the Decision Tree model as a Graphviz file using the `export_graphviz` function from the `sklearn.tree` module. This allows us to visualize the Decision Tree's structure and decision-making process.

```python
with open(r"model_DecisionTree.txt", "w") as f:
    f = tree.export_graphviz(model_DecisionTree, feature_names=cars_train.columns[0:-1], out_file=f)
```

### Visualization
To visualize the Decision Tree, you can generate the file and upload the code in a Graphviz viewer such as [webgraphviz.com](http://www.webgraphviz.com/) or use the Graphviz software locally.



In [33]:

with open(r"model_DecisionTree.txt","w") as f:
    f = tree.export_graphviz(model_DecisionTree, feature_names=cars_train.columns[0:-1],
                            out_file=f)

# generate the file and upload the code in webgraphviz.com to plot the decision tree

## Logistic Regression Model Evaluation

### Description
We created a Logistic Regression model to evaluate its performance on the dataset. Logistic Regression is a classification algorithm commonly used for binary classification tasks. In this project, we used it to predict the target variable based on the input features.

### Usage
We instantiated a LogisticRegression classifier and fitted it to the training data. Then, we used the trained model to make predictions on the test data.

```python
# Create a Logistic Regression model
classifier = LogisticRegression()

# Fit the model to the training data
classifier.fit(x_train, y_train)

# Make predictions on the test data
y_pred = classifier.predict(x_test)
print(y_pred)
```



In [34]:

#create a model
classifier=LogisticRegression()
#fitting training data to the model
classifier.fit(x_train,y_train)

y_pred=classifier.predict(x_test)
print(y_pred)

[2 2 2 2 2 0 2 2 2 0 0 2 0 2 2 2 2 2 0 2 2 2 2 0 2 2 2 2 3 0 2 2 2 0 2 2 2
 2 2 2 2 2 0 2 0 2 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2 0 3
 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 0 2 2 2 2 0 2 2 2 2 2
 0 2 0 0 2 2 2 2 2 2 2 2 2 0 2 2 0 2 2 0 2 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 0 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2 0 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 0 2 0 2 2 2 0 2 2 0 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 0 0 2 2 0 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 3 2 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2]


### Outcome
The `y_pred` variable contains the predicted labels generated by the Logistic Regression model. We can further evaluate the model's performance using metrics such as accuracy, precision, recall, and F1-score.


In [35]:
from sklearn.metrics import confusion_matrix, accuracy_score , classification_report

cfm=confusion_matrix(y_test,y_pred)
print(cfm)

print("Classification report: ")

print(classification_report(y_test,y_pred))

acc=accuracy_score(y_test,y_pred)
print("Accuracy of the model: ",acc*100,'%')

[[ 11   0  58   2]
 [  2   0  10   0]
 [ 18   0 167   0]
 [  6   0   2   1]]
Classification report: 
              precision    recall  f1-score   support

           0       0.30      0.15      0.20        71
           1       0.00      0.00      0.00        12
           2       0.70      0.90      0.79       185
           3       0.33      0.11      0.17         9

    accuracy                           0.65       277
   macro avg       0.33      0.29      0.29       277
weighted avg       0.56      0.65      0.59       277

Accuracy of the model:  64.62093862815884 %


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Multi-Model Testing

### Description
In this step, we performed a multi-model testing to evaluate the performance of various classification algorithms on the dataset. We initialized four different classifiers: Decision Tree, K-Nearest Neighbors (KNN), Support Vector Machine (SVM), and Logistic Regression. Then, we created a list containing these classifier objects.

In [36]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
 
# first, initialize the classificators
tree= DecisionTreeClassifier(random_state=10) # using the random state for reproducibility
knn= KNeighborsClassifier(n_neighbors=5,metric='euclidean')
svm= SVC(kernel="rbf", gamma=0.1, C=90,random_state=10)
logreg=LogisticRegression(multi_class="multinomial",random_state=10)

In [37]:
# now, create a list with the objects 
models= [tree, knn, svm, logreg]

### Usage
We used a for loop to iterate through each model in the list and performed the following steps:
1. Fit the model to the training data.
2. Predicted the target variable using the trained model on the test data.
3. Calculated the accuracy score, confusion matrix, and classification report to evaluate the model's performance.
4. Printed the accuracy score, confusion matrix, classification report, and model's score (if applicable) for each model.


In [38]:
from sklearn.metrics import confusion_matrix, accuracy_score,classification_report
 
for model in models:
    model.fit(x_train, y_train) # fit the model
    y_pred= model.predict(x_test) # then predict on the test set
    score = model.score(x_train,y_train)
    accuracy= accuracy_score(y_test, y_pred) 
    clf_report= classification_report(y_test, y_pred) 
    print(confusion_matrix(y_test,y_pred))
    print("The accuracy of the ",type(model).__name__, " model is ", accuracy*100 )
    print("Classification report:\n", clf_report)
    print("\n")
    print("Score of the model: ",score*100)
    print("\n")

[[ 69   1   1   0]
 [  4   8   0   0]
 [  0   0 185   0]
 [  0   0   0   9]]
The accuracy of the  DecisionTreeClassifier  model is  97.83393501805054
Classification report:
               precision    recall  f1-score   support

           0       0.95      0.97      0.96        71
           1       0.89      0.67      0.76        12
           2       0.99      1.00      1.00       185
           3       1.00      1.00      1.00         9

    accuracy                           0.98       277
   macro avg       0.96      0.91      0.93       277
weighted avg       0.98      0.98      0.98       277



Score of the model:  100.0


[[ 65   1   5   0]
 [  8   4   0   0]
 [  1   0 184   0]
 [  2   0   1   6]]
The accuracy of the  KNeighborsClassifier  model is  93.50180505415162
Classification report:
               precision    recall  f1-score   support

           0       0.86      0.92      0.88        71
           1       0.80      0.33      0.47        12
           2       0.97  

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Model Performance Insights

Based on the outcome of the multi-model testing, we have the following insights:

1. **DecisionTreeClassifier:**
   - Achieved an accuracy of approximately 97.83%.
   - Shows high precision, recall, and F1-score for all classes, indicating good performance across different classes.
   - Score of the model on the training set is 100.0%.

2. **KNeighborsClassifier:**
   - Achieved an accuracy of approximately 93.50%.
   - Precision, recall, and F1-score vary across different classes, with lower scores for class 1.
   - Score of the model on the training set is 96.92%.

3. **SVC (Support Vector Classifier):**
   - Achieved an accuracy of approximately 98.56%, the highest among the models.
   - Shows high precision, recall, and F1-score for all classes, similar to the DecisionTreeClassifier.
   - Score of the model on the training set is 100.0%.

4. **LogisticRegression:**
   - Achieved the lowest accuracy of approximately 64.62%.
   - Precision, recall, and F1-score are relatively low, especially for class 1.
   - Score of the model on the training set is 69.50%.

### Insights:
- The DecisionTreeClassifier and SVC models performed exceptionally well, with high accuracy and balanced performance across different classes.
- KNeighborsClassifier also performed reasonably well but showed slightly lower accuracy compared to DecisionTreeClassifier and SVC.
- LogisticRegression showed the lowest performance, indicating that it might not be the best choice for this dataset compared to the other models.
- Further analysis and fine-tuning of hyperparameters could potentially improve the performance of the models, especially for LogisticRegression, to make it more competitive with the other models.


Out of the many models that we tried, the base decision tree and the tuned SVC are the top performing models. So if we want to select the best model, we can go with the base decision tree since it is a less complex model that gives us good accuracy. So we can use this base decision tree to predict upon test data.

- Tuned SVC: 98.55%
- Base SVC: 82.67%
- Base Decision Tree: 97.83%
- Pruned Decision Tree: 96.02%
- Logistic Regression: 64.62%
- KNN: 93.50%


In [39]:
cars_test.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety
0,2,3,3,1,2,1
1,3,0,0,0,0,2
2,1,0,0,2,2,1
3,3,3,1,0,0,0
4,1,2,2,1,1,2


In [40]:
test = cars_test.values
test = scaler.transform(test)
#print(test)

# Final Model: Decision Tree Classifier

We have selected the Decision Tree Classifier as our final model for predicting the values. Here are the details:

- **Algorithm**: Decision Tree Classifier
- **Criterion**: Gini impurity
- **Random State**: 10
- **Splitter**: Best

We fitted the model on the training data (`x_train`, `y_train`) and used it to predict values.

```python 
# importing libraries
from sklearn.tree import DecisionTreeClassifier

# Initialize the Decision Tree Classifier with specified parameters
model_DecisionTree = DecisionTreeClassifier(criterion="gini",
                                            random_state=10,
                                            splitter="best")

# Fit the model on the training data
model_DecisionTree.fit(x_train, y_train)
```

In [41]:
#predicting using the Decision_Tree_Classifier
from sklearn.tree import DecisionTreeClassifier

model_DecisionTree = DecisionTreeClassifier (criterion ="gini",
                                            random_state=10,
                                            splitter="best")

#fit the model on the data and predict the values
model_DecisionTree.fit(x_train,y_train)

DecisionTreeClassifier(random_state=10)

In [42]:
test_pred = model_DecisionTree.predict(test)
test_pred

array([2, 2, 2, 2, 1, 2, 0, 0, 2, 0, 2, 2, 2, 2, 0, 0, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 0, 2, 2, 2, 3,
       2, 0, 2, 2, 2, 2, 2, 0, 1, 3, 1, 2, 0, 2, 0, 2, 2, 2, 2, 3, 2, 2,
       0, 0, 2, 2, 3, 2, 2, 2, 1, 2, 0, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0,
       2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 1, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 0, 2, 2, 3, 2, 2, 0, 2, 0, 3, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2,
       2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 3, 2, 2,
       2, 2, 0, 0, 2, 2, 2, 2, 3, 2, 0, 2, 1, 0, 2, 2, 2, 2, 2, 3, 0, 0,
       2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 3, 0, 2, 2, 2, 3, 2, 2, 0, 2,
       2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 2, 3, 2, 2, 0,
       2, 0, 2, 0, 2, 2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 0, 2, 2, 2, 0, 2,
       2, 2, 0, 2, 2, 0, 2, 2, 2, 1, 1, 2, 2, 2, 0, 2, 2, 0, 3, 3, 0, 2,
       0, 2, 2, 2, 3, 2, 2, 0, 2, 2, 2, 2, 2, 0, 2,

# Predictions on Test Data

We have loaded the test data from the 'cars_test.csv' file and assigned column names to it. After making predictions using the trained model, we have added a new column 'Pred' to the test data DataFrame to store the predicted class labels.

```python
import pandas as pd

# Load the test data and assign column names
cars_test = pd.read_csv(r'cars_test.csv', header=None)
cars_test.columns = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'classes']

# Replace numerical predictions with corresponding labels
cars_test['Pred'] = test_pred
cars_test['Pred'] = cars_test['Pred'].replace({0: 'acc', 1: 'good', 2: 'unacc', 3: 'vgood'})

# Display the first few rows of the updated test data
cars_test.head()
```

In [43]:
cars_test = pd.read_csv(r'cars_test.csv',header = None)
cars_test.columns=['buying','maint','doors','persons','lug_boot','safety','classes']
cars_test['Pred'] =test_pred
cars_test['Pred']=cars_test['Pred'].replace({0:'acc', 1:'good',2:'unacc',3:'vgood'})

cars_test.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,classes,Pred
0,med,vhigh,5more,4,small,low,unacc,unacc
1,vhigh,high,2,2,big,med,unacc,unacc
2,low,high,2,more,small,low,unacc,unacc
3,vhigh,vhigh,3,2,big,high,unacc,unacc
4,low,med,4,4,med,med,good,good


The 'Pred' column now contains the predicted class labels ('acc', 'good', 'unacc', 'vgood') for each data point in the test set.

In [44]:
from sklearn.metrics import confusion_matrix, accuracy_score,classification_report
# confusion matrix 
print(confusion_matrix(cars_test.classes,cars_test.Pred))
print(accuracy_score(cars_test.classes,cars_test.Pred))
print(classification_report(cars_test.classes,cars_test.Pred))

[[ 60   1   3   0]
 [  3  10   0   0]
 [  0   0 251   0]
 [  0   0   0  18]]
0.9797687861271677
              precision    recall  f1-score   support

         acc       0.95      0.94      0.94        64
        good       0.91      0.77      0.83        13
       unacc       0.99      1.00      0.99       251
       vgood       1.00      1.00      1.00        18

    accuracy                           0.98       346
   macro avg       0.96      0.93      0.94       346
weighted avg       0.98      0.98      0.98       346



# Saving Predictions to Excel

We have saved the predictions made on the test data using the Decision Tree Classifier to an Excel file named "Decision Tree Output for car safety.xlsx". The DataFrame containing the test data along with the predicted class labels is written to the Excel file without including the index and with column headers included.

```python
# Save the DataFrame to an Excel file
cars_test.to_excel("Decision Tree Output for car safety.xlsx", header=True, index=False)
```
This Excel file can be used for further analysis or reporting purposes.



In [45]:
cars_test.to_excel("Decision Tree Output for car safety.xlsx",header = True,index=False)

# Conclusion

In this project, we explored the classification of car safety using machine learning techniques. We began by analyzing the dataset and preprocessing the data to prepare it for model training. We then experimented with various classification algorithms, including Decision Tree Classifier, K-Nearest Neighbors, Support Vector Machine, and Logistic Regression.

After evaluating the performance of each model using metrics such as accuracy, precision, recall, and F1-score, we selected the Decision Tree Classifier as our final model due to its high accuracy and simplicity. We fine-tuned the Decision Tree Classifier using parameters such as criterion and splitter to optimize its performance.

Finally, we applied the trained model to make predictions on the test dataset and saved the results to an Excel file for further analysis or reporting.

This project demonstrates the application of machine learning algorithms for car safety classification and highlights the importance of model selection and evaluation in achieving accurate predictions.
