## Data preprocessing

**Create a notebook that pre-processes this data for model fitting**

**In your notebook, analyze and process your chosen data. Identify your target variable and your input variables. Be sure to include details about what you observed, what changes you are making, how you are making these changes, and why you are making these changes. For instance, you might have a section in your notebook where you analyze the data to identify issues with missing data. Present your results, then argue for what (if anything) you need to do and why, and then show clearly the steps you undertook to accomplish this. Save the results into csv files (these files should therefore be pre-processed and ready for model fitting. Later model fitting notebooks should not need data manipulation/processing.**

## Introduction

I have chosen this data source for the assignment -
https://www.kaggle.com/datasets/subhajeetdas/car-acceptability-classification-dataset?select=car.data

### Brief overview of the data

The data set consists of 21 attributes of 1000 anonymous customers in Deutsche Mark bank.

#### Attribute description
* Car Acceptability Classification Dataset

1. Buying_Price - Categorical Data [vhigh, high, med, low]
2. Maintenance_Price - Categorical Data [vhigh, high, med, low]
3. No_of_Doors - Categorical Data [2, 3, 4, 5more]
4. Person_Capacity - Categorical Data [2, 4, more]
5. Size_of_Luggage - Categorical Data [small, med, big]
6. Safety - Categorical Data [low, med, high]
7. Car_Acceptability - Categorical Data [unacc, acc, good, vgood]


## Goal 

The goal of this analysis is to predict the car acceptability status by developing a predictive model using Keras.

* The attribute "Car_Acceptability" from the list of 7 attributes is used as the target variable to classify a Car into unacceptable, acceptable, good and very good categories.
* The remaining 6 attributes are used as predictors. 

## Importing the required libraries

In [132]:
# import numpy and pandas libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.impute import SimpleImputer


# set random seed to ensure that results are repeatable
np.random.seed(86089106)

## Loading the data

In [133]:
df = pd.read_csv("car.csv")
df.head(3)

Unnamed: 0,Buying_Price,Maintenance_Price,No_of_Doors,Person_Capacity,Size_of_Luggage,Safety,Car_Acceptability
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc


## Inspect the data

In [134]:
df.describe()

Unnamed: 0,Buying_Price,Maintenance_Price,No_of_Doors,Person_Capacity,Size_of_Luggage,Safety,Car_Acceptability
count,1728,1728,1728,1728,1728,1728,1728
unique,4,4,4,3,3,3,4
top,vhigh,vhigh,2,2,small,low,unacc
freq,432,432,432,576,576,576,1210


In [135]:
# Check the missing values by summing the total na's for each variable
df.isna().sum()

Buying_Price         0
Maintenance_Price    0
No_of_Doors          0
Person_Capacity      0
Size_of_Luggage      0
Safety               0
Car_Acceptability    0
dtype: int64

In [136]:
#clear any white spaces before starting to analyze.
df.columns = [s.strip() for s in df.columns] 
df.columns

Index(['Buying_Price', 'Maintenance_Price', 'No_of_Doors', 'Person_Capacity',
       'Size_of_Luggage', 'Safety', 'Car_Acceptability'],
      dtype='object')

## Using one-hot encoding to replace categorical variables 

In [137]:

df['Buying_Price'] = df['Buying_Price'].replace({1: "low", 2: 'med', 3: 'high', 4: 'vhigh'})

df['Maintenance_Price'] = df['Maintenance_Price'].replace({1: "low", 2: 'med', 3: 'high', 4: 'vhigh'})

df['No_of_Doors'] = df['No_of_Doors'].replace({1: '2', 2: '3', 3: '4', 4: '5more'})

df['Person_Capacity'] = df['Person_Capacity'].replace({1: '2', 2: '4', 3: 'more'})

df['Size_of_Luggage'] = df['Size_of_Luggage'].replace({1: 'small', 2: 'med', 3: 'big'})

df['Safety'] = df['Safety'].replace({1: 'low', 2: 'med', 3: 'high'})

#df['Car_Acceptability'] = df['Car_Acceptability'].replace({1: 'unacc', 2: 'acc', 3: 'good', 4: 'vgood'})


In [138]:
buying_price_dummies = pd.get_dummies(df['Buying_Price'], prefix='Buying_Price', drop_first=False)
df = df.join(buying_price_dummies)

maintenance_price_dummies = pd.get_dummies(df['Maintenance_Price'], prefix='Maintenance_Price', drop_first=False)
df = df.join(maintenance_price_dummies)

no_of_doors_dummies = pd.get_dummies(df['No_of_Doors'], prefix='No_of_Doors', drop_first=False)
df = df.join(no_of_doors_dummies)

person_capacity_dummies = pd.get_dummies(df['Person_Capacity'], prefix='Person_Capacity', drop_first=False)
df = df.join(person_capacity_dummies)

size_of_luggage_dummies = pd.get_dummies(df['Size_of_Luggage'], prefix='Size_of_Luggage', drop_first=False)
df = df.join(size_of_luggage_dummies)

safety_dummies = pd.get_dummies(df['Safety'], prefix='Safety', drop_first=False)
df = df.join(safety_dummies)

#car_acceptability_dummies = pd.get_dummies(df['Car_Acceptability'], prefix='Car_Acceptability', drop_first=False)
#df = df.join(car_acceptability_dummies)

In [139]:
#Inspecting the data after the above changes
df.head(5)

Unnamed: 0,Buying_Price,Maintenance_Price,No_of_Doors,Person_Capacity,Size_of_Luggage,Safety,Car_Acceptability,Buying_Price_high,Buying_Price_low,Buying_Price_med,...,No_of_Doors_5more,Person_Capacity_2,Person_Capacity_4,Person_Capacity_more,Size_of_Luggage_big,Size_of_Luggage_med,Size_of_Luggage_small,Safety_high,Safety_low,Safety_med
0,vhigh,vhigh,2,2,small,low,unacc,0,0,0,...,0,1,0,0,0,0,1,0,1,0
1,vhigh,vhigh,2,2,small,med,unacc,0,0,0,...,0,1,0,0,0,0,1,0,0,1
2,vhigh,vhigh,2,2,small,high,unacc,0,0,0,...,0,1,0,0,0,0,1,1,0,0
3,vhigh,vhigh,2,2,med,low,unacc,0,0,0,...,0,1,0,0,0,1,0,0,1,0
4,vhigh,vhigh,2,2,med,med,unacc,0,0,0,...,0,1,0,0,0,1,0,0,0,1


In [140]:
# Dropping unnessary columns 
df = df.drop(['Buying_Price', 'Maintenance_Price', 'No_of_Doors', 'Person_Capacity',
       'Size_of_Luggage', 'Safety'], axis=1)

In [141]:
# generate a basic summary of the datal
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 22 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Car_Acceptability        1728 non-null   object
 1   Buying_Price_high        1728 non-null   uint8 
 2   Buying_Price_low         1728 non-null   uint8 
 3   Buying_Price_med         1728 non-null   uint8 
 4   Buying_Price_vhigh       1728 non-null   uint8 
 5   Maintenance_Price_high   1728 non-null   uint8 
 6   Maintenance_Price_low    1728 non-null   uint8 
 7   Maintenance_Price_med    1728 non-null   uint8 
 8   Maintenance_Price_vhigh  1728 non-null   uint8 
 9   No_of_Doors_2            1728 non-null   uint8 
 10  No_of_Doors_3            1728 non-null   uint8 
 11  No_of_Doors_4            1728 non-null   uint8 
 12  No_of_Doors_5more        1728 non-null   uint8 
 13  Person_Capacity_2        1728 non-null   uint8 
 14  Person_Capacity_4        1728 non-null  

In [142]:
# generate a statistical summary of the numeric value in the data
df.describe()

Unnamed: 0,Buying_Price_high,Buying_Price_low,Buying_Price_med,Buying_Price_vhigh,Maintenance_Price_high,Maintenance_Price_low,Maintenance_Price_med,Maintenance_Price_vhigh,No_of_Doors_2,No_of_Doors_3,...,No_of_Doors_5more,Person_Capacity_2,Person_Capacity_4,Person_Capacity_more,Size_of_Luggage_big,Size_of_Luggage_med,Size_of_Luggage_small,Safety_high,Safety_low,Safety_med
count,1728.0,1728.0,1728.0,1728.0,1728.0,1728.0,1728.0,1728.0,1728.0,1728.0,...,1728.0,1728.0,1728.0,1728.0,1728.0,1728.0,1728.0,1728.0,1728.0,1728.0
mean,0.25,0.25,0.25,0.25,0.25,0.25,0.25,0.25,0.25,0.25,...,0.25,0.333333,0.333333,0.333333,0.333333,0.333333,0.333333,0.333333,0.333333,0.333333
std,0.433138,0.433138,0.433138,0.433138,0.433138,0.433138,0.433138,0.433138,0.433138,0.433138,...,0.433138,0.471541,0.471541,0.471541,0.471541,0.471541,0.471541,0.471541,0.471541,0.471541
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.25,0.25,0.25,0.25,0.25,0.25,0.25,0.25,0.25,0.25,...,0.25,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


### Properteis and observations of cleaned data

In [143]:
df.head(5)

Unnamed: 0,Car_Acceptability,Buying_Price_high,Buying_Price_low,Buying_Price_med,Buying_Price_vhigh,Maintenance_Price_high,Maintenance_Price_low,Maintenance_Price_med,Maintenance_Price_vhigh,No_of_Doors_2,...,No_of_Doors_5more,Person_Capacity_2,Person_Capacity_4,Person_Capacity_more,Size_of_Luggage_big,Size_of_Luggage_med,Size_of_Luggage_small,Safety_high,Safety_low,Safety_med
0,unacc,0,0,0,1,0,0,0,1,1,...,0,1,0,0,0,0,1,0,1,0
1,unacc,0,0,0,1,0,0,0,1,1,...,0,1,0,0,0,0,1,0,0,1
2,unacc,0,0,0,1,0,0,0,1,1,...,0,1,0,0,0,0,1,1,0,0
3,unacc,0,0,0,1,0,0,0,1,1,...,0,1,0,0,0,1,0,0,1,0
4,unacc,0,0,0,1,0,0,0,1,1,...,0,1,0,0,0,1,0,0,0,1


### Spliting the data for training and testing (data partitoning 70/30)

In [144]:
train_df, test_df = train_test_split(df, test_size=0.3)

### Seperating the predictors and traget variables

In [145]:
target = 'Car_Acceptability'
predictors = list(df.columns)
predictors.remove(target)

### Saving the datasets for testing and training

In [146]:
X_train = train_df[predictors]
y_train = train_df[target]
X_test = test_df[predictors]
y_test = test_df[target]

In [147]:
X_train.isna().sum()

Buying_Price_high          0
Buying_Price_low           0
Buying_Price_med           0
Buying_Price_vhigh         0
Maintenance_Price_high     0
Maintenance_Price_low      0
Maintenance_Price_med      0
Maintenance_Price_vhigh    0
No_of_Doors_2              0
No_of_Doors_3              0
No_of_Doors_4              0
No_of_Doors_5more          0
Person_Capacity_2          0
Person_Capacity_4          0
Person_Capacity_more       0
Size_of_Luggage_big        0
Size_of_Luggage_med        0
Size_of_Luggage_small      0
Safety_high                0
Safety_low                 0
Safety_med                 0
dtype: int64

In [148]:
X_train.head(3)

Unnamed: 0,Buying_Price_high,Buying_Price_low,Buying_Price_med,Buying_Price_vhigh,Maintenance_Price_high,Maintenance_Price_low,Maintenance_Price_med,Maintenance_Price_vhigh,No_of_Doors_2,No_of_Doors_3,...,No_of_Doors_5more,Person_Capacity_2,Person_Capacity_4,Person_Capacity_more,Size_of_Luggage_big,Size_of_Luggage_med,Size_of_Luggage_small,Safety_high,Safety_low,Safety_med
965,0,0,1,0,0,0,0,1,0,0,...,1,0,0,1,0,0,1,1,0,0
516,1,0,0,0,0,0,0,1,0,0,...,1,1,0,0,0,1,0,0,1,0
160,0,0,0,1,1,0,0,0,0,1,...,0,0,0,1,1,0,0,0,0,1


In [149]:
y_train.head(3)

965      acc
516    unacc
160    unacc
Name: Car_Acceptability, dtype: object

In [150]:
X_test.head(3)

Unnamed: 0,Buying_Price_high,Buying_Price_low,Buying_Price_med,Buying_Price_vhigh,Maintenance_Price_high,Maintenance_Price_low,Maintenance_Price_med,Maintenance_Price_vhigh,No_of_Doors_2,No_of_Doors_3,...,No_of_Doors_5more,Person_Capacity_2,Person_Capacity_4,Person_Capacity_more,Size_of_Luggage_big,Size_of_Luggage_med,Size_of_Luggage_small,Safety_high,Safety_low,Safety_med
1217,0,0,1,0,0,1,0,0,0,1,...,0,1,0,0,0,0,1,1,0,0
1312,0,1,0,0,0,0,0,1,1,0,...,0,0,1,0,1,0,0,0,0,1
840,1,0,0,0,0,1,0,0,0,0,...,1,1,0,0,0,1,0,0,1,0


In [151]:
y_test.head(3)

1217    unacc
1312      acc
840     unacc
Name: Car_Acceptability, dtype: object

In [152]:
#saving the training and testing data to .csv files
X_train.to_csv('car-X-train_data.csv', index=False)
y_train.to_csv('car-y-train_data.csv', index=False)
X_test.to_csv('car-X-test_data.csv', index=False)
y_test.to_csv('car-y-test_data.csv', index=False)

## Conclusion

In this notebook, I used the techniques discussed in the class to preprocess the dataset. The training and test data are saved to the .csv files which can be used for modelling.