# Predicting Fuel Efficiency of Vehicles - Part 2

### Data Preparation


1. Handling categorical Attributes - OneHotEncoder
2. Data Cleaning - Imputer
3. Attribute Addition - Adding custom transformation
4. Setting up Data Transformation Pipeline for numerical and categorical column.

We will automate app the steps and use base class to write custom transformations to add new features to the dataset, that turned out te be important to us in the last step.

At the end, we would have a pipeline, and we will pass just the dataframe to that pipeline and receive the prepared data, ready to go into the ML model. Both to train and to get predictions.

In [1]:
# Importing general use case libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import StratifiedShuffleSplit

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Importing the dataset

# Defining the column names
cols = ['MPG','Cylinders','Displacement','Horsepower','Weight',
        'Acceleration','Model Year','Origin']

# Loading the dataset into a dataframe
df = pd.read_csv('auto-mpg.data',names=cols, na_values='?',
                 comment='\t',sep=' ',skipinitialspace=True)

# Making a copy of this dataframe
data = df.copy()

# Splitting training and test datasets
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(data,data['Cylinders']):
    strat_train_set = data.loc[train_index]
    strat_test_set = data.loc[test_index]

As I have target variable and features in one dataframe, I will separate them.
#### Segregating Target and Feature Variables

In [3]:
data = strat_train_set.drop('MPG',axis=1)
data_labels = strat_train_set['MPG'].copy()  # Target variable

data.head()

Unnamed: 0,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model Year,Origin
145,4,83.0,61.0,2003.0,19.0,74,3
151,4,79.0,67.0,2000.0,16.0,74,2
388,4,156.0,92.0,2585.0,14.5,82,1
48,6,250.0,88.0,3139.0,14.5,71,1
114,4,98.0,90.0,2265.0,15.5,73,2


We have 7 attributes we have to work with.

The next function we have to implement is Preprocessing the Origin Column. We will change the codes to countries.


#### Preprocessing the Origin Column

In [4]:
def preprocess_origin_cols(df):
    df['Origin'] = df['Origin'].map({1:'India', 2:'USA', 3:'Germany'})
    return df

data_tr = preprocess_origin_cols(data)
data_tr.head()

Unnamed: 0,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model Year,Origin
145,4,83.0,61.0,2003.0,19.0,74,Germany
151,4,79.0,67.0,2000.0,16.0,74,USA
388,4,156.0,92.0,2585.0,14.5,82,India
48,6,250.0,88.0,3139.0,14.5,71,India
114,4,98.0,90.0,2265.0,15.5,73,USA


### 1. One Hot Encoding the Origin Column

In [5]:
data_tr.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 318 entries, 145 to 362
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Cylinders     318 non-null    int64  
 1   Displacement  318 non-null    float64
 2   Horsepower    314 non-null    float64
 3   Weight        318 non-null    float64
 4   Acceleration  318 non-null    float64
 5   Model Year    318 non-null    int64  
 6   Origin        318 non-null    object 
dtypes: float64(4), int64(2), object(1)
memory usage: 19.9+ KB


There are missing values in Horsepower column. That is alright, we will take care of it later.

The data type of the Origin column is object, which means it is categorical column, a qualitative value we will have to deal with.

In [6]:
# isolating all the categorical variables
# isolating only origin as it is the only categorical variables
data_cat = data_tr[["Origin"]]
data_cat.head()

Unnamed: 0,Origin
145,Germany
151,USA
388,India
48,India
114,USA


In [7]:
# One hot encoding the categorical values
from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder()
data_cat_1hot = cat_encoder.fit_transform(data_cat)
data_cat_1hot      # returns a sparse matrix

<318x3 sparse matrix of type '<class 'numpy.float64'>'
	with 318 stored elements in Compressed Sparse Row format>

We could have used get_dummies() method from pandas, but as we are automating, its better to use sklearn OneHotEncoder class which gives us a sparse matrix.

First we instantiate the class OneHotEncoder. Now this instance of a class has a method called fit_tranform(<categorical_data>). This method computes the no of categories present and creates one-hot vectors for all the categories. Then it returns a sparse matrix.

In [8]:
# Converts the matrix into a 2D array 
data_cat_1hot.toarray()[:5]

array([[1., 0., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [9]:
# Tells categories it encoded to one-hot-vectors
cat_encoder.categories_  

[array(['Germany', 'India', 'USA'], dtype=object)]

We have looked and transformed categorical variable. Now we will see how this step will be implemented with a statement in a pipeline.

So, the above was for explanation for what will happen in backend. Now we will build and automate this stuff, soon. This is why we will work with original data.

### 2. Handling Missing Values using SimpleImputer

We saw we have null values. We will take care of the. \
Steps:
1. We will first segregate numerical columns
2. Used SimpleImputer class to impute missing values using medians
3. Converted the data back to dataframe

In [13]:
# segregating the numerical columns,
# leaving the Origin column as it is string.
num_data = data.iloc[:,:-1]
num_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 318 entries, 145 to 362
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Cylinders     318 non-null    int64  
 1   Displacement  318 non-null    float64
 2   Horsepower    314 non-null    float64
 3   Weight        318 non-null    float64
 4   Acceleration  318 non-null    float64
 5   Model Year    318 non-null    int64  
dtypes: float64(4), int64(2)
memory usage: 17.4 KB


In [23]:
# handling missing values
from sklearn.impute import SimpleImputer

# refer the documentation for examples of other strategies to impute
imputer = SimpleImputer(strategy="median")   # Note it creates an imputer object of SimpleImputer class  
imputer.fit(num_data)      # we give data to the imputer object, and it returns fitted estimator

In [24]:
# returns medians of all the 6 columns we have
imputer.statistics_   

array([   4. ,  146. ,   92. , 2844. ,   15.5,   76. ])

In [25]:
# We can use pandas dataframe to compute the median - same
data.median().values

array([   4. ,  146. ,   92. , 2844. ,   15.5,   76. ])

In [27]:
# imputing the missing values by transforming the dataframe
x = imputer.transform(num_data)  # imputes all missing values in num_data and returns imputed data in ndarray form
x

array([[   4. ,   83. ,   61. , 2003. ,   19. ,   74. ],
       [   4. ,   79. ,   67. , 2000. ,   16. ,   74. ],
       [   4. ,  156. ,   92. , 2585. ,   14.5,   82. ],
       ...,
       [   4. ,  135. ,   84. , 2295. ,   11.6,   82. ],
       [   4. ,  113. ,   95. , 2372. ,   15. ,   70. ],
       [   6. ,  146. ,  120. , 2930. ,   13.8,   81. ]])

In [28]:
# converting the 2D array back into a dataframe 
# (because easier to look at dataframe)
data_tr = pd.DataFrame(x, columns=num_data.columns, 
                       index=num_data.index)
data_tr.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 318 entries, 145 to 362
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Cylinders     318 non-null    float64
 1   Displacement  318 non-null    float64
 2   Horsepower    318 non-null    float64
 3   Weight        318 non-null    float64
 4   Acceleration  318 non-null    float64
 5   Model Year    318 non-null    float64
dtypes: float64(6)
memory usage: 17.4 KB


Originally we had 6 missing values but here only 4 were there, 2 are in the test dataset.

When we will test data, we will pre-process the test data set as well.

### Adding Attributes using BaseEstimator and Transformer

Adding out custom attributes using BaseEstimator class.
The BaseEstimator class allows you to define your own methods and override transform and fit_transform methods.

Transformer mixin class allows you to build all sort of transformers that you want to add.

In [29]:
num_data.head()

Unnamed: 0,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model Year
145,4,83.0,61.0,2003.0,19.0,74
151,4,79.0,67.0,2000.0,16.0,74
388,4,156.0,92.0,2585.0,14.5,82
48,6,250.0,88.0,3139.0,14.5,71
114,4,98.0,90.0,2265.0,15.5,73


In [32]:
from sklearn.base import BaseEstimator, TransformerMixin

# column indexes of Acceleration, Horsepower, Cylinders
acc_ix, hpower_ix, cyl_ix = 4, 2, 0

class CustomAttrAdder(BaseEstimator, TransformerMixin):    # inherts BaseEstimator and TransformerMixin class, they work on ndarrays
    def __init__(self, acc_on_power=True):                 # no *args or **kargs
        self.acc_on_power = acc_on_power
    def fit(self, X, y=None):
        return self                                        # nothing else to do
    def transform(self, X):                                # X is a 2D array
        acc_on_cyl = X[:, acc_ix] / X[:, cyl_ix]
        if self.acc_on_power:
            acc_on_power = X[:, acc_ix] / X[:, hpower_ix]
            return np.c_[X, acc_on_power, acc_on_cyl]      # concatenates arrays
        return np.c_[X, acc_on_cyl]
    

attr_adder = CustomAttrAdder(acc_on_power=True)
data_tr_extra_attrs = attr_adder.transform(data_tr.values)
data_tr_extra_attrs[0]

array([4.0000000e+00, 8.3000000e+01, 6.1000000e+01, 2.0030000e+03,
       1.9000000e+01, 7.4000000e+01, 3.1147541e-01, 4.7500000e+00])

### 4. Creating a Pipeline of tasks

Currently we have seen the individual elements, now we will create pipeline, so that we only pass the data and everythings gets handled automatically and it returns the prepared data.

For this we will use pipeline class from sklearn pipeline model. The other method we will add is StandardScaler. Because it is a good practice to scale the numerical values.

In [34]:
# Using Pipeline class
from sklearn.pipeline import Pipeline
# Using StandardScalar to scale all the numerical attributes
from sklearn.preprocessing import StandardScaler

numerics = ['float64','int64']

num_data = data_tr.select_dtypes(include=numerics)

# pipeline for numerical attributes
# imputing -> adding attributes -> scale them
num_pipeline = Pipeline([
    ('imputer',  SimpleImputer(strategy='median')),         # Task 1: Impute Missing Values
    ('attrs_adder', CustomAttrAdder()),                     # Task 2: Add Custom Attributes
    ('std_scaler', StandardScaler()),                       # Task 3: Scale the values
])

num_data_tr = num_pipeline.fit_transform(num_data)
num_data_tr[0]

array([-0.85657842, -1.07804475, -1.15192977, -1.17220298,  1.21586943,
       -0.54436373,  1.70952741,  1.29565517])

#### Transforming Numerical and Categorical Attributes

Above we saw the processing for numerical values in the pipeline. Next, we can add categorical processing as well.

In [37]:
# Transform different columns or subsets using ColumnTransformer
from sklearn.compose import ColumnTransformer

num_attrs = list(num_data)
cat_attrs = ['Origin']

# Complete pipeline to transform
# Both numerical and categorical attributes
full_pipeline = ColumnTransformer([
    ('num', num_pipeline, num_attrs),
    ('cat', OneHotEncoder(), cat_attrs),
])

prepared_data = full_pipeline.fit_transform(data)
prepared_data

array([[-0.85657842, -1.07804475, -1.15192977, ...,  1.        ,
         0.        ,  0.        ],
       [-0.85657842, -1.1174582 , -0.9900351 , ...,  0.        ,
         0.        ,  1.        ],
       [-0.85657842, -0.3587492 , -0.31547399, ...,  0.        ,
         1.        ,  0.        ],
       ...,
       [-0.85657842, -0.56566984, -0.53133355, ...,  0.        ,
         1.        ,  0.        ],
       [-0.85657842, -0.78244384, -0.23452666, ...,  1.        ,
         0.        ,  0.        ],
       [ 0.32260746, -0.45728283,  0.44003446, ...,  1.        ,
         0.        ,  0.        ]])