<h1 style='color: #C9C9C9'>Machine Learning with Python<img style="float: right; margin-top: 0;" width="240" src="../../Images/cf-logo.png" /></h1> 
<p style='color: #C9C9C9'>&copy; Coding Fury 2022 - all rights reserved</p>

<hr style='color: #C9C9C9' />

# Dummies

Before I introduce the topic of dummies, let's import the automobiles dataset. 

We'll also deal with the missing data.

## Load the Automobiles Dataset

In [None]:
import numpy as np
import pandas as pd
# ensure that we can see all columns when we display a dataframe
pd.set_option('max_columns', None) 

# read the automobiles dataset into a dataframe
auto_df = pd.read_csv('../../Data/automobiles.csv')

# drop the symbolling and normalised losses columns
auto_df = auto_df.drop(['symboling', 'normalised_losses'], axis=1)
# drop the rows without a price
auto_df = auto_df.dropna(subset=['price']) 
auto_df


In [None]:
auto_df['body_style'].unique()

## Impute missing values

In [None]:
from sklearn.impute import SimpleImputer

In [None]:
imp_mode = SimpleImputer(strategy='most_frequent', missing_values=np.NaN)   # "most_frequent" is same as the "mode"
auto_df['num_of_doors'] = imp_mode.fit_transform(auto_df[['num_of_doors']])

In [None]:
imp_mean = SimpleImputer(strategy='mean', missing_values=np.NaN)  
auto_df[['bore', 'stroke']] = imp_mean.fit_transform(auto_df[['bore','stroke']])

In [None]:
imp_median = SimpleImputer(strategy='median', missing_values=np.NaN)  
auto_df[['horsepower','peak_rpm']] = imp_mean.fit_transform(auto_df[['horsepower','peak_rpm']])

## Are we ready to train a model?

No, we can't train a model yet because lost of our columns contain strings. 

How could we fix this? 

If you look at the columns that contain strings, you can see that each column only contains a finite number of options. 

In [None]:
auto_df['make'].unique()

In [None]:
auto_df['fuel_type'].unique()

In [None]:
auto_df['aspiration'].unique()

## A possible solution?

It should be obvious that in cases like this we should be able to substitute numbers for each of the values. For example perhaps 'gas' could be 0 and diesel could be '1'?

Or maybe we could number the car make from 'alpha-romero'=0 through 'volvo'=21?

There's a problem with this approach: it suggests that somehow a "volvo" is worth "more" than an "alpha-romero".

# Creating Dummies

What we actually need to do is create a column for each make. Then we can denote if the make is "alpha-romero" with a 0 or 1. 

This sounds like a lot of work, but don't worry, it's not. This process is called creating dummies; and a library for creating dummies is built into Pandas. 

In [None]:
auto_dummies_df = pd.get_dummies(auto_df)
auto_dummies_df


> OneHotEncoder()
>
> Note that you can also create dummies with SciKit Learn's OneHotEncoder()
>
> However, OneHotEncoder only works on numerical data, which means the categories have to be numbered e.g. 0-10 for the encoder to make dummies for you.  
>
> If a feature contains strings/objects then you first have to use SciKit Learn's LabelEncoder to convert them into numbers, then you can apply the OneHotEncoder. Unquestionably, the pandas dummies library is easier to use.


# Dummies: drop_first=True

Consider the fuel types for 4 rows of data - shown as dummies.

| # | fuel_type_diesel | fuel_type_gas |
|---|------------------|---------------|
|  1|                 0|              1|
|  2|                 0|              1|
|  3|                 1|              0|
|  4|                 0|              1|

Because there are only 2 possible outcomes: if the fuel type isn't petrol, then we know for certain that it must be diesel. Therefore I can represent the same information with only one column: 

| # | fuel_type_gas |
|---|---------------|
|  1|              1|
|  2|              1|
|  3|              0|
|  4|              1|


The same goes for the makes of cars, if we know that the car make isn't any of the combinations "audi" through "volvo" then it must be an "alpha-romeo". 

Because we know this implicitly, we should drop the first column, otherwise, we're carrying duplicate information. 

Furthermore, not dropping this column will cause problems for some models.

In [None]:
auto_dummies_df = pd.get_dummies(auto_df, drop_first=True)
auto_dummies_df