# Dummy Encoding

When storing categorial values such as __France__, __Germany__, and __Spain__, the values can't be stored as 1, 2, and 3. This is because the machine learning algorithms will treat the value 3 as a greater value than 1 and 2, causing errors within the algorithm. Instead, they must be stored in column format with 1 indicating the presence of the value and 0 indicating it isn't present. This way, the values are equal and independent.

<img src="dummyEncoding.png" alt="Dummy Encoding" style="width: 800px;"/>


<hr>

## The Dummy Variable Trap

<img src="dummyVariableTrap.png" alt="Dummy Variable Trap" style="width: 700px;"/>

When using qualitative data, the data must be turned into dummy variables. The problem here is that the California column can be easily predicted from the New York column _(multicollinearity)_. After all...

$$\LARGE b_5 = 1 - b_4$$

This can easily be fixed by removing one of the dummy variables from a independent variable dataset. All the coefficients for  California will just be included in the $b_0$ constant. So then how does the algorithm differentiate between the constant and other dummy variables? Simply put, $b_4$ will not just be the coefficient for New York; it will be the coefficient for the difference between New York and California. 
<hr>

## Code

In [5]:
# Encoding Categorical Data
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer([('one-hot-encoder', OneHotEncoder(categories = 'auto'), [0])], remainder='passthrough')
X = ct.fit_transform(X)

# Preventing the Dummy Variable Trap 
X = X[:, 1:]

X

array([[0.0, 0.0, 44.0, 72000.0],
       [0.0, 1.0, 27.0, 48000.0],
       [1.0, 0.0, 30.0, 54000.0],
       [0.0, 1.0, 38.0, 61000.0],
       [1.0, 0.0, 40.0, 63777.77777777778],
       [0.0, 0.0, 35.0, 58000.0],
       [0.0, 1.0, 38.77777777777778, 52000.0],
       [0.0, 0.0, 48.0, 79000.0],
       [1.0, 0.0, 50.0, 83000.0],
       [0.0, 0.0, 37.0, 67000.0]], dtype=object)

Unlike the _X[0]_ column that dealt with 3 categories, the _y_ column only deals with 2 categories that can be represented as values 0 or 1. The _LabelEncoder_ is a more fitting choice than the _OneHotEncoder_ as it deals with this exact situation.

In [6]:
## Dummy Encoding - 2 Categories
from sklearn.preprocessing import LabelEncoder
y = LabelEncoder().fit_transform(y)

y

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])