# #️⃣ One Hot Encoding #️⃣

* the process by which **categorical variables** are **converted** to a form that can be used to **improve the predictions** of ML algorithms

From my recollection, **one hot encoding** can be used with either:

* ```sklearn```'s ```preprocessing.OneHotEncoder```, OR
* ```keras```'s ```np_utils```.

# 📋 Suppose we have the following dataset:

In [1]:
# ╔════════════╦═════════════════╦════════╗ 
# ║ CompanyName Categoricalvalue ║ Price  ║
# ╠════════════╬═════════════════╣════════║ 
# ║ VW         ╬      1          ║ 20000  ║
# ║ Acura      ╬      2          ║ 10011  ║
# ║ Honda      ╬      3          ║ 50000  ║
# ║ Honda      ╬      3          ║ 10000  ║
# ╚════════════╩═════════════════╩════════╝

* Categorical value assignment (also called **integer encoding**) can be done with ```sklearn```'s ```LabelEncoder```.
* The **categorical value** represents the entry's **numerical value**.
* If we had another company, say *Tesla*, its **categorical value** would be 4.
* <font size="+2">This is just one example. In general, categorical values start from ```0``` and go up to ```n-1``` categories.</font>


### One hot encoding *dualizes categories into binary* with the following steps:</font>

1. Start with sequence of **already integer encoded** data.
1. Add that **same quantity** of new features
1. **Each new feature is binary** and reflects whether the data point has the specific corresponding categorical value or not (i.e. **is_VW**, **is_Acura**, **is_Honda**)
1. So for our example, **each data point** should have a single feature with a value of ```1``` -- and the rest of the features have values of ```0```.

# After one hot encoding the first dataset:

In [2]:
# ╔════╦══════╦══════╦════════╦
# ║ VW ║ Acura║ Honda║ Price  ║
# ╠════╬══════╬══════╬════════╬
# ║ 1  ╬ 0    ╬ 0    ║ 20000  ║
# ║ 0  ╬ 1    ╬ 0    ║ 10011  ║
# ║ 0  ╬ 0    ╬ 1    ║ 50000  ║
# ║ 0  ╬ 0    ╬ 1    ║ 10000  ║
# ╚════╩══════╩══════╩════════╝

# ❓ What's the big deal about one hot encoding? Why not just use *label encoding* when it can assign categorical values?

* Because label encoding **assumes greater categorical value indicates superior category**. ❌
    * So Honda > Acura > VW (*because 3 > 2 > 1*)... 😭 Errors aplenty will abound!
* By taking the extra step to make the data **binary** with **one hot encoding**, we escape this trap & all its associated error-prone models!

*Resources*:
* Notes from this HackerNoon article: https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f
* ```sklearn``` documentation: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
* ```keras``` documentation: https://jovianlin.io/keras-one-hot-encode-decode-sequence-data/