# One-Hot Encoding

Review: Why do we preprocess data?

    1. To transform the data to better suit a model's assumptions
    2. To format the data in the way a model expects

### Inputs in Neural Networks & why we need One-Hot Encoding

<strong>Vectors</strong>: inputs in neural networks. Each entry in the vector corresponds to a FEATURE. Those features are then used to make predictions.  

<strong>Caution</strong>: VECTORS can only contain NUMERICAL DATA. THEY CANNOT BE STRINGS

In [4]:
# Step 1: Get the data (here we use iris dataset, but rather than 'load_iris()' we link to a site)

import pandas as pd

names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']
df = pd.read_csv(
    'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', names = names)
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


Notice how one of the features is a string. We need all numerical data to make the NNDL model.  

The 'class' will be one of three:  
* iris-setosa
* iris-versicolour
* iris-virginica

# Implementing One-Hot Encoding

One-Hot Encoding is in a format that looks like the following (for example:

setosa = [1, 0, 0]
versicolour = [0, 1, 0]
virginica = [0, 0, 1]

1. **Label Encoding**. First, we convert the three possible classes to integer labels. E.g., `iris-setosa` will be `1`; `iris-versicolour`, `2`; and `iris-virginica`, `3`.
2. **One-Hot Encoding**. Then, we set each row's `class` value to an _array_. This array will have a `1` in whichever slot corresponds to the integer label. E.g., after one-hot encoding, a row with the class `iris-setosa` will have the array `[1, 0, 0]`. A row with class `iris-virginica`, the array `[0, 0, 1]`; etc.