# 🤔 One Hot Encoding Practice 🤔

* 😈 Manual One Hot Encoding
* 🧪 One Hot Encode with ```sklearn.preprocessing.OneHotEncoder```
* 💗 One Hot Encode with ```keras.utils.to_categorical```
* 💝 More with Keras

# 😈 Manual One Hot Encoding

In [1]:
from numpy import argmax

### Define input string

In [2]:
data = 'hello world'
print(data)

hello world


### Define universe of possible input values

In [3]:
alphabet = 'abcdefghijklmnopqrstuvwxyz '

### Define a mapping of characters to integers

In [4]:
char_to_int = dict((c, i) for i, c in enumerate(alphabet))
int_to_char = dict((i, c) for i, c in enumerate(alphabet))

### Use the mapping to *integer encode* the input data

In [5]:
integer_encoded = [char_to_int[char] for char in data]
print(integer_encoded)

[7, 4, 11, 11, 14, 26, 22, 14, 17, 11, 3]


### One hot encode

In [6]:
onehot_encoded = list()
for value in integer_encoded:
    letter = [0 for _ in range(len(alphabet))]
    letter[value] = 1
    onehot_encoded.append(letter)
print(onehot_encoded)

[[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]


### Invert encoding on first letter

Done like so:
* Locates index of the binary vector with the largest value using the ```numpy argmax()``` function
* Uses the integer value in a reverse lookup table of character values to integers (*created in cell 4*)

The first letter was "h", so that should be what is returned.

In [7]:
inverted = int_to_char[argmax(onehot_encoded[0])]
print(inverted)

h


# 🧪 One Hot Encode with ```sklearn.preprocessing.OneHotEncoder```

Used in cases where the input sequence **fully captures** the expected range of input values.

In [8]:
from numpy import array
from numpy import argmax
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

### Define example

In [9]:
data = ['cold', 'cold', 'warm', 'cold', 'hot', 'hot', 'warm', 'cold', 'warm', 'hot']
values = array(data)
print(values)

['cold' 'cold' 'warm' 'cold' 'hot' 'hot' 'warm' 'cold' 'warm' 'hot']


### Integer encode

In [10]:
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(values)
print(integer_encoded)

[0 0 2 0 1 1 2 0 2 1]


### Binary encode

* The ```OneHotEncoder``` class by default returns a more efficient sparse encoding
* Not suitable for some applications, ex. the ```keras``` deep learning library
* Here, we will **disable** the sparse return type by setting the *sparse=False* argument

In [11]:
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
print(onehot_encoded)

[[1. 0. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]]


In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


### Invert first example

* Done if we receive a prediction in this 3-value (cold, warm, hot) one hot encoding
* Uses the ```numpy argmax()``` function to find the index of the column with the largest value.
* This column is fed to the ```LabelEncoder``` to inverse transform it back to a text label.

In [12]:
inverted = label_encoder.inverse_transform([argmax(onehot_encoded[0, :])])
print(inverted)

['cold']


# 💗 One Hot Encode with ```keras.utils.to_categorical```

Directly one hot encoding an integer-encoded sequence.

Say we start with an integer-encoded sequence:
```python
data = [1, 3, 2, 0, 3, 2, 2, 1, 0, 1]
```

* **IF** the sequence contains **all known values,**
    * **THEN** we can use ```to_categorical()``` directly.
* **IF** the sequence **does not represent all possible values**,
    * **THEN** we specify the argument: *to_categorical(num_classes=4)*.

In [13]:
from numpy import array
from numpy import argmax
from keras.utils import to_categorical

Using TensorFlow backend.
  return f(*args, **kwds)


### Define example

In [14]:
data = [1, 3, 2, 0, 3, 2, 2, 1, 0, 1]
data = array(data)
print(data)

[1 3 2 0 3 2 2 1 0 1]


### One hot encode

In [15]:
encoded = to_categorical(data)
print(encoded)

[[0. 1. 0. 0.]
 [0. 0. 0. 1.]
 [0. 0. 1. 0.]
 [1. 0. 0. 0.]
 [0. 0. 0. 1.]
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]
 [0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [0. 1. 0. 0.]]


### Invert encoding

In [16]:
inverted = argmax(encoded[0])
print(inverted)

1


# 💝 More with Keras

In [17]:
import numpy as np
from keras.utils import to_categorical

In [18]:
data = np.array([1, 5, 3, 8])
print(data)

[1 5 3 8]


### Encode

In [19]:
def encode(data):
    print('Shape of data BEFORE encode: %s' % str(data.shape))
    encoded = to_categorical(data)
    print('Shape of data AFTER encode: %s\n' % str(encoded.shape))
    return encoded

In [20]:
encoded_data = encode(data)
print(encoded_data)

Shape of data BEFORE encode: (4,)
Shape of data AFTER encode: (4, 9)

[[0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1.]]


### Decode

In [21]:
def decode(datum):
    return np.argmax(datum)

In [22]:
for i in range(encoded_data.shape[0]):
    datum = encoded_data[i]
    print('index: %d' % i)
    print('encoded datum: %s' % datum)
    decoded_datum = decode(encoded_data[i])
    print('decoded datum: %s' % decoded_datum)
    print()

index: 0
encoded datum: [0. 1. 0. 0. 0. 0. 0. 0. 0.]
decoded datum: 1

index: 1
encoded datum: [0. 0. 0. 0. 0. 1. 0. 0. 0.]
decoded datum: 5

index: 2
encoded datum: [0. 0. 0. 1. 0. 0. 0. 0. 0.]
decoded datum: 3

index: 3
encoded datum: [0. 0. 0. 0. 0. 0. 0. 0. 1.]
decoded datum: 8



*Sources/References:*
* https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/
* https://jovianlin.io/keras-one-hot-encode-decode-sequence-data/