# Encoding und Skalierung

Die 2 bekanntesten Arten für das Encoding sind das One-Hot-Encoding und das Label-Encoding. 

## Label encoding

Das Label-Encoding ist eine einfache Methode, um kategorische Daten in numerische Daten umzuwandeln. Es wird verwendet, wenn die kategorischen Daten eine natürliche Ordnung aufweisen.

Der Vorteil des Label encodings ist, dass keine neuen Spalten und Freiheitsgrade entstehen.

Ein Problem kann sein, dass durch die Umwandlung in numerische Daten eine Ordnung suggeriert wird, die nicht vorhanden ist.

In Python können wir das Label-Encoding mit der `LabelEncoder` Klasse aus dem `sklearn.preprocessing` Modul durchführen.

In [20]:
from sklearn.preprocessing import LabelEncoder

labels = ['BB', 'B', 'A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'AA']

# Sortiert und Codiert die Liste labels
encoder = LabelEncoder()

# Der "lange" Weg, wenn man erst den Encoder die Labels "lernen" lassen will und danach transformieren
encoder.fit(labels)
encoder.transform(['B', 'D', 'F', 'H', 'J', 'A'])

array([ 2,  5,  7,  9, 11,  0])

In [21]:
# Der "kurze" Weg, wenn lernen und übersetzen in einem Schritt erfolgen soll
encoder.fit_transform(labels)

# # Anschließend kann ich den encoder trotzdem mit anderen Listen verwenden
# encoder.transform(['B', 'D', 'F', 'H', 'J', 'A'])

array([ 3,  2,  0,  4,  5,  6,  7,  8,  9, 10, 11,  1])

In [22]:
# Wir können auch den umgekehrten Weg gehen und die numerischen Werte in Labels umwandeln
encoder.inverse_transform([1, 3, 5, 7, 9, 2])

array(['AA', 'BB', 'D', 'F', 'H', 'B'], dtype='<U2')

In [23]:
encoder.classes_

array(['A', 'AA', 'B', 'BB', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
      dtype='<U2')

### One Hot Encoding

Das One-Hot-Encoding ist eine Methode, um kategorische Daten in numerische Daten umzuwandeln. Es wird verwendet, wenn die kategorischen Daten keine natürliche Ordnung aufweisen.

Vorteile: Die codierten Daten sind unabhängig von der Kodierung und es entsteht keine Ordnung.

Nachteile: Es entstehen neue Spalten und Freiheitsgrade.

In Python können wir das One-Hot-Encoding mit der `OneHotEncoder` Klasse aus dem `sklearn.preprocessing` Modul durchführen. Alternativ können wir auch die `get_dummies` Funktion aus dem `pandas` Modul verwenden.

In [29]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()

out = encoder.fit_transform(pd.DataFrame(labels)).toarray()
out

array([[0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

In [30]:
classes = encoder.get_feature_names_out()
classes 

array(['x0_A', 'x0_AA', 'x0_B', 'x0_BB', 'x0_C', 'x0_D', 'x0_E', 'x0_F',
       'x0_G', 'x0_H', 'x0_I', 'x0_J'], dtype=object)

In [31]:
pd.DataFrame(out, columns=classes)

Unnamed: 0,x0_A,x0_AA,x0_B,x0_BB,x0_C,x0_D,x0_E,x0_F,x0_G,x0_H,x0_I,x0_J
0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [33]:
pd.get_dummies(labels, dtype=int)

Unnamed: 0,A,AA,B,BB,C,D,E,F,G,H,I,J
0,0,0,0,1,0,0,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,1,0,0,0,0,0,0,0
4,0,0,0,0,0,1,0,0,0,0,0,0
5,0,0,0,0,0,0,1,0,0,0,0,0
6,0,0,0,0,0,0,0,1,0,0,0,0
7,0,0,0,0,0,0,0,0,1,0,0,0
8,0,0,0,0,0,0,0,0,0,1,0,0
9,0,0,0,0,0,0,0,0,0,0,1,0
