# 1. Introduction
- Machine can not understand categorical feature. Therefore, we try to transform it to numerical feature. We can apply those methods: 
    - One-hot encoding
    - Ordinal encoding
    - Label encoding of the target variable
- Exercise 

# 2. One-hot encoding

In [1]:
import numpy as np

from sklearn.preprocessing import OneHotEncoder

In [2]:
X = np.array([["A"],["A"],["B"],["C"]])

In [3]:
X

array([['A'],
       ['A'],
       ['B'],
       ['C']], dtype='<U1')

In [4]:
enc = OneHotEncoder()

In [6]:
enc.fit_transform(X).todense()

matrix([[1., 0., 0.],
        [1., 0., 0.],
        [0., 1., 0.],
        [0., 0., 1.]])

In [None]:
enc = OneHotEncoder()

In [7]:
enc.fit_transform(X).todense()

matrix([[1., 0., 0.],
        [1., 0., 0.],
        [0., 1., 0.],
        [0., 0., 1.]])

## 2.1 Output with dense matrix 

In [9]:
enc = OneHotEncoder(sparse_output = False)

In [10]:
enc.fit_transform(X)

array([[1., 0., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

## 2.2 Removing one dummy variable

In [11]:
enc = OneHotEncoder(sparse_output = False, drop='first')

In [12]:
enc.fit_transform(X)

array([[0., 0.],
       [0., 0.],
       [1., 0.],
       [0., 1.]])

In [13]:
print(type(X))

<class 'numpy.ndarray'>


## 2.3 Removing one dummy variable from binary features

In [14]:
X = np.array([["A"],["A"],["A"],["C"]])

In [15]:
X

array([['A'],
       ['A'],
       ['A'],
       ['C']], dtype='<U1')

In [16]:
enc = OneHotEncoder(sparse_output=False, drop="if_binary")
enc.fit_transform(X)

array([[0.],
       [0.],
       [0.],
       [1.]])

# Error handling

In [18]:
enc = OneHotEncoder(sparse_output=False)

enc.fit(X)

In [19]:
enc.transform(X)

array([[1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.]])

In [23]:
Y = np.array([["A"],["A"],["B"],["C"]])

In [24]:
enc.transform(Y)

NotFittedError: This OneHotEncoder instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

In [25]:
enc = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

In [26]:
enc.fit(X)

In [27]:
enc.transform(Y)

array([[1., 0.],
       [1., 0.],
       [0., 0.],
       [0., 1.]])

In [21]:
enc = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

In [28]:
X = [["A","X"],["B","Y"],["C","Z"]]

#### Categories parameter

In [29]:
enc = OneHotEncoder(sparse_output=False, categories= [["A","B","C","D"], ["X","Y","Z"]])

In [30]:
enc.fit_transform(X)

array([[1., 0., 0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0., 1., 0.],
       [0., 0., 1., 0., 0., 0., 1.]])

In [32]:
Y=[["D","Z"]]

In [33]:
enc.transform(Y)

array([[0., 0., 0., 1., 0., 0., 1.]])

## 2. Ordinal encoding

In [34]:
from sklearn.preprocessing import OrdinalEncoder

In [40]:
X = [["High"],["Low"],["Low"],["Medium"]]

In [41]:
enc = OrdinalEncoder()

In [42]:
enc.fit_transform(X)

array([[0.],
       [1.],
       [1.],
       [2.]])

We want:
- Low -> 0
- Medium -> 1
- High -> 2

In [43]:
enc = OrdinalEncoder(categories=[["Low","Medium","High"]])

enc.fit_transform(X)

array([[2.],
       [0.],
       [0.],
       [1.]])

In [46]:
X = [["High","A"],["Low","C"],["Low","B"],["Medium","C"]]
X

[['High', 'A'], ['Low', 'C'], ['Low', 'B'], ['Medium', 'C']]

In [47]:
enc = OrdinalEncoder(categories = [["Low","Medium","High"],["A","B","C"]])

enc.fit_transform(X)

array([[2., 0.],
       [0., 2.],
       [0., 1.],
       [1., 2.]])

In [48]:
X = [["Under 18"],["18-25"],["25-30"],["Over 30"]]

In [49]:
enc = OrdinalEncoder(categories=[["Under 18","18-25","25-30","Over 30"]])

enc.fit_transform(X)

array([[0.],
       [1.],
       [2.],
       [3.]])

## 3. Label encoding for Target Variable

In [50]:
from sklearn.preprocessing import LabelEncoder

In [51]:
y = [["A"],["B"],["B"],["C"],["D"]]

In [52]:
enc = LabelEncoder()

In [53]:
enc.fit_transform(y)

  y = column_or_1d(y, warn=True)


array([0, 1, 1, 2, 3])

In [54]:
enc.inverse_transform(enc.fit_transform(y))

  y = column_or_1d(y, warn=True)


array(['A', 'B', 'B', 'C', 'D'], dtype='<U1')

# 4. Exercises

---

**Exercise 1**

- Use the dataset
  ```python
  X = [["X","High"],["Y","Low"],["Z","Medium"],["X","Low"]]
  ```

- Apply One-hot encoding to all variables
- Apply One-hot encoding to the first column and ordinal encoding to the second column following the rank Low, Medium, High

---

**Exercise 2**

- Use the dataset
  ```python
  X = [["X","High"],["Y","Low"],["Z","Medium"],["X","Low"]]
  ```

- Apply One-hot encoding to the first variable and ordinal encoding to the second variable.
  - The first variable must be encoded considering this set of values: X, Y, Z, W.
  - The second variable must be encoded considering this set of ranked values: Low, Medium, High, Very high
- Consider the dataset:
  ```python
  y = [["W", "Very high"]]
  ```
- Transform this dataset according to the fitted encoder