<h1>Categorical Transformations</h1>

In order to transform categorical data into numbers which can be understood by M.L. algorithms, we go for categorical transformations:
- `One Hot Encoding (O.H.E.)`
- `Label Encoding`

<h1>One Hot Encoding</h1>

This tranformation technique works in 2 steps:
- Step 1: First for each column it will <b>compute</b> all the <b>unique categories</b> and then <b>creates</b> a <b>feature vector</b> using the unique categories.
- Step 2: Apply One Hot Encoding transformation.

In [None]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np
import pandas as pd

# Initializing the object
enc = OneHotEncoder(handle_unknown='ignore',sparse_output=False)
# We will look into sparse in NLP section

gender = ["Male","Male","Female","Male","Female"]

# Fitting the data
enc.fit(np.array(gender).reshape(-1,1))
print("The categories are: ",enc.categories_)
print()

# Transforming data
print("Encoded Values:\n",enc.transform(np.array(gender).reshape(-1,1)))
print()

# Beautifying the Output
gender_OHE = pd.DataFrame(enc.fit_transform(np.array(gender).reshape(-1,1)),columns=["Female","Male"])
gender_OHE

The categories are:  [array(['Female', 'Male'], dtype='<U6')]

Encoded Values:
 [[0. 1.]
 [0. 1.]
 [1. 0.]
 [0. 1.]
 [1. 0.]]



Unnamed: 0,Female,Male
0,0.0,1.0
1,0.0,1.0
2,1.0,0.0
3,0.0,1.0
4,1.0,0.0


- Now once we have got the transformations we can use it to transform unseen test day as well.

In [None]:
test = ["Male","Male","Other","Female"]

df = pd.DataFrame(enc.transform(np.array(test).reshape(-1,1)),columns=["Female","Male"])
df

Unnamed: 0,Female,Male
0,0.0,1.0
1,0.0,1.0
2,0.0,0.0
3,1.0,0.0


<h3>Observation</h3>

- We see that even on unseen data we are getting the proper encodings except for `Other` gender.
- Whenever `unseen category(ies)` come in test data to `O.H.E.` regardless of anything they are all assigned the `encoding` of `0`.

<h3>How to handle unseen data for O.H.E.?</h3>

The best way to handle it is to ensure that no unseen data point comes in test data and to do this it is the responsibility of the Data Scientist to perform a thorough cleaning of the data before performing a train-test split. After performing the train test split, it must be ensured that every single category that may come in test data has atleast one sample point present in the data.

<h2>An important point when it comes to M.L. is that duplicates and unnecessary columns are always removed.</h2>

In case of O.H.E. even if we remove the first leftmost column we will still get unique encodings for the categories.
<br>Let's see an example for it:

In [None]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np
import pandas as pd

# Initializing the object
enc = OneHotEncoder(drop="first",sparse_output=False)

gender = ["Male","Male","Female","Male","Female"]

# Fitting the data
enc.fit(np.array(gender).reshape(-1,1))
print("The categories are: ",enc.categories_)
print()

# Transforming data
print("Encoded Values:\n",enc.transform(np.array(gender).reshape(-1,1)))
print()

# Beautifying the Output
gender_OHE = pd.DataFrame({"Encoded Gender":enc.fit_transform(np.array(gender).reshape(-1,1)).flatten(),"Gender":gender})
gender_OHE

The categories are:  [array(['Female', 'Male'], dtype='<U6')]

Encoded Values:
 [[1.]
 [1.]
 [0.]
 [1.]
 [0.]]



Unnamed: 0,Encoded Gender,Gender
0,1.0,Male
1,1.0,Male
2,0.0,Female
3,1.0,Male
4,0.0,Female


<h3>Observation</h3>

- Even after dropping one column we are still able to capture the same information that we were having before.
- After this if we try to drop more columns we will not be able to capture the information and loose the uniqueness of the data.
- `drop="First"` helps us tackle a very commonly occuring problem called `Curse of Dimensionality`.
- `CURSE OF DIMENSIONALITY`: It refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces, it includes facing problems like: data sparsity, measuring closeness of points, etc.

<h1>Label Encoding</h1>

It is another categorical data transformation which works in the following way:
- `Step 1`: Compute the unique categories for each column.
- `Step 2`: Apply the label encoding transformation, i.e. assign a numerical value to each category.<br>

- It is advantageous over O.H.E. for the fact that it does not introduce curse of dimensionality problem.
- One of its drawbacks is that it introduces unwanted patterns if applied in a wrong way. for eg Male>Female; Japan>U.S., etc.
- However the same drawback becomes an advantage if used correctly.

Let's see a code implementation for the same:

In [None]:
# defining the unique categories and their rankings
cat_encoder = {"Very Good":1,"Ideal":2,"Premium":3}

# categorical data for diamond quality
quality = pd.DataFrame(["Very Good","Very Good","Premium","Very Good","Ideal","Ideal"],columns=["Quality"])

# Applying Label Encoding
quality_encoding = quality["Quality"].apply(lambda x : cat_encoder[x])

df = pd.concat([quality,quality_encoding],axis=1,)
df.columns = ["Quality","Quality Encoding"]
df

Unnamed: 0,Quality,Quality Encoding
0,Very Good,1
1,Very Good,1
2,Premium,3
3,Very Good,1
4,Ideal,2
5,Ideal,2


<h3>Observation</h3>

- Based upon the data we can infer that for diamond `premium` is the best compared to other, so while encoding it got the highest number to indicate highest priority.

In the above example we know that a `premium` for diamond is way better than `Ideal` and `Very Good`. <br>So the pattern like:<b> premium > Ideal > Very Good</b> being learnt by the model turns out to be favorable for us.

---



<h1> RECAP </h1>
<h2> Converting Categorical Features to Numerical</h2>

The converison of categorical data to numerical is needed as the M.L. algorithms are designed to work with numerical data.

<h3>
1. One Hot Encoding (O.H.E.):<br></h3>

 - Step 1:
    - Learn all the unique categories for each column.
    - Create feature vector using these categories.

- Step 2: Apply One Hot Encoding Transformation using the feature vector as reference from Step 1, i.e. put `1` for the category if it is there, rest all will be given `0`.
- Why is it recommended to do `drop="first"`?
    - It is done in order to avoid `Curse of Dimensionality`.
    - It also helps in removing `multicollinearity` which is a big problems for algorithms like Naïve Bayes, Linear Regression etc.
- Apply `O.H.E.` on `Nominal Data/Columnms`, for e.g. Color, Gender, etc.
- <b>PROBLEMS</b>:
    - If we perform `drop="first"` and an unseen category(ies) comes in the test data then they will all get assigned the encoding of `0` which will create an encoding clash as it will be impossible to tell `0` is for which category.
    - If we avoid `drop="first"` and only 1 unseen category comes in test data then it will be given an encoding of `0` but it will still be effective as rest all categories will have `1` in their encoding.
      - However if `2 or more` unseen categories come in test data in that case we will have the same problem as mentioned in the above point.
    - If the categorical column has high cardinality (large number of unique categories), then performing O.H.E. on that column will cause the problem of `Curse of Dimensionality` and using `drop="first"` will not make a difference.
- <b>Solutions</b>:
    - As an alternate to O.H.E., you can explore:
      - Target Encoding
      - Leave One out Encoding
      - K-Fold Encoding

<h3>
2. Label Encoding:</h3>

- Step 1: Compute the unique categories for each column.
- Step 2: Apply the label encoding transformation, i.e. assign a numerical value to each category based on a criterion.<br>
- One of the key advantages of Label Encoding is that it does not produce `Curse of Dimensionality`.
- Apply `Label Encoding` on `Ordinal Data/Column`, for e.g. Grade, Gender, etc.
- <b>PROBLEMS</b>:
  - If applied on nominal data it introduces ranking which is not desirable. For e.g. Gender column will be nominal column but applying label encoding will lead to model to infer patterns like Male>Female, which is not at all good learning by the model.
  - If there is a nominal column with only 2 categories, applying O.H.E. with `drop="first` will make it the same case as above and introduce ranking. It happens because final representation is like label encoding.
  - Using sklearn based Label encoding may or may not be effective as it gives the encoding alphabetically which may or may not useful depending upon the problem statement.
- <b>Solutions</b>:
  - Before applying label encoding understand the business problem properly.
  - Use custome encoding rather than sklearn or any module based implementation of Label Encoding.
  


<br>

<br>

<br>