# Categorical Data Encoding

### What is Categorical Data Encoding

Categorical data is a type of data that represents categories or groups. Examples of categorical data include gender, color, or product type. In data exploration, categorical data encoding is the process of converting categorical data into numerical data that can be analyzed by machine learning algorithms.

### Why do we Encode our data

Machine learning algorithms cannot work with categorical data directly. They require numerical data to perform mathematical operations such as addition and multiplication. Therefore, categorical data encoding is necessary to convert categorical data into numerical data.

![Encoding Example](assests/encoding-3.png "Encoding Example")

In the above picture it can be seen that the *Qualitative* column of *Height* is being encoded into *Quantitative* representation. This is done by assigning a number for each unique entry in the *Height* column in the right table. Now each of the height values are being represented by a distinct number. This is just a simplistic way of encoding things.


## One-Hot Encoding

One-hot encoding is a technique that creates a binary column for each category in a categorical variable. The value is 1 if the category is present and 0 if it is not.

![One-Hot Encoding Example](assests/oneHotenc.png "One-Hot Encoding Example")

### One-Hot Encoding in Pandas

One-Hot encoding of the data can be done in Pandas by using the *get_dummies* function.
Lets take a look how that would work.

In [2]:
# importing Pandas library
import pandas as pd

# creating a dictionary with data
data = {
    "Team": ["A","A","B","B","B","B","C","C"]
}

# initializing a dataframe with the data
df = pd.DataFrame(data)

# applying one-hot encoding by using get_dummies function
df_encoded = pd.get_dummies(df, columns=['Team'])

# printing out the encoded data frame
print(df_encoded)

   Team_A  Team_B  Team_C
0       1       0       0
1       1       0       0
2       0       1       0
3       0       1       0
4       0       1       0
5       0       1       0
6       0       0       1
7       0       0       1


### Lets try "One-Hot Encoding" on a Dataset!

For this we will be using the dataset called "Iris Dataset". This data sets consists of 3 different types of irisesâ€™ (Setosa, Versicolour, and Virginica) petal and sepal length. The type of Iris will be the qualitative data that we will be applying encoding on.

In [3]:
# loading Iris dataset
iris_data = pd.read_csv("data/iris.csv")

# printing head
print(iris_data.head())

   sepal.length  sepal.width  petal.length  petal.width variety
0           5.1          3.5           1.4          0.2  Setosa
1           4.9          3.0           1.4          0.2  Setosa
2           4.7          3.2           1.3          0.2  Setosa
3           4.6          3.1           1.5          0.2  Setosa
4           5.0          3.6           1.4          0.2  Setosa


In [4]:
# extracting the qualitative data column which is the variety column
iris_variety = iris_data["variety"]

print(iris_variety.head())

0    Setosa
1    Setosa
2    Setosa
3    Setosa
4    Setosa
Name: variety, dtype: object


In [5]:
# applying one-hot encoding on the extracted column
iris_variety_enc = pd.get_dummies(iris_variety)

# sampling random rows from the encodings
sample = iris_variety_enc.sample(n=10)

# printing encoded values
print(sample.head(10))

     Setosa  Versicolor  Virginica
9         1           0          0
26        1           0          0
33        1           0          0
15        1           0          0
141       0           0          1
16        1           0          0
50        0           1          0
1         1           0          0
37        1           0          0
52        0           1          0


## Label Encoding

Label encoding is a technique that assigns a unique integer value to each category in a categorical variable.

<div>
<img src="assests/labelEnc.png" width="500"/>
</div>

### Label Encoding in Pandas and SkLearn

Label encoding of the data can be done in Pandas by using the *LabelEncoder* class that is available in SkLearn library.

More about Sklearn's *LabelEncoder* class [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder)

More about Sklearn [here](https://scikit-learn.org/stable/)

Lets take a look how that would work.

In [6]:
# installing sklearn
!pip install scikit-learn



In [7]:
# importing the LabelEncoder from sklearn
from sklearn.preprocessing import LabelEncoder

# creating a dictionary with data
data = {
    "Team": ["A","A","B","B","B","B","C","C"]
}

# initializing a dataframe with the data
df = pd.DataFrame(data)

# initializing the LabelEncoder class object
label_encoder = LabelEncoder()

# applying label encoding by using get_dummies function
df['Team Encoded'] = label_encoder.fit_transform(df['Team'])

# printing out the encoded data frame
print(df)

  Team  Team Encoded
0    A             0
1    A             0
2    B             1
3    B             1
4    B             1
5    B             1
6    C             2
7    C             2


### Lets try "Label Encoding" on Iris Dataset!

In [8]:
# applying label encoding on the variety column
iris_data["variety encoded"] = label_encoder.fit_transform(iris_data["variety"])

# sampling random rows from the encodings
sample = iris_data.sample(n=10)

# printing encoded values
print(sample[["variety", "variety encoded"]].head(10))

        variety  variety encoded
84   Versicolor                1
61   Versicolor                1
20       Setosa                0
88   Versicolor                1
130   Virginica                2
63   Versicolor                1
18       Setosa                0
139   Virginica                2
115   Virginica                2
37       Setosa                0


## Binary Encoding

Binary encoding is a technique that creates binary columns for each category in a categorical variable, similar to one-hot encoding. However, instead of creating a binary column for each category, binary encoding creates a binary column for each unique combination of categories.

<br>
<div>
<img src="assests/binaryEnc.png" width="500"/>
</div>

### Binary Encoding in Pandas and Category Encoder library

Binary encoding of the data can be done in Pandas by using the *BinaryEncoder* class that is available in Category Encoder library.

More about *Binary Encoder* libary [here](https://pypi.org/project/category-encoders/)

Lets take a look how that would work.

In [9]:
# installing category encoder library
!pip install category-encoders



In [10]:
# importing the BinaryEncoder from category encoder
from category_encoders import BinaryEncoder

# creating a dictionary with data
data = {
    "Fruits": ["Apple","Banana","Orange"]
}

# initializing a dataframe with the data
df = pd.DataFrame(data)

# initializing the BinaryEncoder class object and passing the values 
binary_encoder = BinaryEncoder(cols=["Fruits"])

# applying binary encoding
encoded_data = binary_encoder.fit_transform(df)

# printing out the encoded data frame
print(encoded_data)

   Fruits_0  Fruits_1
0         0         1
1         1         0
2         1         1


### Lets try "Binary Encoding" on Iris Dataset!

In [11]:
# initializing the BinaryEncoder class object and passing the values 
binary_encoder = BinaryEncoder(cols=["variety"])

# applying binary encoding on the variety column
iris_variety_enc = binary_encoder.fit_transform(iris_data, return_df=True)

# sampling random rows from the encodings
sample = iris_variety_enc.sample(n=10)

# printing encoded values
print(sample.head())

     sepal.length  sepal.width  petal.length  petal.width  variety_0  \
2             4.7          3.2           1.3          0.2          0   
40            5.0          3.5           1.3          0.3          0   
52            6.9          3.1           4.9          1.5          1   
103           6.3          2.9           5.6          1.8          1   
131           7.9          3.8           6.4          2.0          1   

     variety_1  variety encoded  
2            1                0  
40           1                0  
52           0                1  
103          1                2  
131          1                2  


## Count Encoding

Count encoding is a technique that replaces each category with the count of its occurrences in the dataset. This method is particularly useful for categorical variables with a large number of categories that have a similar frequency of occurrence

<br>
<div>
<img src="assests/countEnc.png" width="90%"/>
</div>

### Count Encoding in Pandas and Category Encoder library

Label encoding of the data can be done in Pandas by using the *CountEncoder* class that is available in Category Encoder library.

More about *Count Encoder* libary [here](https://pypi.org/project/category-encoders/)

Lets take a look how that would work.

In [12]:
# importing the CountEncoder from category encoder
from category_encoders import CountEncoder

# creating a dictionary with data
data = {
    "Fruits": ["Apple","Banana","Apple","Banana","Banana"]
}

# initializing a dataframe with the data
df = pd.DataFrame(data)

# initializing the CountEncoder class object
count_encoder = CountEncoder()

# applying count encoding
df["Fruits Encoded"] = count_encoder.fit_transform(df["Fruits"])

# printing out the data frame with count encodings
print(df)

   Fruits  Fruits Encoded
0   Apple               2
1  Banana               3
2   Apple               2
3  Banana               3
4  Banana               3


### Lets try "Count Encoding" on Iris Dataset!

In [13]:
# initializing the CountEncoder class object and passing the values 
count_encoder = CountEncoder()

# making a copy of the loaded iris DataFrame
iris_data_count_df = iris_data.copy()

# applying count encoding on the variety column
iris_data_count_df["variety encoded"] = count_encoder.fit_transform(iris_data_count_df["variety"], return_df=True)

# sampling random rows from the encodings
sample = iris_data_count_df.sample(n=10)

# printing encoded values
print(sample.head())

     sepal.length  sepal.width  petal.length  petal.width     variety  \
81            5.5          2.4           3.7          1.0  Versicolor   
1             4.9          3.0           1.4          0.2      Setosa   
19            5.1          3.8           1.5          0.3      Setosa   
32            5.2          4.1           1.5          0.1      Setosa   
131           7.9          3.8           6.4          2.0   Virginica   

     variety encoded  
81                50  
1                 50  
19                50  
32                50  
131               50  


**In the above output can you spot the problem?**

## Target Encoding

Target encoding is a technique that replaces each category in the feature with the mean of the target variable for that category. This method works well for nominal and ordinal data, but can lead to overfitting when the number of instances in a category is small.

<br>
<div>
<img src="assests/targetEnc.png" width="60%"/>
</div>

### Target Encoding in Pandas and Category Encoder library

Target encoding of the data can be done in Pandas by using the *TargetEncoder* class that is available in Category Encoder library.

More about *Target Encoder* libary [here](https://pypi.org/project/category-encoders/)

Lets take a look how that would work.

In [14]:
# importing the TargetEncoder from category encoder
from category_encoders import TargetEncoder

# creating a dictionary with data
data = {
    "feature_1": [1, 2, 3, 4, 5, 6, 7, 8],
    "feature_2": ["A", "A", "A", "A", "A", "B", "B", "B"],
    "target": [1, 0, 1, 1, 1, 1, 0, 1],
}

# initializing a dataframe with the data
df = pd.DataFrame(data)

# initializing the TargetEncoder class object
target_encoder = TargetEncoder(cols=["feature_2"])

# applying target encoding
df["feature_2 encoded"] = target_encoder.fit_transform(df["feature_2"], df['target'])

# printing out the data frame with target encodings
print(df)

   feature_1 feature_2  target  feature_2 encoded
0          1         A       1           0.759121
1          2         A       0           0.759121
2          3         A       1           0.759121
3          4         A       1           0.759121
4          5         A       1           0.759121
5          6         B       1           0.737128
6          7         B       0           0.737128
7          8         B       1           0.737128


### You know the drill, try it on Iris dataset?
### Discuss whether we can truly apply Target Encoding on Iris dataset or not?

# Data Normalization and Scaling

### What is normalization and scaling?

Data normalization and scaling are important preprocessing steps in data exploration and analysis. The purpose of normalization and scaling is to bring all features to the same scale or range, so that they can be compared and analyzed meaningfully.

### Why do we normalize and scale data?

1. Improving accuracy annd stability of mamchine learning models
2. Facilitating convergence of optimization algorithms
3. Improving interpretability of data


## Min-Max Scaling

Min-max scaling, also known as normalization, scales the data to a fixed range, usually between 0 and 1. The formula for min-max scaling is:

```
X_scaled = (X - X_min) / (X_max - X_min)
```
Where X is a feature value, X_min is the minimum value of that feature, and X_max is the maximum value of that feature.

### Min-Max Scaler in Pandas using Sklearn

In order to get functionality to apply scaling on the DataFrame we will using the *preprocessing* module of Sklearn.

Read more about Sklearn's preprocessing module [here](https://scikit-learn.org/stable/modules/preprocessing.html)

More on MinMaxScaler [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler)

In [20]:
# importing the MinMax scaler from Sklearn
from sklearn.preprocessing import MinMaxScaler

# creating a dataframe
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'B': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
})

# printing the dataframe
print(df)

    A    B
0   1   10
1   2   20
2   3   30
3   4   40
4   5   50
5   6   60
6   7   70
7   8   80
8   9   90
9  10  100


In [27]:
# create a MinMaxScaler object
min_max_scaler = MinMaxScaler()

# fit and transform the data
df_scaled = min_max_scaler.fit_transform(df)

# datatype of df_scaled
print("Data Type: {}".format(type(df_scaled)))

# create a new DataFrame with the normalized data
normalized_df = pd.DataFrame(df_scaled, columns=df.columns)

# printing the 
print(normalized_df)

Data Type: <class 'numpy.ndarray'>
          A         B
0  0.000000  0.000000
1  0.111111  0.111111
2  0.222222  0.222222
3  0.333333  0.333333
4  0.444444  0.444444
5  0.555556  0.555556
6  0.666667  0.666667
7  0.777778  0.777778
8  0.888889  0.888889
9  1.000000  1.000000


*What if we add outliers to the above data sample and see what the results are? Is the normalization acceptable or not?*

## Standardization

Standardization scales the data so that it has a mean of 0 and a standard deviation of 1. The formula for standardization is:

```
X_scaled = (X - mean(X)) / std(X)
```
Where X is a feature value, mean(X) is the mean value of that feature, and std(X) is the standard deviation of that feature.

### Standardization in Pandas using Sklearn

In order to get functionality to apply scaling on the DataFrame we will using the *preprocessing* module of Sklearn.

Read more about Sklearn's preprocessing module [here](https://scikit-learn.org/stable/modules/preprocessing.html)

More on StandardScaler [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler)

In [28]:
# importing the StandardScaler scaler from Sklearn
from sklearn.preprocessing import StandardScaler

# creating a dataframe
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'B': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
})

# printing the dataframe
print(df)

    A    B
0   1   10
1   2   20
2   3   30
3   4   40
4   5   50
5   6   60
6   7   70
7   8   80
8   9   90
9  10  100


In [29]:
# create a StandardScaler object
standard_scaler = StandardScaler()

# fit and transform the data
df_scaled = standard_scaler.fit_transform(df)

# datatype of df_scaled
print("Data Type: {}".format(type(df_scaled)))

# create a new DataFrame with the normalized data
normalized_df = pd.DataFrame(df_scaled, columns=df.columns)

# printing the 
print(normalized_df)

Data Type: <class 'numpy.ndarray'>
          A         B
0 -1.566699 -1.566699
1 -1.218544 -1.218544
2 -0.870388 -0.870388
3 -0.522233 -0.522233
4 -0.174078 -0.174078
5  0.174078  0.174078
6  0.522233  0.522233
7  0.870388  0.870388
8  1.218544  1.218544
9  1.566699  1.566699


*Lets check manually if the mean of the columns equal to 0*

## Robust Scaling

Robust scaling is similar to standardization, but it uses the median and interquartile range instead of the mean and standard deviation. The formula for robust scaling is:

```
X_scaled = (X - median(X)) / IQR(X)
```
where X is a feature value, median(X) is the median value of that feature, and IQR(X) is the interquartile range of that feature.

### Robust Scaling in Pandas using Sklearn

In order to get functionality to apply scaling on the DataFrame we will using the *preprocessing* module of Sklearn.

Read more about Sklearn's preprocessing module [here](https://scikit-learn.org/stable/modules/preprocessing.html)

More on RobustScaler [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html#sklearn.preprocessing.RobustScaler)

In [30]:
# importing the RobustScaler scaler from Sklearn
from sklearn.preprocessing import RobustScaler

# creating a dataframe
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'B': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
})

# printing the dataframe
print(df)

    A    B
0   1   10
1   2   20
2   3   30
3   4   40
4   5   50
5   6   60
6   7   70
7   8   80
8   9   90
9  10  100


In [31]:
# create a RobustScaler object
robust_scaler = RobustScaler()

# fit and transform the data
df_scaled = robust_scaler.fit_transform(df)

# datatype of df_scaled
print("Data Type: {}".format(type(df_scaled)))

# create a new DataFrame with the normalized data
normalized_df = pd.DataFrame(df_scaled, columns=df.columns)

# printing the 
print(normalized_df)

Data Type: <class 'numpy.ndarray'>
          A         B
0 -1.000000 -1.000000
1 -0.777778 -0.777778
2 -0.555556 -0.555556
3 -0.333333 -0.333333
4 -0.111111 -0.111111
5  0.111111  0.111111
6  0.333333  0.333333
7  0.555556  0.555556
8  0.777778  0.777778
9  1.000000  1.000000


## Max Abs Scaling

Max Abs scaling scales the data to the absolute maximum value of each feature. The formula for max abs scaling is:

```
X_scaled = X / max(abs(X))
```
where X is a feature value.

*What is the purpose of abs (absolute) in this scaling technique, what does this operation adds to the process?*

### Max Abs Scaling in Pandas using Sklearn

In order to get functionality to apply scaling on the DataFrame we will using the *preprocessing* module of Sklearn.

Read more about Sklearn's preprocessing module [here](https://scikit-learn.org/stable/modules/preprocessing.html)

More on RobustScaler [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html#sklearn.preprocessing.MaxAbsScaler)

In [33]:
# importing the MaxAbsScaler scaler from Sklearn
from sklearn.preprocessing import MaxAbsScaler

# creating a dataframe
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'B': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
})

# printing the dataframe
print(df)

    A    B
0   1   10
1   2   20
2   3   30
3   4   40
4   5   50
5   6   60
6   7   70
7   8   80
8   9   90
9  10  100


In [35]:
# create a MaxAbsScaler object
max_abs_scaler = MaxAbsScaler()

# fit and transform the data
df_scaled = max_abs_scaler.fit_transform(df)

# datatype of df_scaled
print("Data Type: {}".format(type(df_scaled)))

# create a new DataFrame with the normalized data
normalized_df = pd.DataFrame(df_scaled, columns=df.columns)

# printing the 
print(normalized_df)

Data Type: <class 'numpy.ndarray'>
     A    B
0  0.1  0.1
1  0.2  0.2
2  0.3  0.3
3  0.4  0.4
4  0.5  0.5
5  0.6  0.6
6  0.7  0.7
7  0.8  0.8
8  0.9  0.9
9  1.0  1.0
