# Data Encoding 🧩

### 1. Categorical Feature Encoding 🌸

Categorical features are variables that contain label values rather than numeric values. Some machine learning algorithms can’t operate on label data directly, so categorical data must be converted into a numerical format.

#### 1.1 Ordinal Encoding 📊

**What it is:**

Ordinal encoding assigns numerical values to categorical features based on their order. This is useful when there's an inherent order in the categories.

**Example:**

Consider the Iris dataset where the species (target) can be one of three categories: **setosa**, **versicolor**, or **virginica**. If you believe there’s an order to these categories, you can encode them as **0, 1, and 2**.

---

#### 1.2 One Hot Encoding 🔥

**What it is:**

One Hot Encoding converts categorical variables into a format that can be provided to ML algorithms by creating new binary columns (0 or 1) for each unique category.

**Example:**

For the species feature, one hot encoding would create three new columns: **species_setosa**, **species_versicolor**, and **species_virginica**. Each column contains a binary value (0 or 1) indicating the presence of that species.

In [1]:
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd
from sklearn.datasets import load_iris

# Load the iris dataset
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [2]:
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


In [7]:
# Apply ordinal encoding 🌟
ordinal_encoder = OrdinalEncoder() # Constructor calling making objects
df['species_ordinal'] = ordinal_encoder.fit_transform(df[['species']]) # 0 based encoding

df.sample()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species,species_ordinal
57,4.9,2.4,3.3,1.0,versicolor,1.0


- Result: setosa might be mapped to 0, versicolor to 1, and virginica to 2.

In [8]:
from sklearn.preprocessing import OneHotEncoder

# Apply one hot encoding 🌈
one_hot_encoder = OneHotEncoder(sparse=False) # Converts the categorical species into multiple binary columns.
one_hot_encoder

In [10]:
species_one_hot = one_hot_encoder.fit_transform(df[['species']]) # Transforms the categorical feature into a binary matrix.
species_one_hot



array([[1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0

In [11]:
# Convert to DataFrame and merge with the original 🧩
df_one_hot = pd.DataFrame(species_one_hot, columns=one_hot_encoder.get_feature_names_out(['species']))
df_one_hot

Unnamed: 0,species_setosa,species_versicolor,species_virginica
0,1.0,0.0,0.0
1,1.0,0.0,0.0
2,1.0,0.0,0.0
3,1.0,0.0,0.0
4,1.0,0.0,0.0
...,...,...,...
145,0.0,0.0,1.0
146,0.0,0.0,1.0
147,0.0,0.0,1.0
148,0.0,0.0,1.0


In [12]:
df = pd.concat([df, df_one_hot], axis=1)

df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species,species_ordinal,species_setosa,species_versicolor,species_virginica
0,5.1,3.5,1.4,0.2,setosa,0.0,1.0,0.0,0.0
1,4.9,3.0,1.4,0.2,setosa,0.0,1.0,0.0,0.0
2,4.7,3.2,1.3,0.2,setosa,0.0,1.0,0.0,0.0
3,4.6,3.1,1.5,0.2,setosa,0.0,1.0,0.0,0.0
4,5.0,3.6,1.4,0.2,setosa,0.0,1.0,0.0,0.0


### 2. Numerical Feature Encoding 🔢

Numerical features are variables that contain numeric data. Depending on the data and the problem at hand, you might want to transform these features in various ways.

#### 2.1 Discretization 📏

Discretization involves dividing a continuous feature into discrete bins or intervals.

---

##### Unsupervised Binning 🎛️

**What it is:**

Unsupervised binning divides the range of continuous features into bins without considering the target variable.

**Example:**

You can divide the sepal length into equal-width bins to categorize them.

---

##### Custom Binning 🛠️

**What it is:**

Custom binning involves manually defining bin edges based on domain knowledge or specific criteria.

**Example:**

You might want to define bins for sepal length as **[4-5]**, **[5-6]**, **[6-8]**.

In [13]:
from sklearn.preprocessing import KBinsDiscretizer

# Create an instance of KBinsDiscretizer to discretize continuous features
# n_bins=3 specifies the number of bins to divide the data into
# encode='ordinal' returns integer-encoded bins
# strategy='uniform' means that the bins will have uniform width
discretizer = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')

# Fit the discretizer to the 'sepal length (cm)' column and transform the data
# The transformed data will be divided into 3 equal-width bins, which are integer-encoded
df['sepal_length_binned'] = discretizer.fit_transform(df[['sepal length (cm)']])

# Display the first few rows of the DataFrame to check the binned column
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species,species_ordinal,species_setosa,species_versicolor,species_virginica,sepal_length_binned
0,5.1,3.5,1.4,0.2,setosa,0.0,1.0,0.0,0.0,0.0
1,4.9,3.0,1.4,0.2,setosa,0.0,1.0,0.0,0.0,0.0
2,4.7,3.2,1.3,0.2,setosa,0.0,1.0,0.0,0.0,0.0
3,4.6,3.1,1.5,0.2,setosa,0.0,1.0,0.0,0.0,0.0
4,5.0,3.6,1.4,0.2,setosa,0.0,1.0,0.0,0.0,0.0


In [15]:
# Apply custom binning 🧩
df['sepal_length_custom_binned'] = pd.cut(df['sepal length (cm)'], bins=[4, 5, 6, 8], labels=[0, 1, 2])

df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species,species_ordinal,species_setosa,species_versicolor,species_virginica,sepal_length_binned,sepal_length_custom_binned
0,5.1,3.5,1.4,0.2,setosa,0.0,1.0,0.0,0.0,0.0,1
1,4.9,3.0,1.4,0.2,setosa,0.0,1.0,0.0,0.0,0.0,0
2,4.7,3.2,1.3,0.2,setosa,0.0,1.0,0.0,0.0,0.0,0
3,4.6,3.1,1.5,0.2,setosa,0.0,1.0,0.0,0.0,0.0,0
4,5.0,3.6,1.4,0.2,setosa,0.0,1.0,0.0,0.0,0.0,0


#### 2.2 **Binarization** ⚙️

**What it is**:  
Binarization converts continuous numerical data into binary values (0 and 1) based on a threshold.

**Example**:  
You can binarize sepal length by setting a threshold at 5.0 cm. If sepal length is greater than 5.0, it's 1; otherwise, it's 0.

In [16]:
from sklearn.preprocessing import Binarizer

# Create an instance of Binarizer to convert continuous data into binary values
# threshold=5.0 means that any value above 5.0 will be converted to 1, and any value below or equal to 5.0 will be converted to 0
binarizer = Binarizer(threshold=5.0)
binarizer

In [17]:
# Fit the binarizer to the 'sepal length (cm)' column and transform the data
# The transformed data will be binary, with 1s and 0s based on the threshold
df['sepal_length_binarized'] = binarizer.fit_transform(df[['sepal length (cm)']])

# Display the first few rows of the DataFrame to check the binarized column
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species,species_ordinal,species_setosa,species_versicolor,species_virginica,sepal_length_binned,sepal_length_custom_binned,sepal_length_binarized
0,5.1,3.5,1.4,0.2,setosa,0.0,1.0,0.0,0.0,0.0,1,1.0
1,4.9,3.0,1.4,0.2,setosa,0.0,1.0,0.0,0.0,0.0,0,0.0
2,4.7,3.2,1.3,0.2,setosa,0.0,1.0,0.0,0.0,0.0,0,0.0
3,4.6,3.1,1.5,0.2,setosa,0.0,1.0,0.0,0.0,0.0,0,0.0
4,5.0,3.6,1.4,0.2,setosa,0.0,1.0,0.0,0.0,0.0,0,0.0
