# Categorical Encoding

## Ordinal Data with Low Cardinality: 
1. If you have ordinal data (categories with a natural order) and the number of unique categories is relatively low, you can use Ordinal Encoding. 
2. This will preserve the order information among the categories and can be beneficial for algorithms that can utilize this ordinal relationship.

## Frequency-Based Ordinal Encoding: 
1. Instead of using a direct mapping from categories to integers, you can encode ordinal data based on their frequency. 
2. Assign higher integers to more frequent categories. 
3. This can help your model differentiate between more and less common categories.

## Ordinal Target Encoding: 
1. If the ordinal variable is strongly correlated with the target variable, you can apply ordinal target encoding. 
2. Replace the ordinal values with the mean or median of the target variable for each category. 
3. This can capture the relationship between the ordinal variable and the target while reducing the feature space

## Nominal Data with Low Cardinality: 
1. For nominal data (categories with no inherent order) and low cardinality, One-Hot Encoding is a common choice. 
2. This method ensures no ordinal relationship is implied between categories

## Nominal Data with High Cardinality: 
1. When dealing with nominal data and high cardinality (many unique categories), Binary Encoding or Hashing Encoding can be helpful. 
2. These methods reduce the number of features created compared to one-hot encoding, which can be computationally expensive for high-cardinality data

# Nominal Data with High Cardinality
mean encoding for high cardinality

## High Cardinality and Limited Data: 
In cases where you have limited data and high cardinality, Hashing Encoding or Binary Encoding can be a viable choice, as it reduces the number of features while still providing some level of uniqueness for the categories

# Implementation!!!

In [1]:
# import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
df=pd.read_csv('titanic.csv')
df.dropna(inplace=True)
data=df.copy()

### One-Hot Encoding

In [None]:
'''from sklearn.preprocessing import OneHotEncoder 
one_hot_encoder = OneHotEncoder(sparse=False)
df['Sex'] = one_hot_encoder.fit_transform(df['Sex'])'''

### this is complicated so use below one

In [2]:
df=pd.get_dummies(df,columns=['Sex','Embarked'],drop_first=True)

### Ordinal Encoding

In [3]:
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()
df['Cabin_Ordinal'] = ordinal_encoder.fit_transform(df[['Cabin']])

### Binary Encoding:

1. Method: Binary encoding represents each category as a binary code. It involves converting the integer representation of the category into binary format and then creating binary columns for each digit in the binary representation.
2. Number of Columns: If you have a categorical variable with n unique categories, binary encoding will create log2(n) binary columns. For example, if you have 8 unique categories, it will create 3 binary columns.
3. Example: If you're encoding categories using 3 binary columns, the encoding might look like: Red -> 001, Blue -> 010, Green -> 011, Yellow -> 100, ...
4. Advantage: It results in a compact representation of high-cardinality categorical variables and captures some ordinal information.
5. Disadvantage: It might not work well if there's no inherent order or relationship among categories

In [5]:
from category_encoders import BinaryEncoder
# Initialize the BinaryEncoder
binary_encoder = BinaryEncoder()

# Fit and transform the 'Color' feature
binary_encoded = binary_encoder.fit_transform(df['Name'])

# Concatenate the binary encoded DataFrame to the original DataFrame
df = pd.concat([df, binary_encoded], axis=1)

# Drop the original 'Color' column
df.drop('Name', axis=1, inplace=True)

### Frequency-Based Encoding

In [12]:
df=data.copy()
color_frequency = df['Cabin'].value_counts(normalize=True)
df['Cabin'] = df['Cabin'].map(color_frequency)

In [16]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,0.005464,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,0.010929,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,0.005464,S
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,0.021858,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,0.005464,S
...,...,...,...,...,...,...,...,...,...,...,...,...
871,872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,0.010929,S
872,873,0,1,"Carlsson, Mr. Frans Olof",male,33.0,0,0,695,5.0000,0.010929,S
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,0.005464,C
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,0.005464,S


### Target Encoding

In [18]:
df=data.copy()
from category_encoders import TargetEncoder
target_encoder = TargetEncoder()
df['Cabin'] = target_encoder.fit_transform(df[['Cabin']], df['Survived'])

In [19]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,0.714790,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,0.647714,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,0.584681,S
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,0.643216,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,0.714790,S
...,...,...,...,...,...,...,...,...,...,...,...,...
871,872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,0.718640,S
872,873,0,1,"Carlsson, Mr. Frans Olof",male,33.0,0,0,695,5.0000,0.647714,S
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,0.714790,C
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,0.714790,S


### Label Encoding

In [21]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
df['Embarked'] = label_encoder.fit_transform(df['Embarked'])

In [22]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,0.714790,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,0.647714,2
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,0.584681,2
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,0.643216,2
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,0.714790,2
...,...,...,...,...,...,...,...,...,...,...,...,...
871,872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,0.718640,2
872,873,0,1,"Carlsson, Mr. Frans Olof",male,33.0,0,0,695,5.0000,0.647714,2
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,0.714790,0
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,0.714790,2
