# Data encoding 

Data Encoding is the process of converting categorical data (text/labels) into numerical form so that machine learning models can understand and process it.

| Type                              | When to Use                                           | Example                                          | Technique / Library                         |
| --------------------------------- | ----------------------------------------------------- | ------------------------------------------------ | ------------------------------------------- |
| **1. Label Encoding**             | When categorical variable is **ordinal** (has order)  | Size → Small=0, Medium=1, Large=2                | `LabelEncoder()`                            |
| **2. One-Hot Encoding**           | When categorical variable is **nominal** (no order)   | Color → Red=[1,0,0], Blue=[0,1,0], Green=[0,0,1] | `pd.get_dummies()` / `OneHotEncoder()`      |
| **3. Ordinal Encoding**           | When you **manually assign rank/order** to categories | Education → Primary=1, Secondary=2, Graduate=3   | `OrdinalEncoder()`                          |
| **4. Frequency / Count Encoding** | When categories are many — replace with frequency     | City → Delhi=50, Mumbai=30, Pune=20              | `df['City'].map(df['City'].value_counts())` |
| **5. Target / Mean Encoding**     | Replace category with **mean of target**              | Product type → mean(sales)                       | Used in advanced ML                         |
| **6. Binary Encoding**            | Compress high-cardinality categorical data            | “India” → binary code representation             | `category_encoders` library                 |


In [2]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
df = pd.read_csv('D:/Btech_CS/Python/Feature_Engineering/house-prices-advanced-regression-techniques/train.csv')

In [4]:
df.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

# 1. Label Encoding

In [8]:
from sklearn.preprocessing import LabelEncoder

col = "MSZoning"

print("Before", df[col].unique())

# Method : 1
df["MSZoning_label"] = pd.factorize(df[col])[0]

# Method : 2
# sorts alphabetically before assigining numbers
le = LabelEncoder()
df["MSZoning_label_sklearn"] = le.fit_transform(df[col])

print("After Pandas factorize:", df["MSZoning_label"].unique())
print("After Sklearn:", df["MSZoning_label_sklearn"].unique())

Before ['RL' 'RM' 'C (all)' 'FV' 'RH']
After Pandas factorize: [0 1 2 3 4]
After Sklearn: [3 4 0 1 2]


In [9]:
# Label-encoding (with order)
from sklearn.preprocessing import OrdinalEncoder

# Actual : ['RL' 'RM' 'C (all)' 'FV' 'RH']
ordinal_order = [['C (all)', 'RM', 'RH', 'RL', 'FV']]

oe = OrdinalEncoder(categories=ordinal_order)
df["MSZoning_ordinal"] = oe.fit_transform(df[[col]])

print("Before:", df[col].unique())
print("After Pandas factorize:", df["MSZoning_label"].unique())
print("After OrdinalEncoder (with order):", df["MSZoning_ordinal"].unique())


Before: ['RL' 'RM' 'C (all)' 'FV' 'RH']
After Pandas factorize: [0 1 2 3 4]
After OrdinalEncoder (with order): [3. 1. 0. 4. 2.]


# 2. One-Hot Encoding

In [10]:
# 2. OHE-HOT Encoding

col = "MSZoning"
print("Before:", df[col].unique())


# Pandas
mszoning_ohe_pandas = pd.get_dummies(df[col], prefix="MSZoning")

# Sklearn
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(sparse_output=False, drop=None)
mszoning_ohe_sklearn = ohe.fit_transform(df[[col]])

ohe_df = pd.DataFrame(mszoning_ohe_sklearn, columns=ohe.get_feature_names_out([col]))

pandas_comparison = pd.concat([df[col], mszoning_ohe_pandas], axis=1)

sklearn_comparison = pd.concat([df[col], ohe_df], axis=1)

unique_pandas_ohe = pandas_comparison.drop_duplicates(subset=[col]).reset_index(drop=True)
unique_sklearn_ohe = sklearn_comparison.drop_duplicates(subset=[col]).reset_index(drop=True)

print("Unique Values and OHE (Pandas get dummies)")
print(unique_pandas_ohe)
print("==="*30)
print("\n\nUnique Values and OHE (Scikit OheHotEncoder)")
print(unique_sklearn_ohe)

Before: ['RL' 'RM' 'C (all)' 'FV' 'RH']
Unique Values and OHE (Pandas get dummies)
  MSZoning  MSZoning_C (all)  MSZoning_FV  MSZoning_RH  MSZoning_RL  \
0       RL             False        False        False         True   
1       RM             False        False        False        False   
2  C (all)              True        False        False        False   
3       FV             False         True        False        False   
4       RH             False        False         True        False   

   MSZoning_RM  
0        False  
1         True  
2        False  
3        False  
4        False  


Unique Values and OHE (Scikit OheHotEncoder)
  MSZoning  MSZoning_C (all)  MSZoning_FV  MSZoning_RH  MSZoning_RL  \
0       RL               0.0          0.0          0.0          1.0   
1       RM               0.0          0.0          0.0          0.0   
2  C (all)               1.0          0.0          0.0          0.0   
3       FV               0.0          1.0          0.0   

# 3. Target Encoding

In [11]:
import category_encoders as ce

col = "Neighborhood"
print("Before:", df[col].unique())

target_enc = ce.TargetEncoder(cols=[col])
df["Neighborhood_target"] = target_enc.fit_transform(df[col], df["SalePrice"])

# I am onlydoing it to show you the output, IN ML we will not do this step
# This is just for visualization
unique_neighborhood_target = df[[col, "Neighborhood_target"]].drop_duplicates().reset_index(drop=True)

print("\n--- Unique Neighborhood values and their Target Encoded values ---")
print(unique_neighborhood_target)

Before: ['CollgCr' 'Veenker' 'Crawfor' 'NoRidge' 'Mitchel' 'Somerst' 'NWAmes'
 'OldTown' 'BrkSide' 'Sawyer' 'NridgHt' 'NAmes' 'SawyerW' 'IDOTRR'
 'MeadowV' 'Edwards' 'Timber' 'Gilbert' 'StoneBr' 'ClearCr' 'NPkVill'
 'Blmngtn' 'BrDale' 'SWISU' 'Blueste']

--- Unique Neighborhood values and their Target Encoded values ---
   Neighborhood  Neighborhood_target
0       CollgCr        197965.734807
1       Veenker        197643.209810
2       Crawfor        209344.287867
3       NoRidge        318453.591177
4       Mitchel        157555.763763
5       Somerst        225319.439258
6        NWAmes        189009.693995
7       OldTown        128230.118126
8       BrkSide        126061.309722
9        Sawyer        136991.546950
10      NridgHt        315819.259117
11        NAmes        145847.080044
12      SawyerW        186444.004409
13       IDOTRR        112604.177463
14      MeadowV        145878.781837
15      Edwards        128237.373454
16       Timber        233548.253290
17      Gilb

# 4. Frequency Encoder 

In [12]:
col = "Neighborhood"
print("Before:", df[col].unique()[:5])


# Method Pandas : map
freq_encoding = df[col].value_counts().to_dict()
df["Neighborhood_freq"] = df[col].map(freq_encoding)

# Method category_encoders
count_enc = ce.CountEncoder(cols=[col])
df["Neighborhood_freq_ce"] = count_enc.fit_transform(df[col])

unique_neighborhood_frequency_encoding = df[[col, "Neighborhood_freq_ce"]].drop_duplicates().reset_index(drop=True)

print("\n--- Unique Neighborhood values and their frequence Encoded values ---")
print(unique_neighborhood_frequency_encoding)

Before: ['CollgCr' 'Veenker' 'Crawfor' 'NoRidge' 'Mitchel']

--- Unique Neighborhood values and their frequence Encoded values ---
   Neighborhood  Neighborhood_freq_ce
0       CollgCr                   150
1       Veenker                    11
2       Crawfor                    51
3       NoRidge                    41
4       Mitchel                    49
5       Somerst                    86
6        NWAmes                    73
7       OldTown                   113
8       BrkSide                    58
9        Sawyer                    74
10      NridgHt                    77
11        NAmes                   225
12      SawyerW                    59
13       IDOTRR                    37
14      MeadowV                    17
15      Edwards                   100
16       Timber                    38
17      Gilbert                    79
18      StoneBr                    25
19      ClearCr                    28
20      NPkVill                     9
21      Blmngtn                  

In [13]:
df[[col, "Neighborhood_freq_ce"]].head(50)

Unnamed: 0,Neighborhood,Neighborhood_freq_ce
0,CollgCr,150
1,Veenker,11
2,CollgCr,150
3,Crawfor,51
4,NoRidge,41
5,Mitchel,49
6,Somerst,86
7,NWAmes,73
8,OldTown,113
9,BrkSide,58
