#### **There are three main types of encoding**
- **One Hot Encoding**
- **Lable Encoding**
- **Ordinal Encoding**

In [1]:
import pandas as pd
import numpy as np

df = pd.DataFrame({
    "city": ["New York", "Los Angeles", "Chicago", "Houston", "Phoenix"],
    "gender": ["Male", "Female", "Male", "Female", "Male"],
    "size": ["small", "medium", "large", "medium", "small"]
})

### **One Hot encoding using pandas**

- **Definition**: Jab data main categorical values hoti hain (jaise "Red", "Blue", "Green"), machine learning models unko directly samajh nahi sakte kyunki wo numbers expect karte hain.
- **One‑Hot Encoding**: Har category ko ek binary vector (0 aur 1) main convert kar dete hain.
- **Example**:
- Red → [1, 0, 0]
- Blue → [0, 1, 0]
- Green → [0, 0, 1]
***
Is tarah har category apni unique column ban jati hai aur model easily samajh sakta hai.


In [2]:
pd.get_dummies(df, dtype=int, drop_first=True)

Unnamed: 0,city_Houston,city_Los Angeles,city_New York,city_Phoenix,gender_Male,size_medium,size_small
0,0,0,1,0,1,0,1
1,0,1,0,0,0,1,0
2,0,0,0,0,1,0,0
3,1,0,0,0,0,1,0
4,0,0,0,1,1,0,1


### **One Hot encoding using Sk-Learn**

In [3]:
from sklearn.preprocessing import OneHotEncoder


# Initialize encoder
encoder = OneHotEncoder(sparse_output=False)

# Fit and transform
encoded = encoder.fit_transform(df)

# Convert back to DataFrame with column names
pd.DataFrame(encoded, columns=encoder.get_feature_names_out())

Unnamed: 0,city_Chicago,city_Houston,city_Los Angeles,city_New York,city_Phoenix,gender_Female,gender_Male,size_large,size_medium,size_small
0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
1,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
3,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0


***another way***

In [4]:
from sklearn.preprocessing import OneHotEncoder


# Initialize encoder
encoder = OneHotEncoder(drop='first')

# Fit and transform
encoded = encoder.fit_transform(df).toarray()

# Convert back to DataFrame with column names
pd.DataFrame(encoded, columns=encoder.get_feature_names_out())

Unnamed: 0,city_Houston,city_Los Angeles,city_New York,city_Phoenix,gender_Male,size_medium,size_small
0,0.0,0.0,1.0,0.0,1.0,0.0,1.0
1,0.0,1.0,0.0,0.0,0.0,1.0,0.0
2,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,1.0,1.0,0.0,1.0


### ***Example Difference***

In [5]:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

encoder = OneHotEncoder(drop='first')

# Sparse matrix (default)
encoded_sparse = encoder.fit_transform(df)
print(type(encoded_sparse))
# Output: <class 'scipy.sparse._csr.csr_matrix'>

# Dense array
encoded_dense = encoder.fit_transform(df).toarray()
print(type(encoded_dense))
# Output: <class 'numpy.ndarray'>

<class 'scipy.sparse._csr.csr_matrix'>
<class 'numpy.ndarray'>


- `Pandas get_dummies()` → Quick and simple, great for small datasets
- `Scikit‑Learn OneHotEncoder()` → More flexible, integrates directly with ML pipelines (works well with scikit‑learn models).
---

### **Label Encoding using pandas**
- `Definition`: Har category ko ek numeric label assign kar deta hai (0, 1, 2, …).
- `Use case`: Jab categories ordinal ya simple ho (jaise "Low", "Medium", "High").
- `Limitation`: Agar categories nominal (jaise "Red", "Blue", "Green") ho, to model galat samajh sakta hai ke numbers ka order bhi importance rakhta hai. Is liye nominal data ke liye One‑Hot Encoding better hota hai.


In [6]:
df["city_encode"] = df['city'].astype('category').cat.codes
df

Unnamed: 0,city,gender,size,city_encode
0,New York,Male,small,3
1,Los Angeles,Female,medium,2
2,Chicago,Male,large,0
3,Houston,Female,medium,1
4,Phoenix,Male,small,4


### **Label Encoding using Sk-Learn**

In [7]:
from sklearn.preprocessing import LabelEncoder

# Initialize encoder
encoder = LabelEncoder()

# Fit and transform
df['city_encoded'] = encoder.fit_transform(df["city"])
df


Unnamed: 0,city,gender,size,city_encode,city_encoded
0,New York,Male,small,3,3
1,Los Angeles,Female,medium,2,2
2,Chicago,Male,large,0,0
3,Houston,Female,medium,1,1
4,Phoenix,Male,small,4,4


*`Matlab`: Label Encoding simple aur fast hai, lekin kabhi kabhi misleading ho sakta hai agar categories ke beech koi natural order na ho.*