## Data Encoding
#### Data encoding is the process of converting categorical (text or labels) data into a numerical format so that machine learning algorithms can use it.
##### Most ML models (like logistic regression, SVM, neural networks) work only with numbers — they cannot directly understand words like "Male", "Female", "Yes", "No", "Red", "Blue".
👉 Encoding transforms these categories into numbers without losing meaning.

# One Hot encoding

 One-Hot Encoding is a data encoding technique used to convert categorical values into binary (0/1) vectors.

In [6]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [8]:
df=pd.DataFrame({
    'color':['red','blue','red','green','blue']
})

In [9]:
df.head()

Unnamed: 0,color
0,red
1,blue
2,red
3,green
4,blue


In [11]:
# create an instance of OnehotEncoder
encoder=OneHotEncoder()

In [26]:
encoder.fit_transform(df[['color']]).toarray()

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [1., 0., 0.]])

In [33]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Sample dataset
data = {"Color": ["Red", "Blue", "Green", "Blue", "Red"]}
df = pd.DataFrame(data)
print("Original Data:")
print(df)

# Create encoder
encoder = OneHotEncoder(sparse_output=False)  # <-- use sparse_output=False
# If your sklearn is old, replace with: OneHotEncoder(sparse=False)

# Fit and transform
encoded = encoder.fit_transform(df[["Color"]])

# Convert to DataFrame
encoder_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out(["Color"]))

print("\nAfter One-Hot Encoding:")
print(encoder_df)


Original Data:
   Color
0    Red
1   Blue
2  Green
3   Blue
4    Red

After One-Hot Encoding:
   Color_Blue  Color_Green  Color_Red
0         0.0          0.0        1.0
1         1.0          0.0        0.0
2         0.0          1.0        0.0
3         1.0          0.0        0.0
4         0.0          0.0        1.0


# Label Encoding
### Label Encoding is a technique to convert categorical values into numeric labels.

In [34]:
import pandas as pd

data = {"Color": ["Red", "Blue", "Green", "Blue", "Red"]}
df = pd.DataFrame(data)
print("Original Data:")
print(df)


Original Data:
   Color
0    Red
1   Blue
2  Green
3   Blue
4    Red


In [35]:
from sklearn.preprocessing import LabelEncoder

# Create encoder
le = LabelEncoder()

# Fit and transform the column
df["Color_Label"] = le.fit_transform(df["Color"])

print("\nAfter Label Encoding:")
print(df)



After Label Encoding:
   Color  Color_Label
0    Red            2
1   Blue            0
2  Green            1
3   Blue            0
4    Red            2


# Ordinal Encoding
Ordinal Encoding is a way to convert categorical values with a natural order into numbers, preserving that order.

#### 
priority==> High School < Bachelor < Master < PhD

In [1]:
from sklearn.preprocessing import OrdinalEncoder

In [5]:
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder

# Sample data
df = pd.DataFrame({
    'Education': ['Primary', 'Graduate', 'Secondary', 'Post-Graduate', 'Primary']
})

# Define order of categories
education_order = [['Primary', 'Secondary', 'Graduate', 'Post-Graduate']]

# Apply Ordinal Encoding
encoder = OrdinalEncoder(categories=education_order)
df['Education_encoded'] = encoder.fit_transform(df[['Education']])

print(df)



### it gives in order wise like in priority oder 

       Education  Education_encoded
0        Primary                0.0
1       Graduate                2.0
2      Secondary                1.0
3  Post-Graduate                3.0
4        Primary                0.0


# Target Guided Ordinal Encoding

### It is an encoding technique where categorical levels are ordered according to the mean (or median) of the target variable.

In [7]:
import pandas as pd

# Sample data
df = pd.DataFrame({
    'City': ['Delhi', 'Mumbai', 'Delhi', 'Bangalore', 'Mumbai', 'Bangalore', 'Delhi'],
    'Purchase': [200, 500, 300, 100, 700, 150, 400]  # Target variable
})

print("Original Data:")
print(df)

# Step 1: Compute mean target for each category
order = df.groupby('City')['Purchase'].mean().sort_values().index

# Step 2: Create a mapping
mapping = {k: i for i, k in enumerate(order)}

# Step 3: Replace values
df['City_encoded'] = df['City'].map(mapping)

print("\nAfter Target Guided Encoding:")
print(df)


Original Data:
        City  Purchase
0      Delhi       200
1     Mumbai       500
2      Delhi       300
3  Bangalore       100
4     Mumbai       700
5  Bangalore       150
6      Delhi       400

After Target Guided Encoding:
        City  Purchase  City_encoded
0      Delhi       200             1
1     Mumbai       500             2
2      Delhi       300             1
3  Bangalore       100             0
4     Mumbai       700             2
5  Bangalore       150             0
6      Delhi       400             1
