# One-Hot Encoding

In [16]:
import pandas as pd

In [17]:
data = {
    'Car_Type': ['Sedan', 'SUV', 'Hatchback', 'SUV'],
    'Price': [30000, 40000, 25000, 45000]
}
df = pd.DataFrame(data)
print("Original Data:")
df

Original Data:


Unnamed: 0,Car_Type,Price
0,Sedan,30000
1,SUV,40000
2,Hatchback,25000
3,SUV,45000


In [19]:
one_hot_encoded = pd.get_dummies(df, columns=['Car_Type'],  dtype=int)
print("\nAfter One-Hot Encoding:")
print(one_hot_encoded)


After One-Hot Encoding:
   Price  Car_Type_Hatchback  Car_Type_SUV  Car_Type_Sedan
0  30000                   0             0               1
1  40000                   0             1               0
2  25000                   1             0               0
3  45000                   0             1               0


### Avoiding the Dummy Variable Trap
  
If we keep all columns, multicollinearity occurs.

Solution: Drop one column (e.g., drop_first=True).



In [17]:
one_hot_encoded_fixed = pd.get_dummies(df, columns=['Car_Type'], drop_first=True,  dtype=int)
print("\nAvoiding Dummy Trap (Drop First Column):")
print(one_hot_encoded_fixed)


Avoiding Dummy Trap (Drop First Column):
   Price  Car_Type_SUV  Car_Type_Sedan
0  30000             0               1
1  40000             1               0
2  25000             0               0
3  45000             1               0


### Interpretation:

If both Sedan=0 and SUV=0, the car must be Hatchback.

### 4. When to Use One-Hot Encoding?

✅ Best for:

Nominal data (no order, e.g., colors, cities).

Few categories (too many categories → too many columns → curse of dimensionality).

❌ Avoid when:

Ordinal data (use Label Encoding instead, e.g., Low=1, Medium=2, High=3).

Too many categories (e.g., 100+ cities → use other techniques like Target Encoding).

#  Label Encoding

* A technique to convert categorical variables into numerical labels (0, 1, 2, ...).

Each category is assigned a unique integer.

In [3]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

In [69]:
data = {
    'Size': ['Small', 'Medium', 'Large', 'medium'],
    'Price': [10, 20, 30, 25]
}
df = pd.DataFrame(data)
print("Original Data:")
print(df)

Original Data:
     Size  Price
0   Small     10
1  Medium     20
2   Large     30
3  medium     25


In [71]:
label_encoder = LabelEncoder()
df['Size_Encoded'] = label_encoder.fit_transform(df['Size'])
print("\nAfter Label Encoding:")
print(df)


After Label Encoding:
     Size  Price  Size_Encoded
0   Small     10             2
1  Medium     20             1
2   Large     30             0
3  medium     25             3


###  why is Small=2, Large=0?

By default, LabelEncoder assigns numbers alphabetically:

Large → 0

Medium → 1

Small → 2

In [79]:
# Manually Specify the Order
df['Size'] = pd.Categorical(df['Size'], categories=['Small', 'Medium', 'Large','medium'], ordered=True)
df['size_encoded'] = df['Size'].cat.codes
df

Unnamed: 0,Size,Price,Size_Encoded,size_encoded
0,Small,10,2,0
1,Medium,20,1,1
2,Large,30,0,2
3,medium,25,3,3


In [67]:
size_order = {'Small': 0, 'Medium': 1, 'Large': 2}
df['Size_Encoded_Correct'] = df['Size'].map(size_order)
print("\nCorrect Label Encoding (Preserving Order):")
print(df)


Correct Label Encoding (Preserving Order):
     Size  Price  Size_Encoded  size_encoded Size_Encoded_Correct
0   Small     10             2             0                    0
1  Medium     20             1             1                    1
2   Large     30             0             2                    2
3  Medium     25             1             1                    1


In [1]:
import pandas as pd
from sklearn.linear_model import LogisticRegression

# Step 1: Create a simple dataset with categorical string values
data = pd.DataFrame({
    "Color": ["Red", "Blue", "Green", "Red", "Blue", "Green"],
    "Target": [1, 0, 0, 1, 0, 1]
})

print("Original Dataset:")
print(data)

# Step 2: Try fitting a model with raw strings (this will FAIL)
try:
    model = LogisticRegression()
    model.fit(data[["Color"]], data["Target"])
except Exception as e:
    print("\n❌ Model failed with error:")
    print(e)

# Step 3: Apply One-Hot Encoding
data_encoded = pd.get_dummies(data, columns=["Color"])

print("\nEncoded Dataset (One-Hot):")
print(data_encoded)

# Step 4: Fit model with encoded data (this will WORK)
model.fit(data_encoded.drop("Target", axis=1), data_encoded["Target"])
print("\n✅ Model trained successfully with encoded data!")


Original Dataset:
   Color  Target
0    Red       1
1   Blue       0
2  Green       0
3    Red       1
4   Blue       0
5  Green       1

❌ Model failed with error:
could not convert string to float: 'Red'

Encoded Dataset (One-Hot):
   Target  Color_Blue  Color_Green  Color_Red
0       1       False        False       True
1       0        True        False      False
2       0       False         True      False
3       1       False        False       True
4       0        True        False      False
5       1       False         True      False

✅ Model trained successfully with encoded data!
