# Part 1.2: Encoding Categorical Variables

Machine learning models require all input features to be numerical. This notebook covers standard methods for converting categorical features (text-based) into numbers.

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

data = {
    'city': ['New York', 'Los Angeles', 'Chicago', 'New York'], # Nominal
    'experience': ['Entry', 'Senior', 'Mid', 'Senior'] # Ordinal
}
df = pd.DataFrame(data)
print("Original DataFrame:")
df

Original DataFrame:


Unnamed: 0,city,experience
0,New York,Entry
1,Los Angeles,Senior
2,Chicago,Mid
3,New York,Senior


### Strategy 1: Label Encoding (for Ordinal Data)
Label Encoding converts each category into an integer. This is suitable for **ordinal** data, where the categories have a natural order (e.g., Low < Mid < High).

In [2]:
# We can define the order explicitly with a mapping
experience_mapping = {'Entry': 0, 'Mid': 1, 'Senior': 2}
df['experience_encoded'] = df['experience'].map(experience_mapping)

print("DataFrame after Label Encoding 'experience':")
df

DataFrame after Label Encoding 'experience':


Unnamed: 0,city,experience,experience_encoded
0,New York,Entry,0
1,Los Angeles,Senior,2
2,Chicago,Mid,1
3,New York,Senior,2


### Strategy 2: One-Hot Encoding (for Nominal Data)
One-Hot Encoding is used for **nominal** data, where categories have no intrinsic order (e.g., New York vs. Chicago). It creates a new binary (0 or 1) column for each category.

In [3]:
# Using pandas get_dummies is the easiest way
df_one_hot = pd.get_dummies(df, columns=['city'], prefix='city')
print("DataFrame after One-Hot Encoding 'city':")
df_one_hot

DataFrame after One-Hot Encoding 'city':


Unnamed: 0,experience,experience_encoded,city_Chicago,city_Los Angeles,city_New York
0,Entry,0,False,False,True
1,Senior,2,False,True,False
2,Mid,1,True,False,False
3,Senior,2,False,False,True


#### Using Scikit-learn's OneHotEncoder
This is useful when you want to build a reusable preprocessing pipeline.

In [4]:
ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
city_encoded = ohe.fit_transform(df[['city']])
city_df = pd.DataFrame(city_encoded, columns=ohe.get_feature_names_out(['city']))

df_sklearn_ohe = pd.concat([df.drop('city', axis=1), city_df], axis=1)
print("DataFrame using Scikit-learn's OneHotEncoder:")
df_sklearn_ohe

DataFrame using Scikit-learn's OneHotEncoder:


Unnamed: 0,experience,experience_encoded,city_Chicago,city_Los Angeles,city_New York
0,Entry,0,0.0,0.0,1.0
1,Senior,2,0.0,1.0,0.0
2,Mid,1,1.0,0.0,0.0
3,Senior,2,0.0,0.0,1.0


### Advanced Technique: Target Encoding
Target encoding is an advanced technique where each category is replaced with the mean of the target variable for that category. It's powerful but can lead to overfitting if not used carefully (e.g., without proper cross-validation).