# Categorical data is classified into two types:

Ordinal Data: Categorical data with an inherent order (e.g., Low, Medium, High).
Nominal Data: Categorical data with no inherent order (e.g., Red, Green, Blue).

# Encoding Ordinal Data: Ordinal Encoding
Assigns integer values based on the category’s order.

Example: Low → 0, Medium → 1, High → 2.

In [1]:
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder

In [3]:
data = {'experienceLevel' : ['Entry', 'Mid', 'Entry','Senior', 'Mid']}
df = pd.DataFrame(data)

In [4]:
experienceOrder = ['Entry', 'Mid', 'Senior']

In [6]:
#The categories parameter accepts a list of lists, where each inner list specifies the order of values for a column.
encoder = OrdinalEncoder(categories=[experienceOrder])

In [7]:
df['Experience Level Encoder'] = encoder.fit_transform(df[['experienceLevel']])

In [8]:
print(df)

  experienceLevel  Experience Level Encoder
0           Entry                       0.0
1             Mid                       1.0
2           Entry                       0.0
3          Senior                       2.0
4             Mid                       1.0


# Encoding Nominal Data: One-Hot Encoding
Converts categorical variables into multiple binary (0/1) columns.

Suitable when there is no meaningful order.

If we have color column Red, Blue, Green

Red      Blue       Green

1         0           0

0         1           0

0         0           1

In [19]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
df2 = pd.DataFrame({'City': ['New York', 'London', 'Tokyo', 'New York', 'Tokyo', 'London']})
#The reason for using:drop = 'first' is to avoid the dummy variable trap, which is important in statistical models to prevent multicollinearity.
encode = OneHotEncoder(drop='first', sparse_output=False)
encoded_data = encode.fit_transform(df2[['City']])
df_encoded = pd.DataFrame(
    encoded_data,
    columns=encode.get_feature_names_out(['City'])
)
print(df_encoded)

   City_New York  City_Tokyo
0            1.0         0.0
1            0.0         0.0
2            0.0         1.0
3            1.0         0.0
4            0.0         1.0
5            0.0         0.0


Notice how if you know any two columns, you can deduce the third.
(City_London, City_New_York → City_Tokyo)

This causes multicollinearity, where columns become linearly dependent, confusing models like linear regression.
