# 4.1 Categorical variables

- **One Hot Enconding**: We replace categorical variables with one or more new features that can have values 0 and 1.

In [9]:
import pandas as pd

data = pd.read_csv("../data/credit.csv")

# for illustrion purposes, we select only a few columns
data = data[["Income", "Age", "Ethnicity", "Gender"]]

display(data.head())


Unnamed: 0,Income,Age,Ethnicity,Gender
0,14.891,34,Caucasian,Male
1,106.025,82,Asian,Female
2,104.593,71,Asian,Male
3,148.924,36,Asian,Female
4,55.882,68,Caucasian,Male


Before we do the encoding, it is good to check if the data is consistent.

In [10]:
print(data.Ethnicity.value_counts())

Ethnicity
Caucasian           199
Asian               102
African American     99
Name: count, dtype: int64


In [11]:
print(f"Original features: \n {list(data.columns)} \n")
data_dummies = pd.get_dummies(data)

print(f"Features after get_dummies:\n {list(data_dummies.columns)}")

Original features: 
 ['Income', 'Age', 'Ethnicity', 'Gender'] 

Features after get_dummies:
 ['Income', 'Age', 'Ethnicity_African American', 'Ethnicity_Asian', 'Ethnicity_Caucasian', 'Gender_ Male', 'Gender_Female']


Sometimes, datasets have categorical values encoded as numbers. For example, a web form could return values as 1, 2, 3 if user selected option 1, or option, or option 3. In these cases, we need to use OneHot enconder from SciKit, or we need to be explicity about the columns you want to encode.

In [14]:
# create a DF with an integer feature and a categorical string feature

demo_df = pd.DataFrame({"Integer Feature": [0, 1, 2, 1],
"Categorical Feature": ["birds", "cats", "dogs", "cats"]})

display(demo_df)

# Explicity tells the column is an category, not a "number"
demo_df["Integer Feature"] = demo_df["Integer Feature"].astype(str)

pd.get_dummies(demo_df, columns=["Integer Feature", "Categorical Feature"])



Unnamed: 0,Integer Feature,Categorical Feature
0,0,birds
1,1,cats
2,2,dogs
3,1,cats


Unnamed: 0,Integer Feature_0,Integer Feature_1,Integer Feature_2,Categorical Feature_birds,Categorical Feature_cats,Categorical Feature_dogs
0,True,False,False,True,False,False
1,False,True,False,False,True,False
2,False,False,True,False,False,True
3,False,True,False,False,True,False
