# 4.1 Categorical variables

- **One Hot Enconding**: We replace categorical variables with one or more new features that can have values 0 and 1.

In [17]:
import pandas as pd

data = pd.read_csv("../data/credit.csv")

# for illustrion purposes, we select only a few columns
data = data[["Income", "Age", "Ethnicity", "Gender"]]

display(data.head())


Unnamed: 0,Income,Age,Ethnicity,Gender
0,14.891,34,Caucasian,Male
1,106.025,82,Asian,Female
2,104.593,71,Asian,Male
3,148.924,36,Asian,Female
4,55.882,68,Caucasian,Male


Before we do the encoding, it is good to check if the data is consistent.

In [18]:
print(data.Ethnicity.value_counts())

Ethnicity
Caucasian           199
Asian               102
African American     99
Name: count, dtype: int64


In [11]:
print(f"Original features: \n {list(data.columns)} \n")
data_dummies = pd.get_dummies(data)

print(f"Features after get_dummies:\n {list(data_dummies.columns)}")

Original features: 
 ['Income', 'Age', 'Ethnicity', 'Gender'] 

Features after get_dummies:
 ['Income', 'Age', 'Ethnicity_African American', 'Ethnicity_Asian', 'Ethnicity_Caucasian', 'Gender_ Male', 'Gender_Female']


Sometimes, datasets have categorical values encoded as numbers. For example, a web form could return values as 1, 2, 3 if user selected option 1, or option, or option 3. In these cases, we need to use OneHot enconder from SciKit, or we need to be explicity about the columns you want to encode.

In [14]:
# create a DF with an integer feature and a categorical string feature

demo_df = pd.DataFrame({"Integer Feature": [0, 1, 2, 1],
"Categorical Feature": ["birds", "cats", "dogs", "cats"]})

display(demo_df)

# Explicity tells the column is an category, not a "number"
demo_df["Integer Feature"] = demo_df["Integer Feature"].astype(str)

pd.get_dummies(demo_df, columns=["Integer Feature", "Categorical Feature"])



Unnamed: 0,Integer Feature,Categorical Feature
0,0,birds
1,1,cats
2,2,dogs
3,1,cats


Unnamed: 0,Integer Feature_0,Integer Feature_1,Integer Feature_2,Categorical Feature_birds,Categorical Feature_cats,Categorical Feature_dogs
0,True,False,False,True,False,False
1,False,True,False,False,True,False
2,False,False,True,False,False,True
3,False,True,False,False,True,False


# 4.2 Binning (also known as discretisation)

Sometimes the models might work better if you create binning for continuous variables. This is usually the case for linear models, but not for tree-based models.

In [23]:
import numpy as np

# First le'ts check what are the youngest and oldest ages
print(data.Age.value_counts().sort_index())

# Now let's create the 10 bins - Ps you use then 11 in the third argument because you specificy the splits

# I would use the code below if it didn't matter to get decimals
#bins = np.linspace(23, 98, 10)

#I will use this code so I ensure the beans are only intengers
bins = np.arange(23, 99, 10)

print(f"bins: {bins}")

Age
23    1
24    3
25    7
26    1
27    2
     ..
86    1
87    2
89    1
91    1
98    1
Name: count, Length: 68, dtype: int64
bins: [23 33 43 53 63 73 83 93]


In [25]:
# Now we compute for each data point which bin they fall into

which_bin = np.digitize(data.Age, bins=bins)
print(f"\n Data points {data[:5]}")
print(f"\n Membership of the data points {which_bin[:5]}")


 Data points     Income  Age  Ethnicity  Gender
0   14.891   34  Caucasian    Male
1  106.025   82      Asian  Female
2  104.593   71      Asian    Male
3  148.924   36      Asian  Female
4   55.882   68  Caucasian    Male

 Membership of the data points [2 6 5 2 5]


In [43]:
# now we could go on and transform the bin category

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output=False)

# Reshape array2 to have shape (400, 1)
which_bin_reshaped = which_bin.reshape(-1, 1) # this preserves the number of lines, adds info about 1 column
data_combined = np.hstack([data, which_bin_reshaped])

print("\n Data stacked:\n", data_combined[:5])

#encoder.fit finds the unique values that appear in which_bin
encoder.fit(which_bin.reshape(-1,1))

#transform creates the one-hot encoding
data_binned = encoder.transform(which_bin.reshape(-1,1))

print("\nData binned:\n", data_binned[:5])


 Data stacked:
 [[14.891 34 'Caucasian' ' Male' 2]
 [106.025 82 'Asian' 'Female' 6]
 [104.593 71 'Asian' ' Male' 5]
 [148.924 36 'Asian' 'Female' 2]
 [55.882 68 'Caucasian' ' Male' 5]]

Data binned:
 [[0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0.]]


Here in the data binned, we have a new array in which each row represents the bin to which the data belongs to. For example, the first point belongs to bin 2. So the second column of the first row in this array =1, while the other columns =0.