<img src="Images/aiwithai.png" width="100%">

## One Hot Encoding(Nominal) - pandas
### Encoding into k-1 dummy variables
## In this demo:
We will see how to perform one hot encoding with pandas using the Titanic dataset.

In [42]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [43]:
# load titanic dataset

usecols = ["pclass", "sibsp", "parch", "sex", "embarked", "cabin", "survived"]

data = pd.read_csv("./data/titanic_dataset.csv", usecols=usecols)

data.head()

Unnamed: 0,pclass,survived,sex,sibsp,parch,cabin,embarked
0,1,1,female,0,0,B5,S
1,1,1,male,1,2,C22 C26,S
2,1,0,female,1,2,C22 C26,S
3,1,0,male,1,2,C22 C26,S
4,1,0,female,1,2,C22 C26,S


In [44]:
# let's capture only the first letter of the
# cabin for this demonstration

data["cabin"] = data["cabin"].str[0]

data.head()

Unnamed: 0,pclass,survived,sex,sibsp,parch,cabin,embarked
0,1,1,female,0,0,B,S
1,1,1,male,1,2,C,S
2,1,0,female,1,2,C,S
3,1,0,male,1,2,C,S
4,1,0,female,1,2,C,S


### Encoding important

Just like imputation, all methods of categorical encoding should be performed over the training set, and then propagated to the test set. 

Why? 

Because these methods will "learn" patterns from the train data, and therefore you want to avoid leaking information and overfitting.

In [45]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(
    data.drop("survived", axis=1),  # predictors
    data["survived"],  # target
    test_size=0.3,  # percentage of obs in test set
    random_state=0,  # seed to ensure reproducibility
)

X_train.shape, X_test.shape

((916, 6), (393, 6))

### Let's explore the cardinality

In [46]:
# sex has 2 labels

X_train["sex"].unique()

array(['female', 'male'], dtype=object)

In [47]:
# embarked has 3 labels and missing data

X_train["embarked"].unique()

array(['S', 'C', 'Q', '?'], dtype=object)

In [48]:
# cabin has 8 labels and missing data

X_train["cabin"].unique()

array(['?', 'E', 'C', 'D', 'B', 'A', 'F', 'T', 'G'], dtype=object)

## One hot encoding with pandas
### k  dummy variables

In [49]:
# we can create dummy variables with the build in
# pandas method get_dummies

tmp = pd.get_dummies(X_train["sex"])

tmp.head()

Unnamed: 0,female,male
501,1,0
588,1,0
402,1,0
1193,0,1
686,1,0


In [50]:
# for better visualisation let's put the dummies next
# to the original variable

pd.concat([X_train["sex"], pd.get_dummies(X_train["sex"])], axis=1).head()

Unnamed: 0,sex,female,male
501,female,1,0
588,female,1,0
402,female,1,0
1193,male,0,1
686,female,1,0


In [51]:
# and now let's repeat for embarked

tmp = pd.get_dummies(X_train["embarked"])

tmp.head()

Unnamed: 0,?,C,Q,S
501,0,0,0,1
588,0,0,0,1
402,0,1,0,0
1193,0,0,1,0
686,0,0,1,0


In [52]:
# for better visualisation

pd.concat([X_train["embarked"], pd.get_dummies(X_train["embarked"])], axis=1).head()

Unnamed: 0,embarked,?,C,Q,S
501,S,0,0,0,1
588,S,0,0,0,1
402,C,0,1,0,0
1193,Q,0,0,1,0
686,Q,0,0,1,0


In [53]:
# and now for cabin

tmp = pd.get_dummies(X_train["cabin"])

tmp.head()

Unnamed: 0,?,A,B,C,D,E,F,G,T
501,1,0,0,0,0,0,0,0,0
588,1,0,0,0,0,0,0,0,0
402,1,0,0,0,0,0,0,0,0
1193,1,0,0,0,0,0,0,0,0
686,1,0,0,0,0,0,0,0,0


In [54]:
# and now for all variables together: train set

# ========
# get_dummies automatically recognises variables of type
# object and categorical, ignoring numerical variables.
# ========

X_train_enc = pd.get_dummies(X_train)

print(X_train_enc.shape)

X_train_enc.head()

(916, 18)


Unnamed: 0,pclass,sibsp,parch,sex_female,sex_male,cabin_?,cabin_A,cabin_B,cabin_C,cabin_D,cabin_E,cabin_F,cabin_G,cabin_T,embarked_?,embarked_C,embarked_Q,embarked_S
501,2,0,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,1
588,2,1,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,1
402,2,1,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0
1193,3,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,0
686,3,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0


In [55]:
# and now for all variables together: test set

X_test_enc = pd.get_dummies(X_test)

print(X_test_enc.shape)

X_test_enc.head()

(393, 16)


Unnamed: 0,pclass,sibsp,parch,sex_female,sex_male,cabin_?,cabin_A,cabin_B,cabin_C,cabin_D,cabin_E,cabin_F,cabin_G,embarked_C,embarked_Q,embarked_S
1139,3,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1
533,2,0,1,1,0,1,0,0,0,0,0,0,0,0,0,1
459,2,1,0,0,1,1,0,0,0,0,0,0,0,0,0,1
1150,3,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1
393,2,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1


Notice the positives of pandas `get_dummies`:

- Dataframe returned with feature names.
- Automatically recognises variables of type object or categorical.
- Dummies appended to original data in place of the original categorical variables.

**And the limitations:**

The train set contains more dummy features than the test set. This occurred because there was no category T in cabin in the test set.

This will cause problems if training and scoring models with scikit-learn, because predictors require train and test sets to be of the same shape.

### into k -1 

In [56]:
# obtaining k-1 labels: we need to indicate get_dummies
# to drop the first binary variable

tmp = pd.get_dummies(X_train["sex"], drop_first=True)

tmp.head()

Unnamed: 0,male
501,0
588,0
402,0
1193,1
686,0


In [57]:
# obtaining k-1 labels: we need to indicate get_dummies
# to drop the first binary variable

tmp = pd.get_dummies(X_train["embarked"], drop_first=True)

tmp.head()

Unnamed: 0,C,Q,S
501,0,0,1
588,0,0,1
402,1,0,0
1193,0,1,0
686,0,1,0


For embarked, if an observation shows 0 for Q and S, then its value must be C, the remaining category.

Caveat, this variable has missing data, so unless we encode missing data as well, the dropped category and missing data will be treated equally.

In [58]:
# altogether: train set

X_train_enc = pd.get_dummies(X_train, drop_first=True)

print(X_train_enc.shape)

X_train_enc.head()

(916, 15)


Unnamed: 0,pclass,sibsp,parch,sex_male,cabin_A,cabin_B,cabin_C,cabin_D,cabin_E,cabin_F,cabin_G,cabin_T,embarked_C,embarked_Q,embarked_S
501,2,0,1,0,0,0,0,0,0,0,0,0,0,0,1
588,2,1,1,0,0,0,0,0,0,0,0,0,0,0,1
402,2,1,0,0,0,0,0,0,0,0,0,0,1,0,0
1193,3,0,0,1,0,0,0,0,0,0,0,0,0,1,0
686,3,0,0,0,0,0,0,0,0,0,0,0,0,1,0


In [59]:
# altogether: test set

X_test_enc = pd.get_dummies(X_test, drop_first=True)

print(X_test_enc.shape)

X_test_enc.head()

(393, 13)


Unnamed: 0,pclass,sibsp,parch,sex_male,cabin_A,cabin_B,cabin_C,cabin_D,cabin_E,cabin_F,cabin_G,embarked_Q,embarked_S
1139,3,0,0,1,0,0,0,0,0,0,0,0,1
533,2,0,1,0,0,0,0,0,0,0,0,0,1
459,2,1,0,1,0,0,0,0,0,0,0,0,1
1150,3,0,0,1,0,0,0,0,0,0,0,0,1
393,2,0,0,1,0,0,0,0,0,0,0,0,1


### Bonus: get_dummies() can handle missing values

In [None]:
# we can add an additional dummy variable to indicate
# missing data

pd.get_dummies(X_train["embarked"], drop_first=True, dummy_na=True).head()

<img src="Images/aiwithai.png" width="100%">

## Ordinal Encoding - pandas

Ordinal encoding consist in replacing the categories by integers from 1 to n (or 0 to n-1, depending the implementation), where n is the number of distinct categories of the variable.

The numbers are assigned arbitrarily. This encoding method allows for quick benchmarking of machine learning models. It is also suitable for tree based machine learning algorithms.


### Advantages

- Straightforward to implement
- Does not expand the feature space


### Limitations

- Does not capture any information about the categories labels
- Not suitable for linear models.

Ordinal encoding is better suited for non-linear methods which are able to navigate through the arbitrarily assigned digits to try and find patters that relate them to the target.


## In this demo:

We will see how to perform one hot encoding with pandas using the House Prices dataset.

In [60]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [61]:
# load dataset

data = pd.read_csv(
    "./data/houseprice.csv",
    usecols=["Neighborhood", "Exterior1st", "Exterior2nd", "SalePrice"],
)

data.head()

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd,SalePrice
0,CollgCr,VinylSd,VinylSd,208500
1,Veenker,MetalSd,MetalSd,181500
2,CollgCr,VinylSd,VinylSd,223500
3,Crawfor,Wd Sdng,Wd Shng,140000
4,NoRidge,VinylSd,VinylSd,250000


In [62]:
# let's have a look at how many labels each variable has

for col in data.columns:
    print(col, ": ", len(data[col].unique()), " labels")

Neighborhood :  25  labels
Exterior1st :  15  labels
Exterior2nd :  16  labels
SalePrice :  663  labels


In [63]:
# let's explore the unique categories
data["Neighborhood"].unique()

array(['CollgCr', 'Veenker', 'Crawfor', 'NoRidge', 'Mitchel', 'Somerst',
       'NWAmes', 'OldTown', 'BrkSide', 'Sawyer', 'NridgHt', 'NAmes',
       'SawyerW', 'IDOTRR', 'MeadowV', 'Edwards', 'Timber', 'Gilbert',
       'StoneBr', 'ClearCr', 'NPkVill', 'Blmngtn', 'BrDale', 'SWISU',
       'Blueste'], dtype=object)

In [64]:
data["Exterior1st"].unique()

array(['VinylSd', 'MetalSd', 'Wd Sdng', 'HdBoard', 'BrkFace', 'WdShing',
       'CemntBd', 'Plywood', 'AsbShng', 'Stucco', 'BrkComm', 'AsphShn',
       'Stone', 'ImStucc', 'CBlock'], dtype=object)

In [65]:
data["Exterior2nd"].unique()

array(['VinylSd', 'MetalSd', 'Wd Shng', 'HdBoard', 'Plywood', 'Wd Sdng',
       'CmentBd', 'BrkFace', 'Stucco', 'AsbShng', 'Brk Cmn', 'ImStucc',
       'AsphShn', 'Stone', 'Other', 'CBlock'], dtype=object)

### Encoding important

We select which digit to assign to each category using the train set, and then use those mappings in the test set.

In [66]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(
    data.drop("SalePrice", axis=1),  # predictors
    data["SalePrice"],  # target
    test_size=0.3,  # percentage of obs in test set
    random_state=0,
)  # seed to ensure reproducibility

X_train.shape, X_test.shape

((1022, 3), (438, 3))

## Ordinal encoding with pandas


### Advantages

- quick
- returns pandas dataframe

### Limitations of pandas:

- it does not preserve information from train data to propagate to test data

We need to capture and save the mappings manually, if we are planing to use those in production.

In [67]:
# first let's create a dictionary with the mappings of categories to numbers

ordinal_mapping = {k: i for i, k in enumerate(X_train["Neighborhood"].unique(), 0)}

ordinal_mapping

{'CollgCr': 0,
 'ClearCr': 1,
 'BrkSide': 2,
 'Edwards': 3,
 'SWISU': 4,
 'Sawyer': 5,
 'Crawfor': 6,
 'NAmes': 7,
 'Mitchel': 8,
 'Timber': 9,
 'Gilbert': 10,
 'Somerst': 11,
 'MeadowV': 12,
 'OldTown': 13,
 'BrDale': 14,
 'NWAmes': 15,
 'NridgHt': 16,
 'SawyerW': 17,
 'NoRidge': 18,
 'IDOTRR': 19,
 'NPkVill': 20,
 'StoneBr': 21,
 'Blmngtn': 22,
 'Veenker': 23,
 'Blueste': 24}

The dictionary indicates which number will replace each category. Numbers were assigned arbitrarily from 0 to n - 1 where n is the number of distinct categories.

In [68]:
# replace the labels with the integers

X_train["Neighborhood"] = X_train["Neighborhood"].map(ordinal_mapping)
X_test["Neighborhood"] = X_test["Neighborhood"].map(ordinal_mapping)

In [69]:
# let's explore the result

X_train.head(10)

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd
64,0,VinylSd,VinylSd
682,1,Wd Sdng,Wd Sdng
960,2,Wd Sdng,Plywood
1384,3,WdShing,Wd Shng
1100,4,Wd Sdng,Wd Sdng
416,5,HdBoard,HdBoard
1034,6,MetalSd,MetalSd
853,7,MetalSd,HdBoard
472,3,VinylSd,VinylSd
1011,3,AsphShn,AsphShn


In [70]:
# we can turn the previous commands into 2 functions


def find_category_mappings(df, variable):
    return {k: i for i, k in enumerate(df[variable].unique(), 0)}


def integer_encode(train, test, variable, ordinal_mapping):

    X_train[variable] = X_train[variable].map(ordinal_mapping)
    X_test[variable] = X_test[variable].map(ordinal_mapping)

In [71]:
# and now we run a loop over the remaining categorical variables

for variable in ["Exterior1st", "Exterior2nd"]:
    mappings = find_category_mappings(X_train, variable)
    integer_encode(X_train, X_test, variable, mappings)

In [72]:
# let's see the result

X_train.head()

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd
64,0,0,0
682,1,1,1
960,2,1,2
1384,3,2,3
1100,4,1,1


<img src="Images/aiwithai.png" width="100%">

## Count or frequency encoding - pandas

In count encoding we replace the categories by the count of the observations that show that category in the dataset. Similarly, we can replace the category by the frequency -or percentage- of observations in the dataset. That is, if 10 of our 100 observations show the colour blue, we would replace blue by 10 if doing count encoding, or by 0.1 if replacing by the frequency. These techniques capture the representation of each label in a dataset, but the encoding may not necessarily be predictive of the outcome. These are however, very popular encoding methods in Kaggle competitions.

The assumption of this technique is that the number observations shown by each variable is somewhat informative of the predictive power of the category.


### Advantages

- Simple
- Does not expand the feature space

### Disadvantages

- If 2 different categories appear the same amount of times in the dataset, that is, they appear in the same number of observations, they will be replaced by the same number: may lose valuable information.

For example, if there are 10 observations for the category blue and 10 observations for the category red, both will be replaced by 10, and therefore, after the encoding, will appear to be the same thing. 


Follow this [thread in Kaggle](https://www.kaggle.com/general/16927) for more information.



## In this demo:

We will see how to perform count or frequency encoding with pandas using the House Prices dataset.

In [73]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [74]:
# load dataset

data = pd.read_csv(
    "./data/houseprice.csv",
    usecols=["Neighborhood", "Exterior1st", "Exterior2nd", "SalePrice"],
)

data.head()

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd,SalePrice
0,CollgCr,VinylSd,VinylSd,208500
1,Veenker,MetalSd,MetalSd,181500
2,CollgCr,VinylSd,VinylSd,223500
3,Crawfor,Wd Sdng,Wd Shng,140000
4,NoRidge,VinylSd,VinylSd,250000


In [75]:
# let's have a look at how many labels each variable has

for col in data.columns:
    print(col, ": ", len(data[col].unique()), " labels")

Neighborhood :  25  labels
Exterior1st :  15  labels
Exterior2nd :  16  labels
SalePrice :  663  labels


### Important

When doing count transformation of categorical variables, it is important to calculate the count (or frequency = count / total observations) **over the training set**, and then use those numbers to replace the labels in the test set.

In [76]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(
    data[["Neighborhood", "Exterior1st", "Exterior2nd"]],  # predictors
    data["SalePrice"],  # target
    test_size=0.3,  # percentage of obs in test set
    random_state=0,
)  # seed to ensure reproducibility

X_train.shape, X_test.shape

((1022, 3), (438, 3))

## Count and Frequency encoding with pandas

In [77]:
# let's obtain the counts for each one of the labels
# in the variable Neigbourhood

count_map = X_train["Neighborhood"].value_counts().to_dict()

count_map

{'NAmes': 151,
 'CollgCr': 105,
 'OldTown': 73,
 'Edwards': 71,
 'Sawyer': 61,
 'Somerst': 56,
 'Gilbert': 55,
 'NridgHt': 51,
 'NWAmes': 51,
 'SawyerW': 45,
 'BrkSide': 41,
 'Mitchel': 36,
 'Crawfor': 35,
 'NoRidge': 30,
 'Timber': 30,
 'ClearCr': 24,
 'IDOTRR': 24,
 'SWISU': 18,
 'StoneBr': 16,
 'Blmngtn': 12,
 'MeadowV': 12,
 'BrDale': 10,
 'NPkVill': 7,
 'Veenker': 6,
 'Blueste': 2}

The dictionary contains the number of observations per category in Neighbourhood.

In [78]:
# replace the labels with the counts

X_train["Neighborhood"] = X_train["Neighborhood"].map(count_map)
X_test["Neighborhood"] = X_test["Neighborhood"].map(count_map)

In [79]:
# let's explore the result

X_train.head(10)

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd
64,105,VinylSd,VinylSd
682,24,Wd Sdng,Wd Sdng
960,41,Wd Sdng,Plywood
1384,71,WdShing,Wd Shng
1100,18,Wd Sdng,Wd Sdng
416,61,HdBoard,HdBoard
1034,35,MetalSd,MetalSd
853,151,MetalSd,HdBoard
472,71,VinylSd,VinylSd
1011,71,AsphShn,AsphShn


In [80]:
# if instead of the count we would like the frequency
# we need only divide the count by the total number of observations:

frequency_map = (X_train["Neighborhood"].value_counts(normalize=True)).to_dict()
frequency_map

{151: 0.14774951076320939,
 105: 0.10273972602739725,
 51: 0.09980430528375733,
 73: 0.07142857142857142,
 71: 0.06947162426614481,
 61: 0.05968688845401174,
 30: 0.05870841487279843,
 56: 0.0547945205479452,
 55: 0.053816046966731895,
 24: 0.046966731898238745,
 45: 0.04403131115459882,
 41: 0.040117416829745595,
 36: 0.03522504892367906,
 35: 0.03424657534246575,
 12: 0.023483365949119372,
 18: 0.01761252446183953,
 16: 0.015655577299412915,
 10: 0.009784735812133072,
 7: 0.00684931506849315,
 6: 0.005870841487279843,
 2: 0.0019569471624266144}

In [81]:
# let's explore the result
X_train["Neighborhood"] = X_train["Neighborhood"].map(frequency_map)

X_train.head(10)

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd
64,0.10274,VinylSd,VinylSd
682,0.046967,Wd Sdng,Wd Sdng
960,0.040117,Wd Sdng,Plywood
1384,0.069472,WdShing,Wd Shng
1100,0.017613,Wd Sdng,Wd Sdng
416,0.059687,HdBoard,HdBoard
1034,0.034247,MetalSd,MetalSd
853,0.14775,MetalSd,HdBoard
472,0.069472,VinylSd,VinylSd
1011,0.069472,AsphShn,AsphShn


We can do the same for other columns as well. try yourself for Exterior1st and Exterior1st column