## Engineer labels of categorical variables

In this section, I will describe a variety of methods to transform the strings of categorical variables into numbers, so that we can feed these variables in machine learning algorithms using sklearn.

### One Hot Encoding

One hot encoding, consists of replacing the categorical variable by different boolean variables, which take value 0 or 1, to indicate whether or not a certain category / label of the variable was present for that observation.

Each one of the boolean variables are also known as <b>dummy variables or binary variables</b>.

For example, from the categorical variable "Gender", with labels 'female' and 'male', we can generate the boolean variable "female", which takes 1 if the person is female or 0 otherwise. We can also generate the variable male, which takes 1 if the person is "male" and 0 otherwise.

See below:

In [8]:
import pandas as pd
dataGender=pd.read_csv("C:\\Sunil Profile All In one\\1.upgradMaterials\\2.All in one DataSet\\02.Titanic\\titanic.csv",usecols=['Sex'])
dataGender.head(5)

Unnamed: 0,Sex
0,male
1,female
2,female
3,female
4,male


In [10]:
print(pd.get_dummies(dataGender).head()


   Sex_female  Sex_male
0           0         1
1           1         0
2           1         0
3           1         0
4           0         1


Unnamed: 0,Sex,Sex_female,Sex_male
0,male,0,1
1,female,1,0
2,female,1,0
3,female,1,0
4,male,0,1


In [11]:
# for better visualisation
pd.concat([dataGender, pd.get_dummies(dataGender)], axis=1).head()

Unnamed: 0,Sex,Sex_female,Sex_male
0,male,0,1
1,female,1,0
2,female,1,0
3,female,1,0
4,male,0,1


As you may have noticed, we only need 1 of the 2 dummy variables to represent the original categorical variable Sex. Any of the 2 will suffice, and it doesn't matter which one we select, since they are equivalent.

Therefore, to encode a categorical variable with 2 labels, we need 1 dummy variable.

#### <b>To extend this concept, to encode categorical variable with k labels, we need k-1 dummy variables.</b>

How can we get this using pandas?

In [12]:
# obtaining k-1 labels
pd.get_dummies(dataGender, drop_first=True).head()

Unnamed: 0,Sex_male
0,1
1,0
2,0
3,0
4,1


In [18]:
# Let's now look at an example with more than 2 labels

dataEmbarked=pd.read_csv('C:\\Sunil Profile All In one\\1.upgradMaterials\\2.All in one DataSet\\02.Titanic\\titanic.csv', usecols=['Embarked'])
dataEmbarked.head()

Unnamed: 0,Embarked
0,S
1,C
2,S
3,S
4,S


In [20]:
# check the number of different labels
dataEmbarked.Embarked.unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [22]:
# get k-1 dummy variables

pd.get_dummies(dataEmbarked, drop_first=True).head()

Unnamed: 0,Embarked_Q,Embarked_S
0,0,1
1,0,0
2,0,1
3,0,1
4,0,1


In [24]:
# we can also add an additional dummy variable to indicate whether there was missing data
pd.get_dummies(dataEmbarked, drop_first=True, dummy_na=True).head()

Unnamed: 0,Embarked_Q,Embarked_S,Embarked_nan
0,0,1,0
1,0,0,0
2,0,1,0
3,0,1,0
4,0,1,0


In [26]:
# by summing the number of 1s per boolean variable over the rows of the dataset, we get to know how
# many observations we have for each variable (i.e., each category)

pd.get_dummies(dataEmbarked, drop_first=True, dummy_na=True).sum(axis=0)

Embarked_Q       77
Embarked_S      644
Embarked_nan      2
dtype: int64

### Notes
Both pandas and sklearn will provide a whole set of dummy variables from a categorical variable. This is, instead of returning k-1 binary variables, they will return k, with the option in pandas of dropping the first binary variable and obtain k-1.

#### When should you use k and when k-1?
When the original variable is binary, that is, when the original variable has only 2 labels, then you should create one and only one binary variable.
When the original variable has more than 2 labels, the following is important:

#### <b><u> One hot encoding into k-1:</u></b> 
- One hot encoding into k-1 binary variables takes into account that we can use 1 less dimension and still represent the whole information: if the observation is 0 in all the binary variables, then it must be 1 in the final (removed) binary variable. As an example, for the variable gender encoded into male, if the observation is 0, then it has to be female. We do not need the additional female variable to explain that.


- One hot encoding with k-1 binary variables should be used in linear regression, to keep the correct number of degrees of freedom (k-1). The linear regression has access to all of the features as it is being trained, and therefore examines altogether the whole set of dummy variables. This means that k-1 binary variables give the whole information about (represent completely) the original categorical variable to the linear regression.<br>


- And the same is true for all machine learning algorithms that look at ALL the features at the same time during training. For example, support vector machines and neural networks as well. And clustering algorithms.<br>

#### One hot encoding into k dummy variables
- However, tree based models select at each iteration only a group of features to make a decision. This is to separate the data at each node. Therefore, the last category, the one that was removed in the one hot encoding into k-1 variables, would only be taken into account by those splits or even trees, that use the entire set of binary variables at a time.
<br>

- And this would rarely happen, because each split usually uses 1-3 features to make a decision. So, tree based methods will never consider that additional label, the one that was dropped. Thus, if the categorical variables will be used in a tree based learning algorithm, it is good practice to encode it into k binary variables instead of k-1.


- Finally, if you are planning to do feature selection, you will also need the entire set of binary variables (k) to let the machine learning model select which ones have the most predictive power.<br>


### Notes
- If our datasets have a few multi-label variables, we will end up very soon with datasets with thousands of columns or more. And this may make training of our algorithms slow.


- In addition, many of these dummy variables may be similar to each other, since it is not unusual for 2 or more variables to share the same combinations of 1 and 0s.