# Encoding categorical variables

## Categorical dataset

**Import of the libraries useful for the analysis**

In [1]:
# pandas -> read input file and data manipulation
import pandas as pd
pd.set_option("float_format", "{:.2f}".format)

# numpy -> array manipulations
import numpy as np
np.set_printoptions(suppress=True)

# scikit-learn variables encoding
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, MinMaxScaler, LabelEncoder
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.impute import SimpleImputer

# show sklearn objects in diagram
from sklearn import set_config
set_config(print_changed_only=False, display="diagram")

# warnings -> to silence warnings
from warnings import filterwarnings
filterwarnings("ignore")

### Encoding categorical variables
Until now, we faced only numerical values in our analysis, but is very common to have categorical variables in the data. This kind of data describes a trait or an event using a string of words rather than numbers. Since machine lerning models can interpretate only numerical values, we need to process this kind of data for feeding to many estimators.

In [2]:
df = pd.read_csv(filepath_or_buffer="data/categorical.csv")

**Show the first 10 rows**  
There are only categorical values (type ```object```)

In [3]:
df.head(10)

Unnamed: 0,Gender,Student,Married,Ethnicity,Rating
0,Male,No,Yes,Caucasian,Low
1,Female,Yes,Yes,Asian,Premium
2,Male,No,No,Asian,Premium
3,Female,No,No,Asian,Premium
4,Male,No,Yes,Caucasian,Medium
5,Male,No,No,Caucasian,Premium
6,Female,No,No,African American,Low
7,Male,No,No,Asian,Premium
8,Female,No,No,Caucasian,Low
9,Female,Yes,Yes,African American,Premium


**To deal with categorical variables, we have first to inspect about their meaning and structure, we can have:**
* **binary variables**: can have exactly two values (categories)
* **polytomous variables**: have more than two possible categories
>* **nominal variables**: there is no intrinsic ordering to the categories
>* **ordinal variables**: there is a clear ordering of the variables

### Binary variables
This kind of categories can be easily encoded creating a dummy variable for each category, that indicates with a value of zero or one the absence or presence of the attribute. This technique il called one hot encoding and is valid for binary and nominal variables.  
We can see a practical example using the variable ```Gender``` of the dataset, that can have only two values: _'Female'_ or _'Male'_.  

The objective is to make a trasformation of the variable in order to obtain the following result:  


|    | Gender   |   Gender_Male |
|---:|:---------|--------------:|
|  0 | Male     |             1 |
|  1 | Female   |             0 |
|  2 | Male     |             1 |
|  3 | Female   |             0 |
|  4 | Male     |             1 |

**Definition of the variable gender containing the values to encode**

In [4]:
gender = df[["Gender"]].values

**Encode binary categorical variables using scikit-learn**  
The operation of encoding this kind of variable can be done using the ```OneHotEncoder``` class object.  
documentation: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

**Definition of the encoder**

In [5]:
ohe = OneHotEncoder(sparse=False, dtype=int)

**Fit of the encoder to the data**  
With this operation, the encoder read and sort the unique values in the variable and create a rule of conversion to numeric values.

In [6]:
ohe.fit(X=gender)

**Encoding of the categorical variable**  
We use the method ```transform``` to make the encoding of the variable and assign the result to the ```gender_enc``` variable. The object returned by the method is a numpy array with 2 columns, for both female and male columns.

In [7]:
gender_enc = ohe.transform(X=gender)

print(f"gender encoded -> type object: {type(gender_enc)}, shape: {gender_enc.shape}")

gender encoded -> type object: <class 'numpy.ndarray'>, shape: (400, 2)


**We can use the fitted encoder to obtain the original values from the encoded data**  
To do this, we can use the method ```inverse_transform``` of the encoded object to the encoded data.

In [8]:
gender_inv = ohe.inverse_transform(X=gender_enc)

**Comparing original, encoded data and inversed data**  
As we can see from the results, the encoded variable contains two columns: the first column indicates with 1 or 0 the absence or presence of the Male value, while the second column is for the Female category. However, we know that one of the assumption of the model studied since now is the lack of perfect multicollinearity in the predictors, but now the encoded columns have this problem because every time that one column has a value, the other column's value changes.

In [9]:
print("original \t encoded \t inversed")

for original, encoded, inversed in zip(gender[:10, :], gender_enc[:10, :], gender_inv[:10, :]):
    print(f"{original} \t {encoded} \t\t {inversed}")

original 	 encoded 	 inversed
['Male'] 	 [0 1] 		 ['Male']
['Female'] 	 [1 0] 		 ['Female']
['Male'] 	 [0 1] 		 ['Male']
['Female'] 	 [1 0] 		 ['Female']
['Male'] 	 [0 1] 		 ['Male']
['Male'] 	 [0 1] 		 ['Male']
['Female'] 	 [1 0] 		 ['Female']
['Male'] 	 [0 1] 		 ['Male']
['Female'] 	 [1 0] 		 ['Female']
['Female'] 	 [1 0] 		 ['Female']


**Face the multicollinearity problem**  
To face this problem, we can drop one of the two columns created by the encoder. To make this operation in semplicity, we can instance the encoder setting the attribute ```drop="first"```.  
Perform this operation have two beneficts:
* avoid the multicollinearity problem
* have an array one size smaller

As we can see from the operation below, the array contains only the dummy variable for Male category.

In [10]:
ohe = OneHotEncoder(drop="first", sparse=False, dtype=int)
gender_enc = ohe.fit_transform(X=gender)
print(ohe.get_feature_names())
print(gender_enc[:10, :])

['x0_Male']
[[1]
 [0]
 [1]
 [0]
 [1]
 [1]
 [0]
 [1]
 [0]
 [0]]


### Nominal variables
The nominal variables are traited with the same technique saw before for binary variables, the main difference is that the encoder creates *n-1* dummy variables from the original variable with *n* categories. We can see a practical example using the variable Ethnicity of the dataset, that can have only three values: 'African American', 'Asian' and 'Caucasian'.  

The objective is to make a trasformation of the variable in order to obtain the following result:  


|    | Ethnicity   |   Ethnicity_Asian |   Ethnicity_Caucasian |
|---:|:------------|------------------:|----------------------:|
|  0 | Caucasian   |                 0 |                     1 |
|  1 | Asian       |                 1 |                     0 |
|  2 | Asian       |                 1 |                     0 |
|  3 | Asian       |                 1 |                     0 |
|  4 | Caucasian   |                 0 |                     1 |

**Definition of the variable ethnicity containing the values to encode**

In [11]:
ethnicity = df[["Ethnicity"]].values

**Definition of the encoder and fit-transform of the variable**

In [12]:
ohe = OneHotEncoder(drop="first", sparse=False, dtype=int)
ethnicity_enc = ohe.fit_transform(X=ethnicity)

**Comparing original, encoded data and inversed data**  
As we can see from the results, the encoded variable contains two columns: the first column indicates with 1 or 0 the absence or presence of the Asian value, while the second column is for the Caucasian category. The presence of the attribute African American can be obtained by the linear combination of the other two column: in practise the value is present where the first two columns have a value of 0. 

In [13]:
print("original \t\t\t encoded")

for original, encoded in zip(ethnicity[:10, :], ethnicity_enc[:10, :]):
    print(f"{original} \t\t\t {encoded}")

original 			 encoded
['Caucasian'] 			 [0 1]
['Asian'] 			 [1 0]
['Asian'] 			 [1 0]
['Asian'] 			 [1 0]
['Caucasian'] 			 [0 1]
['Caucasian'] 			 [0 1]
['African American'] 			 [0 0]
['Asian'] 			 [1 0]
['Caucasian'] 			 [0 1]
['African American'] 			 [0 0]


**Encode binary and nominal categorical variables using pandas**  
It's possibile to perform the same operations seen above using the function ```get_dummies``` of pandas, indicating the dataframe and the columns to transform. The use of this function is very useful if we want to encode the categorical columns and maintain the pandas dataframe structure, but we have to remember that operating in the whole dataset could lead to data leakage phenomena.  

Definition of the function:
```python 
pandas.get_dummies(
    data,
    prefix=None,
    prefix_sep='_',
    dummy_na=False,
    columns=None,
    sparse=False,
    drop_first=False,
    dtype=None,
) -> 'DataFrame'
```

In [14]:
df_enc = pd.get_dummies(data=df, columns=["Gender", "Student", "Married", "Ethnicity"], drop_first=True)

df_enc

Unnamed: 0,Rating,Gender_Male,Student_Yes,Married_Yes,Ethnicity_Asian,Ethnicity_Caucasian
0,Low,1,0,1,0,1
1,Premium,0,1,1,1,0
2,Premium,1,0,0,1,0
3,Premium,0,0,0,1,0
4,Medium,1,0,1,0,1
...,...,...,...,...,...,...
395,Medium,1,0,1,0,1
396,Low,1,0,0,0,0
397,Medium,0,0,1,0,1
398,Very low,1,0,1,0,1


### Ordinal variables  
The ordinal categories are one the most easy to encode because they have a logical order that can be used to map numerically the categories. The advantage of this kind of variables is that we can encode various categories in only one dimension without lose information.  

The objective is to make a trasformation of the variable in order to obtain the following result:  

|    | Rating   |   Rating Encoded |
|---:|:---------|-----------------:|
|  0 | Low      |                1 |
|  1 | Premium  |                4 |
|  2 | Premium  |                4 |
|  3 | Premium  |                4 |
|  4 | Medium   |                2 |

**Definition of the variable rating containing the values to encode**

In [15]:
rating = df[["Rating"]].values

**Check the order of the categories**  
The alphabetical order of the categories does not follow the logical order.

In [16]:
print(np.unique(ar=rating))

['High' 'Low' 'Medium' 'Premium' 'Very low']


**Encode ordinal categorical variables using scikit-learn**  
The operation of encoding this kind of variable can be done using the ```OrdinalEncoder``` class object, that encode the variable make an alphabetical sorting of the categories, this means that if the categories does not follow this kind of logic, the encoding might be wrong. To perform the operation of encoding, we instance the encoder object using the data with custom sorting and assign to the parameter ```categories```.  

Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html

**Definition of the encoder**

In [17]:
categories = [['Very low', 'Low', 'Medium', 'High', 'Premium']]

oe = OrdinalEncoder(categories=categories, dtype=int)
oe

**Fit and transform of the encoder to the data**

In [18]:
rating_enc = oe.fit_transform(X=rating)

**We can check if the categories follow the desider order calling the ```categories_``` encoder attribute**

In [19]:
print(oe.categories_)

[array(['Very low', 'Low', 'Medium', 'High', 'Premium'], dtype=object)]


**We can use the fitted encoder to obtain the original values from the encoded data**  
To do this, we can use the method ```inverse_transform``` of the encoded object to the encoded data.

In [20]:
rating_inv = oe.inverse_transform(rating_enc)

**Compare original, encoded and inversed data**

In [21]:
print("original \t encoded \t inversed")

for original, encoded, inversed in zip(rating[:10, :], rating_enc[:10, :], rating_inv[:10, :]):
    print(f"{original} \t {encoded} \t\t {inversed}")

original 	 encoded 	 inversed
['Low'] 	 [1] 		 ['Low']
['Premium'] 	 [4] 		 ['Premium']
['Premium'] 	 [4] 		 ['Premium']
['Premium'] 	 [4] 		 ['Premium']
['Medium'] 	 [2] 		 ['Medium']
['Premium'] 	 [4] 		 ['Premium']
['Low'] 	 [1] 		 ['Low']
['Premium'] 	 [4] 		 ['Premium']
['Low'] 	 [1] 		 ['Low']
['Premium'] 	 [4] 		 ['Premium']


**Encode ordinal categorical variables using pandas**  
In pandas there is not a specific method to manipulate the ordinal categorical variables. It's possibile to perform this operation creating a conversion dictionary and apply the encoding to the variable with the ```map``` function.

In [22]:
map_rating = {
    'Very low': 0, 
    'Low': 1,
    'Medium': 2,
    'High': 3,
    'Premium': 4
}

df_enc["Rating"] = df_enc["Rating"].map(map_rating)

**Show the pandas dataframe completely encoded**

In [23]:
df_enc.head(10)

Unnamed: 0,Rating,Gender_Male,Student_Yes,Married_Yes,Ethnicity_Asian,Ethnicity_Caucasian
0,1,1,0,1,0,1
1,4,0,1,1,1,0
2,4,1,0,0,1,0
3,4,0,0,0,1,0
4,2,1,0,1,0,1
5,4,1,0,0,0,1
6,1,0,0,0,0,0
7,4,1,0,0,1,0
8,1,0,0,0,0,1
9,4,0,1,1,0,0


### **Transform etherogeneus data using scikit-learn**  
When we have to encode different kinds of variables could be difficult to encode them properly. In this situations, scikit-learn provides two composers that allows to transform different columns simultaneously called ```make_column_transformer``` and ```ColumnTransformer```. These objects accepts every imputer, transfomer and encoder (and other objects too) that have the methods fit and transform.  

* documentations
>* ColumnTransfomer: https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html
>* make_column_transformer: https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html

**Load the new dataframe**  
One of the advantages of this composer is that applies different transformers (also Scalers or Imputers) to columns of a numpy array or pandas dataFrame. For this example, we transform directly the dataframe columns without make any conversion of the data to a numpy array.

In [24]:
df = pd.read_csv("data/eterogeneus_data.csv")

In [25]:
df.head(10)

Unnamed: 0,Gender,Student,Married,Ethnicity,Rating,Age,Income,Education Years
0,Male,No,Yes,Caucasian,Low,34,14.89,11
1,Female,Yes,Yes,Asian,Premium,82,106.03,15
2,Male,No,No,Asian,Premium,71,104.59,11
3,Female,No,No,Asian,Premium,36,148.92,11
4,Male,No,Yes,Caucasian,Medium,68,55.88,16
5,Male,No,No,Caucasian,Premium,77,80.18,10
6,Female,No,No,African American,Low,37,21.0,12
7,Male,No,No,Asian,Premium,87,71.41,9
8,Female,No,No,Caucasian,Low,66,15.12,13
9,Female,Yes,Yes,African American,Premium,41,71.06,19


**Using ```ColumnTransformer```**  
During the definition of the composer, inside the parameter ```transformers``` we have to define a list of tuples that includes for each transformer the name, the transformer object and the columns to transform.  

Definition of the object:
```python 
ColumnTransformer(
    transformers,
    remainder='drop',
    sparse_threshold=0.3,
    n_jobs=None,
    transformer_weights=None,
    verbose=False,
)
```

In [26]:
categories = [['Very low', 'Low', 'Medium', 'High', 'Premium']]

ct = ColumnTransformer(transformers=[
    ("MinMax", MinMaxScaler(), ["Age", "Income", "Education Years"]),
    ("OneHot", OneHotEncoder(drop="first", sparse=False, dtype=int), ["Gender", "Student", "Married", "Ethnicity"]),
    ("Ordinal", OrdinalEncoder(categories=categories, dtype=int), ["Rating"])
])
ct

**Fit and transform of the categorical data**  
Like the others scikit-learn objects, the ```fit_transform``` method of the composer returns a numpy array.

In [27]:
array_enc = ct.fit_transform(df)

print(array_enc[:10, :])

[[0.14666667 0.02573746 0.4        1.         0.         1.
  0.         1.         1.        ]
 [0.78666667 0.54272181 0.66666667 0.         1.         1.
  1.         0.         4.        ]
 [0.64       0.53459837 0.4        1.         0.         0.
  1.         0.         4.        ]
 [0.17333333 0.78607897 0.4        0.         0.         0.
  1.         0.         4.        ]
 [0.6        0.25827093 0.73333333 1.         0.         1.
  0.         1.         2.        ]
 [0.72       0.39610846 0.33333333 1.         0.         0.
  0.         1.         4.        ]
 [0.18666667 0.06036987 0.46666667 0.         0.         0.
  0.         0.         1.        ]
 [0.85333333 0.34634672 0.26666667 1.         0.         0.
  1.         0.         4.        ]
 [0.57333333 0.0270649  0.53333333 0.         0.         0.
  0.         1.         1.        ]
 [0.24       0.34437826 0.93333333 0.         1.         1.
  0.         0.         4.        ]]


**Using ```make_column_transformer```**  
The main difference between the two transformers is that the function ```make_column_transformer``` want tuples in input and doesn't need to define the name of the transformers.  

Definition of the function:
```python
make_column_transformer(*transformers, **kwargs)
```

In [28]:
categories = [['Very low', 'Low', 'Medium', 'High', 'Premium']]

mct = make_column_transformer(
    (MinMaxScaler(), ["Age", "Income", "Education Years"]),
    (OneHotEncoder(drop="first", sparse=False, dtype=int), ["Gender", "Student", "Married", "Ethnicity"]),
    (OrdinalEncoder(categories=categories, dtype=int), ["Rating"])
)
mct

In [29]:
array_enc = mct.fit_transform(df)

print(array_enc[:10, :])

[[0.14666667 0.02573746 0.4        1.         0.         1.
  0.         1.         1.        ]
 [0.78666667 0.54272181 0.66666667 0.         1.         1.
  1.         0.         4.        ]
 [0.64       0.53459837 0.4        1.         0.         0.
  1.         0.         4.        ]
 [0.17333333 0.78607897 0.4        0.         0.         0.
  1.         0.         4.        ]
 [0.6        0.25827093 0.73333333 1.         0.         1.
  0.         1.         2.        ]
 [0.72       0.39610846 0.33333333 1.         0.         0.
  0.         1.         4.        ]
 [0.18666667 0.06036987 0.46666667 0.         0.         0.
  0.         0.         1.        ]
 [0.85333333 0.34634672 0.26666667 1.         0.         0.
  1.         0.         4.        ]
 [0.57333333 0.0270649  0.53333333 0.         0.         0.
  0.         1.         1.        ]
 [0.24       0.34437826 0.93333333 0.         1.         1.
  0.         0.         4.        ]]


**How interpret the coefficients of categorical variables**  
The logic behind the interpretation of categorical variables is very similar to that used for numerical variables:
* the category with zero value goes to the intercept
* the coefficients relative to the other categories show the variation that occurs to have $y=1$ with that category included in the predictors matrix

### **EXTRA: Handle missing values**  
Let's introduce some missing values in the columns Age and Rating

In [30]:
for column in ["Age", "Rating"]:
    df.loc[df.sample(frac=0.05).index, column] = np.nan

**Check for missing value in our dataframe**

In [31]:
df.isnull().sum()

Gender              0
Student             0
Married             0
Ethnicity           0
Rating             20
Age                20
Income              0
Education Years     0
dtype: int64

**When there are some missing values in the dataset there are two strategies that can be followed:**
* drop the rows or columns with the missing values
* impute the values with another one

**To drop the rows that contains missing values we can easily use the method ```dropna``` provided by pandas**  
Definition of the function:
```python
df.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
```

In [32]:
df.dropna()

Unnamed: 0,Gender,Student,Married,Ethnicity,Rating,Age,Income,Education Years
0,Male,No,Yes,Caucasian,Low,34.00,14.89,11
1,Female,Yes,Yes,Asian,Premium,82.00,106.03,15
2,Male,No,No,Asian,Premium,71.00,104.59,11
3,Female,No,No,Asian,Premium,36.00,148.92,11
4,Male,No,Yes,Caucasian,Medium,68.00,55.88,16
...,...,...,...,...,...,...,...,...
395,Male,No,Yes,Caucasian,Medium,32.00,12.10,13
396,Male,No,No,African American,Low,65.00,13.36,17
397,Female,No,Yes,Caucasian,Medium,67.00,57.87,12
398,Male,No,Yes,Caucasian,Very low,44.00,37.73,13


**To impute the missing values in pandas we can use the method ```fillna``` studied in the previous module**  
In this case, we use the mode to impute the Rating column and the mean to impute the Age column.

In [33]:
rating_mode = df["Rating"].mode()[0]
age_mean = df["Age"].mean()

print(f"Rating mode: '{rating_mode}', Age mean: {age_mean:.4f}")

Rating mode: 'Low', Age mean: 56.0079


In [34]:
df.fillna({"Rating": rating_mode, "Age": age_mean})

Unnamed: 0,Gender,Student,Married,Ethnicity,Rating,Age,Income,Education Years
0,Male,No,Yes,Caucasian,Low,34.00,14.89,11
1,Female,Yes,Yes,Asian,Premium,82.00,106.03,15
2,Male,No,No,Asian,Premium,71.00,104.59,11
3,Female,No,No,Asian,Premium,36.00,148.92,11
4,Male,No,Yes,Caucasian,Medium,68.00,55.88,16
...,...,...,...,...,...,...,...,...
395,Male,No,Yes,Caucasian,Medium,32.00,12.10,13
396,Male,No,No,African American,Low,65.00,13.36,17
397,Female,No,Yes,Caucasian,Medium,67.00,57.87,12
398,Male,No,Yes,Caucasian,Very low,44.00,37.73,13


**Impute the missing values using scikit-learn**  
Scikit-learn provide an imputer object to handle missing values with numerical and categorical data.  
documentation: https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

**Definition of the variable age containing the values to impute**

In [35]:
age = df[["Age"]].values

**Definition of the imputer**  
Imputation of the missing values in the column Age using the mean value.

In [36]:
si = SimpleImputer()
si

**Fit and transform of the imputer to the data**

In [37]:
age_imputed = si.fit_transform(age)
age_imputed[:10, :]

array([[34.],
       [82.],
       [71.],
       [36.],
       [68.],
       [77.],
       [37.],
       [87.],
       [66.],
       [41.]])

**Using the attribute ```statistics_``` is possibile to see the value used to impute the missing values**

In [38]:
si.statistics_

array([56.00789474])

**Imputation of the missing values in the Rating column**  
We can impute the missing values in categorical columns setting the ```strategy="most_frequent"```

In [39]:
rating = df[["Rating"]].values
si = SimpleImputer(strategy="most_frequent")
rating_imputed = si.fit_transform(rating)
rating_imputed[:10, :]

array([['Low'],
       ['Premium'],
       ['Premium'],
       ['Premium'],
       ['Medium'],
       ['Premium'],
       ['Low'],
       ['Premium'],
       ['Low'],
       ['Premium']], dtype=object)

## <font color= "red"> Exercise </font>

**Read the eterogeneus_data.csv file in the data folder, introduce missing values in two columns (one numerical and one categorical) and make all the trasformations required using one column transformer.**