In [1]:
# imports
import numpy as np
import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import KBinsDiscretizer

# Data set

Contains data on demographics and an target feature which indicates if the person earns more than $50k a year.

In [2]:
df = pd.read_csv(
    "https://www.openml.org/data/get_csv/1595261/phpMawTba", na_values=' ?'
)

## Missing values

In this dataset, missing values are identified with the question mark (with a preceding space), ` ?`.

Check the non-null counts to confirm that some values have been identified as `NA` (Not Available).

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       46043 non-null  object
 2   fnlwgt          48842 non-null  int64 
 3   education       48842 non-null  object
 4   education-num   48842 non-null  int64 
 5   marital-status  48842 non-null  object
 6   occupation      46033 non-null  object
 7   relationship    48842 non-null  object
 8   race            48842 non-null  object
 9   sex             48842 non-null  object
 10  capital-gain    48842 non-null  int64 
 11  capital-loss    48842 non-null  int64 
 12  hours-per-week  48842 non-null  int64 
 13  native-country  47985 non-null  object
 14  class           48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB


In [4]:
df.isna().sum() # equivalently

age                  0
workclass         2799
fnlwgt               0
education            0
education-num        0
marital-status       0
occupation        2809
relationship         0
race                 0
sex                  0
capital-gain         0
capital-loss         0
hours-per-week       0
native-country     857
class                0
dtype: int64

Even in Bayesian Networks **can** handle missing values (note the EM algorithm, for example), in this case imputation of missing values will be performed in order to simplify learning both _structure_ and _parameters_.

Instead of performing a rather simplistic imputation based only on the present (non-null) values of the specific variable, a more realistic approach would be to use the **whole _data-points_** to try to estimate missing values. 

Nevertheless, bear in mind that this kind of imputation, which usually entails computing an average, is only feasible for numeric variables. Sadly, not available values in the data set all belong to categorical variables. Imputation for discrete values is not as well support and so we resort to a simplified imputation. Namely, the most frequent value of each variable is used as the imputation value. According to scikit-learn's documentation:

> If there is more than one such value, only the smallest is returned.

In [5]:
mask = df.dtypes == object
df.loc[:, mask].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   workclass       46043 non-null  object
 1   education       48842 non-null  object
 2   marital-status  48842 non-null  object
 3   occupation      46033 non-null  object
 4   relationship    48842 non-null  object
 5   race            48842 non-null  object
 6   sex             48842 non-null  object
 7   native-country  47985 non-null  object
 8   class           48842 non-null  object
dtypes: object(9)
memory usage: 3.4+ MB


In [6]:
# not neccesary to explictly specify NA values
# missing_values='?'
imp = SimpleImputer(strategy='most_frequent')
df.loc[:, mask] = imp.fit_transform(df.loc[:, mask]) # copy=False is invalid

The total number of missing values now is

In [7]:
df.isna().sum().sum()

0

## Continuous features

Bayesian networks can handle both discrete and continuous variables. However, we will focus on discrete Bayesian networks.

Therefore, continuous features will need to be discretized. Continuous features are usually identified with numerical features. But that is not always the case. The `education-num` variable is categorial, even though it is codified with numbers.

In [8]:
len(df['education-num'].value_counts()) # number of different values

16

In [9]:
mask = df.dtypes == np.int64
mask[4] = False # education-num index
df.loc[:, mask]

Unnamed: 0,age,fnlwgt,capital-gain,capital-loss,hours-per-week
0,25,226802,0,0,40
1,38,89814,0,0,50
2,28,336951,0,0,40
3,44,160323,7688,0,40
4,18,103497,0,0,30
...,...,...,...,...,...
48837,27,257302,0,0,38
48838,40,154374,0,0,40
48839,58,151910,0,0,40
48840,22,201490,0,0,20


There basically two options to discretize all continuous variables. 
One option is to create an indivual transformation for each.
```
bins = [0, 1, 13, 20, 60, np.inf]
labels = ['infant', 'kid', 'teen', 'adult', 'senior citizen']
age_tf = FunctionTransformer(
    pd.cut, kw_args={'bins': bins, 'labels': labels, 'retbins': False}
)
age_tf.fit_transform(df['age'])
```
But there is an alternative that consists of making use of `scikit-learn`'s `KBinsDiscretizer`.
One of the advantages of this technique is that the bins can be selected according to various strategies. The chosen method performs clustering for each variable (1-D) in order to group the instances.

In [10]:
# n_bins=5 (default)
ct = ColumnTransformer([
    ('age',  KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='kmeans'), [0]),
    ('fnlwgt',         KBinsDiscretizer(encode='ordinal', strategy='kmeans'), [1]),
    ('capital-gain',   KBinsDiscretizer(encode='ordinal', strategy='kmeans'), [2]),
    ('capital-loss',   KBinsDiscretizer(encode='ordinal', strategy='kmeans'), [3]),
    ('hours-per-week', KBinsDiscretizer(encode='ordinal', strategy='kmeans'), [4])
])
ct.fit_transform(df.loc[:, mask])

array([[0., 2., 0., 0., 1.],
       [1., 0., 0., 0., 2.],
       [0., 3., 0., 0., 1.],
       ...,
       [2., 1., 0., 0., 1.],
       [0., 1., 0., 0., 0.],
       [2., 2., 2., 0., 1.]])

In [11]:
df.loc[:, mask] = ct.fit_transform(df.loc[:, mask])
df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,0.0,Private,2.0,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0.0,0.0,1.0,United-States,<=50K
1,1.0,Private,0.0,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0.0,0.0,2.0,United-States,<=50K
2,0.0,Local-gov,3.0,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0.0,0.0,1.0,United-States,>50K
3,1.0,Private,1.0,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,1.0,0.0,1.0,United-States,>50K
4,0.0,Private,0.0,Some-college,10,Never-married,Prof-specialty,Own-child,White,Female,0.0,0.0,1.0,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,0.0,Private,2.0,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0.0,0.0,1.0,United-States,<=50K
48838,1.0,Private,1.0,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0.0,0.0,1.0,United-States,>50K
48839,2.0,Private,1.0,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0.0,0.0,1.0,United-States,<=50K
48840,0.0,Private,1.0,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0.0,0.0,0.0,United-States,<=50K
