# Random Forest Handling Missing Values
A lot of our previous approaches handle the missing data in training by imputing the values. However, as we've seen earlier, models like the HistGradientClassifier and XGBoost are among the best performing models. These models can handle missing values directly without imputation. So let's see if we can adapt a random forest model to handle missing values directly.

[This paper](https://link.springer.com/article/10.1007/s00521-009-0295-6) describes three methods in the decision trees part. The types discussed are ID3, C4.5 and CN2. ID3 generates additional edges for unknown attributes. C4.5 extends this with a probablistic approach. During training, each value for an attribute is assigned a weight. If the value is known, the weight is one; otherwise the weight is the relative frequency of that weight. during testing, all available branches are explored and the class is decided by the most probablistic value. CN2 uses imputation. It imputes with the attributes most common value, before computing the entropy measure. Accoording to [this stack exchange post](https://stats.stackexchange.com/questions/96025/how-do-decision-tree-learning-algorithms-deal-with-missing-values-under-the-hoo) CART forwards the missing values to the branch with the biggest number of instances during training.

The sklearn HistGradientClassifier uses another approach, which is described [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingClassifier.html):

> This estimator has native support for missing values (NaNs). During training, the tree grower learns at each split point whether samples with missing values should go to the left or right child, based on the potential gain. When predicting, samples with missing values are assigned to the left or right child consequently. If no missing values were encountered for a given feature during training, then samples with missing values are mapped to whichever child has the most samples.

I seem to find different approaches how XGBoost handles this.

The [documentation FAQ](https://xgboost.readthedocs.io/en/stable/faq.html) says:

> XGBoost supports missing values by default. In tree algorithms, branch directions for missing values are learned during training. Note that the gblinear booster treats missing values as zeros.

However, [this Medium article](https://towardsdatascience.com/xgboost-is-not-black-magic-56ca013144b4) (from 2018) mentiones it learns during training what the best way is to impute the missing value:

>  One of the coolest characteristics of XGBoost is how it deals with missing values: deciding for each sample which is the best way to impute them.

A default direction for missing values is trained. To find a new split, all possible splits are enumerated, but the loss is computed twice: once for each default direction the missing values can take. The best is taken as default direction. This is called Sparsity-aware split finding.

The article continues with an attempt to outperform XGBoost using KNN with a distance metric that takes missing values into account. It seems to give better results in the datasets of the article, so might be worth a try. My current KNN approaches just used the default distance metric, no idea whether this is sparsity aware.



By reasoning what we would expect a missing value to be classified at, exploring all different branches (and e.g. taking the average) might give the best result. This is similar to the C4.5 approach (which also adds wheights to this). There are some [implementations of C4.5 trees on Github](https://github.com/topics/c45-trees?l=python). The most popular one (in the provided link (containing the c45-trees label and python language), in terms of stars) is [chefboost](https://github.com/serengil/chefboost), so let's try that one.

In [1]:
%%bash
pip install chefboost

-bash: line 1: pip: command not found


CalledProcessError: Command 'b'pip install chefboost\n'' returned non-zero exit status 127.

The [Github Readme](https://github.com/serengil/chefboost) contains following comment:
> Chefboost handles the both numeric and nominal features and target values in contrast to its alternatives. So, you don't have to apply any pre-processing to build trees.

So let's start by trying it without any preprocessing/feature engineering.

In [2]:
from sklearn.model_selection import train_test_split
from util import get_train_dataset, get_test_dataset

df = get_train_dataset().sample(200)
# Change 'reaction' column from 0/1 to 'negative'/'positive' (since chefboost needs an object column)
df['reaction'] = df['reaction'].apply(lambda x: 'negative' if x == 0 else 'positive')
train, test = train_test_split(df, test_size=0.2, random_state=42)

In [3]:
from util import get_features, fix_test

train_y = train['reaction']
test_y = test['reaction']

train_x = get_features(train)
test_x = get_features(test, test=True)
test_x = fix_test(test_x, train_x.columns)

  x_test[col] = 0 #np.nan  # TODO: NaN of 0? currently kept it at 0 to make clear it's not because the chain was missing


In [4]:
# add train_y column to train_x
train_xy = train_x.copy()
train_xy['reaction'] = train_y

In [5]:
from chefboost import Chefboost as chef

config = {'algorithm': 'C4.5'}
model = chef.fit(train_xy, config=config, target_label='reaction')

[INFO]:  6 CPU cores will be allocated in parallel running
C4.5  tree is going to be built...


IndexError: single positional indexer is out-of-bounds

Seems to crash. Let's try another implementation. The second on [the list](https://github.com/topics/c45-trees?l=python) is [scikit-learn-C4.5-tree-classfier](https://github.com/RaczeQ/scikit-learn-C4.5-tree-classifier). Let's try that one.

In [1]:
from sklearn.model_selection import train_test_split
from util import get_train_dataset

df = get_train_dataset().sample(200)
train, test = train_test_split(df, test_size=0.2, random_state=42)

In [2]:
from util import get_features, fix_test

train_y = train['reaction']
test_y = test['reaction']

train_x = get_features(train)
test_x = get_features(test, test=True)
test_x = fix_test(test_x, train_x.columns)

  x_test[col] = 0 #np.nan  # TODO: NaN of 0? currently kept it at 0 to make clear it's not because the chain was missing


In [3]:
from c45 import C45

clf = C45()
clf.fit(train_x, train_y)
print(f'Accuracy: {clf.score(test_x, test_y)}')

ValueError: Input X contains NaN.

Appearantly, this implementation doesn't support NaN's (while C4.5 should support them the way we want).

The search for a correct implementation continues. [This stack exchange user](https://datascience.stackexchange.com/questions/65917/is-there-a-real-c4-5-implementation-in-python-handling-missing-value) mentions how so called C4.5 implementations don't support missing values, while they should support it. There are two specific options mentioned, either using [this implementation](https://github.com/geerk/C45algorithm) or by [using an R implementation in Python](https://stackoverflow.com/questions/41070087/calling-c5-0-in-python).

I started with the Python implementation (but have some concerns, because of the strange input format and lack of clear examples). I cloned the Github repo and added a file `__init__.py` with following content:
```python
from .c45 import *
from .mine import *
from .utils import *
```
And I also added a `.` in front of the local imports in those files (and changed `import utils` to  `from .utils import *` and sightly changed the code to support this import change).
This allows us to import the package in this notebook.

In [1]:
from util import get_train_dataset
from sklearn.model_selection import train_test_split
df = get_train_dataset().sample(200)
train, test = train_test_split(df, test_size=0.2, random_state=42)

In [2]:
from util import get_features, fix_test

train_y = train['reaction']
test_y = test['reaction']

train_x = get_features(train)
test_x = get_features(test, test=True)
test_x = fix_test(test_x, train_x.columns)

# add train_y column to train_x
train_xy = train_x.copy()
train_xy['reaction'] = train_y

  x_test[col] = 0 #np.nan  # TODO: NaN of 0? currently kept it at 0 to make clear it's not because the chain was missing


In [3]:
# Change the df to a dict of following format:
# {column_name: [list of values], column_name: [list of values], ...}
train_dict = train_xy.to_dict(orient='list')

In [4]:
from C45algorithm import mine_c45
mine_c45(train_dict, 'reaction')

KeyboardInterrupt: 

This takes way to long for our small sample of 200 rows. Let's try the [R implementation](https://stackoverflow.com/questions/41070087/calling-c5-0-in-python).

In [5]:
from rpy2.robjects.packages import importr
from rpy2 import robjects
C50 = importr('C50')
C5_0 = robjects.r('C5.0')

ValueError: r_home is None. Try python -m rpy2.situation

Let's just implement it ourselves. Start with a simple random forest, to see whether it's performant enough for the size of our dataset.

In [14]:
from util import get_train_dataset
from sklearn.model_selection import train_test_split
df = get_train_dataset().sample(200)
# df = df.dropna()
train, test = train_test_split(df, test_size=0.2, random_state=42)
test = test.dropna()

In [15]:
from util import get_features, fix_test
train_x = get_features(train)
test_x = get_features(test, test=True)
test_x = fix_test(test_x, train_x.columns)

train_y = train['reaction']
test_y = test['reaction']


  x_test[col] = 0 #np.nan  # TODO: NaN of 0? currently kept it at 0 to make clear it's not because the chain was missing


In [16]:
from util import evaluate_no_cv
from CustomForest import RandomForest
rf = RandomForest()

In [17]:
# rf.fit(train_x, train_y)

In [18]:
evaluate_no_cv(rf, train_x, train_y, test_x, test_y)

  return np.array([X_1, X_2])
Training: 100% [------------------------------------------------] Time: 0:00:17

ROC AUC: 0.923





0.9230769230769231

I currently have no idea how NaN's are handled in this implementation. However, it seems to work fine at decent speed (still quite slow training, but acceptable for experiments).