# Overview Test Without NaN
Overview from test sets without NaN in it.

## Overview

| Model                              | AUC   | Note | Notebook |
|------------------------------------|-------|------|----------|
| Random Forest (Zero imputed)       | 0.950 |      | 12       |
| Random Forest (Mean imputed)       | 0.944 |      | 12       |
| Random Forest (MissForest imputed) |       |      |          |
| GradientBoosting                   | 0.950 |      | 13       |
| HistGradientBoosting               | 0.968 |      | 12       |
| XGBoost                            | 0.962 |      | 12       |

## Data

In [10]:
from sklearn.model_selection import train_test_split
from util import get_train_dataset, get_features, fix_test, evaluate_no_cv, calculate_auc_and_plot, get_columns_starting_with
import numpy as np

df = get_train_dataset()
# df = df.sample(n=1000) # for faster debugging

train, test = train_test_split(df, test_size=0.2, random_state=42)

test.dropna(inplace=True)
x = get_features(train)
y = train['reaction']

x_test = get_features(test, test=True)
x_test = fix_test(x_test, x.columns)
y_test = test['reaction']

  x_test[col] = np.nan  # TODO: NaN geven


## Imputations

In [11]:
from sklearn import impute

### Zero

In [12]:
zero_imputer = impute.SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=0)
x_zero_imputed = zero_imputer.fit_transform(x)
x_test_zero_imputed = zero_imputer.transform(x_test)

### Mean

In [13]:
mean_imputer = impute.SimpleImputer(missing_values=np.nan, strategy='mean')
x_mean_imputed = mean_imputer.fit_transform(x)
x_test_mean_imputed = mean_imputer.transform(x_test)

### MissForest
An iterative RF imputation method https://towardsdatascience.com/missforest-the-best-missing-data-imputation-algorithm-4d01182aed3

In [14]:
# Fix neigbors base import

# Thank you AMLoucas from https://gist.github.com/betterdatascience/c455473d7445c0e7e279efe31a896e17 !!
# Whoever is having the issue with ModuleNotFoundError: No module named 'sklearn.neighbors.base'. this is because when importing missingpy it tries to import automatically 'sklearn.neighbors.base' however in the new versions of sklearn it has been renamed to 'sklearn.neighbors._base' so we have to manually import it to work. The code snippet below does the job. You run the snippet before the import
#
# import sys
# import sklearn.neighbors._base
# sys.modules['sklearn.neighbors.base'] = sklearn.neighbors._base

In [15]:
# Drop columns that have all rows missing (is required by missforest)
# x_no_empty_columns = x.drop(columns=x.columns[x.isnull().all()])

In [16]:
# from missingpy import MissForest
# miss_forest_imputer = MissForest()
# x_miss_forest_imputed = miss_forest_imputer.fit_transform(x_no_empty_columns)
# x_test_miss_forest_imputed = miss_forest_imputer.transform(x_test)

## Random Forest

In [17]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=0)

### Zero imputed

In [18]:
evaluate_no_cv(clf, x_zero_imputed, y, x_test_zero_imputed, y_test)

ROC AUC: 0.950


### Mean imputed

In [19]:
evaluate_no_cv(clf, x_mean_imputed, y, x_test_mean_imputed, y_test)

ROC AUC: 0.944


### MissForest imputed

In [20]:
# evaluate_no_cv(clf, x_miss_forest_imputed, y, x_test_miss_forest_imputed, y_test)

## GradientBoosting

In [25]:
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(random_state=0)

In [26]:
evaluate_no_cv(clf, x_zero_imputed, y, x_test_zero_imputed, y_test)

ROC AUC: 0.958


## HistGradientBoosting

In [21]:
from sklearn.ensemble import HistGradientBoostingClassifier
clf = HistGradientBoostingClassifier(random_state=0)

In [22]:
evaluate_no_cv(clf, x, y, x_test, y_test)

ROC AUC: 0.968


## XGBoost

In [23]:
from xgboost import XGBClassifier
clf = XGBClassifier(random_state=0)

In [24]:
evaluate_no_cv(clf, x, y, x_test, y_test)

ROC AUC: 0.962
