# Overview
I'll try to keep an up to date overview here of all methods I've tried so far.

## Table
PERFECT DATA PROBLEM STILL
Training set always contains NaNs (might be imputed), unless stated otherwise in the Notes column.

### Test set containing NaNs
(+/-x) where x is twice the standard deviation. If x is a question mark, cross validation wasn't used.
Note: about 60% of the rows contain NaNs.

| Method                                                                           | ROC              | Parameters                    | Notes                                                                             | Notebook |
|----------------------------------------------------------------------------------|------------------|-------------------------------|-----------------------------------------------------------------------------------|----------|
| Random Forest                                                                    | 0.999 (+/-0.001) |  |                                                                                   | 9        |
| HistGradient                                                                     | 1.000 (+/-0.000) |  |                                                                                   | 9        |
| Random Forest (alpha-chain only)                                                 | 0.767 (+/-?)     | n_estimators=200              | Graph contains straight line (because of a lot of rows with the same value (NaN)) | 5        |
| Random Forest (beta-chain only)                                                  | 1.000 (+/-?)     | n_estimators=200              |                                                                                   | 5        |
| Random Forest (alpha and beta seperate RF models, combined with arithmetic mean) | 1.000 (+/-?)     | n_estimators=200              |                                                                                   | 5        |
| Random Forest (alpha and beta seperate RF models, combined with maximum)         | 0.995 (+/-?)     | n_estimators=200              |                                                                                   | 5        |
| Random Forest (alpha and beta seperate RF models, combined with minimum)         | 0.998 (+/-?)     | n_estimators=200              |                                                                                   | 5        |
| Random Forest (alpha and beta seperate RF models, combined with multiplication)  | 1.000 (+/-?)     | n_estimators=200              |                                                                                   | 5        |
| Random Forest (K-Means imputed)                                                  | 0.996 (+/-?)     |                               |                                                                                   | 7        |
| KNeighbors (K-Means imputed)                                                     | 0.752 (+/-?)     | n_neighbors=5                 | Kinked ROC graph                                                                  | 7        |
| KNeighbors (Zero imputed)                                                        | 0.782 (+/-?)     | n_neighbors=5                 | Kinked ROC graph                                                                  | 7        |
| KNeighbors (Mean imputed)                                                        | 0.778 (+/-?)     | n_neighbors=5                 | Kinked ROC graph                                                                  | 7        |

### Test set without NaNs
| Method                                                                           | ROC          | Parameters       | Notes                          | Notebook |
|----------------------------------------------------------------------------------|--------------|------------------|--------------------------------|----------|
| Random Forest (alpha-chain only)                                                 | 0.981 (+/-?) | n_estimators=200 | NaNs also dropped in train set | 5        |
| Random Forest (beta-chain only)                                                  | 0.999 (+/-?) | n_estimators=200 | NaNs also dropped in train set | 5        |
| Random Forest (alpha and beta seperate RF models, combined with arithmetic mean) | 1.000 (+/-?) | n_estimators=200 | NaNs also dropped in train set | 5        |
| Random Forest (both-chains)                                                      | 0.998 (+/-?) | n_estimators=200 | NaNs also dropped in train set | 5        |
| Random Forest (alpha-chain only)                                                 | 0.987 (+/-?) | n_estimators=200 | NaNs not dropped in train set  | 5        |
| Random Forest (beta-chain only)                                                  | 1.000 (+/-?) | n_estimators=200 | NaNs not dropped in train set  | 5        |
| Random Forest (alpha and beta seperate RF models, combined with arithmetic mean) | 1.000 (+/-?) | n_estimators=200 | NaNs not dropped in train set  | 5        |
| Random Forest (both-chains)                                                      | 0.999 (+/-?) | n_estimators=200 | NaNs not dropped in train set  | 5        |

## Separatly
In notebook 5, we've trained a model (Random Forest) on the alpha-chain and a different model (also Random Forest) on the beta-chain.

## Data and Features

In [27]:
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from util import get_train_dataset, get_features, fix_test
import numpy as np

df = get_train_dataset()
# df = df.sample(n=1000)

df_reaction_column = df['reaction']
df_features = get_features(df)

df_features.reset_index(drop=True, inplace=True)
df_reaction_column.reset_index(drop=True, inplace=True)

assert df_features.shape[0] == df_reaction_column.shape[0]
assert np.isnan(df_reaction_column).sum().sum() == 0

# add the reaction column to the features (will be dropped later on)
df_features['reaction'] = df_reaction_column

In [28]:
y = df_features['reaction']
x = df_features.drop(['reaction'], axis=1)

assert 'reaction' not in x.columns
assert np.isnan(y).sum() == 0

## Edited data
Some models require the data to be edited in a certain way. This is done here.

In [29]:
# replace nan values with 0
x_zero_filled = x.fillna(0)
assert np.isnan(x_zero_filled).sum().sum() == 0

## Util

In [30]:
def evaluate(clf):
    kf = KFold(n_splits=5, shuffle=True, random_state=42)
    scores = cross_val_score(clf, x_zero_filled, y, cv=kf, scoring='roc_auc')
    print(scores)
    print(f"ROC: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")

## Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=0)
evaluate(clf)

## Hist Gradient Boosting

In [None]:
from sklearn.ensemble import HistGradientBoostingClassifier
clf = HistGradientBoostingClassifier(random_state=0)
evaluate(clf)