# Installation

In [None]:
!pip install git+https://github.com/VolodymyrVozniak/universal-trainer.git

# Checker

In [None]:
from croatoan_trainer.constants import DEVICE

In [None]:
DEVICE

device(type='cuda')

# Load Dataset

For binary problem as an example let's use dataset from `sklearn` named `breast cancer dataset`. 

Check this [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html) for more details.

In [None]:
import numpy as np

from sklearn.datasets import load_breast_cancer


data = load_breast_cancer()
x = data['data']
y = data['target']

print(f"Number of samples : {x.shape[0]}")
print(f"Number of features: {x.shape[1]}")
print(f"Number of classes : {len(np.unique(y))}\n")

print(f"Features shape: {x.shape}")
print(f"Targets  shape: {y.shape}\n")

print("First 2 entries, their features and targets")
print(f"Features:\n{x[:2]}")
print(f"Targets:\n{y[:2]}")

Number of samples : 569
Number of features: 30
Number of classes : 2

Features shape: (569, 30)
Targets  shape: (569,)

First 2 entries, their features and targets
Features:
[[1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01
  1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02
  6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01
  1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01
  4.601e-01 1.189e-01]
 [2.057e+01 1.777e+01 1.329e+02 1.326e+03 8.474e-02 7.864e-02 8.690e-02
  7.017e-02 1.812e-01 5.667e-02 5.435e-01 7.339e-01 3.398e+00 7.408e+01
  5.225e-03 1.308e-02 1.860e-02 1.340e-02 1.389e-02 3.532e-03 2.499e+01
  2.341e+01 1.588e+02 1.956e+03 1.238e-01 1.866e-01 2.416e-01 1.860e-01
  2.750e-01 8.902e-02]]
Targets:
[0 0]


As we see our data isn't scaled, so let's scale it with `StandardScaler`.

In [None]:
from sklearn.preprocessing import StandardScaler


scaler = StandardScaler()
x = scaler.fit_transform(x)

print("First 2 entries, their features and targets")
print(f"Features:\n{x[:2]}")
print(f"Targets:\n{y[:2]}")

First 2 entries, their features and targets
Features:
[[ 1.09706398e+00 -2.07333501e+00  1.26993369e+00  9.84374905e-01
   1.56846633e+00  3.28351467e+00  2.65287398e+00  2.53247522e+00
   2.21751501e+00  2.25574689e+00  2.48973393e+00 -5.65265059e-01
   2.83303087e+00  2.48757756e+00 -2.14001647e-01  1.31686157e+00
   7.24026158e-01  6.60819941e-01  1.14875667e+00  9.07083081e-01
   1.88668963e+00 -1.35929347e+00  2.30360062e+00  2.00123749e+00
   1.30768627e+00  2.61666502e+00  2.10952635e+00  2.29607613e+00
   2.75062224e+00  1.93701461e+00]
 [ 1.82982061e+00 -3.53632408e-01  1.68595471e+00  1.90870825e+00
  -8.26962447e-01 -4.87071673e-01 -2.38458552e-02  5.48144156e-01
   1.39236330e-03 -8.68652457e-01  4.99254601e-01 -8.76243603e-01
   2.63326966e-01  7.42401948e-01 -6.05350847e-01 -6.92926270e-01
  -4.40780058e-01  2.60162067e-01 -8.05450380e-01 -9.94437403e-02
   1.80592744e+00 -3.69203222e-01  1.53512599e+00  1.89048899e+00
  -3.75611957e-01 -4.30444219e-01 -1.46748968e-01  1.

Now we can move to preprocessing.

# Preprocessing

First of all, for initializing `BinaryPreproc` class you will need two dicts:
* `ids_to_features`: dict with unique ids as keys and features as values that will be used for training;
* `ids_to_targets`: dict with the same unique ids and targets as values.

In [None]:
from croatoan_trainer.preprocess import BinaryPreproc


# Make dict with unique ids as keys and features as values
ids_to_features = dict(zip(np.arange(len(y)), x))

# Make dict with unique ids as keys and targets as values
ids_to_targets = dict(zip(np.arange(len(y)), y))

preproc = BinaryPreproc(ids_to_features, ids_to_targets)

## Preparing Targets

First of all, let's plot targets histogram (distribution).

In [None]:
preproc.plot_targets(prepared=False)

As we see class 1 has more samples that class 0, so it will be useful if we swap these classes, because usually for binary problem we are trying to optimize F1 metric, which is class specific (with respect to 1).

In [None]:
preproc.prepare_targets(reverse=True)

[INFO] Reverse targets will be used!
[INFO] Prepared targets were successfully saved to `self.targets`!


Let's plot prepared targets histogram (distribution).

In [None]:
preproc.plot_targets(prepared=True)

We are ready to split our data.

## Splitting Data

When split data, you must define the following arguments:
* `test_size`: fraction of data for test (default is `0.2`);
* `n_folds`: number of CV folds. For now must be >= `1`. Can be `1`, meaning one validation set. If `1`, `val_size` must be specified (default is `5`);
* `val_size`: fraction of data for validation set. Must be specified if `n_folds` == `1` (default is `None`);
* `seed`: seed for splitting (default is `51983`).

Let's split data to train-test with `0.2` for test and to CV with `5` folds.

In [None]:
preproc.random_split(
    test_size=0.2,
    n_folds=5,
    val_size=None,
    seed=51983
)

[INFO] Train-test split was successfully saved to `self.split['train_test']`!
[INFO] CV split was successfully saved to `self.split['cv']`!


Let's plot targets histograms (distributions) for all sets and folds for input and prepared targets.

For input targets:

In [None]:
preproc.plot_split_targets(prepared=False)

For prepared targets:

In [None]:
preproc.plot_split_targets(prepared=True)

Everything looks good! Let's get some split info.

In [None]:
preproc.get_split_info()

Unnamed: 0,count,%
All,569,100.0
Train,455,80.0
Test,114,20.0
Train_0,364,64.0
Val_0,91,16.0
Train_1,364,64.0
Val_1,91,16.0
Train_2,364,64.0
Val_2,91,16.0
Train_3,364,64.0


We are ready to train our model!

# Training

For initializing `Trainer` you must provide these arguments:
* `preprocessed_data`: `BinaryPreproc` class object that we have already defined and prepared;
* `dataset_class`: `CroatoanDataset` class or class that inherit `CroatoanDataset` (this class must include `ids` attribute; ideally it is recommended to overwrite only `process_features` function that processes features that will be passed to model for training);
* `loader_class`: class for DataLoader. Can be standard `torch.utils.data.DataLoader`, `torch_geometric.laoder.DataLoader` or your custom DataLoader that inherit `torch.utils.data.DataLoader`;
* `model_class`: class for torch model that will be trained. Must inherit `torch.nn.Module`;
* `optimizer_class`: any `torch.optim.Optimizer` optimizer;
* `criterion`: any loss from `torch`;
* `get_metrics`: function that takes two torch tensors (real and predicted values), computes some metrics and saves them to dict with metric's name as keys and metric's values as values.
* `main_metric`: main metric (must be one of metrics defined in `get_metrics` function). This metric will be used to choose the best epoch on CV;
* `direction`: direction in which we want to optimize `main_metric`. For example, `"minimize"` for regression problem and `mae` as `main_metric` or `"maximize"` for binary problem and `f1` as `main_metric`.

## Dataset

As it was said before, it is recommended to overwrite only `process_features` function that processes features that will be passed to model for training.

Here is how `CroatoanDataset` looks like:

```python
class CroatoanDataset(Dataset):
    """
    Dataset that will be used for training.

    Attributes:
        `ids` (list): List with unique ids.
        `features` (dict): Dictionary with unique ids as keys
        and features as values.
        `targets` (dict): Dictionary with unique ids as keys
        and targets as values.

    Methods:
        `process_features(features)`: Processes features for one entry.
    """
    def __init__(
        self,
        ids: List[Union[int, str]],
        features: Dict[Union[int, str], List[float]],
        targets: Dict[Union[int, str], float]
    ):
        """
        Args:
            `ids` (list): List with unique ids.
            `features` (dict): Dictionary with unique ids as keys
            and features as values.
            `targets` (dict): Dictionary with unique ids as keys
            and targets as values.
        """
        self.ids = ids
        self.features = features
        self.targets = targets

    def process_features(self, features: List[float]) -> torch.Tensor:
        """
        Processes features for one entry.

        Args:
            features (list): Features for one entry.

        Returns:
            torch.Tensor: New features that will go to torch model.
        """
        return torch.Tensor(features)

    def __getitem__(self, index: int) -> Tuple[torch.Tensor, float]:
        id_ = self.ids[index]
        features = self.process_features(self.features[id_])
        return features, self.targets[id_]

    def __len__(self) -> int:
        return len(self.ids)
```

## Model

Usually, you want to define your own `torch` model with your own architecture and pass it to `Trainer`.

But there are already defined simple models that you can import and use.

Here is how `BinarySimpleMLP` looks like:

```python
class BinarySimpleMLP(Module):
    """
    Simple MLP for binary problem
    (2 hidden layers; dropout layers; ReLU activation and
    Sigmoid activation at the end).
    """

    def __init__(
        self,
        in_features: int,
        hidden_features: int,
        dropout: float
    ):
        """
        Args:
            `in_features` (int): Number of features for input.
            `hidden_features (ine)`: Number of features for hidden layers.
            `dropout` (float): Dropout for dropout layers.
        """
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(in_features, hidden_features),
            nn.Dropout(dropout),
            nn.ReLU(),
            nn.Linear(hidden_features, hidden_features),
            nn.Dropout(dropout),
            nn.ReLU(),
            nn.Linear(hidden_features, 1),
            nn.Sigmoid()
        )

    def forward(self, data: torch.Tensor) -> torch.Tensor:
        return self.model(data).reshape(-1)
```

## Metrics Function

Also, it is important to define a function that will cumpute some metrics on each epoch for both train and test (val) sets.

There are already defined metrics functions that you can import and use.

Here is how `get_metrics_binary` looks like:

```python
def get_metrics_binary(
    y_true: torch.Tensor,
    y_pred: torch.Tensor
) -> Dict[str, float]:
    """
    Computes metrics for binary problem
    (`roc_auc`, `accuracy`, `recall`, `precision`, `f1`).

    Args:
        `y_true` (torch.Tensor): Torch tensor with real values.
        `y_pred` (torch.Tensor): Torch tensor with predicted values
        (must be values after sigmoid).

    Returns:
        dict: Dictionary with metric's name as keys and
        metric's values as values.
    """
    scores = {}
    try:
        scores["roc_auc"] = float(roc_auc_score(y_true, y_pred))
    except ValueError:
        scores["roc_auc"] = -np.inf
    y_pred = torch.round(y_pred)
    scores["accuracy"] = float(accuracy_score(y_true, y_pred))
    scores["recall"] = float(recall_score(y_true, y_pred, zero_division=0))
    scores["precision"] = float(precision_score(y_true, y_pred,
                                                zero_division=0))
    scores["f1"] = float(f1_score(y_true, y_pred, zero_division=0))
    return scores
```

## Train

Let's define `Trainer` object ussing `croatoan_trainer.train.dataset.CroatoanDataset`, `torch.utils.data.DataLoader`, `croatoan_trainer.train.model.BinarySimpleMLP`, `torch.optim.Adam`, `torch.nn.BCELoss()`, `croatoan_trainer.train.metrics.get_metrics_binary`, `"f1"` as main metric and `"maximize"` as direction.

In [None]:
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam

from croatoan_trainer.train import Trainer
from croatoan_trainer.train.dataset import CroatoanDataset
from croatoan_trainer.train.model import BinarySimpleMLP
from croatoan_trainer.train.metrics import get_metrics_binary


trainer = Trainer(
    preprocessed_data=preproc,
    dataset_class=CroatoanDataset,
    loader_class=DataLoader,
    model_class=BinarySimpleMLP,
    optimizer_class=Adam,
    criterion=torch.nn.BCELoss(),
    get_metrics=get_metrics_binary,
    main_metric="f1",
    direction="maximize"
)

For training you must call `train(params)` method that will use all already defined arguments from class initializing under the hood, but also you must provide `params` argument - a dict with the following keys: `model` (kwargs for your model), `optimizer` (kwargs for optimizer), `batch_size` and `epochs`.

In [None]:
params = {
    "model": {
        "in_features": x.shape[1],
        "hidden_features": 20,
        "dropout": 0.25
    },
    "optimizer": {
        "lr": 1e-3,
        "weight_decay": 5*1e-5
    },
    "batch_size": 32,
    "epochs": 100
}

Now we are ready to train our model. But first let me explain how this method works under the hood.

Training pipeline

1. Trains in CV mode (meaning trains model on train set of specific fold and checks model performance on val set of specific fold with passed value for epochs and gets avarage performance on each epoch by avaraging scores for all folds), chooses best epoch and saves all results (losses, metrics on each epoch for train and val sets, best result, training time, unique ids, true values and predicted values on each epoch for val set). Results on each fold are also saved.
2. Trains in test mode (meaning trains model on train set and checks model performance on test set with chosen number of epochs on the CV stage) and saves all results (losses, metrics on each epoch for train and test sets, best result, training time, unique ids, true values and predicted values on each epoch for test set).
3. Trains in final mode (meaning trains model on all data with chosen number of epochs on the CV stage) and saves all results (losses, metrics on each epoch for train and test sets, best result, training time, unique ids, true values and predicted values on each epoch for test set). Here train and test are the same: all possible data, but the metrics can differ, because train set is always shuffled, while test set isn't.

In [None]:
results, model_weights = trainer.train(params)

2023-01-29 20:42:43,517 [INFO] 'cuda' is being used!
2023-01-29 20:42:43,520 [INFO] Training with cv...
2023-01-29 20:42:43,522 [INFO] Fold 0 is being trained...
2023-01-29 20:42:50,105 [INFO] Epoch 000 train_loss: 0.6741 val_loss: 0.6452 train_f1: 0.5481 val_f1: 0.8182
2023-01-29 20:42:50,927 [INFO] Epoch 020 train_loss: 0.0931 val_loss: 0.0322 train_f1: 0.9668 val_f1: 1.0000
2023-01-29 20:42:51,754 [INFO] Epoch 040 train_loss: 0.0565 val_loss: 0.0132 train_f1: 0.9851 val_f1: 1.0000
2023-01-29 20:42:52,564 [INFO] Epoch 060 train_loss: 0.0466 val_loss: 0.0116 train_f1: 0.9851 val_f1: 1.0000
2023-01-29 20:42:53,410 [INFO] Epoch 080 train_loss: 0.0289 val_loss: 0.0326 train_f1: 0.9852 val_f1: 0.9851
2023-01-29 20:42:54,173 [INFO] Epoch 099 train_loss: 0.0220 val_loss: 0.0147 train_f1: 0.9926 val_f1: 0.9851
2023-01-29 20:42:54,175 [INFO] Fold 1 is being trained...
2023-01-29 20:42:54,222 [INFO] Epoch 000 train_loss: 0.6705 val_loss: 0.6429 train_f1: 0.5704 val_f1: 0.8125
2023-01-29 20:42:

After training you can always check your training process by calling `print_logs()` method, which will just print logs from `logs.log` file.

In [None]:
trainer.print_logs()

# Analyzing

For initializing `BinaryAnalyzer` class you will need:
* `results` - dictionary with results per each stage after training. Keys are `cv`, `test` and `final` and values are dicts with results for each stage (which contain lossses for each epoch inside `losses`, dict with metrics for each epoch inside `metrics`, best epoch and best metrics inside `best_result`, training time inside `time`, list with unique ids inside `ids`, list with true values inside `true` and list with model outputs for each epoch inside `pred`);
* `postprocess_fn` - function that takes list with model outputs from `pred` key for each stage in `results` and somehow processes them. For binary problem it is important to have probability of belonging to class 1 as final output. So, for example, if you have logits as your model output, define function that will convert your logits into probabilities (simple sigmoid function). If you have probabilities as your model output, keep this argument `None` and use default model outputs (default is `None`).

<ins>**REMINDER!**</ins> The main stage is always `test`, not `final` (`test` is how your model performs on data that it didn't see; `final` is how your model performs on data that it used for training).

In [None]:
from croatoan_trainer.analyze import BinaryAnalyzer


analyzer = BinaryAnalyzer(results, None)

## Getters

In this section some simple getters are explained:
* `get_stages()`: gets list of stages;
* `get_metrics()`: gets list of metrics used in training;
* `get_folds()`: gets number of folds used in training;
* `get_epochs(stage)`: gets number of epochs for stage;
* `get_time()`: gets train time in seconds for all stages;
* `get_df_pred(stage)`: gets dataframe with predictions.

In [None]:
analyzer.get_stages()

['cv', 'test', 'final']

In [None]:
analyzer.get_metrics()

['roc_auc', 'accuracy', 'recall', 'precision', 'f1']

In [None]:
analyzer.get_folds()

5

In [None]:
analyzer.get_epochs("test")

65

In [None]:
analyzer.get_time()

{'cv': 27.817191123962402,
 'test': 3.131624221801758,
 'final': 4.338812828063965}

In [None]:
analyzer.get_df_pred("test")

Unnamed: 0,ID,True,Pred
0,214,1.0,0.999999
1,280,1.0,1.000000
2,222,0.0,0.000030
3,503,1.0,1.000000
4,231,0.0,0.000427
...,...,...,...
109,350,0.0,0.000081
110,334,0.0,0.000224
111,57,1.0,0.999990
112,75,1.0,1.000000


## Metrics

You can get dataframe with final metrics for defined stages by calling `get_df_metrics(stages)` method.

In [None]:
analyzer.get_df_metrics(stages=["cv", "test"])

metric,roc_auc,roc_auc,accuracy,accuracy,recall,recall,precision,precision,f1,f1
stage,cv,test,cv,test,cv,test,cv,test,cv,test
0,0.995459,0.995701,0.984615,0.982456,0.976471,0.952381,0.982353,1.0,0.979368,0.97561


In [None]:
analyzer.get_df_metrics(stages=["cv", "test", "final"])

metric,roc_auc,roc_auc,roc_auc,accuracy,accuracy,accuracy,recall,recall,recall,precision,precision,precision,f1,f1,f1
stage,cv,test,final,cv,test,final,cv,test,final,cv,test,final,cv,test,final
0,0.995459,0.995701,0.9993,0.984615,0.982456,0.99297,0.976471,0.952381,0.981132,0.982353,1.0,1.0,0.979368,0.97561,0.990476


Also, you can compute some other metrics if you want using `get_metric_result(stage, metric, round, **kwargs)` method.

Arguments:
* `stage`: one of stage from `get_stages()` method;
* `metric`: function that takes `y_true` and `y_pred` in this order and gives float as output;
* `round`: flag to work with binary values (if `True`)
or predictions (if `False`) (default is `True`);
* `**kwargs`: extra arguments for `metric` function.

In [None]:
from sklearn.metrics import jaccard_score


analyzer.get_metric_result("test", jaccard_score, True, zero_division=0)

0.9523809523809523

## Charts

### Plotly Args

You can set `plotly` args for charts on your own vision by calling `set_plotly_args(**kwargs)` method.

Arguments:
`**kwargs`: named arguments for plotly `update_layout()` method (name of arguments must match arguments from this method).

In [None]:
analyzer.set_plotly_args(font_size=14, template="plotly_dark", bargap=0.2)

### Losses

You can plot losses for each stage and each fold by calling `plot_losses(stage, fold)` method.

Arguments:
* `stage`: one of stage from `get_stages()` method;
* `fold`: number of CV fold. Specify this parameter only when `stage` == `'cv'`. If not specified and `stage` == `'cv'` plots mean results for all CV folds (default is `None`).

In [None]:
analyzer.plot_losses("test")

In [None]:
analyzer.plot_losses("cv")

In [None]:
analyzer.plot_losses("cv", 2)

### Metrics

You can plot losses for each stage and each fold by calling `plot_losses(stage, metrics, fold)` method.

Arguments:
* `stage`: one of stage from `get_stages()` method;
* `metrics`: list of metrics for plotting;
* `fold`: number of CV fold. Specify this parameter only when `stage` == `'cv'`. If not specified and `stage` == `'cv'` plots mean results for all CV folds (default is `None`).

In [None]:
analyzer.plot_metrics("test", ["f1"])

In [None]:
analyzer.plot_metrics("cv", ["accuracy"], 2)

In [None]:
# plot all possible metrics
analyzer.plot_metrics("cv", analyzer.get_metrics())

### Sample Predictions

You can plot predictions over epochs for specific `id` from unique ids by calling `get_pred_sample(stage, id)` method.

Arguments:
* `stage`: one of stage from `get_stages()` method;
* `id`: unique id to identify specific entry.

In [None]:
analyzer.plot_pred_sample("test", 214)

In [None]:
analyzer.plot_pred_sample("cv", 256)

### Confusion Matrix

You can plot confusion matrix for each stage by calling `plot_confusion_matrix(stage)` method.

In [None]:
analyzer.plot_confusion_matrix("test")

In [None]:
analyzer.plot_confusion_matrix("cv")

Also, you can plot confusion matrix per epochs for each stage by calling `plot_confusion_matrix_per_epoch(stage, epochs)` method.

Arguments:
* `stage`: one of stage from `get_stages()` method;
* `epochs`: list with epochs for plotting (epochs counter started from 0).

In [None]:
# Plot 1, 25, 50, 75 and 100 epochs
analyzer.plot_confusion_matrix_per_epoch(
    stage="cv",
    epochs=[0, 24, 49, 74, 99]
)

In [None]:
# Plot every 10th epoch
analyzer.plot_confusion_matrix_per_epoch(
    stage="test",
    epochs=range(9, analyzer.get_epochs("test"), 10)
)

### ROC-AUC

You can plot ROC-AUC curve for each stage by calling `plot_roc_auc(stage)` method.

In [None]:
analyzer.plot_roc_auc("cv")

In [None]:
analyzer.plot_roc_auc("test")

### Precision-Recall AUC

You can plot Precision-Recall AUC curve for each stage by calling `plot_precision_recall_auc(stage)` method.

In [None]:
analyzer.plot_precision_recall_auc("cv")

In [None]:
analyzer.plot_precision_recall_auc("test")

### Enrichment

You can plot enrichment for each stage by calling `plot_enrichment(stage)` method.

In [None]:
analyzer.plot_enrichment("cv")

In [None]:
analyzer.plot_enrichment("test")

### All

You can plot all main charts (losses, all metrics, confusion matrix, ROC-AUC, Precision-Recall AUC, enrichment) for each stage by calling `plot_all(stage)` method.

In [None]:
analyzer.plot_all("test")