# Installation

In [None]:
!pip install git+https://github.com/VolodymyrVozniak/universal-trainer.git

# Checker

In [None]:
from croatoan_trainer.constants import DEVICE

In [None]:
DEVICE

device(type='cuda')

# Load Dataset

For multiclassification problem as an example let's use dataset from `sklearn` named `iris dataset`. 

Check this [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html) for more details.

In [None]:
import numpy as np

from sklearn.datasets import load_iris


data = load_iris()
x = data['data']
y = data['target']

print(f"Number of samples : {x.shape[0]}")
print(f"Number of features: {x.shape[1]}")
print(f"Number of classes : {len(np.unique(y))}\n")

print(f"Features shape: {x.shape}")
print(f"Targets  shape: {y.shape}\n")

print("First 2 entries, their features and targets")
print(f"Features:\n{x[:2]}")
print(f"Targets:\n{y[:2]}")

Number of samples : 150
Number of features: 4
Number of classes : 3

Features shape: (150, 4)
Targets  shape: (150,)

First 2 entries, their features and targets
Features:
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]]
Targets:
[0 0]


As we see our data isn't scaled, so let's scale it with `StandardScaler`.

In [None]:
from sklearn.preprocessing import StandardScaler


scaler = StandardScaler()
x = scaler.fit_transform(x)

print("First 2 entries, their features and targets")
print(f"Features:\n{x[:2]}")
print(f"Targets:\n{y[:2]}")

First 2 entries, their features and targets
Features:
[[-0.90068117  1.01900435 -1.34022653 -1.3154443 ]
 [-1.14301691 -0.13197948 -1.34022653 -1.3154443 ]]
Targets:
[0 0]


Now we can move to preprocessing.

# Preprocessing

First of all, for initializing `MulticlassPreproc` class you will need two dicts:
* `ids_to_features`: dict with unique ids as keys and features as values that will be used for training;
* `ids_to_targets`: dict with the same unique ids and targets as values.

In [None]:
import numpy as np

from croatoan_trainer.preprocess import MulticlassPreproc


# Make dict with unique ids as keys and features as values
ids_to_features = dict(zip(np.arange(len(y)), x))

# Make dict with unique ids as keys and targets as values
ids_to_targets = dict(zip(np.arange(len(y)), y))

preproc = MulticlassPreproc(ids_to_features, ids_to_targets)

## Preparing Targets

First of all, let's plot targets histogram (distribution).

In [None]:
preproc.plot_targets(prepared=False)

Let's prepare our targets. For multiclassification problem we can't somehow preprocess targets now.

In [None]:
preproc.prepare_targets()

[INFO] Prepared targets were successfully saved to `self.targets`!


Let's plot prepared targets histogram (distribution).

In [None]:
preproc.plot_targets(prepared=True)

We are ready to split our data.

## Splitting Data

When split data, you must define the following arguments:
* `test_size`: fraction of data for test (default is `0.2`);
* `n_folds`: number of CV folds. For now must be >= `1`. Can be `1`, meaning one validation set. If `1`, `val_size` must be specified (default is `5`);
* `val_size`: fraction of data for validation set. Must be specified if `n_folds` == `1` (default is `None`);
* `seed`: seed for splitting (default is `51983`).

Let's split data to train-test with `0.2` for test and to CV with `5` folds.

In [None]:
preproc.random_split(
    test_size=0.2,
    n_folds=5,
    val_size=None,
    seed=51983
)

[INFO] Train-test split was successfully saved to `self.split['train_test']`!
[INFO] CV split was successfully saved to `self.split['cv']`!


Let's plot targets histograms (distributions) for all sets and folds for input and prepared targets.

For input targets:

In [None]:
preproc.plot_split_targets(prepared=False)

For prepared targets:

In [None]:
preproc.plot_split_targets(prepared=True)

Everything looks good! Let's get some split info.

In [None]:
preproc.get_split_info()

Unnamed: 0,count,%
All,150,100.0
Train,120,80.0
Test,30,20.0
Train_0,96,64.0
Val_0,24,16.0
Train_1,96,64.0
Val_1,24,16.0
Train_2,96,64.0
Val_2,24,16.0
Train_3,96,64.0


We are ready to train our model!

# Training

For initializing `Trainer` you must provide these arguments:
* `preprocessed_data`: `MulticlassPreproc` class object that we have already defined and prepared;
* `dataset_class`: `CroatoanDataset` class or class that inherit `CroatoanDataset` (this class must include `ids` attribute; ideally it is recommended to overwrite only `process_features` function that processes features that will be passed to model for training);
* `loader_class`: class for DataLoader. Can be standard `torch.utils.data.DataLoader`, `torch_geometric.laoder.DataLoader` or your custom DataLoader that inherit `torch.utils.data.DataLoader`;
* `model_class`: class for torch model that will be trained. Must inherit `torch.nn.Module`;
* `optimizer_class`: any `torch.optim.Optimizer` optimizer;
* `criterion`: any loss from `torch`;
* `get_metrics`: function that takes two torch tensors (real and predicted values), computes some metrics and saves them to dict with metric's name as keys and metric's values as values.
* `main_metric`: main metric (must be one of metrics defined in `get_metrics` function). This metric will be used to choose the best epoch on CV;
* `direction`: direction in which we want to optimize `main_metric`. For example, `"minimize"` for regression problem and `mae` as `main_metric` or `"maximize"` for binary problem and `f1` as `main_metric`.

## Dataset

As it was said before, it is recommended to overwrite only `process_features` function that processes features that will be passed to model for training.

Here is how `CroatoanDataset` looks like:

```python
class CroatoanDataset(Dataset):
    """
    Dataset that will be used for training.

    Attributes:
        `ids` (list): List with unique ids.
        `features` (dict): Dictionary with unique ids as keys
        and features as values.
        `targets` (dict): Dictionary with unique ids as keys
        and targets as values.

    Methods:
        `process_features(features)`: Processes features for one entry.
    """
    def __init__(
        self,
        ids: List[Union[int, str]],
        features: Dict[Union[int, str], List[float]],
        targets: Dict[Union[int, str], float]
    ):
        """
        Args:
            `ids` (list): List with unique ids.
            `features` (dict): Dictionary with unique ids as keys
            and features as values.
            `targets` (dict): Dictionary with unique ids as keys
            and targets as values.
        """
        self.ids = ids
        self.features = features
        self.targets = targets

    def process_features(self, features: List[float]) -> torch.Tensor:
        """
        Processes features for one entry.

        Args:
            features (list): Features for one entry.

        Returns:
            torch.Tensor: New features that will go to torch model.
        """
        return torch.Tensor(features)

    def __getitem__(self, index: int) -> Tuple[torch.Tensor, float]:
        id_ = self.ids[index]
        features = self.process_features(self.features[id_])
        return features, self.targets[id_]

    def __len__(self) -> int:
        return len(self.ids)
```

## Model

Usually, you want to define your own `torch` model with your own architecture and pass it to `Trainer`.

But there are already defined simple models that you can import and use.

Here is how `MulticlassSimpleMLP` looks like:

```python
class MulticlassSimpleMLP(Module):
    """
    Simple MLP for multiclassification problem
    (2 hidden layers; dropout layers; ReLU activation).
    """

    def __init__(
        self,
        in_features: int,
        hidden_features: int,
        output_features: int,
        dropout: float
    ):
        """
        Args:
            `in_features` (int): Number of features for input.
            `hidden_features` (int): Number of features for hidden layers.
            `output_features` (int): Number of features for output.
            `dropout` (float): Dropout for dropout layers.
        """
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(in_features, hidden_features),
            nn.Dropout(dropout),
            nn.ReLU(),
            nn.Linear(hidden_features, hidden_features),
            nn.Dropout(dropout),
            nn.ReLU(),
            nn.Linear(hidden_features, output_features)
        )

    def forward(self, data: torch.Tensor) -> torch.Tensor:
        return self.model(data)
```

## Metrics Function

Also, it is important to define a function that will cumpute some metrics on each epoch for both train and test (val) sets.

There are already defined metrics functions that you can import and use.

Here is how `get_metrics_multiclass` looks like:

```python
def get_metrics_multiclass(
    y_true: torch.Tensor,
    y_pred: torch.Tensor
) -> Dict[str, float]:
    """
    Computes metrics for multiclassification problem
    (`accuracy`, `recall`, `precision`, `f1` with `'macro'`).

    Args:
        `y_true` (torch.Tensor): Torch tensor with real values.
        `y_pred` (torch.Tensor): Torch tensor with predicted values
        (logits for each class).

    Returns:
        dict: Dictionary with metric's name as keys and
        metric's values as values.
    """
    y_pred = torch.argmax(y_pred, dim=1)
    scores = {}
    scores["accuracy"] = float(accuracy_score(y_true, y_pred))
    scores["recall"] = float(recall_score(
        y_true=y_true,
        y_pred=y_pred,
        zero_division=0,
        average="macro"
    ))
    scores["precision"] = float(precision_score(
        y_true=y_true,
        y_pred=y_pred,
        zero_division=0,
        average="macro"
    ))
    scores["f1"] = float(f1_score(
        y_true=y_true,
        y_pred=y_pred,
        zero_division=0,
        average="macro"
    ))
    return scores
```

## Train

Let's define `Trainer` object ussing `croatoan_trainer.train.dataset.CroatoanDataset`, `torch.utils.data.DataLoader`, `croatoan_trainer.train.model.MulticlassSimpleMLP`, `torch.optim.Adam`, `torch.nn.CrossEntropyLoss()`, `croatoan_trainer.train.metrics.get_metrics_multiclass`, `"f1"` as main metric and `"maximize"` as direction.

In [None]:
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam

from croatoan_trainer.train import Trainer
from croatoan_trainer.train.dataset import CroatoanDataset
from croatoan_trainer.train.model import MulticlassSimpleMLP
from croatoan_trainer.train.metrics import get_metrics_multiclass


trainer = Trainer(
    preprocessed_data=preproc,
    dataset_class=CroatoanDataset,
    loader_class=DataLoader,
    model_class=MulticlassSimpleMLP,
    optimizer_class=Adam,
    criterion=torch.nn.CrossEntropyLoss(),
    get_metrics=get_metrics_multiclass,
    main_metric="f1",
    direction="maximize"
)

For training you must call `train(params)` method that will use all already defined arguments from class initializing under the hood, but also you must provide `params` argument - a dict with the following keys: `model` (kwargs for your model), `optimizer` (kwargs for optimizer), `batch_size` and `epochs`.

In [None]:
params = {
    "model": {
        "in_features": x.shape[1],
        "hidden_features": 20,
        "output_features": 3,
        "dropout": 0.25
    },
    "optimizer": {
        "lr": 1e-3,
        "weight_decay": 5*1e-5
    },
    "batch_size": 32,
    "epochs": 100
}

Now we are ready to train our model. But first let me explain how this method works under the hood.

Training pipeline

1. Trains in CV mode (meaning trains model on train set of specific fold and checks model performance on val set of specific fold with passed value for epochs and gets avarage performance on each epoch by avaraging scores for all folds), chooses best epoch and saves all results (losses, metrics on each epoch for train and val sets, best result, training time, unique ids, true values and predicted values on each epoch for val set). Results on each fold are also saved.
2. Trains in test mode (meaning trains model on train set and checks model performance on test set with chosen number of epochs on the CV stage) and saves all results (losses, metrics on each epoch for train and test sets, best result, training time, unique ids, true values and predicted values on each epoch for test set).
3. Trains in final mode (meaning trains model on all data with chosen number of epochs on the CV stage) and saves all results (losses, metrics on each epoch for train and test sets, best result, training time, unique ids, true values and predicted values on each epoch for test set). Here train and test are the same: all possible data, but the metrics can differ, because train set is always shuffled, while test set isn't.

In [None]:
results, model_weights = trainer.train(params)

2023-01-29 20:26:49,760 [INFO] 'cuda' is being used!
2023-01-29 20:26:49,762 [INFO] Training with cv...
2023-01-29 20:26:49,768 [INFO] Fold 0 is being trained...
2023-01-29 20:26:56,263 [INFO] Epoch 000 train_loss: 1.0990 val_loss: 1.1031 train_f1: 0.2388 val_f1: 0.1667
2023-01-29 20:26:57,395 [INFO] Epoch 020 train_loss: 0.8334 val_loss: 0.8368 train_f1: 0.7093 val_f1: 0.7091
2023-01-29 20:26:58,531 [INFO] Epoch 040 train_loss: 0.5112 val_loss: 0.5325 train_f1: 0.8230 val_f1: 0.6393
2023-01-29 20:26:59,696 [INFO] Epoch 060 train_loss: 0.3805 val_loss: 0.3975 train_f1: 0.9266 val_f1: 0.7691
2023-01-29 20:27:00,826 [INFO] Epoch 080 train_loss: 0.2814 val_loss: 0.2991 train_f1: 0.9470 val_f1: 0.8704
2023-01-29 20:27:01,911 [INFO] Epoch 099 train_loss: 0.1992 val_loss: 0.2277 train_f1: 0.9160 val_f1: 0.9582
2023-01-29 20:27:01,912 [INFO] Fold 1 is being trained...
2023-01-29 20:27:01,976 [INFO] Epoch 000 train_loss: 1.1013 val_loss: 1.0973 train_f1: 0.2670 val_f1: 0.1667
2023-01-29 20:27:

After training you can always check your training process by calling `print_logs()` method, which will just print logs from `logs.log` file.

In [None]:
trainer.print_logs()

# Analyzing

For initializing `MulticlassAnalyzer` class you will need:
* `results` - dictionary with results per each stage after training. Keys are `cv`, `test` and `final` and values are dicts with results for each stage (which contain lossses for each epoch inside `losses`, dict with metrics for each epoch inside `metrics`, best epoch and best metrics inside `best_result`, training time inside `time`, list with unique ids inside `ids`, list with true values inside `true` and list with model outputs for each epoch inside `pred`);
* `postprocess_fn` - function that takes list with model outputs from `pred` key for each stage in `results` and somehow processes them. For multiclassification problem it is important to have exact class as final output. So, for example, if you have list of logits as your model output, define function that will convert your logits into belonging to some class (just maximum of these logits) (default is `None`).

<ins>**REMINDER!**</ins> The main stage is always `test`, not `final` (`test` is how your model performs on data that it didn't see; `final` is how your model performs on data that it used for training).

In [None]:
from croatoan_trainer.analyze import MulticlassAnalyzer


def postprocess_fn(model_output):
    return np.argmax(model_output, axis=1)


analyzer = MulticlassAnalyzer(results, postprocess_fn)

## Getters

In this section some simple getters are explained:
* `get_stages()`: gets list of stages;
* `get_metrics()`: gets list of metrics used in training;
* `get_folds()`: gets number of folds used in training;
* `get_epochs(stage)`: gets number of epochs for stage;
* `get_time()`: gets train time in seconds for all stages;
* `get_df_pred(stage)`: gets dataframe with predictions.

In [None]:
analyzer.get_stages()

['cv', 'test', 'final']

In [None]:
analyzer.get_metrics()

['accuracy', 'recall', 'precision', 'f1']

In [None]:
analyzer.get_folds()

5

In [None]:
analyzer.get_epochs("test")

93

In [None]:
analyzer.get_time()

{'cv': 36.05851912498474,
 'test': 6.5041632652282715,
 'final': 12.00890302658081}

In [None]:
analyzer.get_df_pred("test")

Unnamed: 0,ID,True,Pred
0,37,0,0
1,133,2,2
2,141,2,2
3,44,0,0
4,91,1,1
5,60,1,1
6,106,2,1
7,92,1,1
8,90,1,1
9,31,0,0


## Metrics

You can get dataframe with final metrics for defined stages by calling `get_df_metrics(stages)` method.

In [None]:
analyzer.get_df_metrics(stages=["cv", "test"])

metric,accuracy,accuracy,recall,recall,precision,precision,f1,f1
stage,cv,test,cv,test,cv,test,cv,test
0,0.966667,0.933333,0.966667,0.933333,0.971852,0.933333,0.966337,0.933333


In [None]:
analyzer.get_df_metrics(stages=["cv", "test", "final"])

metric,accuracy,accuracy,accuracy,recall,recall,recall,precision,precision,precision,f1,f1,f1
stage,cv,test,final,cv,test,final,cv,test,final,cv,test,final
0,0.966667,0.933333,0.973333,0.966667,0.933333,0.973333,0.971852,0.933333,0.973825,0.966337,0.933333,0.973323


Also, you can compute some other metrics if you want using `get_metric_result(stage, metric, round, **kwargs)` method.

Arguments:
* `stage`: one of stage from `get_stages()` method;
* `metric`: function that takes `y_true` and `y_pred` in this order and gives float as output;
* `**kwargs`: extra arguments for `metric` function.

In [None]:
from sklearn.metrics import jaccard_score


analyzer.get_metric_result("test", jaccard_score, average="macro")

0.8787878787878789

## Charts

### Plotly Args

You can set `plotly` args for charts on your own vision by calling `set_plotly_args(**kwargs)` method.

Arguments:
`**kwargs`: named arguments for plotly `update_layout()` method (name of arguments must match arguments from this method).

In [None]:
analyzer.set_plotly_args(font_size=14, template="plotly_dark", bargap=0.2)

### Losses

You can plot losses for each stage and each fold by calling `plot_losses(stage, fold)` method.

Arguments:
* `stage`: one of stage from `get_stages()` method;
* `fold`: number of CV fold. Specify this parameter only when `stage` == `'cv'`. If not specified and `stage` == `'cv'` plots mean results for all CV folds (default is `None`).

In [None]:
analyzer.plot_losses("test")

In [None]:
analyzer.plot_losses("cv")

In [None]:
analyzer.plot_losses("cv", 2)

### Metrics

You can plot losses for each stage and each fold by calling `plot_losses(stage, metrics, fold)` method.

Arguments:
* `stage`: one of stage from `get_stages()` method;
* `metrics`: list of metrics for plotting;
* `fold`: number of CV fold. Specify this parameter only when `stage` == `'cv'`. If not specified and `stage` == `'cv'` plots mean results for all CV folds (default is `None`).

In [None]:
analyzer.plot_metrics("test", ["f1"])

In [None]:
analyzer.plot_metrics("cv", ["accuracy"], 2)

In [None]:
# plot all possible metrics
analyzer.plot_metrics("cv", analyzer.get_metrics())

### Sample Predictions

You can plot predictions over epochs for specific `id` from unique ids by calling `get_pred_sample(stage, id)` method.

Arguments:
* `stage`: one of stage from `get_stages()` method;
* `id`: unique id to identify specific entry.

In [None]:
analyzer.plot_pred_sample("test", 77)

In [None]:
analyzer.plot_pred_sample("cv", 4)

### Confusion Matrix

You can plot confusion matrix for each stage by calling `plot_confusion_matrix(stage)` method.

In [None]:
analyzer.plot_confusion_matrix("test")

In [None]:
analyzer.plot_confusion_matrix("cv")

Also, you can plot confusion matrix per epochs for each stage by calling `plot_confusion_matrix_per_epoch(stage, epochs)` method.

Arguments:
* `stage`: one of stage from `get_stages()` method;
* `epochs`: list with epochs for plotting (epochs counter started from 0).

In [None]:
# Plot 1, 25, 50, 75 and 100 epochs
analyzer.plot_confusion_matrix_per_epoch(
    stage="cv",
    epochs=[0, 24, 49, 74, 99]
)

In [None]:
# Plot every 10th epoch
analyzer.plot_confusion_matrix_per_epoch(
    stage="test",
    epochs=range(9, analyzer.get_epochs("test"), 10)
)

### All

You can plot all main charts (losses, all metrics, confusion matrix) for each stage by calling `plot_all(stage)` method.

In [None]:
analyzer.plot_all("test")