# Installation

In [None]:
!pip install git+https://github.com/VolodymyrVozniak/universal-trainer.git

# Checker

In [None]:
from croatoan_trainer.constants import DEVICE

In [None]:
DEVICE

device(type='cuda')

# Load Dataset

For regression problem as an example let's use dataset from `sklearn` named `diabetes dataset`. 

Check this [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html) for more details.

In [None]:
import numpy as np

from sklearn.datasets import load_diabetes


data = load_diabetes()
x = data['data']
y = data['target']

print(f"Number of samples : {x.shape[0]}")
print(f"Number of features: {x.shape[1]}\n")

print(f"Features shape: {x.shape}")
print(f"Targets  shape: {y.shape}\n")

print("First 2 entries, their features and targets")
print(f"Features:\n{x[:2]}")
print(f"Targets:\n{y[:2]}")

Number of samples : 442
Number of features: 10

Features shape: (442, 10)
Targets  shape: (442,)

First 2 entries, their features and targets
Features:
[[ 0.03807591  0.05068012  0.06169621  0.02187239 -0.0442235  -0.03482076
  -0.04340085 -0.00259226  0.01990749 -0.01764613]
 [-0.00188202 -0.04464164 -0.05147406 -0.02632753 -0.00844872 -0.01916334
   0.07441156 -0.03949338 -0.06833155 -0.09220405]]
Targets:
[151.  75.]


As we see our data is already scaled, so we don't need to scale it with `StandardScaler`.

Now we can move to preprocessing.

# Preprocessing

First of all, for initializing `RegressionPreproc` class you will need two dicts:
* `ids_to_features`: dict with unique ids as keys and features as values that will be used for training;
* `ids_to_targets`: dict with the same unique ids and targets as values.

In [None]:
from croatoan_trainer.preprocess import RegressionPreproc


# Make dict with unique ids as keys and features as values
ids_to_features = dict(zip(np.arange(len(y)), x))

# Make dict with unique ids as keys and targets as values
ids_to_targets = dict(zip(np.arange(len(y)), y))

preproc = RegressionPreproc(ids_to_features, ids_to_targets)

## Preparing Targets

First of all, let's plot targets histogram (distribution).

In [None]:
preproc.plot_targets(prepared=False)

As we see there are no outliers and distribution is close to normal, so we don't need to log our data and use quantiles.

In [None]:
preproc.prepare_targets(log=False, quantiles=None)

[INFO] Prepared targets were successfully saved to `self.targets`!


Let's plot prepared targets histogram (distribution).

In [None]:
preproc.plot_targets(prepared=True)

We are ready to split our data.

## Splitting Data

When split data, you must define the following arguments:
* `test_size`: fraction of data for test (default is `0.2`);
* `n_folds`: number of CV folds. For now must be >= `1`. Can be `1`, meaning one validation set. If `1`, `val_size` must be specified (default is `5`);
* `val_size`: fraction of data for validation set. Must be specified if `n_folds` == `1` (default is `None`);
* `seed`: seed for splitting (default is `51983`).

Let's split data to train-test with `0.2` for test and to CV with `5` folds.

In [None]:
preproc.random_split(
    test_size=0.2,
    n_folds=5,
    val_size=None,
    seed=51983
)

[INFO] Train-test split was successfully saved to `self.split['train_test']`!
[INFO] CV split was successfully saved to `self.split['cv']`!


Let's plot targets histograms (distributions) for all sets and folds for input and prepared targets.

For input targets:

In [None]:
preproc.plot_split_targets(prepared=False)

For prepared targets:

In [None]:
preproc.plot_split_targets(prepared=True)

Everything looks good! Let's get some split info.

In [None]:
preproc.get_split_info()

Unnamed: 0,count,%
All,442,100.0
Train,353,79.9
Test,89,20.1
Train_0,282,63.8
Val_0,71,16.1
Train_1,282,63.8
Val_1,71,16.1
Train_2,282,63.8
Val_2,71,16.1
Train_3,283,64.0


We are ready to train our model!

# Training

For initializing `Trainer` you must provide these arguments:
* `preprocessed_data`: `RegressionPreproc` class object that we have already defined and prepared;
* `dataset_class`: `CroatoanDataset` class or class that inherit `CroatoanDataset` (this class must include `ids` attribute; ideally it is recommended to overwrite only `process_features` function that processes features that will be passed to model for training);
* `loader_class`: class for DataLoader. Can be standard `torch.utils.data.DataLoader`, `torch_geometric.laoder.DataLoader` or your custom DataLoader that inherit `torch.utils.data.DataLoader`;
* `model_class`: class for torch model that will be trained. Must inherit `torch.nn.Module`;
* `optimizer_class`: any `torch.optim.Optimizer` optimizer;
* `criterion`: any loss from `torch`;
* `get_metrics`: function that takes two torch tensors (real and predicted values), computes some metrics and saves them to dict with metric's name as keys and metric's values as values.
* `main_metric`: main metric (must be one of metrics defined in `get_metrics` function). This metric will be used to choose the best epoch on CV;
* `direction`: direction in which we want to optimize `main_metric`. For example, `"minimize"` for regression problem and `mae` as `main_metric` or `"maximize"` for binary problem and `f1` as `main_metric`.

## Dataset

As it was said before, it is recommended to overwrite only `process_features` function that processes features that will be passed to model for training.

Here is how `CroatoanDataset` looks like:

```python
class CroatoanDataset(Dataset):
    """
    Dataset that will be used for training.

    Attributes:
        `ids` (list): List with unique ids.
        `features` (dict): Dictionary with unique ids as keys
        and features as values.
        `targets` (dict): Dictionary with unique ids as keys
        and targets as values.

    Methods:
        `process_features(features)`: Processes features for one entry.
    """
    def __init__(
        self,
        ids: List[Union[int, str]],
        features: Dict[Union[int, str], List[float]],
        targets: Dict[Union[int, str], float]
    ):
        """
        Args:
            `ids` (list): List with unique ids.
            `features` (dict): Dictionary with unique ids as keys
            and features as values.
            `targets` (dict): Dictionary with unique ids as keys
            and targets as values.
        """
        self.ids = ids
        self.features = features
        self.targets = targets

    def process_features(self, features: List[float]) -> torch.Tensor:
        """
        Processes features for one entry.

        Args:
            features (list): Features for one entry.

        Returns:
            torch.Tensor: New features that will go to torch model.
        """
        return torch.Tensor(features)

    def __getitem__(self, index: int) -> Tuple[torch.Tensor, float]:
        id_ = self.ids[index]
        features = self.process_features(self.features[id_])
        return features, self.targets[id_]

    def __len__(self) -> int:
        return len(self.ids)
```

## Model

Usually, you want to define your own `torch` model with your own architecture and pass it to `Trainer`.

But there are already defined simple models that you can import and use.

Here is how `RegressionSimpleMLP` looks like:

```python
class RegressionSimpleMLP(Module):
    """
    Simple MLP for regression problem
    (2 hidden layers; dropout layers; ReLU activation).
    """

    def __init__(
        self,
        in_features: int,
        hidden_features: int,
        dropout: float
    ):
        """
        Args:
            `in_features` (int): Number of features for input.
            `hidden_features` (int): Number of features for hidden layers.
            `dropout` (float): Dropout for dropout layers.
        """
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(in_features, hidden_features),
            nn.Dropout(dropout),
            nn.ReLU(),
            nn.Linear(hidden_features, hidden_features),
            nn.Dropout(dropout),
            nn.ReLU(),
            nn.Linear(hidden_features, 1)
        )

    def forward(self, data: torch.Tensor) -> torch.Tensor:
        return self.model(data).reshape(-1)
```

## Metrics Function

Also, it is important to define a function that will cumpute some metrics on each epoch for both train and test (val) sets.

There are already defined metrics functions that you can import and use.

Here is how `get_metrics_regression` looks like:

```python
def get_metrics_regression(
    y_true: torch.Tensor,
    y_pred: torch.Tensor
) -> Dict[str, float]:
    """
    Computes metrics for regression problem
    (`mse`, `mae`, `mape`, `r2`).

    Args:
        `y_true` (torch.Tensor): Torch tensor with real values.
        `y_pred` (torch.Tensor): Torch tensor with predicted values.

    Returns:
        dict: Dictionary with metric's name as keys and
        metric's values as values.
    """
    scores = {}
    if np.isfinite(y_pred).all():
        scores["mse"] = float(mean_squared_error(y_true, y_pred))
        scores["mae"] = float(mean_absolute_error(y_true, y_pred))
        scores["mape"] = float(mean_absolute_percentage_error(y_true, y_pred))
        scores["r2"] = float(r2_score(y_true, y_pred))
    else:
        scores["mse"], scores["mae"], scores["mape"], scores["r2"] \
            = [-np.inf] * 4
    return scores
```

## Train

Let's define `Trainer` object ussing `croatoan_trainer.train.dataset.CroatoanDataset`, `torch.utils.data.DataLoader`, `croatoan_trainer.train.model.RegressionSimpleMLP`, `torch.optim.Adam`, `torch.nn.MSELoss()`, `croatoan_trainer.train.metrics.get_metrics_regression`, `"mae"` as main metric and `"minimize"` as direction.

In [None]:
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam

from croatoan_trainer.train import Trainer
from croatoan_trainer.train.dataset import CroatoanDataset
from croatoan_trainer.train.model import RegressionSimpleMLP
from croatoan_trainer.train.metrics import get_metrics_regression


trainer = Trainer(
    preprocessed_data=preproc,
    dataset_class=CroatoanDataset,
    loader_class=DataLoader,
    model_class=RegressionSimpleMLP,
    optimizer_class=Adam,
    criterion=torch.nn.MSELoss(),
    get_metrics=get_metrics_regression,
    main_metric="mae",
    direction="minimize"
)

For training you must call `train(params)` method that will use all already defined arguments from class initializing under the hood, but also you must provide `params` argument - a dict with the following keys: `model` (kwargs for your model), `optimizer` (kwargs for optimizer), `batch_size` and `epochs`.

In [None]:
params = {
    "model": {
        "in_features": x.shape[1],
        "hidden_features": 20,
        "dropout": 0.25
    },
    "optimizer": {
        "lr": 1e-3,
        "weight_decay": 5*1e-5
    },
    "batch_size": 32,
    "epochs": 100
}

Now we are ready to train our model. But first let me explain how this method works under the hood.

Training pipeline

1. Trains in CV mode (meaning trains model on train set of specific fold and checks model performance on val set of specific fold with passed value for epochs and gets avarage performance on each epoch by avaraging scores for all folds), chooses best epoch and saves all results (losses, metrics on each epoch for train and val sets, best result, training time, unique ids, true values and predicted values on each epoch for val set). Results on each fold are also saved.
2. Trains in test mode (meaning trains model on train set and checks model performance on test set with chosen number of epochs on the CV stage) and saves all results (losses, metrics on each epoch for train and test sets, best result, training time, unique ids, true values and predicted values on each epoch for test set).
3. Trains in final mode (meaning trains model on all data with chosen number of epochs on the CV stage) and saves all results (losses, metrics on each epoch for train and test sets, best result, training time, unique ids, true values and predicted values on each epoch for test set). Here train and test are the same: all possible data, but the metrics can differ, because train set is always shuffled, while test set isn't.

In [None]:
results, model_weights = trainer.train(params)

2023-01-29 18:33:06,620 [INFO] 'cuda' is being used!
2023-01-29 18:33:06,621 [INFO] Training with cv...
2023-01-29 18:33:06,626 [INFO] Fold 0 is being trained...
2023-01-29 18:33:15,148 [INFO] Epoch 000 train_loss: 28772.5446 val_loss: 30009.5952 train_mae: 151.4379 val_mae: 154.8196
2023-01-29 18:33:16,032 [INFO] Epoch 020 train_loss: 26350.0942 val_loss: 27294.9011 train_mae: 143.4382 val_mae: 146.2197
2023-01-29 18:33:16,852 [INFO] Epoch 040 train_loss: 13013.2233 val_loss: 13239.2457 train_mae: 90.6531 val_mae: 92.6150
2023-01-29 18:33:17,755 [INFO] Epoch 060 train_loss: 5336.8615 val_loss: 4632.1470 train_mae: 58.5801 val_mae: 54.9845
2023-01-29 18:33:18,967 [INFO] Epoch 080 train_loss: 4861.1130 val_loss: 3961.9017 train_mae: 57.3771 val_mae: 51.8525
2023-01-29 18:33:19,661 [INFO] Epoch 099 train_loss: 4864.4130 val_loss: 3793.6777 train_mae: 56.8589 val_mae: 50.5621
2023-01-29 18:33:19,663 [INFO] Fold 1 is being trained...
2023-01-29 18:33:19,690 [INFO] Epoch 000 train_loss: 286

After training you can always check your training process by calling `print_logs()` method, which will just print logs from `logs.log` file.

In [None]:
trainer.print_logs()

# Analyzing

For initializing `RegressionAnalyzer` class you will need:
* `results` - dictionary with results per each stage after training. Keys are `cv`, `test` and `final` and values are dicts with results for each stage (which contain lossses for each epoch inside `losses`, dict with metrics for each epoch inside `metrics`, best epoch and best metrics inside `best_result`, training time inside `time`, list with unique ids inside `ids`, list with true values inside `true` and list with model outputs for each epoch inside `pred`);
* `postprocess_fn` - function that takes list with model outputs from `pred` key for each stage in `results` and somehow processes them (default is `None`).

<ins>**REMINDER!**</ins> The main stage is always `test`, not `final` (`test` is how your model performs on data that it didn't see; `final` is how your model performs on data that it used for training).

In [None]:
from croatoan_trainer.analyze import RegressionAnalyzer


analyzer = RegressionAnalyzer(results, None)

## Getters

In this section some simple getters are explained:
* `get_stages()`: gets list of stages;
* `get_metrics()`: gets list of metrics used in training;
* `get_folds()`: gets number of folds used in training;
* `get_epochs(stage)`: gets number of epochs for stage;
* `get_time()`: gets train time in seconds for all stages;
* `get_df_pred(stage)`: gets dataframe with predictions.

In [None]:
analyzer.get_stages()

['cv', 'test', 'final']

In [None]:
analyzer.get_metrics()

['mse', 'mae', 'mape', 'r2']

In [None]:
analyzer.get_folds()

5

In [None]:
analyzer.get_epochs("test")

100

In [None]:
analyzer.get_time()

{'cv': 22.12667965888977,
 'test': 2.852672576904297,
 'final': 3.979445219039917}

In [None]:
analyzer.get_df_pred("test")

Unnamed: 0,ID,True,Pred
0,299,83.0,138.271149
1,368,123.0,181.972992
2,321,230.0,231.060577
3,224,77.0,101.455818
4,270,202.0,156.245773
...,...,...,...
84,21,49.0,111.287781
85,435,64.0,125.641182
86,165,59.0,99.559120
87,102,302.0,131.084854


## Metrics

You can get dataframe with final metrics for defined stages by calling `get_df_metrics(stages)` method.

In [None]:
analyzer.get_df_metrics(stages=["cv", "test"])

metric,mse,mse,mae,mae,mape,mape,r2,r2
stage,cv,test,cv,test,cv,test,cv,test
0,3823.220605,3539.823975,51.727971,48.4669,0.45838,0.414309,0.347965,0.421734


In [None]:
analyzer.get_df_metrics(stages=["cv", "test", "final"])

metric,mse,mse,mse,mae,mae,mae,mape,mape,mape,r2,r2,r2
stage,cv,test,final,cv,test,final,cv,test,final,cv,test,final
0,3823.220605,3539.823975,3308.001709,51.727971,48.4669,47.288002,0.45838,0.414309,0.411224,0.347965,0.421734,0.442147


Also, you can compute some other metrics if you want using `get_metric_result(stage, metric, round, **kwargs)` method.

Arguments:
* `stage`: one of stage from `get_stages()` method;
* `metric`: function that takes `y_true` and `y_pred` in this order and gives float as output;
or predictions (if `False`) (default is `True`);
* `**kwargs`: extra arguments for `metric` function.

In [None]:
from sklearn.metrics import median_absolute_error


analyzer.get_metric_result("test", median_absolute_error)

44.81414031982422

## Charts

### Plotly Args

You can set `plotly` args for charts on your own vision by calling `set_plotly_args(**kwargs)` method.

Arguments:
`**kwargs`: named arguments for plotly `update_layout()` method (name of arguments must match arguments from this method).

In [None]:
analyzer.set_plotly_args(font_size=14, template="plotly_dark", bargap=0.2)

### Losses

You can plot losses for each stage and each fold by calling `plot_losses(stage, fold)` method.

Arguments:
* `stage`: one of stage from `get_stages()` method;
* `fold`: number of CV fold. Specify this parameter only when `stage` == `'cv'`. If not specified and `stage` == `'cv'` plots mean results for all CV folds (default is `None`).

In [None]:
analyzer.plot_losses("test")

In [None]:
analyzer.plot_losses("cv")

In [None]:
analyzer.plot_losses("cv", 2)

### Metrics

You can plot losses for each stage and each fold by calling `plot_losses(stage, metrics, fold)` method.

Arguments:
* `stage`: one of stage from `get_stages()` method;
* `metrics`: list of metrics for plotting;
* `fold`: number of CV fold. Specify this parameter only when `stage` == `'cv'`. If not specified and `stage` == `'cv'` plots mean results for all CV folds (default is `None`).

In [None]:
analyzer.plot_metrics("test", ["mae"])

In [None]:
analyzer.plot_metrics("cv", ["r2"], 2)

In [None]:
# plot all possible metrics
analyzer.plot_metrics("cv", analyzer.get_metrics())

### Sample Predictions

You can plot predictions over epochs for specific `id` from unique ids by calling `get_pred_sample(stage, id)` method.

Arguments:
* `stage`: one of stage from `get_stages()` method;
* `id`: unique id to identify specific entry.

In [None]:
analyzer.plot_pred_sample("test", 6)

In [None]:
analyzer.plot_pred_sample("cv", 8)

### True-Predict Dependency

You can plot True-Predict dependency for each stage by calling `plot_pred(stage)` method.

In [None]:
analyzer.plot_pred("test")

In [None]:
analyzer.plot_pred("cv")

### Target Histograms

You can plot histograms (distributions) for true and final predicted values for each stage by calling `plot_hist(stage)` method.

In [None]:
analyzer.plot_hist("test")

In [None]:
analyzer.plot_hist("cv")

Also, you can plot histograms (distributions) for true and predicted values per epochs for each stage by calling `plot_hist_per_epoch(stage, epochs)` method.

Arguments:
* `stage`: one of stage from `get_stages()` method;
* `epochs`: list with epochs for plotting (epochs counter started from 0).

In [None]:
# Plot 1, 25, 50, 75 and 100 epochs
analyzer.plot_hist_per_epoch(
    stage="cv",
    epochs=[0, 24, 49, 74, 99]
)

In [None]:
# Plot every 10th epoch
analyzer.plot_hist_per_epoch(
    stage="test",
    epochs=range(9, analyzer.get_epochs("test"), 10)
)

### Kernel Density Estimation (KDE)

You can plot kernel destiny estimation (KDE) for true and final predicted values for each stage by calling `plot_kde(stage)` method.

In [None]:
analyzer.plot_kde("cv")

In [None]:
analyzer.plot_kde("test")

Also, you can plot kernel destiny estimation for true, final predicted values and predicted values per epochs for each stage by calling `plot_kde_per_epoch(stage, epochs)` method.

Arguments:
* `stage`: one of stage from `get_stages()` method;
* `epochs`: list with epochs for plotting (epochs counter started from 0).

In [None]:
# Plot 50 and 75
analyzer.plot_kde_per_epoch(
    stage="cv",
    epochs=[49, 74]
)

In [None]:
# Plot every 10th epoch
analyzer.plot_kde_per_epoch(
    stage="test",
    epochs=range(9, analyzer.get_epochs("test"), 10)
)

### All

You can plot all main charts (losses, all metrics, true-predict dependency, histogram for true and final predicted values, kernel density estimation for true and final predicted values) for each stage by calling `plot_all(stage)` method.

In [None]:
analyzer.plot_all("test")