# ML Recipes - From `numpy`

In this page, we will show you how to solve your own ML tasks on top of a `numpy` dataset.

> It is recommended to go through some basic recipes ([`models`](../models.ipynb), [`losses`](../losses.ipynb), [`metrics`](../metrics.ipynb)) first, but if you just want to utilize `carefree-learn` as soon as possible, this page is pretty self-contained as well!

# Table of Content

- [No Data Processing](#No-Data-Processing)
  - [Use Existing Modules](#Use-Existing-Modules)
  - [Customize Modules](#Customize-Modules)
    - [Customize Models](#Customize-Models)
    - [Customize Losses](#Customize-Losses)
    - [Customize Metrics](#Customize-Metrics)
- [With Data Processing](#With-Data-Processing)
  - [Out of the Loop](#Out-of-the-Loop)
  - [Utilize the Integrated `IMLData` System](#Utilize-the-Integrated-IMLData-System)
    - [Example](#Example)
    - [Explanations](#Explanations)
      - [General](#General)
      - [`build_with`](#build_with)
      - [`preprocess`](#preprocess)
      - [`dumps` / `loads`](#dumps-/-loads)
- [Processing Labels](#Processing-Labels)
- [Optional Callbacks](#Optional-Callbacks)
  - [`get_num_samples`](#get_num_samples)
  - [`fetch_batch`](#fetch_batch)
  - [`postprocess_batch`](#postprocess_batch)
  - [`postprocess_results`](#postprocess_results)

# Preparations

In [1]:
import torch
import cflearn

import numpy as np
import torch.nn as nn

from torch import Tensor
from typing import Dict

np.random.seed(142857)
torch.manual_seed(142857)

<torch._C.Generator at 0x1e5797cf4d0>

# No Data Processing

We will first jump into the typical situation where our data is already 'nice and clean', so we can use it as-is without any data processing procedure.

Here's the toy dataset that we'll use through out this case:

In [2]:
dim = 10
num_sample = 1000

x = np.random.random([num_sample, dim])
w = np.random.random([dim, 1])
y = x.dot(w)

x.shape, y.shape

((1000, 10), (1000, 1))

## Use Existing Modules

`carefree-learn` already provides a bunch of modules that can be used directly in your ML tasks. We will show you how to specify each of them in this section, and illustrate how to customize each of them in the next section.

In [3]:
m = cflearn.api.fit_ml(
    x,
    y,
    # specify the `model`
    core_name="fcnn",
    # specify the `loss`
    loss_name="mae",
    # specify the `metrics`
    metric_names=["mae", "mse"],
    # some common settings
    output_dim=1,
    is_classification=False,
)

                                Internal Default Configurations Used by `carefree-learn`                                
------------------------------------------------------------------------------------------------------------------------
                             train_samples   |   1000
                             valid_samples   |   None
                         max_snapshot_file   |   25
                                 input_dim   |   10
                                 workplace   |   _logs\2022-08-18_19-45-53-344718
                             monitor_names   |   ['mean_std', 'plateau']
                      additional_callbacks   |   ['_log_metrics_msg', '_inject_loader_name']
                   log_metrics_msg_verbose   |   True
------------------------------------------------------------------------------------------------------------------------
Layer (type)                             Input Shape                             Output Shape    Trainable Param #
--------

And that's it! We can see that everything works like a charm and the final result is pretty decent.

We can inspect all the modules that `carefree-learn` supports with `supported_ml_models`, `supported_losses` and `supported_metrics`:

In [4]:
", ".join(cflearn.api.supported_ml_models())

'ddr, fcnn, fnet, linear, mixer, mixer_bake, mixer_r_dropout, nbm, ndt, pool_former, rnn, rnn_bake, transformer, wnd'

In [5]:
", ".join(cflearn.api.supported_losses())

'[PLACEHOLDER], adain, bce, corr, cross_entropy, ddr, focal, iou, label_smooth_cross_entropy, mae, mse, quantile, recon, sigmoid_mae, siren_vae, style_vae, vae, vae1d, vae2d, vq_vae'

In [6]:
", ".join(cflearn.api.supported_metrics())

'acc, auc, aux, ber, corr, f1, iou, mae, mse, quantile, r2'

## Customize Modules

`carefree-learn` also supports replacing any module in the workflow. In this section, We will show you how to customize your own models, losses and metrics.

### Customize Models

We can utilize `register_ml_module` to register any `nn.Module` into `carefree-learn`:

In [7]:
@cflearn.register_ml_module("my_fancy_linear")
class MyFancyLinear(nn.Module):
    def __init__(self, in_dim, out_dim):
        super().__init__()
        self.linear = nn.Linear(in_dim, out_dim)
    
    def forward(self, net):
        return self.linear(net)

m = cflearn.api.fit_ml(
    x,
    y,
    # THIS LINE IS CHANGED!
    core_name="my_fancy_linear",
    loss_name="mae",
    metric_names=["mae", "mse"],
    # some common settings
    output_dim=1,
    is_classification=False,
)

                                Internal Default Configurations Used by `carefree-learn`                                
------------------------------------------------------------------------------------------------------------------------
                             train_samples   |   1000
                             valid_samples   |   None
                         max_snapshot_file   |   25
                                 input_dim   |   10
                                 workplace   |   _logs\2022-08-18_19-46-00-931779
                             monitor_names   |   ['mean_std', 'plateau']
                      additional_callbacks   |   ['_log_metrics_msg', '_inject_loader_name']
                   log_metrics_msg_verbose   |   True
------------------------------------------------------------------------------------------------------------------------
Layer (type)                             Input Shape                             Output Shape    Trainable Param #
--------

As you can see, the result is almost perfect now!

> It is recommended to go through the [`models`](../models.ipynb) recipe for more details!

### Customize Losses

We can utilize `register_loss_module` to register any `nn.Module` into `carefree-learn`:

In [8]:
@cflearn.register_loss_module("my_fancy_loss")
class MyFancyLoss(nn.Module):
    def forward(self, predictions, target):
        return (predictions - target).abs()

m = cflearn.api.fit_ml(
    x,
    y,
    core_name="my_fancy_linear",
    # THIS LINE IS CHANGED!
    loss_name="my_fancy_loss",
    metric_names=["mae", "mse"],
    # some common settings
    output_dim=1,
    is_classification=False,
)

                                Internal Default Configurations Used by `carefree-learn`                                
------------------------------------------------------------------------------------------------------------------------
                             train_samples   |   1000
                             valid_samples   |   None
                         max_snapshot_file   |   25
                                 input_dim   |   10
                                 workplace   |   _logs\2022-08-18_19-46-10-486108
                             monitor_names   |   ['mean_std', 'plateau']
                      additional_callbacks   |   ['_log_metrics_msg', '_inject_loader_name']
                   log_metrics_msg_verbose   |   True
------------------------------------------------------------------------------------------------------------------------
Layer (type)                             Input Shape                             Output Shape    Trainable Param #
--------

As you can see, the result is also perfect! This is because `my_fancy_loss` is just a hand-written `mae` loss, so the result should not change.

> It is recommended to go through the [`losses`](../losses.ipynb) recipe for more details!

### Customize Metrics

We can utilize `register_metric` & `IMetric` to register any `IMetric`-like object into `carefree-learn`:

In [9]:
@cflearn.register_metric("my_mae")
class MyMAE(cflearn.IMetric):
    # False means that the smaller this metric is, the better.
    @property
    def is_positive(self) -> bool:
        return False

    # predictions : [N, 1]
    # labels      : [N, 1]
    def forward(self, predictions: np.ndarray, labels: np.ndarray) -> float:
        return np.abs(predictions - labels).mean().item()

m = cflearn.api.fit_ml(
    x,
    y,
    core_name="my_fancy_linear",
    loss_name="my_fancy_loss",
    # THIS LINE IS CHANGED!
    metric_names=["my_mae", "mae"],
    # some common settings
    output_dim=1,
    is_classification=False,
)

                                Internal Default Configurations Used by `carefree-learn`                                
------------------------------------------------------------------------------------------------------------------------
                             train_samples   |   1000
                             valid_samples   |   None
                         max_snapshot_file   |   25
                                 input_dim   |   10
                                 workplace   |   _logs\2022-08-18_19-46-21-400019
                             monitor_names   |   ['mean_std', 'plateau']
                      additional_callbacks   |   ['_log_metrics_msg', '_inject_loader_name']
                   log_metrics_msg_verbose   |   True
------------------------------------------------------------------------------------------------------------------------
Layer (type)                             Input Shape                             Output Shape    Trainable Param #
--------

As you can see, the `my_mae` always equals to `mae`, which is expected.

> It is recommended to go through the [`metrics`](../metrics.ipynb) recipe for more details!

# With Data Processing

Although it will be nice to have the dataset already processed, in most real-world cases, we need to handle the data processing ourselves. In `carefree-learn`, there are two ways to handle data processing: out of the loop, or utilize the integrated `IMLData` system.

Here's the toy dataset that we'll use through out this case:

In [10]:
dim = 10
num_sample = 1000

x = np.random.random([num_sample, dim])
_x_mean = x.mean(0)
_x_std  = x.std(0)
_x_normalized = (x - _x_mean) / _x_std
w = np.random.random([dim, 1])
y = _x_normalized.dot(w)

x.shape, y.shape

((1000, 10), (1000, 1))

You might notice that if we normalize the dataset correctly, this task will be exactly the same as the task we've encountered in the [No Data Processing](#No-Data-Processing) case!

## Run Without Data Processing

In order to demonstrate why data processing is important, let's run an experiment without it:

In [11]:
m = cflearn.api.fit_ml(
    x,
    y,
    core_name="my_fancy_linear",
    loss_name="my_fancy_loss",
    metric_names=["my_mae", "mae"],
    # some common settings
    output_dim=1,
    is_classification=False,
)

                                Internal Default Configurations Used by `carefree-learn`                                
------------------------------------------------------------------------------------------------------------------------
                             train_samples   |   1000
                             valid_samples   |   None
                         max_snapshot_file   |   25
                                 input_dim   |   10
                                 workplace   |   _logs\2022-08-18_19-46-31-637018
                             monitor_names   |   ['mean_std', 'plateau']
                      additional_callbacks   |   ['_log_metrics_msg', '_inject_loader_name']
                   log_metrics_msg_verbose   |   True
------------------------------------------------------------------------------------------------------------------------
Layer (type)                             Input Shape                             Output Shape    Trainable Param #
--------

As you can see, the result is much worse than before, although their 'internal' tasks are the same.

## Out of the Loop

The most intuitive way is to process the dataset 'out of the loop'. That is, to process the dataset 'somewhere else' before we start using `carefree-learn`. It suffers from certain disadvantages, though, that the final solution can hardly be 'out of the box'. One of the powerful functions that `carefree-learn` supports is that we can save everything into a single `zip` file, so reprodution / deployment will be very easy. But if we process the dataset 'out of the loop', we have to manage to save / load the data processing stuffs, which makes it more difficult to manage the solutions.

## Utilize the Integrated `IMLData` System

To make things easier, `carefree-learn` introduces the `IMLData` system. By following some protocols, we can integrate the data processing stuffs into `carefree-learn`, so they can be saved into the same `zip` file.

### Example

Here's a minimal example where we can integrate the normalization process with this system:

In [12]:
from cflearn import IMLData
from cflearn import IMLDataProcessor
from cflearn import IMLPreProcessedData
from cflearn import register_ml_data
from cflearn import register_ml_data_processor


@register_ml_data_processor("my_fancy_processor", allow_duplicate=True)
class MyFancyProcessor(IMLDataProcessor):
    mean: np.ndarray
    std: np.ndarray
    
    def build_with(self, x_train) -> None:
        self.mean = x_train.mean(0)
        self.std = x_train.std(0)

    def preprocess(self, x_train, y_train, x_valid, y_valid) -> IMLPreProcessedData:
        x_train = (x_train - self.mean) / self.std
        if x_valid is not None:
            x_valid = (x_valid - self.mean) / self.std
        return IMLPreProcessedData(x_train, y_train, x_valid, y_valid)

    def dumps(self):
        return {
            "mean": self.mean.tolist(),
            "std": self.std.tolist(),
        }

    def loads(self, dumped) -> None:
        self.mean = np.array(dumped["mean"], np.float32)
        self.std = np.array(dumped["std"], np.float32)


@register_ml_data("my_fancy_data")
class MyFancyData(IMLData):
    processor_type = "my_fancy_processor"

Although it seems to be a lot of codes, they should be pretty straight forward to understand. We'll dive into the details in the [Explanations](#Explanations) section, for now let's try it out and see if everything works well:

In [13]:
m = cflearn.api.fit_ml(
    # THIS LINE IS CHANGED!
    MyFancyData(x, y),
    core_name="my_fancy_linear",
    loss_name="my_fancy_loss",
    metric_names=["my_mae", "mae"],
    # some common settings
    output_dim=1,
    is_classification=False,
)

                                Internal Default Configurations Used by `carefree-learn`                                
------------------------------------------------------------------------------------------------------------------------
                             train_samples   |   1000
                             valid_samples   |   None
                         max_snapshot_file   |   25
                                 input_dim   |   10
                                 workplace   |   _logs\2022-08-18_19-46-57-109509
                             monitor_names   |   ['mean_std', 'plateau']
                      additional_callbacks   |   ['_log_metrics_msg', '_inject_loader_name']
                   log_metrics_msg_verbose   |   True
------------------------------------------------------------------------------------------------------------------------
Layer (type)                             Input Shape                             Output Shape    Trainable Param #
--------

Great! As we can see, the result becomes perfect again.

As we mentioned before, the best advantage of utilizing the `IMLData` system is that, we can save the data processing procedure into the same `zip` file that we use to save our models. For example, by running:

In [14]:
m.save("./test")

<cflearn.api.ml.pipeline.MLPipeline at 0x1e54471bbe0>

This will generate a `test.zip` file which can be loaded:

In [15]:
m2 = cflearn.api.load("./test")

We can use it to make predictions and see if the data processing procedure is preserved:

In [16]:
idata = m2.make_inference_data(x)
predictions = m2.predict(idata)[cflearn.PREDICTIONS_KEY]
# calculate the mae
np.abs(y - predictions).mean()

1.419603043329165e-07

As we can see, the data processing procedure is perfectly preserved! We can also inspect the processed data to consolidate our confidence:

In [17]:
x_processed = idata.train_data.x
x_processed.mean(0), x_processed.std(0)

(array([-1.04624677e-07,  8.98624838e-08,  9.77361483e-08,  2.68471631e-08,
        -9.40245746e-08, -3.79181705e-08,  5.08937793e-08, -7.58817618e-09,
         3.30258817e-08,  3.82606370e-08]),
 array([0.99999997, 0.99999996, 1.00000002, 1.00000001, 0.99999998,
        1.00000001, 0.99999995, 0.99999996, 1.00000002, 1.00000002]))

### Explanations

To start up, we can first write down these two parts:

```python
# The name, `xxx_processor`, can be an arbitrary name. Just make sure it is 'unique' so it will not collide with others.
@register_ml_data_processor("xxx_processor", allow_duplicate=True)
class Processor(IMLDataProcessor):
    ...

# The name, `xxx_data`, can be an arbitrary name. Just make sure it is 'unique' so it will not collide with others.
@register_ml_data("xxx_data")
class Data(IMLData):
    # make sure that the `processor_type` matches the name you used in the `register_ml_data_processor` above
    processor_type = "xxx_processor"
```

These will register a new `IMLDataProcessor` and a new `IMLData` who uses the new `IMLDataProcessor` into `carefree-learn`. After these, we can use them by constructing the training data with the new `IMLData`:

```python
data = Data(x, y)
```

We support using validation dataset as well:

```python
data = Data(x_train, y_train, x_valid, y_valid)
```

With the `fit_ml` API, we can utilize the `data` easily:

```python
m = cflearn.api.fit_ml(
    data,
    ...
)
```

So the key tasks are: how to define the details in `IMLDataProcessor`? We'll walk through each of its `abstractmethod` in the following sections.

#### General

In order to reduce the number of boiler plate codes, `carefree-learn` allows you to 'choose' the arguments that you want for your implementations. For example, the `build_with` defined in `IMLDataProcessor` is:

```python
@abstractmethod
def build_with(
    self,
    config: Dict[str, Any],
    x_train: Union[np.ndarray, str],
    y_train: Optional[Union[np.ndarray, str]],
    x_valid: Optional[Union[np.ndarray, str]],
    y_valid: Optional[Union[np.ndarray, str]],
) -> None:
    pass
```

But in the `MyFancyProcessor`, we actually wrote:

```python
def build_with(self, x_train) -> None:
    self.mean = x_train.mean(0)
    self.std = x_train.std(0)
```

As you can see, we only 'chose' `x_train` for our implementations. In fact, you can 'choose' any number of arguments based on your actual requirements, without having to write down all five arguments defined in the `IMLDataProcessor` interface!

> In the following sections, we will introduce all the arguments defined in the interface, but keep in mind that you don't need to write down all of them in your own implementations, as you can choose what you need!

#### `build_with`

```python
@abstractmethod
def build_with(
    self,
    config: Dict[str, Any],
    x_train: Union[np.ndarray, str],
    y_train: Optional[Union[np.ndarray, str]],
    x_valid: Optional[Union[np.ndarray, str]],
    y_valid: Optional[Union[np.ndarray, str]],
) -> None:
    pass
```

- `config`: configurations specified by the corresponding `IMLData`.
  - It will be defined in the `processor_build_config` property, as explained below.
- `x_train`: training data, could be `str` when we need to handle file datasets.
  - See [`from_file`](./from_file.ipynb) recipe for more details.
- `y_train`: training labels, could be `None` if `x_train` is a file or not provided.
  - It is common that labels are not provided at inference time. 
- `x_valid`: validation data, could be `str` when we need to handle file datasets, could be `None` if not provided.
- `y_valid`: validation labels, could be `None` if not provided.

This method will only be called when we are instantiating a `IMLData` instance. Here's the pseudo codes of the `__init__` process:

```python
class IMLData:
    def __init__(
        self,
        x_train,
        y_train,
        x_valid,
        y_valid,
        ...,
    ):
        ...
        # here, we should define `processor_build_config` to pass some configs to our processor
        kw = dict(
            config=self.processor_build_config,
            x_train=x_train,
            y_train=y_train,
            x_valid=x_valid,
            y_valid=y_valid,
        )
        # `build_with` will be called here
        processor.build_with(**kw)
    
    @property
    def processor_build_config(self) -> Dict[str, Any]:
        ...
        
```

So if out processor needs to be configured, we need to pass the configurations to our own `IMLData` instance, and then define them in the `processor_build_config` property:

In [18]:
@register_ml_data_processor("foo_processor", allow_duplicate=True)
class FooProcessor(IMLDataProcessor):
    def build_with(self, config, x_train):
        print("> x_train   ", x_train)
        print("> config.foo", config["foo"])
    
    def preprocess(self):
        pass
    
    def dumps(self):
        pass
    
    def loads(self):
        pass

@register_ml_data("foo_data", allow_duplicate=True)
class FooData(IMLData):
    processor_type = "foo_processor"
    
    def __init__(self, x_train, foo):
        # the extra assignments should take place before the `super` call, because
        # `processor_build_config` will be used in the `super` call
        self.foo = foo
        super().__init__(x_train)
    
    @property
    def processor_build_config(self):
        return dict(
            foo=self.foo,
        )

data = FooData("foo.train", 1.2345)

> x_train    foo.train
> config.foo 1.2345


#### `preprocess`

```python
class IMLPreProcessedData(NamedTuple):
    x_train: np.ndarray
    y_train: Optional[np.ndarray] = None
    x_valid: Optional[np.ndarray] = None
    y_valid: Optional[np.ndarray] = None
    # if input_dim is not specified, `x_train.shape[-1]` will be used
    input_dim: Optional[int] = None
    num_history: Optional[int] = None
    num_classes: Optional[int] = None
    is_classification: Optional[bool] = None

@abstractmethod
def preprocess(
    self,
    config: Dict[str, Any],
    x_train: Union[np.ndarray, str],
    y_train: Optional[Union[np.ndarray, str]],
    x_valid: Optional[Union[np.ndarray, str]],
    y_valid: Optional[Union[np.ndarray, str]],
    *,
    for_inference: bool,
) -> IMLPreProcessedData:
    pass
```

- `config`: configurations specified by the corresponding `IMLData`.
  - It will be defined in the `processor_preprocess_config` property, as explained below.
- `x_train`: original training data, could be `str` when we need to handle file datasets.
  - See [`from_file`](./from_file.ipynb) recipe for more details.
- `y_train`: original training labels, could be `None` if `x_train` is a file or not provided.
- `x_valid`: original validation data, could be `None` if not provided.
- `y_valid`: original validation labels, could be `None` if not provided.

The `preprocess` method should return a `IMLPreProcessedData` namedtuple:
- `x_train`: preprocessed training features.
- `y_train`: preprocessed training labels, could be `None` if not provided.
  - It is common that labels are not provided at inference time.
- `x_valid`: preprocessed validation features, could be `None` if not provided.
- `y_valid`: preprocessed validation labels, could be `None` if not provided.
- `input_dim`: input feature dim that the model will receive.
  - If not provided, `x_train.shape[-1]` will be used.
  - If `encoder` is used, this setting will not represent the final input dim that your model will receive, because the `encoder` might 'expand' the dimension with some encoding methods.
- `num_history`: number of history, useful in time series tasks.
  - If not provided, we will use the default value defined in the pipeline.
- `num_classes`: number of classes, will be used as `output_dim` if `is_classification` is True & `output_dim` is not specified.
  - If not provided, we will use the default value defined in the pipeline.
- `is_classification`: whether current task is a classification task.
  - If not provided, we will use the default value defined in the pipeline.
  
This method, as its name suggests, will be called before `IMLData` construct its datasets and dataloaders.

#### `dumps` / `loads`

```python
@abstractmethod
def dumps(self) -> Any:
    pass

@abstractmethod
def loads(self, dumped: Any) -> None:
    pass
```

- `dumps`: return an object that holds the necessary information for the `loads` method.
- `loads`: setup the processor with the object (`dumped`) returned by the `dumps` method.

These two methods are the key parts of the serialization process. When we save / load our pipeline via `m.save` / `cflearn.api.load`, these methods will be called respectively at the proper places.

Although almost anything can be handled by `carefree-learn`, following these best practices can make your processor more light-weight and performant:

- return a JSON-serializable object in the `dumps` method.
- Turn small `np.ndarray` into a python `list` (as we've done in the `MyFancyProcessor`).
- Save large `np.ndarray` to a certain path, and then dump the path instead of the `np.ndarray` itself.
  - This practice applies to other large objects as well.

> For the last suggestion, we should keep in mind that this situation can hardly happen if our processor is defined properly. Because in most cases, a processor should only hold the necessary information for processing the data, so it hardly needs to contain large `np.ndarray`s / objects!

# Processing Labels

You might noticed that the `preprocess` method will take in `y_train` and `y_valid`, and the returned `IMLPreProcessedData` also contains `y_train` and `y_valid`. That's because `carefree-learn` also supports processing labels with the `IMLData` system.

Here's the toy dataset that we'll use through out this case:

In [19]:
dim = 10
num_sample = 1000

x = np.random.random([num_sample, dim])
w = np.random.random([dim, 1])
y = x.dot(w)
y = np.exp(y)

x.shape, y.shape

((1000, 10), (1000, 1))

Again, let's run an experiment without processing it:

In [20]:
m = cflearn.api.fit_ml(
    x,
    y,
    core_name="my_fancy_linear",
    loss_name="my_fancy_loss",
    metric_names=["my_mae", "mae"],
    # some common settings
    output_dim=1,
    is_classification=False,
)

                                Internal Default Configurations Used by `carefree-learn`                                
------------------------------------------------------------------------------------------------------------------------
                             train_samples   |   1000
                             valid_samples   |   None
                         max_snapshot_file   |   25
                                 input_dim   |   10
                                 workplace   |   _logs\2022-08-18_19-47-01-602958
                             monitor_names   |   ['mean_std', 'plateau']
                      additional_callbacks   |   ['_log_metrics_msg', '_inject_loader_name']
                   log_metrics_msg_verbose   |   True
------------------------------------------------------------------------------------------------------------------------
Layer (type)                             Input Shape                             Output Shape    Trainable Param #
--------

Although the internal task never changes, the result is not satisfying. Let's define a processor to overcome this issue:

In [21]:
@register_ml_data_processor("my_fancy_label_processor", allow_duplicate=True)
class MyFancyLabelProcessor(IMLDataProcessor):
    def build_with(self):
        pass

    def preprocess(self, x_train, y_train, x_valid, y_valid):
        if y_train is not None:
            y_train = np.log(y_train)
        if y_valid is not None:
            y_valid = np.log(y_valid)
        return IMLPreProcessedData(x_train, y_train, x_valid, y_valid)
    
    def dumps(self):
        pass
    
    def loads(self):
        pass

@register_ml_data("my_fancy_label_data", allow_duplicate=True)
class MyFancyLabelData(IMLData):
    processor_type = "my_fancy_label_processor"

m = cflearn.api.fit_ml(
    # THIS LINE IS CHANGED!
    MyFancyLabelData(x, y),
    core_name="my_fancy_linear",
    loss_name="my_fancy_loss",
    metric_names=["my_mae", "mae"],
    # some common settings
    output_dim=1,
    is_classification=False,
)

                                Internal Default Configurations Used by `carefree-learn`                                
------------------------------------------------------------------------------------------------------------------------
                             train_samples   |   1000
                             valid_samples   |   None
                         max_snapshot_file   |   25
                                 input_dim   |   10
                                 workplace   |   _logs\2022-08-18_19-47-23-857891
                             monitor_names   |   ['mean_std', 'plateau']
                      additional_callbacks   |   ['_log_metrics_msg', '_inject_loader_name']
                   log_metrics_msg_verbose   |   True
------------------------------------------------------------------------------------------------------------------------
Layer (type)                             Input Shape                             Output Shape    Trainable Param #
--------

Great! As we can see, the result becomes perfect again. Let's make an inference to ensure everything works well:

In [22]:
idata = m.make_inference_data(x)
predictions = m.predict(idata)[cflearn.PREDICTIONS_KEY]
# calculate the mae
np.abs(y - predictions).mean()

38.12670601237174

Oops! The result looks problematic! That's because we only defined processing methods in the `preprocess` method, but we never tell our processor that we need to 'postprocess' the inference results!

In order tackle this problem, `carefree-learn` provides a `postprocess_results` callback:

In [23]:
@register_ml_data_processor("my_fancy_label_processor", allow_duplicate=True)
class MyFancyLabelProcessor(IMLDataProcessor):
    def build_with(self):
        pass

    def preprocess(self, x_train, y_train, x_valid, y_valid):
        if y_train is not None:
            y_train = np.log(y_train)
        if y_valid is not None:
            y_valid = np.log(y_valid)
        return IMLPreProcessedData(x_train, y_train, x_valid, y_valid)
    
    def dumps(self):
        pass
    
    def loads(self):
        pass
    
    def postprocess_results(self, forward):
        y = forward[cflearn.PREDICTIONS_KEY]
        y = np.exp(y)
        forward[cflearn.PREDICTIONS_KEY] = y
        return forward

m = cflearn.api.fit_ml(
    # THIS LINE IS CHANGED!
    MyFancyLabelData(x, y),
    core_name="my_fancy_linear",
    loss_name="my_fancy_loss",
    metric_names=["my_mae", "mae"],
    # some common settings
    output_dim=1,
    is_classification=False,
)

                                Internal Default Configurations Used by `carefree-learn`                                
------------------------------------------------------------------------------------------------------------------------
                             train_samples   |   1000
                             valid_samples   |   None
                         max_snapshot_file   |   25
                                 input_dim   |   10
                                 workplace   |   _logs\2022-08-18_19-47-34-369372
                             monitor_names   |   ['mean_std', 'plateau']
                      additional_callbacks   |   ['_log_metrics_msg', '_inject_loader_name']
                   log_metrics_msg_verbose   |   True
------------------------------------------------------------------------------------------------------------------------
Layer (type)                             Input Shape                             Output Shape    Trainable Param #
--------

Let's make an inference again:

In [24]:
idata = m.make_inference_data(x)
predictions = m.predict(idata)[cflearn.PREDICTIONS_KEY]
# calculate the mae
np.abs(y - predictions).mean()

5.831679950732749e-06

Great! Now our predictions are 'postprocessed' as expected!

> More details of `postprocess_results` will be covered in the [Optional Callbacks](#postprocess_results) section.

Let's try serializations as well:

In [25]:
m.save("./test")
m2 = cflearn.api.load("./test")
idata = m2.make_inference_data(x)
predictions = m2.predict(idata)[cflearn.PREDICTIONS_KEY]
# calculate the mae
np.abs(y - predictions).mean()

5.831679950732749e-06

As we can see, the data processing procedure is perfectly preserved!

# Optional Callbacks

Besides the above processing methods, the `IMLData` system also provides several useful callbacks in order to enable the full control of the data flow. We will introduce these callbacks in the following sections.

## `get_num_samples`

```python
def get_num_samples(self, x: np.ndarray) -> Optional[int]:
    return None
```

This method can override how the number of samples is calculated. `len(x)` will be used if `None` is returned.

## `fetch_batch`

```python
class IMLBatch(NamedTuple):
    input: np.ndarray
    labels: Optional[np.ndarray]
    others: Optional[np_dict_type] = None

def fetch_batch(
    self,
    x: np.ndarray,
    y: Optional[np.ndarray],
    indices: Union[int, List[int], np.ndarray],
) -> IMLBatch:
    return IMLBatch(x[indices], None if y is None else y[indices])
```

This method defines how the batch will be fetched. The `others` field allows you to inject some additional data to your batch, for example:

In [26]:
from cflearn import IMLBatch

@register_ml_data_processor("inject_foo_processor", allow_duplicate=True)
class InjectFooProcessor(IMLDataProcessor):
    def build_with(self):
        pass

    def preprocess(self, x_train, y_train):
        return IMLPreProcessedData(x_train, y_train)
    
    def dumps(self):
        pass
    
    def loads(self):
        pass
    
    def fetch_batch(self, x, y, indices):
        foo = np.full([len(indices), 1], 1.234)
        return IMLBatch(
            x[indices],
            None if y is None else y[indices],
            others={"foo": foo},
        )

@register_ml_data("inject_foo_data", allow_duplicate=True)
class InjectFooData(IMLData):
    processor_type = "inject_foo_processor"

@cflearn.register_ml_module("inject_foo_model", allow_duplicate=True)
class InjectFooModel(nn.Module):
    def __init__(self, in_dim, out_dim):
        super().__init__()
        self.linear = nn.Linear(in_dim, out_dim)
    
    def forward(self, batch):
        print("> foo")
        print(batch["foo"][:3])
        return self.linear(batch[cflearn.INPUT_KEY])

m = cflearn.api.fit_ml(
    InjectFooData(x, y),
    core_name="inject_foo_model",
    loss_name="my_fancy_loss",
    metric_names=["my_mae", "mae"],
    # some common settings
    output_dim=1,
    is_classification=False,
    # debug
    debug=True,
)

                                Internal Default Configurations Used by `carefree-learn`                                
------------------------------------------------------------------------------------------------------------------------
                             train_samples   |   1000
                             valid_samples   |   None
                         max_snapshot_file   |   25
                                 input_dim   |   10
                                 workplace   |   _logs\2022-08-18_19-47-45-463111
                             monitor_names   |   ['mean_std', 'plateau']
                      additional_callbacks   |   ['_log_metrics_msg', '_inject_loader_name']
                   log_metrics_msg_verbose   |   True
------------------------------------------------------------------------------------------------------------------------
> foo
tensor([[1.2340]])
Layer (type)                             Input Shape                             Output Shape    T

## `postprocess_batch`

```python
# changes can happen inplace
def postprocess_batch(self, batch: np_dict_type) -> np_dict_type:
    return batch
```

This method allows you to postprocess the batch.
> In most cases, specifying `fetch_batch` is already enough.

## `postprocess_results`

```python
# changes can happen inplace
def postprocess_results(
    self,
    forward: np_dict_type,
    *,
    return_classes: bool,
    binary_threshold: float,
    return_probabilities: bool,
) -> np_dict_type:
    return forward
```

This method allows you to postprocess the inference results.
> Notice that:
> - the arguments of this method is 'optional'.
> - this will only affect the inference methods (e.g. `predict`) and will not affect the training process.

- `forward`: the 'raw' inference results.
- `return_classes`: whether we need to return class labels instead of the 'raw' results (e.g. logits).
- `binary_threshold`: threshold used in binary classification tasks.
- `return_probabilities`: whether we need to return the probability predictions.

Although there are some flags in this method (`return_classes`, etc.), they are just telling you what kinds of data `forward` will hold, and do not require you to handle them because they have already been handled. For example:

In [27]:
@register_ml_data_processor("inspect_processor", allow_duplicate=True)
class InspectProcessor(IMLDataProcessor):
    def build_with(self):
        pass

    def preprocess(self, x_train, y_train):
        return IMLPreProcessedData(x_train, y_train)
    
    def dumps(self):
        pass
    
    def loads(self):
        pass
    
    def postprocess_results(self, forward, *, return_classes, binary_threshold, return_probabilities):
        print("> return_classes", return_classes)
        print("> binary_threshold", binary_threshold)
        print("> return_probabilities", return_probabilities)
        print(forward[cflearn.PREDICTIONS_KEY][:3])
        return forward

@register_ml_data("inspect_data", allow_duplicate=True)
class InspectData(IMLData):
    processor_type = "inspect_processor"


x = np.random.randn(100, 5)
w = np.random.randn(5, 1)
y = (x.dot(w) > 0.0).astype(int)

m = cflearn.api.fit_ml(
    InspectData(x, y),
    core_name="linear",
    # some common settings
    output_dim=2,
    is_classification=True,
)

                                Internal Default Configurations Used by `carefree-learn`                                
------------------------------------------------------------------------------------------------------------------------
                             train_samples   |   100
                             valid_samples   |   None
                         max_snapshot_file   |   25
                                 input_dim   |   5
                                 loss_name   |   focal
                                 workplace   |   _logs\2022-08-18_19-47-45-508937
                             monitor_names   |   ['mean_std', 'plateau']
                      additional_callbacks   |   ['_log_metrics_msg', '_inject_loader_name']
                   log_metrics_msg_verbose   |   True
                              metric_names   |   ['acc', 'auc']
------------------------------------------------------------------------------------------------------------------------
Layer 

In [28]:
idata = m.make_inference_data(x)
predictions = m.predict(idata)

> return_classes False
> binary_threshold 0.5
> return_probabilities False
[[0.3514592  0.4325828 ]
 [0.78386474 0.3538649 ]
 [0.46501547 0.35370737]]


In [29]:
predictions = m.predict(idata, return_classes=True)

> return_classes True
> binary_threshold 0.5
> return_probabilities False
[[1]
 [0]
 [0]]


In [30]:
predictions = m.predict(idata, return_probabilities=True)

> return_classes False
> binary_threshold 0.5
> return_probabilities True
[[0.47973022 0.52026975]
 [0.60587364 0.3941264 ]
 [0.5277983  0.47220165]]


As shown above, when it comes to the `postprocess_results` method, the `forward` is already processed with those flags, so we can focus on the 'real' postprocess procudure in the `postprocess_results` method.