## Iris

Here are some of the information provided by the official website:

```text
This is perhaps the best known database to be found in the pattern recognition literature.
The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant.
Predicted attribute: class of iris plant.
```

And here's the pandas-view of the raw data:

```text
      f0   f1   f2   f3           label
0    5.1  3.5  1.4  0.2     Iris-setosa
1    4.9  3.0  1.4  0.2     Iris-setosa
2    4.7  3.2  1.3  0.2     Iris-setosa
3    4.6  3.1  1.5  0.2     Iris-setosa
4    5.0  3.6  1.4  0.2     Iris-setosa
..   ...  ...  ...  ...             ...
145  6.7  3.0  5.2  2.3  Iris-virginica
146  6.3  2.5  5.0  1.9  Iris-virginica
147  6.5  3.0  5.2  2.0  Iris-virginica
148  6.2  3.4  5.4  2.3  Iris-virginica
149  5.9  3.0  5.1  1.8  Iris-virginica

[150 rows x 5 columns]
```

> We didn't use pandas in our code, but it is convenient to visualize some data with it though 🤣
>
> You can download the raw data (`iris.data`) with [this link](https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data).

In [1]:
# preparations

import os
import torch
import pickle
import cflearn
import numpy as np
from cflearn.toolkit import seed_everything

seed_everything(123)

123

### Basic Usages

Traditionally, we need to process the raw data before we feed them into our machine learning models (e.g. encode the label column, which is a string column, into an ordinal column). In `carefree-learn`, however, we can train neural networks directly on files without worrying about the rest:

In [2]:
processor_config = cflearn.MLBundledProcessorConfig(has_header=False, num_split=25)
data = cflearn.MLData.init(processor_config=processor_config).fit("iris.data")
config = cflearn.MLConfig(
    module_name="fcnn",
    module_config=dict(input_dim=data.num_features, output_dim=data.num_labels),
    loss_name="focal",
    metric_names=["acc", "auc"],
)
m = cflearn.api.fit_ml(data, config=config)

                                    Internal Default Configurations Used by `carefree-learn`                                    
--------------------------------------------------------------------------------------------------------------------------------
                                                   train_samples   |   125
                                                   valid_samples   |   25
                                               max_snapshot_file   |   25
                                                encoder_settings   |   {}
                                                       workspace   |   _logs/2023-12-03_09-48-13-631981
                                  module_config.encoder_settings   |   {}
                                                 index_mapping.0   |   0
                                                 index_mapping.1   |   1
                                                 index_mapping.2   |   2
                                                

What's going under the hood is that carefree-learn will try to parse the `iris.data` automatically, split the data into training set and validation set, with which we'll train a fully connected neural network (fcnn).

We can further inspect the processed data if we want to know how `carefree-learn` actually parsed the input data:

In [3]:
data = m.data
x_train = data.train_dataset.x
print("> mean", x_train.mean(0))
print("> std ", x_train.std(0))

> mean [ 3.65485420e-16  1.25832678e-15  3.65929509e-16 -4.60076421e-16]
> std  [1. 1. 1. 1.]


It shows that the raw data is carefully normalized into numerical data, with `mean=0.0` and `std=1.0`, that neural networks can happily accept.

We can also inspect the validation dataset:

In [4]:
data = m.data
x_valid = data.valid_dataset.x
print("> mean", x_valid.mean(0))
print("> std ", x_valid.std(0))

> mean [0.13588234 0.00552686 0.08309051 0.07267612]
> std  [1.04254617 0.97065931 1.01830191 1.09129383]


The results shown above means as default, we will split the dataset before we normalize it. This can avoid data leakage and ensure that the validation dataset has a closer distribution to the real world test set.

> On the other hand, if you want to calculate the statistics on both the training & validation set, you can simply set the `split_before_preprocess` argument to `False` in `MLBundledProcessorConfig`.

After training on files, `carefree-learn` can predict & evaluate on files directly as well. We'll handle the data parsing and normalization for you automatically:

In [5]:
loader = data.build_loader("iris.data")
predictions = m.predict(loader)
# evaluations could be achieved easily with cflearn.api.evaluate
cflearn.api.evaluate(loader, dict(m=m))

|        metrics         |                       acc                        |                       auc                        |
--------------------------------------------------------------------------------------------------------------------------------
|                        |      mean      |      std       |     score      |      mean      |      std       |     score      |
--------------------------------------------------------------------------------------------------------------------------------
|           m            |    0.846666    |    0.000000    |    0.846666    |    0.971666    |    0.000000    |    0.971666    |


{'acc': {'m': Statistics(sign=1.0, mean=0.8466666666666667, std=0.0, score=0.8466666666666667)},
 'auc': {'m': Statistics(sign=1.0, mean=0.9716666666666667, std=0.0, score=0.9716666666666667)}}

### Benchmarking

As we know, neural networks are trained with **_stochastic_** gradient descent (and its variants), which will introduce some randomness to the final result, even if we are training on the same dataset. In this case, we need to repeat the same task several times in order to obtain the bias & variance of our neural networks.

Fortunately, `carefree-learn` introduced `repeat_ml` API, which can achieve this goal easily with only a few lines of code:

In [6]:
# With num_repeat=3 specified, we'll train 3 models on `iris.data`.
results = cflearn.api.repeat_ml(data, m.config, num_repeat=3)
pipelines = cflearn.api.load_pipelines(results)
cflearn.api.evaluate(loader, pipelines)

  0%|                                                                                                                       | 0/3 [00:00<?, ?it/s]

                                    Internal Default Configurations Used by `carefree-learn`                                    
--------------------------------------------------------------------------------------------------------------------------------
                                                   train_samples   |   125
                                                   valid_samples   |   25
                                  module_config.encoder_settings   |   {}
                                                 index_mapping.0   |   0
                                                 index_mapping.1   |   1
                                                 index_mapping.2   |   2
                                                 index_mapping.3   |   3
--------------------------------------------------------------------------------------------------------------------------------
                                                    External Configurations                       

 33%|█████████████████████████████████████                                                                          | 1/3 [00:04<00:09,  4.56s/it]

                                    Internal Default Configurations Used by `carefree-learn`                                    
--------------------------------------------------------------------------------------------------------------------------------
                                                   train_samples   |   125
                                                   valid_samples   |   25
                                  module_config.encoder_settings   |   {}
                                                 index_mapping.0   |   0
                                                 index_mapping.1   |   1
                                                 index_mapping.2   |   2
                                                 index_mapping.3   |   3
--------------------------------------------------------------------------------------------------------------------------------
                                                    External Configurations                       

 67%|██████████████████████████████████████████████████████████████████████████                                     | 2/3 [00:09<00:04,  4.65s/it]

                                    Internal Default Configurations Used by `carefree-learn`                                    
--------------------------------------------------------------------------------------------------------------------------------
                                                   train_samples   |   125
                                                   valid_samples   |   25
                                  module_config.encoder_settings   |   {}
                                                 index_mapping.0   |   0
                                                 index_mapping.1   |   1
                                                 index_mapping.2   |   2
                                                 index_mapping.3   |   3
--------------------------------------------------------------------------------------------------------------------------------
                                                    External Configurations                       

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:13<00:00,  4.66s/it]

|        metrics         |                       acc                        |                       auc                        |
--------------------------------------------------------------------------------------------------------------------------------
|                        |      mean      |      std       |     score      |      mean      |      std       |     score      |
--------------------------------------------------------------------------------------------------------------------------------
|          fcnn          |    0.837777    |    0.011331    |    0.826446    |    0.976911    |    0.004713    |    0.972197    |





{'acc': {'fcnn': Statistics(sign=1.0, mean=0.8377777777777778, std=0.011331154474650655, score=0.8264466233031272)},
 'auc': {'fcnn': Statistics(sign=1.0, mean=0.9769111111111112, std=0.00471320708092209, score=0.9721979040301891)}}

We can also compare the performances across different models:

In [7]:
# With modules=["linear", "fcnn"], we'll train both linear module and fcnn module.
modules = ["linear", "fcnn"]
results = cflearn.api.repeat_ml(data, m.config, modules=modules, num_repeat=3)
pipelines = cflearn.api.load_pipelines(results)
cflearn.api.evaluate(loader, pipelines)

  0%|                                                                                                                       | 0/6 [00:00<?, ?it/s]

                                    Internal Default Configurations Used by `carefree-learn`                                    
--------------------------------------------------------------------------------------------------------------------------------
                                                   train_samples   |   125
                                                   valid_samples   |   25
                                  module_config.encoder_settings   |   {}
                                                 index_mapping.0   |   0
                                                 index_mapping.1   |   1
                                                 index_mapping.2   |   2
                                                 index_mapping.3   |   3
--------------------------------------------------------------------------------------------------------------------------------
                                                    External Configurations                       

 17%|██████████████████▌                                                                                            | 1/6 [00:04<00:20,  4.08s/it]

                                    Internal Default Configurations Used by `carefree-learn`                                    
--------------------------------------------------------------------------------------------------------------------------------
                                                   train_samples   |   125
                                                   valid_samples   |   25
                                  module_config.encoder_settings   |   {}
                                                 index_mapping.0   |   0
                                                 index_mapping.1   |   1
                                                 index_mapping.2   |   2
                                                 index_mapping.3   |   3
--------------------------------------------------------------------------------------------------------------------------------
                                                    External Configurations                       

 33%|█████████████████████████████████████                                                                          | 2/6 [00:09<00:19,  4.87s/it]

                                    Internal Default Configurations Used by `carefree-learn`                                    
--------------------------------------------------------------------------------------------------------------------------------
                                                   train_samples   |   125
                                                   valid_samples   |   25
                                  module_config.encoder_settings   |   {}
                                                 index_mapping.0   |   0
                                                 index_mapping.1   |   1
                                                 index_mapping.2   |   2
                                                 index_mapping.3   |   3
--------------------------------------------------------------------------------------------------------------------------------
                                                    External Configurations                       

 50%|███████████████████████████████████████████████████████▌                                                       | 3/6 [00:13<00:13,  4.57s/it]

                                    Internal Default Configurations Used by `carefree-learn`                                    
--------------------------------------------------------------------------------------------------------------------------------
                                                   train_samples   |   125
                                                   valid_samples   |   25
                                  module_config.encoder_settings   |   {}
                                                 index_mapping.0   |   0
                                                 index_mapping.1   |   1
                                                 index_mapping.2   |   2
                                                 index_mapping.3   |   3
--------------------------------------------------------------------------------------------------------------------------------
                                                    External Configurations                       

 67%|██████████████████████████████████████████████████████████████████████████                                     | 4/6 [00:18<00:09,  4.64s/it]

                                    Internal Default Configurations Used by `carefree-learn`                                    
--------------------------------------------------------------------------------------------------------------------------------
                                                   train_samples   |   125
                                                   valid_samples   |   25
                                  module_config.encoder_settings   |   {}
                                                 index_mapping.0   |   0
                                                 index_mapping.1   |   1
                                                 index_mapping.2   |   2
                                                 index_mapping.3   |   3
--------------------------------------------------------------------------------------------------------------------------------
                                                    External Configurations                       

 83%|████████████████████████████████████████████████████████████████████████████████████████████▌                  | 5/6 [00:22<00:04,  4.57s/it]

                                    Internal Default Configurations Used by `carefree-learn`                                    
--------------------------------------------------------------------------------------------------------------------------------
                                                   train_samples   |   125
                                                   valid_samples   |   25
                                  module_config.encoder_settings   |   {}
                                                 index_mapping.0   |   0
                                                 index_mapping.1   |   1
                                                 index_mapping.2   |   2
                                                 index_mapping.3   |   3
--------------------------------------------------------------------------------------------------------------------------------
                                                    External Configurations                       

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:27<00:00,  4.58s/it]

|        metrics         |                       acc                        |                       auc                        |
--------------------------------------------------------------------------------------------------------------------------------
|                        |      mean      |      std       |     score      |      mean      |      std       |     score      |
--------------------------------------------------------------------------------------------------------------------------------
|          fcnn          | -- 0.880000 -- | -- 0.053610 -- | -- 0.826389 -- | -- 0.971755 -- | -- 0.002305 -- | -- 0.969449 -- |
--------------------------------------------------------------------------------------------------------------------------------
|         linear         |    0.235555    |    0.333125    |    -0.09757    |    0.387800    |    0.347282    |    0.040517    |





{'acc': {'fcnn': Statistics(sign=1.0, mean=0.88, std=0.053610391474732545, score=0.8263896085252674),
  'linear': Statistics(sign=1.0, mean=0.23555555555555555, std=0.33312586135899575, score=-0.0975703058034402)},
 'auc': {'fcnn': Statistics(sign=1.0, mean=0.9717555555555556, std=0.0023057630428725473, score=0.969449792512683),
  'linear': Statistics(sign=1.0, mean=0.3878000000000001, std=0.3472822525885292, score=0.040517747411470906)}}

It is worth mentioning that `carefree-learn` supports distributed training, which means when we need to perform large scale benchmarking (e.g. train 100 models), we could accelerate the process through multiprocessing:

> In `carefree-learn`, Distributed Training in Machine Learning tasks sometimes doesn't mean training your model on multiple GPUs or multiple machines. Instead, it may mean training multiple models at the same time.

In [8]:
# With num_jobs=2, we will launch 2 processes to run the tasks in a distributed way.
results = cflearn.api.repeat_ml(data, m.config, modules=modules, num_repeat=3, num_jobs=2)

  0%|                                                                                                                       | 0/6 [00:00<?, ?it/s]

                                    Internal Default Configurations Used by `carefree-learn`                                    
--------------------------------------------------------------------------------------------------------------------------------
                                                   train_samples   |   125
                                                   valid_samples   |   25
                                  module_config.encoder_settings   |   {}
                                                 index_mapping.0   |   0
                                                 index_mapping.1   |   1
                                                 index_mapping.2   |   2
                                                 index_mapping.3   |   3
--------------------------------------------------------------------------------------------------------------------------------
                                                    External Configurations                       

 17%|██████████████████▌                                                                                            | 1/6 [00:05<00:27,  5.56s/it]

                                    Internal Default Configurations Used by `carefree-learn`                                    
--------------------------------------------------------------------------------------------------------------------------------
                                                   train_samples   |   125
                                                   valid_samples   |   25
                                  module_config.encoder_settings   |   {}
                                                 index_mapping.0   |   0
                                                 index_mapping.1   |   1
                                                 index_mapping.2   |   2
                                                 index_mapping.3   |   3
--------------------------------------------------------------------------------------------------------------------------------
                                                    External Configurations                       

 67%|██████████████████████████████████████████████████████████████████████████                                     | 4/6 [00:10<00:04,  2.14s/it]

                                    Internal Default Configurations Used by `carefree-learn`                                    
--------------------------------------------------------------------------------------------------------------------------------
                                                   train_samples   |   125
                                                   valid_samples   |   25
                                  module_config.encoder_settings   |   {}
                                                 index_mapping.0   |   0
                                                 index_mapping.1   |   1
                                                 index_mapping.2   |   2
                                                 index_mapping.3   |   3
--------------------------------------------------------------------------------------------------------------------------------
                                                    External Configurations                       

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:15<00:00,  2.58s/it]


On iris dataset, however, launching distributed training may actually hurt the speed because iris dataset only contains 150 samples, so the overhead brought by distributed training might be too large.

### Advanced Benchmarking

But this is not enough, because we want to know whether other models (e.g., `scikit-learn` models) could achieve a better performance than `carefree-learn` models. In this case, we can perform an advanced benchmarking with the `Experiment` helper class.

In [9]:
experiment = cflearn.dist.ml.Experiment()
data_folder = experiment.dump_data(data)

# Add carefree-learn tasks
experiment.add_task(module="fcnn", config=config, data_folder=data_folder)
experiment.add_task(module="linear", config=config, data_folder=data_folder)
# Add scikit-learn tasks
run_command = f"python run_sklearn.py"
common_kwargs = {"run_command": run_command, "data_folder": data_folder}
experiment.add_task(module="decision_tree", **common_kwargs)
experiment.add_task(module="random_forest", **common_kwargs)

'/Users/heyujian/Documents/GitHub/carefree-learn-v0.5.x/examples/ml/iris/_experiment/random_forest/0'

Notice that we specified `run_command="python run_sklearn.py"` for `scikit-learn` tasks, which means `Experiment` will try to execute this command in the current working directory for training `scikit-learn` models. The good news is that we do not need to speciy any command line arguments, because `Experiment` will handle those for us.

Here is basically what a `run_sklearn.py` should look like ([source code](run_sklearn.py)):

```python
import os
import pickle

import numpy as np

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from cflearn.constants import INPUT_KEY
from cflearn.constants import LABEL_KEY
from cflearn.dist.ml.runs._utils import get_info


if __name__ == "__main__":
    info = get_info()
    meta = info.meta
    # data
    data = info.data
    assert data is not None
    data.prepare(None)
    loader = data.initialize()[0]
    dataset = loader.get_full_batch()
    x, y = dataset[INPUT_KEY], dataset[LABEL_KEY]
    assert isinstance(x, np.ndarray)
    assert isinstance(y, np.ndarray)
    # model
    model = meta["model"]
    if model == "decision_tree":
        base = DecisionTreeClassifier
    elif model == "random_forest":
        base = RandomForestClassifier
    else:
        raise NotImplementedError
    sk_model = base()
    # train & save
    sk_model.fit(x, y.ravel())
    with open(os.path.join(info.workplace, "sk_model.pkl"), "wb") as f:
        pickle.dump(sk_model, f)

```

With `run_sklearn.py` defined, we could run those tasks with one line of code:

In [10]:
results = experiment.run_tasks()

  0%|                                                                                                                       | 0/4 [00:00<?, ?it/s]

                                    Internal Default Configurations Used by `carefree-learn`                                    
--------------------------------------------------------------------------------------------------------------------------------
                                                   train_samples   |   125
                                                   valid_samples   |   25
                                               max_snapshot_file   |   25
                                                encoder_settings   |   {}
                                  module_config.encoder_settings   |   {}
                                                 index_mapping.0   |   0
                                                 index_mapping.1   |   1
                                                 index_mapping.2   |   2
                                                 index_mapping.3   |   3
                                                   monitor_names   |   ['mean_s

 25%|███████████████████████████▊                                                                                   | 1/4 [00:04<00:13,  4.48s/it]

                                    Internal Default Configurations Used by `carefree-learn`                                    
--------------------------------------------------------------------------------------------------------------------------------
                                                   train_samples   |   125
                                                   valid_samples   |   25
                                               max_snapshot_file   |   25
                                                encoder_settings   |   {}
                                  module_config.encoder_settings   |   {}
                                                 index_mapping.0   |   0
                                                 index_mapping.1   |   1
                                                 index_mapping.2   |   2
                                                 index_mapping.3   |   3
                                                   monitor_names   |   ['mean_s

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:18<00:00,  4.52s/it]


After finished running, we should be able to see the following file structure in the current working directory:

```text
|--- _experiment
   |--- __data__
      |-- npd
      |-- id.txt
      |-- info.json
   |--- fcnn/0
      |-- __meta__.json
      |-- __dl_config__
      |-- pipeline
   |--- linear/0
      |-- ...
   |--- decision_tree/0
      |-- __meta__.json
      |-- sk_model.pkl
   |--- random_forest/0
      |-- ...
```

As we expected, `carefree-learn` pipeline are saved into the `pipeline` folder, while `scikit-learn` models are saved into `sk_model.pkl` files. Since these models are not yet loaded, we should manually load them into our environment:

In [11]:
pipelines = cflearn.api.load_pipelines(results)
for workspace, workspace_key in zip(results.workspaces, results.workspace_keys):
    module = workspace_key[0]
    if module in ["decision_tree", "random_forest"]:
        model_file = os.path.join(workspace, "sk_model.pkl")
        with open(model_file, "rb") as f:
            predictor = cflearn.SKLearnClassifier(pickle.load(f))
            pipelines[module] = cflearn.GeneralEvaluationPipeline(config, predictor)

After which we can finally perform benchmarking on these models:

In [12]:
cflearn.api.evaluate(loader, pipelines)

|        metrics         |                       acc                        |                       auc                        |
--------------------------------------------------------------------------------------------------------------------------------
|                        |      mean      |      std       |     score      |      mean      |      std       |     score      |
--------------------------------------------------------------------------------------------------------------------------------
|     decision_tree      | -- 1.000000 -- | -- 0.000000 -- | -- 1.000000 -- | -- 1.000000 -- | -- 0.000000 -- | -- 1.000000 -- |
--------------------------------------------------------------------------------------------------------------------------------
|          fcnn          |    0.660000    | -- 0.000000 -- |    0.660000    |    0.984299    | -- 0.000000 -- |    0.984299    |
-------------------------------------------------------------------------------------------------

  return np.log(proba)
  return np.log(proba)


{'acc': {'decision_tree': Statistics(sign=1.0, mean=1.0, std=0.0, score=1.0),
  'fcnn': Statistics(sign=1.0, mean=0.66, std=0.0, score=0.66),
  'linear': Statistics(sign=1.0, mean=0.7, std=0.0, score=0.7),
  'random_forest': Statistics(sign=1.0, mean=1.0, std=0.0, score=1.0)},
 'auc': {'decision_tree': Statistics(sign=1.0, mean=1.0, std=0.0, score=1.0),
  'fcnn': Statistics(sign=1.0, mean=0.9842999999999998, std=0.0, score=0.9842999999999998),
  'linear': Statistics(sign=1.0, mean=0.8833333333333333, std=0.0, score=0.8833333333333333),
  'random_forest': Statistics(sign=1.0, mean=1.0, std=0.0, score=1.0)}}

### Conclusions

Contained in this notebook is just a subset of the features that `carefree-learn` offers, but we've already walked through many basic & common steps we'll encounter in real life machine learning tasks.