# PyTorch MNIST Basic Example

## Introduction

The **Goal** of this example is to show you how to train with a locally defined `CustomDataset` by embedding it in a TorchTrainingPlan, then launching a small federated run from the Researcher using Experiment. We use a single node for clarity and the Wine Quality (red) CSV as a toy dataset.

Setup assumptions:

- You have at least one Fed‑BioMed node running locally
- You added a dataset entry that points to a directory holding or receiving the CSV (see the dataset add note below)

### 1) Custom Dataset class

`CustomDataset` lets you make any file/folder usable for training, as long as you:

- Implement:

    - `read(self)`: load and preprocess data once (e.g., read CSV, build arrays) and store into instance fields.
    - `__len__(self)`: number of samples.
    - `get_item(self, index)`: return exactly a tuple `(data, target)` for the given index.
and
- Do not override `__init__` or `__getitem__`. These are managed by Fed‑BioMed so it can apply format checks and safety validations.
- `target` can be `None` in unsupervised settings. In supervised tasks (like this one), return a proper target (float, int, tensor, etc.).

**Good practice:**

- Prepare features as float arrays/tensors
- Keep get_item focused and fast; heavy work belongs in read.

### 2) Wine Dataset Example

The example dataset downloads Wine Quality (red) from UCI if it’s not already present, parses it with csv.DictReader(delimiter=';'), then caches two arrays:

- `self._X`: standardized continuous features (all columns except quality)
- `self._y`: the quality score as the regression target

`__len__` returns `len(self._y)` and `get_item(i)` returns a torch tensor pair: `(X[i], y[i])`.

### 3) Register a dataset on the node (once)

Open the node dataset tool and add a Custom Dataset entry tagged wine:

```$ fedbiomed node dataset add``` 

Choose Custom Dataset and set the folder that will hold winequality-red.csv (or where it will be downloaded). Use the tag wine (that’s what the Researcher will look for).

Start the node:

```$ fedbiomed node start```

### 4) Training plan (how the plan is wired)

WineTrainingPlan inherits from TorchTrainingPlan and provides:

- `init_model(model_args)`: builds a tiny MLP for regression. The input size `in_features` is passed via `model_args` (set to 11 for this dataset).

- Inner `WineCSV` dataset class (defined inside the plan in your code): implements `read`, `__len__`, `get_item` and returns Tensors.

- `training_data()`: constructs a `DataManager(dataset=WineCSV())`. 

- `training_step(data, target)`: one standard step using `MSELoss` for regression.

In [None]:
import torch
from torch import nn

from fedbiomed.common.training_plans import TorchTrainingPlan
from fedbiomed.common.datamanager import DataManager
from fedbiomed.common.dataset_types import DataReturnFormat
from fedbiomed.common.dataset import CustomDataset

class WineTrainingPlan(TorchTrainingPlan):

    def init_model(self, model_args):
        # infer input dim cheaply from one sample
        in_dim = model_args['in_features']
        hidden = model_args.get("hidden", 64)
        return nn.Sequential(nn.Linear(in_dim, hidden), nn.ReLU(), nn.Linear(hidden, 1))

    def init_optimizer(self, optimizer_args):
        lr = optimizer_args.get("lr", 1e-3)
        return torch.optim.Adam(self.model().parameters(), lr=lr)

    def init_dependencies(self):
        deps = [
            'from fedbiomed.common.dataset import CustomDataset',
            'import os',
            'import urllib.request',
            'import csv',
            'import numpy as np'
        ]
        return deps

    class WineCSV(CustomDataset):
        URL = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
        FNAME = "winequality-red.csv"
        TARGET = "quality"

        def read(self) -> None:
            
            # Fed‑BioMed provides self.path 
            os.makedirs(self.path, exist_ok=True)
            csv_fp = os.path.join(self.path, self.FNAME)
            if not os.path.exists(csv_fp):
                with urllib.request.urlopen(self.URL, timeout=30) as r:
                    data = r.read()
                with open(csv_fp, "wb") as f:
                    f.write(data)
    
            with open(csv_fp, "r", encoding="utf-8", newline="") as f:
                rows = list(csv.DictReader(f, delimiter=";"))
            if not rows:
                print(f"Empty CSV at {csv_fp}")
    
            cols = list(rows[0].keys())
            if self.TARGET not in cols:
                print(f"Target '{self.TARGET}' not found")
    
            feats = [c for c in cols if c != self.TARGET]
            X = np.array([[float(r[c]) for c in feats] for r in rows], dtype=np.float32)
            y = np.array([float(r[self.TARGET]) for r in rows], dtype=np.float32)
    
            # (optional) simple standardization
            X = (X - X.mean(0, keepdims=True)) / (X.std(0, keepdims=True) + 1e-8)
    
            self._X, self._y = X, y
    
        def __len__(self) -> int:
            return len(self._y)
    
        def get_item(self, index: int):
            return torch.tensor(self._X[index]), torch.tensor(self._y[index])
    
    def training_data(self):
        wine_dataset = self.WineCSV()
        return DataManager(dataset=wine_dataset)
    
    def training_step(self, data, target):
        predictions = self.model().forward(data)
        loss   = torch.nn.functional.mse_loss(predictions, target)
        return loss

In [None]:
# model parameters
model_args = {
    'in_features': 11, # number of feature columns
}

# training parameters
training_args = {
    'loader_args': { 
        'batch_size': 64, 
        'shuffle': True 
    },
    'optimizer_args': {
          'lr': 1e-3
    },
    'epochs': 3,
}

In [None]:
from fedbiomed.researcher.federated_workflows import Experiment
from fedbiomed.researcher.aggregators.fedavg import FedAverage

tags =  ['wine']
rounds = 5

exp = Experiment(tags=tags,
                 training_plan_class=WineTrainingPlan,
                 model_args=model_args,
                 training_args=training_args,
                 round_limit=rounds,
                 aggregator=FedAverage(),
                 node_selection_strategy=None)

In [None]:
exp.run()

### 5) Conclusion

We have successfully trained a model, using our own customized read and get_item methods. This customization allows us to both filter the data that we read and the sample we get during each training step. Try changing the read function to select specific columns and/or less data samples.