# Experiment output directory

In this tutorial, you will familiarize yourself with the experiment's output folder structure. Some of the files can be used directly to quickly draw conclusions about the experiment's results. Others can be used to create custom visualizations with Analysis module.

After running the experiments from the Hyperparameter Optimization (HPO) tutorial, the results are cached in the directory: `/tmp/experiments`, as specified in the configurations `experiment_dir`. The results directory follows these structures:

```
- tmp/experiments
    - experiment_<experiment_id>
        - <trial1_id>
          - best_checkpoints/
          - checkpoints/
          - dashboard/
          - config.yaml
          - metadata.json
          - results.json
          - train.log
        - <trial2_id>
        - <trial3_id>
        - ...
        - <experiment_id>_optuna.db
        - <experiment_id>_state.db
        - default_config.yaml
        - mp.log
```

As you can see, there are two levels of directories: one for the experiment, and one for the trials.

- The `experiment_<experiment_id>` directory:
  - `<experiment_id>_optuna.db`: for each trial, after finished training, its metrics (those to be optimized) will be added to the experiment state database (via optuna_study.tell()). Optuna will use this as a base to explore the search space based on the results so far. The content of this database is out of this tutorial's scope, since it's used by `optuna` to perform hyperparamweters exploration.
  - `<experiment_id>_state.db`: for each trial, after finished training, its metrics (those to be optimized), configuration, and training state (RUNNING, WAITING, etc.), will be added to the experiment state database. (Probably be accessed by the dashboard API)
  - `default_config.yaml`: the overrall configurations for model, training and hyperparameters tuning. Note that configurations for hyperparameters will be changed in each trial's `config.yaml` file that's specific to that trial only.
  - `mp.log`: console infomation during the running experiment. This gives information about the trials that's running in parallel, how many that's running and how many that's terminated.

- Trial directories, each contains:
  - `train.log`: console infomation during the training process of the running trial. This log reports metrics from training and from evaluation. These metrics include static (e.g best loss value, the current iteration, current epoch, best iteration so far, total steps, current learning rate) and moving average metrics (e.g training loss, validation loss, user-defined metrics f1 score, precision, etc).
  - `results.json`: helps keep track of the running trial. This includes an json object for each of the epochs. This is where all metrics in the `train.log` are stored as json objects.
    ```
    {
    "train_loss": 4.234887647596995,
    "val_loss": NaN,
    "train_accuracy": 0.463046875,
    "val_accuracy": NaN,
    "best_iteration": 0,
    "best_loss": Infinity,
    "current_epoch": 1,
    "current_iteration": 1875,
    "epochs": 10,
    "learning_rate": 0.0025929597250393165,
    "total_steps": 18750
    },
    {
    "train_loss": 1.5882160221735637,
    "val_loss": 1.4797034080797873,
    "train_accuracy": 0.4984375,
    "val_accuracy": 0.5128,
    "best_iteration": 1875,
    "best_loss": 1.4797034080797873,
    "current_epoch": 2,
    "current_iteration": 3750,
    "epochs": 10,
    "learning_rate": 0.0025929597250393165,
    "total_steps": 18750
    },
    ...
    {
    "train_loss": 1.4190918445428213,
    "val_loss": 1.670326127942403,
    "train_accuracy": 0.59286328125,
    "val_accuracy": 0.5895444444444444,
    "best_iteration": 3750,
    "best_loss": 1.461920569594295,
    "current_epoch": 10,
    "current_iteration": 18750,
    "epochs": 10,
    "learning_rate": 0.0025929597250393165,
    "total_steps": 18750
    }

    ```
    As you can observe from this sample `results.json` file from HPO tutorial, there are 10 json objects, each represents metrics of one epoch. `best_iteration` and `best_loss` values give us information about the best performing iteration.
  - `config.yaml`: configuration details that's specific to the running trial.
  - `checkpoints/`: this directory stores checkpoints for the trained model, which includes the model parameters, the optimizer (and/or scheduler), and all the metrics computed using this model. Config parameter `keep_n_checkpoints` control the numbers of checkpoints kept in this folder. You can play around with the checkpoint files by loading the pt file with pytorch. For example, you can load the model parameters and optimizer state with the following code:
    ```python
    import torch

    checkpoint_path = "tmp/experiments/experiment_<experiment_id>/<trial1_id>/checkpoints/<ckpt_name>.pt"
    checkpoint = torch.load(checkpoint_path, map_location="cpu")
    ```
    ```
    {'run_config': {'model_config': {'input_size': 784,
    'hidden_size': 453,
    'num_classes': 10},
    'experiment_dir': ...,
    'random_seed': 123,
    'train_config': {...},
    'scheduler_config': None,
    'total_trials': 1,
    'concurrent_trials': 1,
    'search_space': {...},
    'optim_metrics': {'val_loss': <Optim.min: 'min'>},
    ...
    'metrics': {'train_loss': 1.2681676818847656,
      'val_loss': 1.2884625773460339,
      'train_accuracy': 0.5673828125,
      ...},
    'model': OrderedDict([('model.fc1.weight',
                  tensor([[-0.0115, -0.0191,  0.0091,  ...,  0.0055,  0.0072, -0.0133],
                          [ 0.0038, -0.0020,  0.0061,  ..., -0.1115,  0.0072,  0.0010],
                          [ 0.0108, -0.0148,  0.0069,  ...,  0.1052, -0.0180, -0.0165],
                          ...,
                          [-0.0234,  0.0091, -0.0088,  ...,  0.0090,  0.0136, -0.0035],
                          [ 0.0115,  0.0197, -0.0017,  ..., -0.1034,  0.0064, -0.0030],
                          [-0.0083,  0.0073,  0.0204,  ..., -0.0111,  0.0014,  0.0003]]))
                          ...
                          ])
      ...
      }}
    ```
  - `best_checkpoints`: this directory stores the best checkpoints for the trained model.
  - `dashboard/`: directory to cache those metrics data, and you can use Tensorboard for visualization of how metrics oscilates while training: Install `tensorboard` and load using `%load_ext tensorboard` if using notebook, then run `%tensorboard --logdir /tmp/experiments/experiment_<experiment_id> --port [port]`. E.g:
    ```
    %load_ext tensorboard
    %tensorboard --logdir /tmp/experiments/experiment_1901_aa90 --port 6008
    ```
    ![TensorBoard-Output](./Images/tensorboard-output.jpg)
  - `metadata.json`: keeps track of the training progess: log iteration (specifies the latest iteration that logs the results to files), checkpoint iteration specifies which iteration a checkpoint is been saved and which iteration is the best one.

In the folling tutorial, we will use Analysis module to visualize these results so we can draw better conclusions about the experiment's results.