# Lab 04: Experiment Management

### What You Will Learn

- How experiment management brings observability to ML model development
- Which features of experiment management we use in developing the Text Recognizer
- Workflows for using Weights & Biases in experiment management, including metric logging, artifact versioning, and hyperparameter optimization

## Setup

In [2]:
if "bootstrap" not in locals() or bootstrap.run:
    # path management for Python
    pythonpath, = !echo $PYTHONPATH
    if "." not in pythonpath.split(":"):
        pythonpath = ".:" + pythonpath
        %env PYTHONPATH={pythonpath}
        !echo $PYTHONPATH

    # get both Colab and local notebooks into the same state
    !wget --quiet https://fsdl.me/gist-bootstrap -O bootstrap.py
    import bootstrap

    # change into the lab directory
    # bootstrap.change_to_lab_dir(lab_idx=lab_idx)

    # allow "hot-reloading" of modules
    %load_ext autoreload
    %autoreload 2
    # needed for inline plots in some contexts
    %matplotlib inline

    bootstrap.run = False  # change to True re-run setup
    
!pwd
%ls

/home/amazingguni/git/fsdl-text-recognizer-2022-labs
Makefile   [0m[01;34mdata[0m/            [01;34mlightning_logs[0m/  [01;34mrequirements[0m/     [01;34mtraining[0m/
README.md  environment.yml  [01;34mnotebooks[0m/       [01;34mtext_recognizer[0m/


In [3]:
from IPython.display import display, HTML, IFrame

full_width = True
frame_height = 720  # adjust for your screen

if full_width:  # if we want the notebook to take up the whole width
    # add styling to the notebook's HTML directly
    display(HTML("<style>.container { width:100% !important; }</style>"))
    display(HTML("<style>.output_result { max-width:100% !important; }</style>"))

# Why experiment management?

In [4]:
from text_recognizer.data.iam import IAM  # base dataset of images of handwritten text
from text_recognizer.data import IAMLines  # processed version split into individual lines
from text_recognizer.models import LineCNNTransformer  # simple CNN encoder / Transformer decoder


print(IAM.__doc__)

# uncomment a line below for details on either class
# IAMLines??  
# LineCNNTransformer??

A dataset of images of handwritten text written on a form underneath a typewritten prompt.

    "The IAM Lines dataset, first published at the ICDAR 1999, contains forms of unconstrained handwritten text,
    which were scanned at a resolution of 300dpi and saved as PNG images with 256 gray levels."
    From http://www.fki.inf.unibe.ch/databases/iam-handwriting-database

    Images are identified by their "form ID". These IDs are used to separate train, validation and test splits,
    as keys for dictonaries returning label and image crop region data, and more.

    The data split we will use is
    IAM lines Large Writer Independent Text Line Recognition Task (LWITLRT): 9,862 text lines.
        The validation set has been merged into the train set.
        The train set has 7,101 lines from 326 writers.
        The test set has 1,861 lines from 128 writers.
        The text lines of all data sets are mutually exclusive, thus each writer has contributed to one set only.
    


In [5]:
%%time
import torch


gpus = int(torch.cuda.is_available()) 

%run training/run_experiment.py --model_class LineCNNTransformer --data_class IAMLines \
  --loss transformer --batch_size 32 --gpus {gpus} --max_epochs 2 \
  --limit_train_batches 0.1 --limit_val_batches 0.1 --limit_test_batches 0.1 --log_every_n_steps 10

Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.model_summary.ModelSummary'>]. Skipping setting a default `ModelSummary` callback.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

   | Name                      | Type               | Params
------------------------------------------------------------------
0  | model                     | LineCNNTransformer | 4.3 M 
1  | model.line_cnn            | LineCNN            | 1.6 M 
2  | model.embedding           | Embedding          | 21.2 K
3  | model.fc                  | Linear             | 21.3 K
4  | model.pos_encoder         | PositionalEncoding | 0     
5  | model.transformer_decoder | TransformerDecoder | 2.6 M 
6  | train_acc                 | Accuracy           | 0     
7  | val_acc                   | Accuracy           | 0     
8  | test_acc      

Model State Dict Disk Size: 17.23 MB


Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Best model saved at: /home/amazingguni/git/fsdl-text-recognizer-2022-labs/training/logs/lightning_logs/version_12/epoch=0000-validation.loss=3.139-validation.cer=1.809.ckpt
Restoring states from the checkpoint path at /home/amazingguni/git/fsdl-text-recognizer-2022-labs/training/logs/lightning_logs/version_12/epoch=0000-validation.loss=3.139-validation.cer=1.809.ckpt
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Loaded model weights from checkpoint at /home/amazingguni/git/fsdl-text-recognizer-2022-labs/training/logs/lightning_logs/version_12/epoch=0000-validation.loss=3.139-validation.cer=1.809.ckpt


Testing: 0it [00:00, ?it/s]

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       Test metric             DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
        test/cer             2.13433837890625
        test/loss            3.17672061920166
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
CPU times: user 55.6 s, sys: 9.86 s, total: 1min 5s
Wall time: 1min 11s


# Local Experiment Tracking with Tensorboard

In [6]:
# we use a sequence of bash commands to get the latest experiment's directory
#  by hand, you can just copy and paste it from the terminal

list_all_log_files = "find training/logs/lightning_logs/"  # find avoids issues ls has with \n in filenames
filter_to_folders = "grep '_[0-9]*$'"  # regex match on end of line
sort_version_descending = "sort -Vr"  # uses "version" sorting (-V) and reverses (-r)
take_first = "head -n 1"  # the first n elements, n=1

In [7]:
latest_log, = ! {list_all_log_files} | {filter_to_folders} | {sort_version_descending} | {take_first}
latest_log

'training/logs/lightning_logs/version_12'

In [8]:
!ls -lh {latest_log}

total 99M
-rw-r--r-- 1 amazingguni amazingguni  50M Jan 25 23:51 'epoch=0000-validation.loss=3.139-validation.cer=1.809.ckpt'
-rw-r--r-- 1 amazingguni amazingguni  50M Jan 25 23:51 'epoch=0001-validation.loss=3.126-validation.cer=1.809.ckpt'
-rw-r--r-- 1 amazingguni amazingguni 1.3K Jan 25 23:51  events.out.tfevents.1674658240.DESKTOP-2KM3TFJ.19492.0
-rw-r--r-- 1 amazingguni amazingguni  176 Jan 25 23:51  events.out.tfevents.1674658303.DESKTOP-2KM3TFJ.19492.1
-rw-r--r-- 1 amazingguni amazingguni    3 Jan 25 23:50  hparams.yaml


In [9]:
%load_ext tensorboard

In [10]:
# same command works in terminal, with "{arguments}" replaced with values or "$VARIABLES"

port = 11717  # pick an open port on your machine
host = "0.0.0.0" # allow connections from the internet
                 #   watch out! make sure you turn TensorBoard off

%tensorboard --logdir {latest_log} --port {port} --host {host}

ERROR: Failed to launch TensorBoard (exited with 255).
Contents of stderr:
TensorFlow installation not found - running with reduced feature set.

NOTE: Using experimental fast data loading logic. To disable, pass
    "--load_fast=false" and report issues on GitHub. More details:
    https://github.com/tensorflow/tensorboard/issues/4784

E0125 23:51:46.371752 139667197018496 program.py:298] TensorBoard could not bind to port 11717, it was already in use
ERROR: TensorBoard could not bind to port 11717, it was already in use

In [11]:
%tensorboard --logdir training/logs/lightning_logs --port {port + 1} --host "0.0.0.0"

Reusing TensorBoard on port 11718 (pid 18714), started 0:04:39 ago. (Use '!kill 18714' to kill it.)

In [12]:
import tensorboard.manager

# get the process IDs for all tensorboard instances
pids = [tb.pid for tb in tensorboard.manager.get_all()]

done_with_tensorboard = False

if done_with_tensorboard:
    # kill processes
    for pid in pids:
        !kill {pid} 2> /dev/null
        
    # remove the temporary files that sometimes persist, see https://stackoverflow.com/a/59582163
    !rm -rf {tensorboard.manager._get_info_dir()}

# Experiment Management with Weights & Biases

In [13]:
import wandb

print(wandb.__doc__)

Use wandb to track machine learning work.

The most commonly used functions/objects are:
  - wandb.init — initialize a new run at the top of your training script
  - wandb.config — track hyperparameters and metadata
  - wandb.log — log metrics and media over time within your training loop

For guides and examples, see https://docs.wandb.com/guides.

For scripts and interactive notebooks, see https://github.com/wandb/examples.

For reference documentation, see https://docs.wandb.com/ref/python.



In [14]:
!grep "args.wandb" -A 5 training/run_experiment.py | head -n 6

    if args.wandb:
        logger = pl.loggers.WandbLogger(log_model="all", save_dir=str(log_dir), job_type="train")
        logger.watch(model, log_freq=max(100, args.log_every_n_steps))
        logger.log_hyperparams(vars(args))
        experiment_dir = logger.experiment.dir
    callbacks += [cb.ModelSizeLogger(), cb.LearningRateMonitor()]


In [15]:
from pytorch_lightning.loggers import WandbLogger


WandbLogger??

[0;31mInit signature:[0m
[0mWandbLogger[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mname[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msave_dir[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0moffline[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mbool[0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mid[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0manonymous[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mbool[0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mversion[0m[0;34m:[0m [0mUnion[0m[0;34m[[0

In [16]:
!wandb login

[34m[1mwandb[0m: Currently logged in as: [33mamazingguni[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [17]:
%%time
%run training/run_experiment.py --model_class LineCNNTransformer --data_class IAMLines \
  --loss transformer --batch_size 16 --gpus {gpus} --max_epochs 10 \
  --log_every_n_steps 10 --wandb --limit_test_batches 0.1 \
  --limit_train_batches 0.1 --limit_val_batches 0.1
    
last_expt = wandb.run

wandb.finish()  # necessary in this style of in-notebook experiment running, not necessary in CLI

ERROR:wandb.jupyter:Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mamazingguni[0m. Use [1m`wandb login --relogin`[0m to force relogin


[34m[1mwandb[0m: logging graph, to disable use `wandb.watch(log_graph=False)`
Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.model_summary.ModelSummary'>]. Skipping setting a default `ModelSummary` callback.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

   | Name                      | Type               | Params
------------------------------------------------------------------
0  | model                     | LineCNNTransformer | 4.3 M 
1  | model.line_cnn            | LineCNN            | 1.6 M 
2  | model.embedding           | Embedding          | 21.2 K
3  | model.fc                  | Linear             | 21.3 K
4  | model.pos_encoder         | PositionalEncoding | 0     
5  | model.transformer_decoder | TransformerDecoder | 2.6 M 
6  | train_acc                 | Accuracy           | 0     

Model State Dict Disk Size: 17.23 MB


Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Best model saved at: /home/amazingguni/git/fsdl-text-recognizer-2022-labs/training/logs/lightning_logs/version_13/epoch=0007-validation.loss=2.331-validation.cer=0.778.ckpt
Best model also uploaded to W&B 
Restoring states from the checkpoint path at /home/amazingguni/git/fsdl-text-recognizer-2022-labs/training/logs/lightning_logs/version_13/epoch=0007-validation.loss=2.331-validation.cer=0.778.ckpt
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Loaded model weights from checkpoint at /home/amazingguni/git/fsdl-text-recognizer-2022-labs/training/logs/lightning_logs/version_13/epoch=0007-validation.loss=2.331-validation.cer=0.778.ckpt


Testing: 0it [00:00, ?it/s]

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       Test metric             DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
        test/cer            0.8246400356292725
        test/loss            2.328057289123535
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────


VBox(children=(Label(value='472.725 MB of 472.728 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=0.99…

0,1
epoch,▁▁▁▁▂▂▂▂▃▃▃▃▃▃▃▃▄▄▄▄▅▅▅▅▆▆▆▆▆▆▆▆▇▇▇▇███▆
optimizer/lr-Adam,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
size/mb_disk,▁
size/nparams,▁
test/cer,▁
test/loss,▁
train/loss,▇▇▆█▆▇▆▄▄▃▃▃▃▃▃▂▃▂▃▃▂▂▃▂▂▃▂▂▂▂▂▂▂▂▂▁▂▁▁▂
trainer/global_step,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇████
validation/cer,█▇▂▆▇▁▁▁▁▇
validation/loss,█▄▃▃▂▂▂▁▁▁

0,1
epoch,7.0
optimizer/lr-Adam,0.001
size/mb_disk,17.22701
size/nparams,4297331.0
test/cer,0.82464
test/loss,2.32806
train/loss,2.51362
trainer/global_step,590.0
validation/cer,1.59753
validation/loss,2.28742


CPU times: user 3min 48s, sys: 32.1 s, total: 4min 20s
Wall time: 5min 42s
