<a href="https://colab.research.google.com/github/wandb/edu/blob/main/mlops-001/lesson1/03_Baseline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
<!--- @wandbcode{course-lesson1} -->

# Baseline solution

<!--- @wandbcode{course-lesson1} -->

In this notebook we will create a baseline solution to our semantic segmentation problem. To iterate fast a notebook is a handy solution. We will then refactor this code into a script to be able to use hyperparameter sweeps.

In [1]:
import wandb
import pandas as pd
from fastai.vision.all import *
from fastai.callback.wandb import WandbCallback
from utils import get_predictions, create_predictions_table
from sklearn.metrics import f1_score, balanced_accuracy_score 
import params


Again, we're importing some global configuration parameters from `params.py` file. We have also defined some helper functions in `utils.py` - for example metrics we will track during our experiments.

Let's now create a `train_config` that we'll pass to W&B `run` to control training hyperparameters. 

In [2]:
train_config = SimpleNamespace(
    framework="fastai",
    img_size=(18, 32),
    batch_size=1024,
    augment=True, # use data augmentation
    epochs=5, 
    lr=2e-3,
    pretrained=True,  # whether to use pretrained encoder
    seed=42,
)

We are setting seed for reproducibility. 

In [3]:
set_seed(train_config.seed, reproducible=True)

In [4]:
run = wandb.init(project=params.WANDB_PROJECT, entity=params.ENTITY, job_type="training", config=train_config)

[34m[1mwandb[0m: Currently logged in as: [33mchrisgjarrett[0m. Use [1m`wandb login --relogin`[0m to force relogin


As usual, we will use W&B Artifacts to track the lineage of our models. 

In [5]:
processed_data_at = run.use_artifact(f'{params.PROCESSED_DATA_AT}:latest')
processed_dataset_dir = Path(processed_data_at.download())
df = pd.read_csv(processed_dataset_dir / 'data_split.csv')

[34m[1mwandb[0m: Downloading large artifact processed_data_at:latest, 556.40MB. 8619 files... 
[34m[1mwandb[0m:   8619 of 8619 files downloaded.  
Done. 0:0:0.6


In [6]:
processed_dataset_dir.ls()

(#6) [Path('artifacts/processed_data_at:v6/t2_11115475104.table.json'),Path('artifacts/processed_data_at:v6/images'),Path('artifacts/processed_data_at:v6/data_split.csv'),Path('artifacts/processed_data_at:v6/eda_table_data_split.joined-table.json'),Path('artifacts/processed_data_at:v6/eda_table.table.json'),Path('artifacts/processed_data_at:v6/media')]

We will not use the hold out dataset stage at this moment. `is_valid` column will tell our trainer how we want to split data between training and validation. 

In [7]:
df = df[df.Stage != 'test'].reset_index(drop=True)
df['is_valid'] = df.Stage == 'valid'

In [8]:
def label_func(fname):
    return "Not Cancer" in fname

We will use `fastai`'s `DataBlock` API to feed data into model training and validation. 

In [9]:
# assign paths
df["image_fname"] = [processed_dataset_dir/f'images/{f}' for f in df.Filename.values]

In [10]:
df = pd.DataFrame(df)

In [11]:
processed_dataset_dir.ls()


(#6) [Path('artifacts/processed_data_at:v6/t2_11115475104.table.json'),Path('artifacts/processed_data_at:v6/images'),Path('artifacts/processed_data_at:v6/data_split.csv'),Path('artifacts/processed_data_at:v6/eda_table_data_split.joined-table.json'),Path('artifacts/processed_data_at:v6/eda_table.table.json'),Path('artifacts/processed_data_at:v6/media')]

In [12]:
def get_data(df:pd.DataFrame, bs=1, img_size=(180, 320), augment=True):
    block = DataBlock(blocks=(ImageBlock, CategoryBlock),
                  get_x=ColReader(0, pref=processed_dataset_dir/"images"),
                  get_y=ColReader("Class"),
                  splitter=ColSplitter(),
                  item_tfms=Resize(img_size),
                #   batch_tfms=aug_transforms() if augment else None,                 )
            )
    return block.dataloaders(df, bs=bs)

We are using `wandb.config` to track our training hyperparameters. 

In [13]:
# config = wandb.config

In [14]:
dls = get_data(df, bs=train_config.batch_size, img_size=train_config.img_size, augment=None)

In [15]:
metrics = [F1Score(), BalancedAccuracy()]

learn = vision_learner(dls, models.resnet34, metrics = metrics)



In `fastai` we already have a callback that integrates tightly with W&B, we only need to pass the `WandbCallback` to the learner and we are ready to go. The callback will log all the useful variables for us. For example, whatever metric we pass to the learner will be tracked by the callback.

In [16]:
callbacks = [
    SaveModelCallback(monitor='f1_score'),
    WandbCallback(log_preds=False, log_model=True)
]

Let's train our model!

In [17]:
learn.fit_one_cycle(train_config.epochs, train_config.lr, cbs=callbacks)

Better model found at epoch 0 with f1_score value: 0.7554076539101497.
Better model found at epoch 4 with f1_score value: 0.7724425887265136.


We will log a table with model predictions and ground truth to W&B, so that we can do error analysis in the W&B dashboard. 

In [18]:
samples, predictions = get_predictions(learn)
table = create_predictions_table(samples, predictions)
wandb.log({"pred_table":table})

ValueError: This table expects 2 columns: ['Image', 'Label'], found 1

We are reloading the model from the best checkpoint at the end and saving it. To make sure we track the final metrics correctly, we will validate the model again and save the final loss and metrics to `wandb.summary`. 

In [None]:
scores = learn.validate()
metric_names = ['final_loss'] + [f'final_{x.name}' for x in metrics]
final_results = {metric_names[i] : scores[i] for i in range(len(scores))}
for k,v in final_results.items(): 
    wandb.summary[k] = v

In [None]:
wandb.finish()