# Base Notebook

The code below modifies the Python path to include several additional directories, which contain certain packages or modules that are not installed in the standard Python library location. Here's what each line does:

1. `import sys`: Imports Python's built-in `sys` module, which provides access to some variables used or maintained by the Python interpreter and to functions that interact strongly with the interpreter. This is typically used for manipulating the Python runtime environment.

2. `sys.path.append("../input/pretrained-models-pytorch")`: Adds the relative path `../input/pretrained-models-pytorch` to the system path. This allows Python to find and load the modules in this directory that are not in Python's default module search path.

3. `sys.path.append("../input/efficientnet-pytorch")`: Adds the relative path `../input/efficientnet-pytorch` to the system path. This is a library for the EfficientNet model in PyTorch.

4. `sys.path.append("/kaggle/input/smp-github/segmentation_models.pytorch-master")`: Adds the absolute path `/kaggle/input/smp-github/segmentation_models.pytorch-master` to the system path. This path refers to the `segmentation_models.pytorch` library.

5. `sys.path.append("/kaggle/input/timm-pretrained-resnest/resnest/")`: Adds the absolute path `/kaggle/input/timm-pretrained-resnest/resnest/` to the system path. This is a directory for pretrained ResNeSt models, a variant of the ResNet architecture.

6. `import segmentation_models_pytorch as smp`: Imports the `segmentation_models_pytorch` module with the alias `smp` after its path has been appended to the system path. This module provides implementations of various deep learning segmentation models in PyTorch.

This set of operations is generally used when you want to import and use Python modules that are not in the default search directories, such as custom modules or modules you have downloaded but not installed into the standard library.


In [1]:
import sys
sys.path.append("../input/pretrained-models-pytorch")
sys.path.append("../input/efficientnet-pytorch")
sys.path.append("/kaggle/input/smp-github/segmentation_models.pytorch-master")
sys.path.append("/kaggle/input/timm-pretrained-resnest/resnest/")
import segmentation_models_pytorch as smp

  if block_type is 'proj':
  elif block_type is 'down':
  assert block_type is 'normal'


The next cell uses UNIX shell commands to perform some file system operations:

1. `!mkdir -p /root/.cache/torch/hub/checkpoints/`: The `mkdir` command is used to create a directory. The `-p` option tells `mkdir` to create the full directory path if it doesn't exist, essentially creating any necessary intermediate directories. In this case, it's ensuring the `/root/.cache/torch/hub/checkpoints/` directory exists.

2. `!cp /kaggle/input/timm-pretrained-resnest/resnest/gluon_resnest26-50eb607c.pth /root/.cache/torch/hub/checkpoints/gluon_resnest26-50eb607c.pth`: The `cp` command is used to copy files. This line copies the `gluon_resnest26-50eb607c.pth` file from the `/kaggle/input/timm-pretrained-resnest/resnest/` directory to the `/root/.cache/torch/hub/checkpoints/` directory. The `.pth` file is a PyTorch model file, storing trained model weights. 

This set of operations is generally used when you want to prepare a specific location in the file system (in this case for PyTorch) where model checkpoints or pre-trained models are stored or cached. The directory `/root/.cache/torch/hub/checkpoints/` is the default directory PyTorch uses to store downloaded model weights when using the `torch.hub.load` function.

In [2]:
!mkdir -p /root/.cache/torch/hub/checkpoints/
!cp /kaggle/input/timm-pretrained-resnest/resnest/gluon_resnest26-50eb607c.pth /root/.cache/torch/hub/checkpoints/gluon_resnest26-50eb607c.pth

The cell below is using the IPython magic command `%%writefile` to create a YAML configuration file named `config.yaml`. This configuration file includes various parameters and settings that are going to be used later.

1. `data_path: "/kaggle/input/contrails-images-ash-color"`: Specifies the path to the dataset.
2. `output_dir: "models"`: Defines the directory where model files will be saved.
3. `seed: 42`: Sets the random seed to ensure reproducibility.
4. `train_bs: 48`, `valid_bs: 128`, `workers: 2`: Specifies the batch sizes for training and validation, and the number of workers for data loading.
5. `progress_bar_refresh_rate: 1`: Determines how often the progress bar should be updated.

6. `early_stop`: Configures early stopping parameters:
   - `monitor: "val_loss"`: Determines which metric to monitor for early stopping.
   - `mode: "min"`: Stops training when the monitored metric stops decreasing.
   - `patience: 999`: Determines how many epochs without improvement before stopping.
   - `verbose: 1`: Turns on verbosity, i.e., print messages during execution.

7. `trainer`: Configures training parameters:
   - `max_epochs: 14`, `min_epochs: 1`: Defines the minimum and maximum number of epochs.
   - `enable_progress_bar: True`: Turns on the progress bar.
   - `precision: "16-mixed"`: Sets the precision for training (16-bit mixed precision).
   - `devices: 2`: Specifies the number of devices to be used for training.

8. `model`: Configures the model and optimization parameters:
   - `seg_model: "FPN"`: Specifies the segmentation model to be used (Feature Pyramid Network).
   - `encoder_name: "timm-resnest50d"`: Sets the encoder model.
   - `loss_smooth: 1.0`: Sets the smoothing factor for the loss function.
   - `image_size: 384`: Sets the size of input images.
   - `optimizer_params`: Sets parameters for the optimizer (`lr: 0.0005`, `weight_decay: 0.0`).
   - `scheduler`: Configures the learning rate schedulers:
      - `CosineAnnealingLR`: Sets parameters for the cosine annealing learning rate scheduler.
      - `ReduceLROnPlateau`: Sets parameters for the learning rate scheduler that reduces the learning rate when a metric has stopped improving.

This configuration file is an efficient way to manage hyperparameters and settings for your machine learning experiments. It allows easy tweaking of parameters without having to modify the source code directly.

In [3]:
%%writefile config.yaml

data_path: "/kaggle/input/contrails-images-ash-color"
output_dir: "models"

seed: 42

train_bs: 48
valid_bs: 128
workers: 2

progress_bar_refresh_rate: 1

early_stop:
    monitor: "val_loss"
    mode: "min"
    patience: 999
    verbose: 1

trainer:
    max_epochs: 14
    min_epochs: 1
    enable_progress_bar: True
    precision: "16-mixed"
    devices: 2

model:
    seg_model: "FPN"
    encoder_name: "timm-resnest50d"
    loss_smooth: 1.0
    image_size: 384
    optimizer_params:
        lr: 0.0005
        weight_decay: 0.0
    scheduler:
        name: "CosineAnnealingLR"
        params:
            CosineAnnealingLR:
                T_max: 2
                eta_min: 1.0e-6
                last_epoch: -1
            ReduceLROnPlateau:
                mode: "min"
                factor: 0.31622776601
                patience: 4
                verbose: True

Writing config.yaml


The next code defines a custom PyTorch `Dataset` class called `ContrailsDataset`. The `Dataset` class is an abstract class representing a dataset in PyTorch and the custom class must override the `__getitem__` and `__len__` methods. 

1. `def __init__(self, df, image_size=256, train=True)`: This is the constructor of the `ContrailsDataset` class. It takes three parameters: a pandas DataFrame `df` which presumably contains paths to the images and their labels, the desired `image_size` for the images, and a `train` flag that indicates whether the dataset is for training or not. It initializes several instance variables such as `self.df`, `self.trn`, `self.normalize_image`, `self.image_size`, and `self.resize_image` (if the image size is not 256).

2. `self.normalize_image = T.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))`: This line sets up a normalization transformation for the images using the means and standard deviations of the RGB channels respectively. These values are standard for models pre-trained on ImageNet.

3. `def __getitem__(self, index)`: This method is called when an element is accessed via indexing. It returns a tuple `(img, label)` where `img` is a normalized, potentially resized, tensor version of the image and `label` is a tensor version of the corresponding label. It first retrieves the path of the image, then loads the image data and label. The image data is reshaped, converted to a tensor, possibly resized, and then normalized.

4. `def __len__(self)`: This method returns the number of items in the dataset, which corresponds to the number of rows in the DataFrame `df`.

By creating a custom `Dataset` class, you can easily load and preprocess your data in a way that is optimized for training a PyTorch model.


In [4]:
# Dataset

import torch
import numpy as np
import torchvision.transforms as T

class ContrailsDataset(torch.utils.data.Dataset):
    def __init__(self, df, image_size=256, train=True):

        self.df = df
        self.trn = train
        self.normalize_image = T.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))
        self.image_size = image_size
        if image_size != 256:
            self.resize_image = T.transforms.Resize(image_size)

    def __getitem__(self, index):
        row = self.df.iloc[index]
        con_path = row.path
        con = np.load(str(con_path))

        img = con[..., :-1]
        label = con[..., -1]

        label = torch.tensor(label)

        img = torch.tensor(np.reshape(img, (256, 256, 3))).to(torch.float32).permute(2, 0, 1)

        if self.image_size != 256:
            img = self.resize_image(img)

        img = self.normalize_image(img)

        return img.float(), label.float()

    def __len__(self):
        return len(self.df)

Now, the next cell defines a custom PyTorch Lightning Module, `LightningModule`, which is used to organize your PyTorch code and makes it easily readable and reusable. The class encompasses base models core parts of a machine learning model in PyTorch: the architecture, the data, the optimizer, and the training loop.

Here is what each part of the code does:

1. `seg_models`: This is a dictionary that maps the name of a model to its corresponding class in the `segmentation_models_pytorch` (smp) library.

2. `class LightningModule(pl.LightningModule)`: This starts the definition of the custom LightningModule class.

3. `def __init__(self, config)`: This is the constructor of the class. It takes a configuration dictionary `config` as input, sets up the model according to the configuration, and defines the loss function.

4. `def forward(self, batch)`: This is the forward pass of the model which takes an input `batch` and returns the predictions.

5. `def configure_optimizers(self)`: This method returns the optimizer and learning rate scheduler based on the configuration. The optimizer is `AdamW` with parameters specified in `config["optimizer_params"]`. Depending on the `config["scheduler"]["name"]`, it selects either a `CosineAnnealingLR` or `ReduceLROnPlateau` scheduler.

6. `def training_step(self, batch, batch_idx)`: This method defines what happens in one step of training. It calculates the loss, logs it and the learning rate, and returns the loss.

7. `def validation_step(self, batch, batch_idx)`: This method defines what happens in one step of validation. It calculates and logs the validation loss, and appends the predictions and labels to `self.val_step_outputs` and `self.val_step_labels` respectively for later use.

8. `def on_validation_epoch_end(self)`: This method is called at the end of each validation epoch. It concatenates all validation predictions and labels, calculates the Dice coefficient between them, logs it, and clears `self.val_step_outputs` and `self.val_step_labels`. It also prints the current epoch number if the current process is the main one (when using distributed computing).

By using a PyTorch Lightning Module, you can easily manage your PyTorch code, making it much easier to read, write, and debug.

In [5]:
# Lightning module

import torch
import pytorch_lightning as pl
import segmentation_models_pytorch as smp
from torch.optim.lr_scheduler import CosineAnnealingLR, ReduceLROnPlateau
from torch.optim import AdamW
import torch.nn as nn
from torchmetrics.functional import dice

seg_models = {
    "Unet": smp.Unet,
    "Unet++": smp.UnetPlusPlus,
    "MAnet": smp.MAnet,
    "Linknet": smp.Linknet,
    "FPN": smp.FPN,
    "PSPNet": smp.PSPNet,
    "PAN": smp.PAN,
    "DeepLabV3": smp.DeepLabV3,
    "DeepLabV3+": smp.DeepLabV3Plus,
}


class LightningModule(pl.LightningModule):
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.model = model = seg_models[config["seg_model"]](
            encoder_name=config["encoder_name"],
            encoder_weights="imagenet",
            in_channels=3,
            classes=1,
            activation=None,
        )
        self.loss_module = smp.losses.DiceLoss(mode="binary", smooth=config["loss_smooth"])
        self.val_step_outputs = []
        self.val_step_labels = []

    def forward(self, batch):
        imgs = batch
        preds = self.model(imgs)
        return preds

    def configure_optimizers(self):
        optimizer = AdamW(self.parameters(), **self.config["optimizer_params"])

        if self.config["scheduler"]["name"] == "CosineAnnealingLR":
            scheduler = CosineAnnealingLR(
                optimizer,
                **self.config["scheduler"]["params"]["CosineAnnealingLR"],
            )
            lr_scheduler_dict = {"scheduler": scheduler, "interval": "step"}
            return {"optimizer": optimizer, "lr_scheduler": lr_scheduler_dict}
        elif self.config["scheduler"]["name"] == "ReduceLROnPlateau":
            scheduler = ReduceLROnPlateau(
                optimizer,
                **self.config["scheduler"]["params"]["ReduceLROnPlateau"],
            )
            lr_scheduler = {"scheduler": scheduler, "monitor": "val_loss"}
            return {"optimizer": optimizer, "lr_scheduler": lr_scheduler}

    def training_step(self, batch, batch_idx):
        imgs, labels = batch
        preds = self.model(imgs)
        if self.config["image_size"] != 256:
            preds = torch.nn.functional.interpolate(preds, size=256, mode='bilinear')
        loss = self.loss_module(preds, labels)
        self.log("train_loss", loss, on_step=True, on_epoch=True, prog_bar=True, batch_size=16)

        for param_group in self.trainer.optimizers[0].param_groups:
            lr = param_group["lr"]
        self.log("lr", lr, on_step=True, on_epoch=False, prog_bar=True)

        return loss

    def validation_step(self, batch, batch_idx):
        imgs, labels = batch
        preds = self.model(imgs)
        if self.config["image_size"] != 256:
            preds = torch.nn.functional.interpolate(preds, size=256, mode='bilinear')
        loss = self.loss_module(preds, labels)
        self.log("val_loss", loss, on_step=False, on_epoch=True, prog_bar=True)
        self.val_step_outputs.append(preds)
        self.val_step_labels.append(labels)

    def on_validation_epoch_end(self):
        all_preds = torch.cat(self.val_step_outputs)
        all_labels = torch.cat(self.val_step_labels)
        all_preds = torch.sigmoid(all_preds)
        self.val_step_outputs.clear()
        self.val_step_labels.clear()
        val_dice = dice(all_preds, all_labels.long())
        self.log("val_dice", val_dice, on_step=False, on_epoch=True, prog_bar=True)
        if self.trainer.global_rank == 0:
            print(f"\nEpoch: {self.current_epoch}", flush=True)

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


The last block of code describes the actual process of training the model:

1. Import necessary libraries and modules: `warnings`, `os`, `torch`, `yaml`, `pandas`, `pytorch_lightning`, `pprint`, `ModelCheckpoint`, `EarlyStopping`, `TQDMProgressBar`, and `DataLoader` are imported from their respective libraries.

2. Suppresses warnings: `warnings.filterwarnings("ignore")` is used to ignore any warnings that might be raised during the execution of the code.

3. Load Configuration: The `config.yaml` file is loaded, and the configurations defined in the file are stored in the `config` dictionary.

4. Prepare Dataset: File paths for the training and validation datasets are created. The `ContrailsDataset` class defined earlier is used to create the training and validation datasets. These datasets are then wrapped into DataLoader instances for efficient loading and batching of data during training.

5. Setup Callbacks: Several callback functions are defined. `ModelCheckpoint` saves the weights of the model with the highest validation Dice coefficient. `TQDMProgressBar` is a progress bar. `EarlyStopping` stops the training early if the validation loss doesn't improve for a certain number of epochs (patience) as defined in the `config` file.

6. Instantiate Trainer: An instance of `pl.Trainer` is created with the specified callbacks and other configurations from the `config` dictionary.

7. Adjust Scheduler: The maximum number of iterations `T_max` for the `CosineAnnealingLR` scheduler is updated to be the number of batches in the training dataset divided by the number of devices used.

8. Instantiate Model: An instance of the `LightningModule` (defined in the previous block) is created with the configurations defined in `config["model"]`.

9. Train Model: The `fit` method of the `pl.Trainer` instance is called to train the model on the training data and validate it on the validation data.

In [6]:
# Actual training

import warnings

warnings.filterwarnings("ignore")

import os
import torch
import yaml
import pandas as pd
import pytorch_lightning as pl
from pprint import pprint
from pytorch_lightning.callbacks import ModelCheckpoint, EarlyStopping, TQDMProgressBar
from torch.utils.data import DataLoader

with open("config.yaml", "r") as file_obj:
    config = yaml.safe_load(file_obj)

contrails = os.path.join(config["data_path"], "contrails/")
train_path = os.path.join(config["data_path"], "train_df.csv")
valid_path = os.path.join(config["data_path"], "valid_df.csv")

train_df = pd.read_csv(train_path)
valid_df = pd.read_csv(valid_path)

train_df["path"] = contrails + train_df["record_id"].astype(str) + ".npy"
valid_df["path"] = contrails + valid_df["record_id"].astype(str) + ".npy"

dataset_train = ContrailsDataset(train_df, config["model"]["image_size"], train=True)
dataset_validation = ContrailsDataset(valid_df, config["model"]["image_size"], train=False)

data_loader_train = DataLoader(
    dataset_train,
    batch_size=config["train_bs"],
    shuffle=True,
    num_workers=config["workers"],
)
data_loader_validation = DataLoader(
    dataset_validation,
    batch_size=config["valid_bs"],
    shuffle=False,
    num_workers=config["workers"],
)

checkpoint_callback = ModelCheckpoint(
    save_weights_only=True,
    monitor="val_dice",
    dirpath=config["output_dir"],
    mode="max",
    filename="model",
    save_top_k=1,
    verbose=1,
)

progress_bar_callback = TQDMProgressBar(
    refresh_rate=config["progress_bar_refresh_rate"]
)

early_stop_callback = EarlyStopping(**config["early_stop"])

trainer = pl.Trainer(
    callbacks=[checkpoint_callback, early_stop_callback, progress_bar_callback],
    **config["trainer"],
)

config["model"]["scheduler"]["params"]["CosineAnnealingLR"]["T_max"] *= len(data_loader_train)/config["trainer"]["devices"]
model = LightningModule(config["model"])

trainer.fit(model, data_loader_train, data_loader_validation)

Downloading: "https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-resnest/resnest50-528c19ca.pth" to /root/.cache/torch/hub/checkpoints/resnest50-528c19ca.pth
100%|██████████| 105M/105M [00:00<00:00, 134MB/s]



Epoch: 0


Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]


Epoch: 0


Validation: 0it [00:00, ?it/s]


Epoch: 1


Validation: 0it [00:00, ?it/s]


Epoch: 2


Validation: 0it [00:00, ?it/s]


Epoch: 3


Validation: 0it [00:00, ?it/s]


Epoch: 4


Validation: 0it [00:00, ?it/s]


Epoch: 5


Validation: 0it [00:00, ?it/s]


Epoch: 6


Validation: 0it [00:00, ?it/s]


Epoch: 7


Validation: 0it [00:00, ?it/s]


Epoch: 8


Validation: 0it [00:00, ?it/s]


Epoch: 9


Validation: 0it [00:00, ?it/s]


Epoch: 10


Validation: 0it [00:00, ?it/s]


Epoch: 11


Validation: 0it [00:00, ?it/s]


Epoch: 12


Validation: 0it [00:00, ?it/s]


Epoch: 13
