In this notebook you will learn how to get started with PytorchEO training a baseline model for the [On Cloud N](https://www.drivendata.org/competitions/83/cloud-cover/): Cloud Cover Detection Challenge, following the baseline published by the authors [here](https://www.drivendata.co/blog/cloud-cover-benchmark/).

> 🚧 PytorchEO is in early stages of development, documentation is lacking and API may change in the future. Please let us know if you find it useful to push forward its development.

## The Dataset

PytorchEO is built around `Datasets` and `Tasks`. In this challenge we are asked to solve an image segmentation task with a dataset composed of Sentinel 2 images with corresponding cloud masks. 

> Before continuing, please join the challenge and download the [data](https://www.drivendata.org/competitions/83/cloud-cover/data/).

In [None]:
# imports
from pathlib import Path
import pandas as pd
import numpy as np

# Pytorch EO imports
from pytorch_eo.datasets.BaseDataset import BaseDataset
from pytorch_eo.utils.datasets.ConcatDataset import ConcatDataset
from pytorch_eo.utils.datasets.SingleBandImageDataset import SingleBandImageDataset
from pytorch_eo.utils.datasets.RGBImageDataset import RGBImageDataset
from pytorch_eo.utils.sensors import Sensors

# The BaseDataset will handle data splitting and loading with pytorch dataloaders
# https://github.com/earthpulse/pytorch_eo/blob/main/pytorch_eo/datasets/BaseDataset.py

class OnCloudNDataset(BaseDataset): 

    def __init__(self,
                 batch_size=32,
                 path='data', 
                 data_folder='train_features', 
                 labels_folder='train_labels',
                 metadata_path='train_metadata.csv',
                 test_size=0.2,
                 val_size=0.2,
                 train_trans=None,
                 val_trans=None,
                 test_trans=None,
                 num_workers=0,
                 pin_memory=False,
                 seed=42,
                 verbose=False,
                 bands=None,
                 ):
        super().__init__(batch_size, test_size, val_size,
                         verbose, num_workers, pin_memory, seed)
        self.path = Path(path)
        self.metadata_path = metadata_path
        self.data_folder = data_folder
        self.labels_folder = labels_folder
        self.train_trans = train_trans
        self.val_trans = val_trans
        self.test_trans = test_trans
        self.bands = bands
        self.num_classes = 2

    def setup(self, stage=None):

        # read csv with metadata
        train_metadata = pd.read_csv(self.path / self.metadata_path)

        # generate paths to images and masks
        images = train_metadata.chip_id.apply(
            lambda cid: f'{self.path}/{self.data_folder}/{cid}')
        masks = train_metadata.chip_id.apply(
            lambda cid: f'{self.path}/{self.labels_folder}/{cid}.tif')

        # build dataframe and splits
        self.df = pd.DataFrame({'image': images, 'mask': masks})
        self.make_splits()

        # generate datasets
        self.build_datasets()

    def build_dataset(self, df, trans):
        return ConcatDataset({
            'image': SingleBandImageDataset(df.image.values, Sensors.S2, self.bands),
            'mask': RGBImageDataset(df['mask'].values, dtype=np.float32)
        }, trans)


## The Model

You can use any model that you want with PytorchEO. In this example we use the [Pytorch Segmentation Models](https://github.com/qubvel/segmentation_models.pytorch) library to build a simple UNet.

In [None]:
import segmentation_models_pytorch as smp
from einops import rearrange
import torch

class Model(torch.nn.Module):

	def __init__(self, in_chans=3, backbone='resnet34', num_classes=1, max_value=4000, pretrained=None):
		super().__init__()
		self.model = smp.Unet(
			encoder_name=backbone,
			encoder_weights=pretrained, # imagenet
			in_channels=in_chans,
			classes=num_classes,
		)
		self.max_value = max_value

	def forward(self, x):
		x = rearrange(x, 'b h w c -> b c h w')
		x = x / self.max_value
		x = x.clip(0., 1.)
		y = self.model(x).squeeze(1) # remove channels dim
		return y

## The Task

PytorchEO comes with several tasks built in, and more to come. In this case, we use the `ImageSegmentationTask`.

In [None]:
from pytorch_eo.tasks.segmentation import ImageSegmentation

model = Model()

task = ImageSegmentation(model)

out = task(torch.randn(32, 512, 512, 3))

out.shape, out.dtype

## Training

We use [PytorchLighning](https://pytorch-lightning.readthedocs.io/) for training.


In [None]:
import pytorch_lightning as pl 
from pytorch_lightning.callbacks import ModelCheckpoint
from pytorch_eo.utils.sensors import S2

bands = [S2.blue, S2.green, S2.red, S2.nir1]

ds = OnCloudNDataset(
    batch_size=10, 
    bands=bands,
)

model = Model(in_chans=len(bands))

task = ImageSegmentation(model)

# overfit batches to check if model is working

trainer = pl.Trainer(
	max_epochs=30,
	overfit_batches=1,
    checkpoint_callback=False,
	logger=None,
	gpus=1,
	precision=16,
)

trainer.fit(task, ds)

Feel free to add your favourite callbacks, accelerators, data augmentation... The following is a more complete example.

In [None]:
import albumentations as A 

pl.seed_everything(42, workers=True) # make results reproducible

trans = A.Compose([ # some data augmentation (we use albumentations)
	A.RandomRotate90(),
	A.HorizontalFlip(),
	A.VerticalFlip(),
])

bands = [S2.blue, S2.green, S2.red, S2.nir1] # choose any bands combination

ds = OnCloudNDataset(
    batch_size=64, # increase the batch size to fully use your GPU
    bands=bands,
    num_workers=20, # faster data loading (put here your CPU core count)
    pin_memory=True, # faster data loading
    train_trans=trans,
)

model = Model(in_chans=len(bands))

hparams = { # customize optimizer
    'optimizer': 'Adam',
    'optim_params': {
        'lr': 1e-3
    },
	'scheduler': 'MultiStepLR',
	'scheduler_params': {
		'milestones': [3, 6],
		'verbose': True
	} # add anything you want to save as hparams with your model
}

# customize your metrics (use as many as you want)

def iou(pr, gt, th=0.5, eps=1e-7):
        mask = gt.ne(255) # ignore value 255 in mask
        gt = gt.masked_select(mask)
        pr = pr.masked_select(mask)
        pr = torch.sigmoid(pr) > th
        gt = gt > th 
        intersection = torch.sum(gt & pr)
        union = torch.sum(gt | pr)
        return intersection / (union + eps)

metrics = {'iou': iou} 

# customize your loss function

loss_fn = smp.losses.SoftBCEWithLogitsLoss(ignore_index=255) # ignore value 255 in mask

# train the model

task = ImageSegmentation(model, hparams=hparams, metrics=metrics, loss_fn=loss_fn)

trainer = pl.Trainer(
    max_epochs=10,
    gpus=1,
    precision=16,
    callbacks=[ # save best model during training
		ModelCheckpoint(
			dirpath='./',
			filename=f"unet-baseline-{{val_iou:.4f}}",
			save_top_k=1,
			monitor='val_iou',
			mode='max'
		)
	],
	deterministic=True, # make results reproducible
)

trainer.fit(task, ds)

## Submission

This is a code submission challenge, you can find an example on how to submit [here](https://www.drivendata.co/blog/cloud-cover-benchmark/). First, we have to export our model, in this case using `torchscript`. 

In [None]:
# optionally, load from checkpoint
cpu_model = model.cpu()
sample_input_cpu = torch.randn(32, 512, 512, len(bands))
traced_cpu = torch.jit.trace(cpu_model, sample_input_cpu)
torch.jit.save(traced_cpu, "model.pt")

Finally, write a `main.py` file with the following content and compress it along with the exported model with the name `submission.zip` to submit into the challenge platform.

In [None]:
from pathlib import Path
import os
from tifffile import imsave, imread
import numpy as np
import torch
from tqdm import tqdm
from torch.utils.data import DataLoader, Dataset

images_path = Path('/codeexecution/data/test_features')
predictions_path = Path('/codeexecution/predictions')
chip_ids = os.listdir(images_path)
bands = ['B02', 'B03', 'B04', 'B08']

class MyDataset(Dataset):
    def __init__(self, images):
        super().__init__()
        self.images = images

    def __len__(self):
        return len(self.images)

    def __getitem__(self, idx):
        chip_id = self.images[idx]
        img = [imread(images_path / chip_id / f'{band}.tif') for band in bands]
        img = np.stack(img, axis=-1).astype(np.float32)
        return torch.from_numpy(img), chip_id


ds = MyDataset(chip_ids)

dl = DataLoader(ds, batch_size=64, shuffle=False,
                num_workers=4, pin_memory=True)

model = torch.jit.load("model.pt")
model.cuda()
model.eval()

for batch in dl:
    imgs, chips = batch
    with torch.no_grad():
        pred = model(imgs.cuda())
    masks = torch.sigmoid(pred) > 0.5
    masks = masks.cpu().numpy().astype(np.uint8)
    for i, chip in enumerate(chips):
        imsave(predictions_path / f'{chip}.tif', masks[i, ...])



> Pytorch 1.8 is required at the moment of this writing ! This is the version used in the scoring platform and different versions may cause conflicts with `torchscript`. 

I hope you like the library, we are planning to include more datasets and tasks in the future. If you find it useful please get in touch !!!