# **Image Processing and Computer Vision (PIVA)**
2022 - Javier Ruiz Hidalgo - [GPI @ IDEAI](https://imatge.upc.edu/web/) Research group // [ETSETB – UPC.TelecosBCN](https://telecos.upc.edu/ca)



# Lab - Instance Segmentation
The goal of this laboratory is to finetune a pre-trained [Mask R-CNN](https://arxiv.org/abs/1703.06870) model in the [*Penn-Fudan Database for Pedestrian Detection and Segmentation*](https://www.cis.upenn.edu/~jshi/ped_html/). We will evaluate the model before and after the fine-tunning to see the improvements on an instance segmentation task.


## 1. Penn-Fudan dataset

We will use this dataset to detect and segment people. 

First, let's download and extract the data, present in a zip file at https://www.cis.upenn.edu/~jshi/ped_html/PennFudanPed.zip

In [None]:
# download the Penn-Fudan dataset
!wget https://www.cis.upenn.edu/~jshi/ped_html/PennFudanPed.zip .
# extract it in the current folder
!unzip -q PennFudanPed.zip
!pip install opencv-contrib-python==4.7.0.72

Let's have a look at the dataset and how it is layed down.

The data is structured as follows
```
PennFudanPed/
  PedMasks/
    FudanPed00001_mask.png
    FudanPed00002_mask.png
    FudanPed00003_mask.png
    FudanPed00004_mask.png
    ...
  PNGImages/
    FudanPed00001.png
    FudanPed00002.png
    FudanPed00003.png
    FudanPed00004.png
```

Here is one example of an image in the dataset, with its corresponding instance segmentation mask

In [None]:
from PIL import Image
Image.open('PennFudanPed/PNGImages/FudanPed00012.png')

In [None]:
mask = Image.open('PennFudanPed/PedMasks/FudanPed00012_mask.png').convert('P')
# each mask instance has a different color, from zero to N, where
# N is the number of instances. In order to make visualization easier,
# let's adda color palette to the mask.
mask.putpalette([
    0, 0, 0, # black background
    255, 0, 0, # index 1 is red
    255, 255, 0, # index 2 is yellow
    255, 153, 0, # index 3 is orange
])
mask

## 2. Training and evaluation functions
In `references/detection/,` we have a number of helper functions to simplify training and evaluating detection models.
Here, we will use `references/detection/engine.py`, `references/detection/utils.py` and `references/detection/transforms.py`.

Let's copy those files (and their dependencies) in here so that they are available in the notebook



In [None]:
# Download TorchVision repo to use some files from
# references/detection
!git clone https://github.com/pytorch/vision.git
!cd vision
!git checkout v0.15.1

!cp /kaggle/working/vision/references/detection/utils.py /kaggle/working
!cp /kaggle/working/vision/references/detection/transforms.py /kaggle/working
!cp /kaggle/working/vision/references/detection/coco_eval.py /kaggle/working
!cp /kaggle/working/vision/references/detection/engine.py /kaggle/working
!cp /kaggle/working/vision/references/detection/coco_utils.py /kaggle/working

Let's write some helper functions for data augmentation / transformation, which leverages the functions in `refereces/detection` that we have just copied:


In [None]:
!pip install pycocotools

In [None]:
from engine import train_one_epoch, evaluate
import utils
import transforms as T


def get_transform(train):
    transforms = []
    transforms.append(T.PILToTensor())
    transforms.append(T.ConvertImageDtype(torch.float))
    if train:
        transforms.append(T.RandomHorizontalFlip(0.5))
    return T.Compose(transforms)

## 3. Writing a custom dataset for Penn-Fudan

Let's write a dataset for the Penn-Fudan dataset. Each image has a corresponding segmentation mask, where each color correspond to a different instance. We will write a `torch.utils.data.Dataset` class for this dataset.


In [None]:
import os
import numpy as np

import torch
import torch.utils.data

from torchvision import transforms
from torchvision import utils as tutils


import skimage.transform as sktf
import skimage.io as skio

from PIL import Image


class PennFudanDataset(torch.utils.data.Dataset):
    def __init__(self, root, transforms=None):
        self.root = root
        self.transforms = transforms
        # load all image files, sorting them to
        # ensure that they are aligned
        self.imgs = list(sorted(os.listdir(os.path.join(root, "PNGImages"))))
        self.masks = list(sorted(os.listdir(os.path.join(root, "PedMasks"))))

    def __getitem__(self, idx):
        # load images ad masks
        img_path = os.path.join(self.root, "PNGImages", self.imgs[idx])
        mask_path = os.path.join(self.root, "PedMasks", self.masks[idx])
        img = Image.open(img_path).convert("RGB")
        # note that we haven't converted the mask to RGB,
        # because each color corresponds to a different instance
        # with 0 being background
        mask = Image.open(mask_path)

        mask = np.array(mask)
        # instances are encoded as different colors
        obj_ids = np.unique(mask)
        # first id is the background, so remove it
        obj_ids = obj_ids[1:]

        # split the color-encoded mask into a set
        # of binary masks
        masks = mask == obj_ids[:, None, None]

        # get bounding box coordinates for each mask
        num_objs = len(obj_ids)
        boxes = []
        for i in range(num_objs):
            pos = np.where(masks[i])
            xmin = np.min(pos[1])
            xmax = np.max(pos[1])
            ymin = np.min(pos[0])
            ymax = np.max(pos[0])
            boxes.append([xmin, ymin, xmax, ymax])

        boxes = torch.as_tensor(boxes, dtype=torch.float32)
        # there is only one class
        labels = torch.ones((num_objs,), dtype=torch.int64)
        masks = torch.as_tensor(masks, dtype=torch.uint8)

        image_id = torch.tensor([idx])
        area = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0])
        # suppose all instances are not crowd
        iscrowd = torch.zeros((num_objs,), dtype=torch.int64)

        target = {}
        target["boxes"] = boxes
        target["labels"] = labels
        target["masks"] = masks
        target["image_id"] = image_id
        target["area"] = area
        target["iscrowd"] = iscrowd

        if self.transforms is not None:
            img, target = self.transforms(img, target)

        return img, target

    def __len__(self):
        return len(self.imgs)


In [None]:
import os
import numpy as np

import torch
import torch.utils.data

from torchvision import transforms
from torchvision import utils as tutils


import skimage.transform as sktf
import skimage.io as skio

from PIL import Image


class DatasetPerson(torch.utils.data.Dataset):
    def __init__(self, root, transforms=None):
        self.root = root
        self.transforms = transforms
        # load all image files, sorting them to
        # ensure that they are aligned
        self.imgs = list(sorted(os.listdir(os.path.join(root, "images"))))
        self.masks = list(sorted(os.listdir(os.path.join(root, "masks"))))

    def __getitem__(self, idx):
        # load images ad masks
        img_path = os.path.join(self.root, "images", self.imgs[idx])
        mask_path = os.path.join(self.root, "masks", self.masks[idx])
        img = Image.open(img_path).convert("RGB")
        # note that we haven't converted the mask to RGB,
        # because each color corresponds to a different instance
        # with 0 being background
        mask = Image.open(mask_path)

        mask = np.array(mask)
        # instances are encoded as different colors
        obj_ids = np.unique(mask)
        # first id is the background, so remove it
        obj_ids = obj_ids[1:]

        # split the color-encoded mask into a set
        # of binary masks
        masks = mask == obj_ids[:, None, None]

        # get bounding box coordinates for each mask
        num_objs = len(obj_ids)
        boxes = []
        for i in range(num_objs):
            pos = np.where(masks[i])
            xmin = np.min(pos[1])
            xmax = np.max(pos[1])
            ymin = np.min(pos[0])
            ymax = np.max(pos[0])
            boxes.append([xmin, ymin, xmax, ymax])

        boxes = torch.as_tensor(boxes, dtype=torch.float32)
        # there is only one class
        labels = torch.ones((num_objs,), dtype=torch.int64)
        masks = torch.as_tensor(masks, dtype=torch.uint8)

        image_id = torch.tensor([idx])
        area = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0])
        # suppose all instances are not crowd
        iscrowd = torch.zeros((num_objs,), dtype=torch.int64)

        target = {}
        target["boxes"] = boxes
        target["labels"] = labels
        target["masks"] = masks
        target["image_id"] = image_id
        target["area"] = area
        target["iscrowd"] = iscrowd

        if self.transforms is not None:
            img, target = self.transforms(img, target)

        return img, target

    def __len__(self):
        return len(self.imgs)


In [None]:
dataset = DatasetPerson('/kaggle/input/personespiva/dataset_person/data/',get_transform(train=False))
(img,target) = dataset[11]

# Let's see the shape of the tensors
print(img.shape)
print(target["masks"].shape)
print(target["labels"].shape)
print(target["boxes"].shape)


That's all for the dataset. Let's see how the outputs are structured for this dataset

In [None]:
dataset = PennFudanDataset('PennFudanPed/',get_transform(train=False))
(img,target) = dataset[11]

# Let's see the shape of the tensors
print(img.shape)
print(target["masks"].shape)
print(target["labels"].shape)
print(target["boxes"].shape)


And see an example again on the masks and bounding boxes on top of the image. We will use a helper funcion to draw the masks and detections on top of the images:

In [None]:
COCO_NAMES = ['__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
    'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A', 'stop sign',
    'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
    'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A', 'N/A',
    'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
    'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket',
    'bottle', 'N/A', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
    'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza',
    'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'N/A', 'dining table',
    'N/A', 'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',
    'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book',
    'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush']

COLORS = np.random.uniform(0, 255, size=(len(COCO_NAMES), 3)).astype(int)
import cv2
import random

def draw_segmentation_map(image, target, score_thres=0.8):
    
    # Convert back to numpy arrays
    _image = np.copy(image.cpu().detach().numpy().transpose(1,2,0)*255)
    _masks = np.copy(target['masks'].cpu().detach().numpy().astype(np.float32))
    _boxes = np.copy(target['boxes'].cpu().detach().numpy().astype(int))
    _labels = np.copy(target['labels'].cpu().detach().numpy().astype(int))
    if "scores" in target:
      _scores = np.copy(target["scores"].cpu().detach().numpy())
    else:
      _scores = np.ones(len(_masks),dtype=np.float32)

    alpha = 0.3
    
    label_names = [COCO_NAMES[i] for i in _labels]

    # Add mask if _scores
    m = np.zeros_like(_masks[0].squeeze())
    for i in range(len(_masks)):
      if _scores[i] > score_thres:
        m = m + _masks[i]
    
    # Make sure m is the right shape
    m = m.squeeze()

    # dark pixel outside masks
    _image[m<0.5] = 0.3*_image[m<0.5]

    # convert from RGB to OpenCV BGR and back (cv2.rectangle is just too picky)
    _image = cv2.cvtColor(_image, cv2.COLOR_RGB2BGR)
    _image = cv2.cvtColor(_image, cv2.COLOR_BGR2RGB)

    for i in range(len(_masks)):
      if _scores[i] > score_thres:         
        # apply a randon color to each object
        color = COLORS[random.randrange(0, len(COLORS))].tolist()
                
        # draw the bounding boxes around the objects
        cv2.rectangle(_image, _boxes[i][0:2], _boxes[i][2:4], color=color, thickness=2)
        # put the label text above the objects
        cv2.putText(_image , label_names[i], (_boxes[i][0], _boxes[i][1]-10), 
                    cv2.FONT_HERSHEY_SIMPLEX, 1, color, 
                    thickness=1, lineType=cv2.LINE_AA)
    
    return _image/255

In [None]:
import plotly.express as px
px.imshow(draw_segmentation_map(img, target))

## 4. Using a pre-trained Mask R-CNN model

We will be using [Mask R-CNN](https://arxiv.org/abs/1703.06870), which is based on top of [Faster R-CNN](https://arxiv.org/abs/1506.01497). Faster R-CNN is a model that predicts both bounding boxes and class scores for potential objects in the image. Mask R-CNN adds an extra branch into Faster R-CNN, which also predicts segmentation masks for each instance.

In [None]:
import torchvision 

model = torchvision.models.detection.maskrcnn_resnet50_fpn_v2(weights='DEFAULT', progress=True)

Set the model to GPU and evaluation mode:

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using {device}.")
model.to(device).eval()

Let us show the prediction of the pre-trained model, as a reminder we use this input image with the following groundtruth:

In [None]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go

fig = make_subplots(rows=1, cols=2, subplot_titles=("Input", "Ground Truth"))
fig.add_trace(go.Image(z=img.numpy().transpose(1,2,0)*255), 1, 1)
fig.add_trace(go.Image(z=draw_segmentation_map(img, target)*255), 1, 2)

Pass the image (as a batch) through the model:

In [None]:
# add a batch dimension
imgs = img.unsqueeze(0).to(device) #torch.stack((img,img))
outs = model(imgs.to(device))

Let's check how many predictions, labels and scores are found for this image (see the COCO_NAMES list for the correspondence between label numbers and semantic meaning):

In [None]:
print(f"Number of predictions = {len(outs[0]['labels'])}")
print(f"  labels = {outs[0]['labels'].cpu().numpy()}")
print(f"  scores = {outs[0]['scores'].detach().cpu().numpy()}")

Let's show the results:

In [None]:
fig = make_subplots(rows=1, cols=2, subplot_titles=("Prediction (all scores)", "Prediction (scores>0.8)"))
fig.add_trace(go.Image(z=draw_segmentation_map(img, outs[0], score_thres=0.0)*255), 1, 1)
fig.add_trace(go.Image(z=draw_segmentation_map(img, outs[0], score_thres=0.8)*255), 1, 2)

## 5. Testing the pre-trained Mask R-CNN model

First, we separate the datasets into training (every image except the last 50) and test (last 50 images):

In [None]:
# use our dataset and defined transformations
dataset = DatasetPerson('/kaggle/input/personespiva/dataset_person/data/',get_transform(train=False))
dataset_test = DatasetPerson('/kaggle/input/personespiva/dataset_person/data/',get_transform(train=False))

# split the dataset in train and test set (30%)
torch.manual_seed(1)
indices = torch.randperm(len(dataset)).tolist()
dataset = torch.utils.data.Subset(dataset, indices[:-1500])
dataset_test = torch.utils.data.Subset(dataset_test, indices[-1500:])

# define training and validation data loaders
data_loader = torch.utils.data.DataLoader(
    dataset, batch_size=2, shuffle=True, num_workers=2,
    collate_fn=utils.collate_fn)

data_loader_test = torch.utils.data.DataLoader(
    dataset_test, batch_size=1, shuffle=False, num_workers=2,
    collate_fn=utils.collate_fn)

We will evaluate the test:

In [None]:
evaluate(model, data_loader_test, device=device)

We can focus on the value of the Average Precision for IoU=0.50:0.95 (first row as it can be seen as an average of the rest of the rows) with a value of 0.843 for bounding boxes or 0.734 for masks (quite good already!).

## 6. Train Mask-RCNN on the Penn-Fudan dataset

We can change the classifier of the Mask-RCNN to train it with this dataset and compare the results:

In [None]:
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torchvision.models.detection.mask_rcnn import MaskRCNNPredictor

num_classes = 2

# load an instance segmentation model pre-trained on COCO
model2 = torchvision.models.detection.maskrcnn_resnet50_fpn_v2(weights='DEFAULT', progress=True)

# get the number of input features for the classifier
in_features = model2.roi_heads.box_predictor.cls_score.in_features
# replace the pre-trained head with a new one
model2.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

# now get the number of input features for the mask classifier
in_features_mask = model2.roi_heads.mask_predictor.conv5_mask.in_channels
hidden_layer = 256
# and replace the mask predictor with a new one
model2.roi_heads.mask_predictor = MaskRCNNPredictor(in_features_mask,
                                                    hidden_layer,
                                                    num_classes)

model2.to(device)


In [None]:
# construct an optimizer
params = [p for p in model2.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(params, lr=0.005,
                            momentum=0.9, weight_decay=0.0005)

# and a learning rate scheduler which decreases the learning rate by
# 10x every 3 epochs
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer,
                                               step_size=3,
                                               gamma=0.1)

In [None]:
# let's train it for 20 epochs
from torch.optim.lr_scheduler import StepLR
num_epochs = 20

for epoch in range(num_epochs):
    # train for one epoch, printing every 10 iterations
    train_one_epoch(model2, optimizer, data_loader, device, epoch, print_freq=10)
    # update the learning rate
    lr_scheduler.step()
torch.save(model2, 'model2_train_test.pth')

In [None]:
# evaluate on the test dataset
#load the model trained in Kaggle
model2 = torch.load("/kaggle/input/model2/model2.pth")
#evaluate(model2, data_loader_test, device=device)


And we can compare the test results with the pre-trained model. As we have a small dataset with not much variability, the small training helps achieving slightly better results.

In [None]:
(img,target) = dataset_test[6]
imgs = img.unsqueeze(0).to(device) #torch.stack((img,img))
outs = model2(imgs)

fig = make_subplots(rows=1, cols=2, subplot_titles=("Prediction (all scores)", "Prediction (scores>0.8)"))
fig.add_trace(go.Image(z=draw_segmentation_map(img, outs[0], score_thres=0.0)*255), 1, 1)
fig.add_trace(go.Image(z=draw_segmentation_map(img, outs[0], score_thres=0.8)*255), 1, 2)

Congratulations you have finished the lab session!


## 6. 1st PIVA Person Segmentation Challenge (PPSC-2023)

In this lab, we are going to ask you to enter the **1st PIVA Person Segmentation Challenge (PPSC-2023)**. We are going to provide you with a [dataset](https://drive.google.com/file/d/1zChno9LQBV8rpxRTOes4W6Kg156KsCTY/view?usp=sharing) of images (with groundtruth) and a set of test images (without groundtruth). You will have to generate a model and send the test results to the challenge in order to evaluate your performance. 

You will have to submit:

1. A report as a PDF file (no more than 3 pages) with an explanation of what you did to obtain, train and test the model (see further comments).
2. A .zip file with the predictions (as images, see further comments) of all test images.
3. A notebook python file with your model, training, pre- and post- processing code.
4. Submit all files (separately, not in another zip file) in the submission task in Atenea.

The rules of the competition are:

1. You can use any architecture you want. You can use the one in this lab session, a modified version of it or something different. But always explain in the report your selection.
2. Explain any training, fine-tunning or hyper-parameter search you do to your selected model.
3. Explain all datasets used by your model (with the splits performed).
4. Explain any pre-processing/post-processing you do to the images/predictions.
5. Explain the losses you use to train or fine-tune your system.
6. As most of the images in the challenge dataset are ony of a single person, the metric to evaluate the results will be the mean of the global IoU of all images in the test split. Comment in your report the advantages/disadvantages of using this metric in the test evaluation.

The prediction format should be:

1. A zip file with a number of PNG files, one for each test image (same filename as the test images but with `.png` extension).
2. The PNG images must be grayscale images with a value of `0` for the backgrund and `>0` for person masks. As IoU is employed as metric, the prediction should have only objects of the class person and all pixels `>0` will be considered person mask.
4. There is a example code to generate a fake submission `.zip` with a made-up segmentation.
5. There is also an example code to check your submission format. It is **mandatory** to use it, Make sure that your zip file passes it without errors or warnings before your submission.
6. After all submission are received a leaderboard will be published.

Good luck!


El model de dalt (model2), entrenat amb 20 èpoques i el dataset de Persones és el que hem usat per fer l'estudi, per tant no l'hem tornat a copiar baix.

In [None]:
# Make sure to change this for the path of the "dataset_person/test" folder and your group/team name
ZIPFILE = 'G12_E12'
TEST_FOLDER = "/kaggle/input/personespiva/dataset_person/test/"

In [None]:
# Let's create fake predictions for the test images
from PIL import Image
import shutil
import random
import os
import plotly.graph_objects as go

import torch
from torchvision import transforms
from PIL import Image

# Use a seed to control random generation
r = random.Random(5)

# Create the directory
out_path= os.path.join('/kaggle/working/','MASKS_PREDICTION')
os.makedirs(out_path, exist_ok=True)

score_threshold = 0.75

for f in open(os.path.join(TEST_FOLDER,'test_names.txt'),'r'):
    b = f.strip()
    # read test image
    img = Image.open(os.path.join(TEST_FOLDER,'images',b)+'.jpg')
    
    transform = transforms.ToTensor()
    img = transform(img)
    
    img = img.unsqueeze(0).to(device) #torch.stack((img,img))
    outs = model(img)
    
    scores = np.copy(outs[0]["scores"].cpu().detach().numpy())
    mask = np.copy(outs[0]["masks"].cpu().detach().numpy())
    
    wdth, hght = mask.shape[-1], mask.shape[-2]
    
    # Add mask if scores
    m = torch.zeros((hght, wdth)).cpu().numpy()
    
    for i in range(len(mask)):
        if scores[i] > score_threshold:
            m = m + mask[i]
            
    preds = m.squeeze()
    min_p, max_p = preds.min(), preds.max()
    
    preds = ((preds - min_p) / (max_p - min_p) * 255).astype(np.uint8)
    pred = Image.fromarray(preds, mode='L') 
    
    pred.save(os.path.join(str(out_path),b)+'.png')
# zip all predictions
shutil.make_archive(ZIPFILE, 'zip', out_path)

In [None]:
# Sample script to compute mean IoU of predictions in the TEST. MANDATORY to check this on your file before the submission
TEST_FOLDER2 = "/kaggle/working/"
# unzip the file into a tmp folder
os.makedirs(os.path.join(TEST_FOLDER2,'tmp3'))
shutil.unpack_archive(ZIPFILE + '.zip',os.path.join(TEST_FOLDER2,'tmp3'))

# Use a seed to control random generation
r = random.Random(7)

mIoU = 0.0

for cnt,f in enumerate(open(os.path.join(TEST_FOLDER,'test_names.txt'),'r')):

    b = f.strip()

    # read test image (we do not needed it but here we will use it to get the size of GT)
    im = Image.open(os.path.join(TEST_FOLDER,'images',b)+'.jpg')

    # read GT image (as a test we can generate the fake predictions)
    #gt = np.asarray(Image.open(os.path.join(TEST_FOLDER,'masks',b)+'.png'))
    gt = np.zeros((im.height,im.width),np.uint8)
    for o in range(r.randint(1,3)):
        sx = r.randint(im.width//8,im.width//2)
        x = r.randint(0,im.width-sx)
        sy = r.randint(im.height//8,im.height//2)
        y = r.randint(0,im.height-sy)
        gt[y:y+sy,x:x+sx] = o+1
    gtb = (gt>0) # we'll consider a global IoU for all objects

    # read prediction image
    pred = np.asarray(Image.open(os.path.join(TEST_FOLDER2,'tmp3',b)+'.png'))
    predb = (pred>0) # we'll consider a global IoU for all objects
    
    # Compute the global IoU in the image
    overlap = gtb * predb
    union   = gtb + predb    
    IoU = overlap.sum()/union.sum()
        
    mIoU += IoU

# Report results
# mIoU = mIoU / (cnt+1)
# print(f"mIoU = {mIoU:0.4f}")