# Generate a COCO-style Dataset for the Supervisely Persons Dataset

A COCO-style dataset is required for training Facebook's Mask R-CNN model implementation. This notebook explores and tests an approach for generating a COCO-style dataset for the Supervisely Persons dataset. This is accomplished through the use of the [pycococreatortools](https://github.com/waspinator/pycococreator/) package.

A standalone script `supervisely_to_coco.py` will be created based on this notebook.

## Import Libraries

In [1]:
import sys
import datetime
import json
import os
import re
import random
import fnmatch
import math
import zlib
import io
import base64

from PIL import Image
import numpy as np
from pycococreatortools import pycococreatortools
import cv2 as cv

## 1. Define Dataset Metadata

In [2]:
INFO = {
    "description": "Supervisely Persons Dataset",
    "url": "https://supervise.ly/",
    "version": "1.0",
    "year": 2018,
    "contributor": "Supervisely",
    "date_created": datetime.datetime.utcnow().isoformat(' ')
}

LICENSES = [
    {
        "id": 1,
        "name": "",
        "url": ""
    }
]

CATEGORIES = [
    {
        'id': 1,
        'name': 'person',
        'supercategory': 'object',
    }
]

## 2. Required Inputs

To generate a COCO-style dataset, we require:
    
- The path to the parent directory of the Supervisely Persons dataset (this is where the COCO-style dataset will be saved).
- The name of the directory containing the Supervisly Persons dataset (in the same structure as downloaded).
- A plain text file containing the `<subset>/<filename>` of each example (generated by `create_supervisely_examples_list.py`).

In [3]:
parent_dir = '/media/adam/HDD Storage/Datasets'
dataset_dir = 'supervisely-persons' # Relative to root
example_file = 'trainval.txt' # Relative to dataset directory

## 3. Create Train and Test Sets

Divide the examples into a 70%-30% split for training and testing.

In [4]:
# Read in list of examples
with open(os.path.join(parent_dir, dataset_dir, example_file)) as f:
    examples = [x.strip() for x in f.readlines()]

# Split examples into train and test sets
random.seed(1)
random.shuffle(examples)
num_train_examples = math.floor(0.7 * len(examples))
num_test_examples = len(examples) - num_train_examples
train_examples = examples[:num_train_examples]
test_examples = examples[num_train_examples:]
print(f'{num_train_examples} train images. {num_test_examples} test images.')

3997 train images. 1714 test images.


## 4. Create Output Directories for the COCO-style Dataset

In [5]:
output_dir = os.path.join(parent_dir, dataset_dir + '-coco')
os.mkdir(output_dir)
os.mkdir(os.path.join(output_dir, 'annotations'))
os.mkdir(os.path.join(output_dir, 'image-train'))
os.mkdir(os.path.join(output_dir, 'image-test'))

## 5. Generate a COCO-style Dataset

This functionality will be used to create train and test example sets.

In [6]:
def base64_2_mask(s):
    """
    Convert from a base64 encoded string to numpy mask.
    
    Provided by Supervisely.
    """
    z = zlib.decompress(base64.b64decode(s))
    n = np.frombuffer(z, np.uint8)
    mask = cv.imdecode(n, cv.IMREAD_UNCHANGED)[:, :, 3].astype(bool)
    
    return mask

def mask_2_base64(mask):
    """
    Convert from a numpy mask to a base64 encoded string.
    
    Provided by Supervisely.
    """
    img_pil = Image.fromarray(np.array(mask, dtype=np.uint8))
    img_pil.putpalette([0,0,0,255,255,255])
    bytes_io = io.BytesIO()
    img_pil.save(bytes_io, format='PNG', transparency=0, optimize=0)
    bytes = bytes_io.getvalue()
    
    return base64.b64encode(zlib.compress(bytes)).decode('utf-8')

In [7]:
def generate_coco_dataset(info, licenses, categories, examples, dataset_path, output_path, train=True):
    """
    """
    
    # Define COCO output template
    coco_output = {
        "info": info,
        "licenses": licenses,
        "categories": categories,
        "images": [],
        "annotations": []
    }
    
    # Add examples to the COCO output
    instance_id = 1
    for example_id, example in enumerate(examples):
        mask_found = False
        subset, filename = example.split('/')
        
        annotation_file_path = os.path.join(dataset_path, subset, 'ann', filename + '.json')
        with open(annotation_file_path) as f:
            annotations = json.load(f)

        image_height = annotations['size']['height']
        image_width = annotations['size']['width']
        
        for instance in annotations['objects']:
            if instance['classTitle'] == 'person_bmp': # Only BMP masks are compatible with pycococreatortools

                # Create whole image mask
                mask = base64_2_mask(instance['bitmap']['data'])
                mask_origin = instance['bitmap']['origin']
                mask_height, mask_width = mask.shape

                left_pad = np.zeros((mask_height, mask_origin[0]))
                right_pad = np.zeros((mask_height, image_width - mask_width - mask_origin[0]))
                top_pad = np.zeros((mask_origin[1], image_width))
                bottom_pad = np.zeros((image_height - mask_height - mask_origin[1], image_width))

                mask = np.hstack((left_pad, mask))
                mask = np.hstack((mask, right_pad))
                mask = np.vstack((top_pad, mask))
                mask = np.vstack((mask, bottom_pad))

                # Specifiy instance as a person
                category_info = {'id': 1, 'is_crowd': 0}
                
                annotation_info = pycococreatortools.create_annotation_info(
                    instance_id,
                    example_id,
                    category_info,
                    mask,
                    (image_width, image_height),
                    tolerance=0
                )

                if annotation_info is not None:
                    coco_output['annotations'].append(annotation_info)
                    instance_id += 1
                    mask_found = True

        if mask_found:
            try:
                image = Image.open(os.path.join(dataset_path, subset, 'img', filename + '.png'))
            except:
                image = Image.open(os.path.join(dataset_path, subset, 'img', filename + '.jpg'))

            # Save image as JPEG with name <dataset>_<filename> to image-train directory
            if train:
                save_path = os.path.join(output_path, 'image-train', subset + '_' + filename + '.jpg')
            else:
                save_path = os.path.join(output_path, 'image-test', subset + '_' + filename + '.jpg')
            image.save(save_path, format="JPEG")

            # Create image info
            image_info = pycococreatortools.create_image_info(
                example_id,
                subset + '_' + filename + '.jpg',
                image.size
            )

            coco_output['images'].append(image_info)
        
        if (example_id % 100) == 0:
            print(f'On example {example_id} of {len(examples)}.')
            
    return coco_output

In [8]:
dataset_path = os.path.join(parent_dir, dataset_dir)

coco_train_output = generate_coco_dataset(INFO, LICENSES, CATEGORIES,
                                          train_examples,
                                          dataset_path, output_dir,
                                          train=True)

with open(os.path.join(output_dir, 'annotations', 'instances_supervisely_train.json'), 'w') as f:
        json.dump(coco_train_output, f)

On example 0 of 3997.
On example 100 of 3997.
On example 200 of 3997.
On example 300 of 3997.
On example 400 of 3997.
On example 500 of 3997.
On example 600 of 3997.
On example 700 of 3997.
On example 800 of 3997.
On example 900 of 3997.
On example 1000 of 3997.
On example 1100 of 3997.
On example 1200 of 3997.
On example 1300 of 3997.
On example 1400 of 3997.
On example 1500 of 3997.
On example 1600 of 3997.
On example 1700 of 3997.
On example 1800 of 3997.
On example 1900 of 3997.
On example 2000 of 3997.
On example 2100 of 3997.
On example 2200 of 3997.
On example 2300 of 3997.
On example 2400 of 3997.
On example 2500 of 3997.
On example 2600 of 3997.
On example 2700 of 3997.
On example 2800 of 3997.
On example 2900 of 3997.
On example 3000 of 3997.
On example 3100 of 3997.
On example 3200 of 3997.
On example 3300 of 3997.
On example 3400 of 3997.
On example 3500 of 3997.
On example 3600 of 3997.
On example 3700 of 3997.
On example 3800 of 3997.
On example 3900 of 3997.


In [9]:
coco_test_output = generate_coco_dataset(INFO, LICENSES, CATEGORIES,
                                          test_examples,
                                          dataset_path, output_dir,
                                          train=False)

with open(os.path.join(output_dir, 'annotations', 'instances_supervisely_test.json'), 'w') as f:
        json.dump(coco_test_output, f)

On example 0 of 1714.
On example 100 of 1714.
On example 200 of 1714.
On example 300 of 1714.
On example 400 of 1714.
On example 500 of 1714.
On example 600 of 1714.
On example 700 of 1714.
On example 800 of 1714.
On example 900 of 1714.
On example 1000 of 1714.
On example 1100 of 1714.
On example 1200 of 1714.
On example 1300 of 1714.
On example 1400 of 1714.
On example 1500 of 1714.
On example 1600 of 1714.
On example 1700 of 1714.


In [10]:
print(f"{len(coco_train_output['images'])} train images with {len(coco_train_output['annotations'])} annotations.")
print(f"{len(coco_test_output['images'])} test images with {len(coco_test_output['annotations'])} annotations.")

864 train images with 1101 annotations.
384 test images with 485 annotations.
