# Stater kit

This notebook provides the basics of 1) how to handle the data, 2) good practices and pitfalls to avoid, 3) how to generate the submission file for codalab

In [None]:
%%capture
!pip install pycocotools
!pip install mxnet-cu110 autogluon.vision
!pip install -U gluoncv==0.10.3.post0

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import logging
import sys
import os
import json  # for dumping json serialized results
import zipfile  # for creating submission zip file
from pycocotools.coco import COCO
root = '../input/cowboyoutfits'
logger = logging.getLogger()
logger.addHandler(logging.StreamHandler(sys.stderr))

## Load the training data

In [None]:
coco = COCO(os.path.join(root, 'train.json'))

Dataset info and categories

In [None]:
print('Data info:', coco.info())
categories = {cat_info['name']:cat_info['id'] for cat_info in coco.loadCats(coco.getCatIds())}
print('Categories:', categories)

## Example training module using AutoGluon

In [None]:
from autogluon.vision import ObjectDetector

In [None]:
train = ObjectDetector.Dataset.from_coco(os.path.join(root, 'train.json'), root=os.path.join(root, 'images'))
train

In [None]:
train.show_images()

### Split train/valid data with cautious

Since the distribution of categories is very imbalanced, we should carefully split the data by category, to make sure we have enough sample for evaluation. 

In [None]:
# randomly select 10 images for each category as valid_data
sample_n_per_cat = 10
valid_ids = pd.Int64Index([])
for cat_name in categories.keys():
    df = train[train.apply(lambda x: True if any([y['class'] == cat_name for y in x['rois']]) else False, axis=1)]
    df = df.sample(sample_n_per_cat)
    valid_ids = valid_ids.append(df.index)
train_ids = train.index
train_ids = train_ids.drop(valid_ids)
train_data = train.loc[train_ids]
valid_data = train.loc[valid_ids]
print('train split:', len(train_data), 'valid split', len(valid_data))

## Training stater code

We provide  a fundamental training example using autogluon.vision package with default settings. In order to achieve higher scores, there are multiple details you need to take care:

- Imbalanced training sample: consider that the training samples for e.g. belt is very rare, you can try methods like class aware sampling to inflate the rare classes
- The training data might contain noises in anotations, there are many custom losses to handle this issue
- others?

In [None]:
detector = ObjectDetector(verbosity=2).fit(train_data, valid_data, hyperparameters={'batch_size': 8, 'epochs': 3, 'transfer': 'ssd_512_resnet50_v1_coco'})

## Generate submission

You will use `valid.csv` for public phase submission, and `test.csv` for the final phase submission. Note that you only have 3 chance to submit for the final phase so be careful not to submit wrong results on the last day

In [None]:
from PIL import Image
def create_submission(df, detector, score_thresh=0.1):
    results = []
    for index, row in df.iterrows():
        img_id = row['id']
        file_name = row['file_name']
        img = Image.open(file_name)
        width, height = img.size
        output = detector.predict(file_name)
        for _, p in output.iterrows():
            if p['predict_score'] > score_thresh:
                roi = p['predict_rois']
                pred = {'image_id': img_id,
                        'category_id': categories[p['predict_class']],
                        'bbox': [roi['xmin'] * width, roi['ymin'] * height, roi['xmax'] * width, roi['ymax'] * height],
                        'score': p['predict_score']}
                results.append(pred)
    return results

In [None]:
submission_df = pd.read_csv(os.path.join(root, 'valid.csv'))  # replace with test.csv on the last day
submission_df['file_name'] = submission_df.apply(lambda x: os.path.join(root, 'images', x['file_name']), axis=1)
submission = create_submission(submission_df, detector)

In [None]:
# create json and zip
submission_name = '/kaggle/working/answer.json'
with open(submission_name, 'w') as f:
    json.dump(submission, f)
zf = zipfile.ZipFile('/kaggle/working/sample_answer.zip', 'w')
zf.write(submission_name, 'answer.json')
zf.close()

## Submit to codalab competition to get the evaluation score

https://competitions.codalab.org/competitions/33573#participate-submit_results

You have to submit the your solution file together with the file submission to win the awards!