Note- this is a copied and slightly modified version of Google's tf model gardent code. The original can be found at https://github.com/tensorflow/models/tree/master/official/projects/waste_identification_ml/pre_processing

# Conversion of COCO annotation JSON file to TFRecords

Given a COCO annotated JSON file, your goal is to convert it into a TFRecords  file necessary to train with the Mask RCNN model.

To accomplish this task, you will clone the TensorFlow Model Garden repo. The TensorFlow Model Garden is a repository with a number of different implementations of state-of-the-art (SOTA) models and modeling solutions for TensorFlow users.

This notebook is an end to end example. When you run the notebook, it will take COCO annotated JSON train and test files as an input and will convert them into TFRecord files. You can also output sharded TFRecord files in case your training and validation data is huge. It makes it easier for the algorithm to read and access the data.

**Note** - In this example, we assume that all our data is saved on Google drive and we will also write our outputs to Google drive. We also assume that the script will be used as a Google Colab notebook. But this can be changed according to the needs of users. They can modify this in case they are working on their local workstation, remote server or any other database. This colab notebook can be changed to a regular jupyter notebook running on a local machine according to the need of the users.

## Run the below command to connect to your google drive

In [None]:
!pip install -q tf-nightly
!pip install -q tensorflow-addons

In [None]:
# import libraries
from google.colab import drive
import sys

In [None]:
# "opencv-python-headless" version should be same of "opencv-python"
import pkg_resources
version_number = pkg_resources.get_distribution("opencv-python").version

!pip install -q opencv-python-headless==$version_number

In [None]:
# connect to google drive
drive.mount('/content/gdrive')

# making an alias for the root path
try:
  !ln -s /content/gdrive/My\ Drive/ /mydrive
  print('Successful')
except Exception as e:
  print(e)
  print('Not successful')

## Clone TensorFlow Model Garden repository

In [None]:
# clone the Model Garden directory for Tensorflow where all the config files and scripts are located for this project.
# project folder name is - 'waste_identification_ml'
!git clone https://github.com/tensorflow/models.git

In [None]:
# Go to the model folder
%cd models

## Create TFRecord for training data

In [None]:
training_images_folder = '/home/rbe07/Documents/Google/zerowaste-f-final/splits_final_deblurred/train/data/'  #@param {type:"string"}
training_annotation_file = '/home/rbe07/Documents/Google/zerowaste-f-final/splits_final_deblurred/train/labels_material.json'  #@param {type:"string"}
output_folder = '/home/rbe07/Documents/Google/zerowaste-f-final/tf_data/train/material/'  #@param {type:"string"}
training_images_folder = '/home/rbe07/Documents/Google/data/sequences/'
training_annotation_file = '/home/rbe07/Downloads/temp.json'
output_folder = '/home/rbe07/Documents/Google/data/tf_records/'  #@param {type:"string"}


import sys
import os

os.chdir("/home/rbe07/Documents/Google/models")
# sys.path.append('/home/rbe07/Documents/Google/models')

# print(sys.path)

In [None]:
# run the script to convert your json file to TFRecord file
# --num_shards (how many TFRecord sharded files you want)

# sources = ['a2_oc_0.0', 'a2_oc_50.0', 'a2_oc_90.0', 'a2_oc_99.0']
sources = ['hard_50.0', 'hard_90.0', 'hard_99.0']

for source in sources:
      training_annotation_file = '/home/rbe07/Downloads/'+source+'.json'
      output_folder = os.path.join('/home/rbe07/Documents/Google/data', source)+"/"  #@param {type:"string"}
      # training_annotation_file = "/home/rbe07/Documents/DEVA_rep/Tracking-Anything-with-DEVA/example/output/pred.json"
      os.makedirs(output_folder, exist_ok=True)

      !python3 -m official.vision.data.create_coco_tf_record \
            --logtostderr \
            --image_dir=$training_images_folder \
            --object_annotations_file=$training_annotation_file \
            --output_file_prefix=$output_folder \
            --num_shards=100 \
            --include_masks=True \
            --num_processes=0

## Create TFRecord for validation data

In [None]:
validation_data_folder = '/home/rbe07/Documents/Google/zerowaste-f-final/splits_final_deblurred/val/data/'  #@param {type:"string"}
validation_annotation_file = '/home/rbe07/Documents/Google/zerowaste-f-final/splits_final_deblurred/val/labels_material.json'  #@param {type:"string"}
output_folder = '/home/rbe07/Documents/Google/zerowaste-f-final/tf_data/val/material/'  #@param {type:"string"}

validation_data_folder = '/home/rbe07/Documents/Google/data/sequences/hand_labeled_corrected'
validation_annotation_file = '/home/rbe07/Documents/Google/data/sequences/Labels/hand_labeled_corrected_labels.json'
output_folder = '/home/rbe07/Documents/Google/data/tf_records_test/'  #@param {type:"string"}

In [None]:
# run the script to convert your json file to TFRecord file
# --num_shards (how many TFRecord sharded files you want)
!python3 -m official.vision.data.create_coco_tf_record --logtostderr \
      --image_dir=$validation_data_folder \
      --object_annotations_file=$validation_annotation_file \
      --output_file_prefix=$output_folder \
      --num_shards=10 \
      --include_masks=True \
      --num_processes=0