# Make TFRecords from VOC XML & jpgs

THIS IS REDUNDANTE with UnderstandingObjectDetectionExample  
Use the other notebook for a full understanding

This was taken from ssd-dag repo

## Prerequitistes
### Input
1. you have jpeg images
2. you have annotations - XML format, VOC Pascal format standard

### Output
tfrecords_dir needs to have 3 subdirectories
/train
/val
/test

this code leverages the standards and templates as much as possible

## [OPTIONAL] Testing / Visualizing
This notebook also includes testing your tfrecord files by visualization.  Two methods:
- matplotlib
- (tensorflow/models)  object_detection.utils  (This is the preferred method)
Remember - you don't have a Linux desktop so, you can't use a GTK based solution like OpenCV for the display.


In [None]:
import os
import sys
import re
import IPython.display as display
from PIL import Image
import contextlib2

import matplotlib.pyplot as plt
# from matplotlib.patches import Rectangle
import matplotlib.patches as patches
# This is needed since we cloned tensorflow/models under code.
# - if you don't know what this means
#   Look at the notebook TrainModel_Step1_Local
#      in this notebook, you basically set up the project with includes cloning 
#      and compiling the tensorflow/models repo
#   we are using the utilities found in that repo

cwd = os.getcwd()
# this path is different in this project
models = os.path.abspath(os.path.join(cwd, '..', 'models/research/'))
slim = os.path.abspath(os.path.join(cwd, '..', 'models/research/slim'))
sys.path.append(models)
sys.path.append(slim)

import tensorflow as tf

from object_detection.utils import ops as utils_ops
from object_detection.utils.visualization_utils import STANDARD_COLORS
from object_detection.utils.visualization_utils import draw_bounding_box_on_image

# this is the standard feature dict
from example_utils import feature_obj_detect, get_all_tfrecords

In [None]:
from example_utils import voc_to_tfrecord_file

In [None]:
# this won't work with Tensorflow 2.0
# if you have TF 2.0 loaded, you can't set eager - it's forced on

print ('TensorFlow Version:', tf.__version__)
AUTOTUNE = tf.data.experimental.AUTOTUNE
tf.enable_eager_execution()

## Globals

This has been simplified from the ssd-dag project.   Kinda going back and forth between projects.  But there is no CODE and DATA -- everything is together

In [None]:
# Globals
# project directories
PROJECT = os.getcwd()
HSDATA = '/hsdata'

IMAGE_DIR_ROOT = os.path.join(HSDATA, "jpeg_images")
ANNOTATION_DIR_ROOT = (os.path.join(HSDATA, "annotation"))

LABEL_MAP_FILE = os.path.join(PROJECT, 'model', 'security_label_map.pbtxt')
TFRECORD_DIR = os.path.join(HSDATA, 'tfrecord')
TRAINING_SPLIT_TUPLE =  (69,30,1)
INCLUDE_CLASSES = 'all'
EXCLUDE_TRUNCATED = False,
EXCLUDE_DIFFICULT = False

SSD_PROJECT = os.path.abspath(os.path.join(cwd, '..', 'ssd-dag'))
SSD_TFRECORDS = os.path.join(SSD_PROJECT, 'code/tfrecords')

DATA_DATE = '20200530'

## Move snapshot -> /hsdata

In [None]:
! ls {ANNOTATION_DIR}/*.xml | wc
! ls {IMAGE_DIR}/*.jpg | wc
! mv snapshot/*.xml {ANNOTATION_DIR}
! mv snapshot/*.jpg {IMAGE_DIR}
! ls {ANNOTATION_DIR}/*.xml | wc
! ls {IMAGE_DIR}/*.jpg | wc

## Retrieve snapshot tarball from S3

In [None]:
# what's available
! aws s3 ls s3://jmduff.security-system/training_data/ --profile=jmduff

In [None]:
# transfer to ~/Downloads/snapshot
snapshot_file = 'snapshot_{}_8100.tar.gz'.format(DATA_DATE)
! aws s3 cp s3://jmduff.security-system/training_data/{snapshot_file} ~/Downloads --profile=jmduff
! rm -rf ~/Downloads/snapshot
! tar -xvf ~/Downloads/{snapshot_file} -C ~/Downloads

In [None]:
# move to High Speed Data
! mv ~/Downloads/snapshot/*.xml {ANNOTATION_DIR}
! mv ~/Downloads/snapshot/*.jpg {IMAGE_DIR}

# Verify
! ls ~/Downloads/snapshot
! ls {ANNOTATION_DIR}| wc

# Generating .tfrecord files from XML annotations & jpeg images

## Fix Labels

if you get an error like:  
!!! label map error: 20190625_polySauce_spicyBag_1561494037 smallSauce  skipped

This is telling you that the image_id:  20190625_polySauce_spicyBag_1561494037  
has a class label:  smallSauce  
which is not defined in the label map.  (don't be fooled!  'smallSauce' is the problem, not polySauce in the filename)

If there are a few - you could ignore it.   To fix the data locally:
1. review the label_map - $ cat code/cfa_prod_label_map.pbtxt;   youll see 7 == cfaSauce, 10 == polySauce, 
2. you need to change any 'smallSauce' to one of the labels in the label_map; we will choose cfaSauce 
3. add a sed command

In [None]:
# you need to fix the labels
# it's in the other notebook:  UnderstaningObjectDetectionExample

# this is the label_map as a dict
#    {'smallHotDrink': 2, 'nuggBox': 5, 'sandBag': 6, 'smallFry': 8, 
#     'largeFry': 9, 'cfaSauce': 7, 'mediumColdDrink': 3, 'sandBox': 4, 
#   'hand': 1, 'spicyBag': 11, 'polySauce': 10}

# you might have to replace some names due to inconsistencies in labeling and the map
os.chdir(ANNOTATION_DIR)
! pwd
! sed -i 's/motorcyclew/motorcycle/g' *.xml
! sed -i 's/bagw/bag/g' *.xml
! sed -i 's/personww/person/g' *.xml
! sed -i 's/stroller /stroller/g' *.xml
! sed -i 's/motorcycle\t /motorcycle/g' *.xml
! sed -i 's/ mail/mail/g' *.xml
! sed -i 's/umbrella /umbrella/g' *.xml
os.chdir(PROJECT)

## voc_to_tfrecord_file()
This program is in code/cfa_utils/example_utils.py    (Reminder - this isn't an example, it is tf.Example - ugh!) 

This progrm leverages as much of the standard TensorFlow code as possible.  That means:
- annotations are based on VOC PASCAL data standards.   There are hundreds of program examples that use this data which is an XML annotation format.
- the tf.Example(Feature) format is based on the format used in the MobileNet model.    I lifted it out of the /models code and placed it here where you can import it.  It is important that you have a consistent format through tfrecord generation, training, predictoin.   

I used a pattern where I imported the SSD Feature dictionary then copied it to a dict - then used that dict in the serialization.   You'll see that in the program.   The point is, the feature (dict) format is defined only once in one place.    (Look at the code.)  Odd side effect:   It seems that you must define every element of the dict.  If you don't, you'll get an error:  
--> 214             features = tf.train.Features(feature=feature)
    215 
    216             tf_example = tf.train.Example(features=features)

TypeError: MergeFrom() takes exactly one argument (3 given)

This program will tell you if it skips image/annotations due to bad labels.  (explained above).

### Result:

This is telling you that 3149 had a 'verified' (XML attribute) annotation and 22 were not verified.   That is normal.  WHen you label (using labelImg for example), you can skip a questionable training image by simply not verifying it.

This dict also shows label map, e.g. hand == class_id = 1

  verified: 3149   not: 22
{'hand': 1, 'smallHotDrink': 2, 'mediumColdDrink': 3, 'sandBox': 4, 'nuggBox': 5, 'sandBag': 6, 'cfaSauce': 7, 'smallFry': 8, 'largeFry': 9, 'polySauce': 10, 'spicyBag': 11}

This is telling you 1889 images were written to the train.tfrecord file.  (not sharded)  
169 objects were class_id = 6 (sandBag)  
568 objects were class_id = 4 (nuggBox)  
These totals will sum >= 1889 because there may be multiple objects per image.

 -- images 1889  writing to: /home/ec2-user/SageMaker/ssd-dag/tmp/train/train.tfrecord
     image count: 1889   class_count: {6: 169, 9: 286, 5: 563, 11: 178, 2: 441, 4: 568, 8: 291, 1: 927, 3: 572, 10: 412, 7: 157}
     
### file output
NOTE - these files were written (depending on your GLOBAL value) to /tmp.    Write to tmp, then promote to S3 if you want to use these.    Look at the training program (notebook) to see where it pulls tfrecords (hint: it won't be /tmp)

In [None]:
# clear out the (*.record-*-of-*) output directory
! rm {TFRECORD_DIR}/train/*.*
! rm {TFRECORD_DIR}/val/*.*
! rm {TFRECORD_DIR}/test/*.*

In [None]:
voc_to_tfrecord_file(IMAGE_DIR_ROOT,
                    ANNOTATION_DIR_ROOT,
                    LABEL_MAP_FILE,
                    TFRECORD_DIR,
                    TRAINING_SPLIT_TUPLE)


## [Optional] Test your TFRecords 
Select your source of tfrecords
data/tfrecords is the source used in training.  
tmp is the source you just created

In [None]:
# TFRECORD_DIR = '/home/ec2-user/SageMaker/ssd-dag/data/tfrecords'
# TFRECORD_DIR = '/home/ec2-user/SageMaker/ssd-dag/tmp'
print (TFRECORD_DIR)
! ls {TFRECORD_DIR}/train

In [None]:
# This will read a list of files
# let's combine train, val & test


tfrecord_file_list_input = get_all_tfrecords([os.path.join(TFRECORD_DIR, 'train'),
                            os.path.join(TFRECORD_DIR, 'val'),
                            os.path.join(TFRECORD_DIR, 'test')])
print ("reading:", tfrecord_file_list_input)
raw_dataset = tf.data.TFRecordDataset(tfrecord_file_list_input)
raw_dataset.cache()  # cache to memory
raw_dataset.shuffle(buffer_size=5000)
print (type(raw_dataset))

### Reading the TFRecords
- The file is read into a dataset (TFRecordDatasetV1 to be exact)
- Iterating through the records:
  - each record is an EagerTensor (you must have Eager Execution enabled)
  - This tensor has a serialized tf.Example
      - byte string
      - get the value (byte string) with .numpy()
  - parse the serialized byte string into an Example
      - tf.Example is made of Features
          - feature[key] == each part of the observation or data point
          
So, make sure this is correct.
      

In [None]:
# this iteration will show you:
# - each record
# - each tf.Example
# BUT - it is not a parsed tf.Example,  it looks readable, but it's not yet consumable
#   look at the next code block for that

# VERIFY your mapping using this loop

for raw_record in raw_dataset.take(1):
    print("raw record type:", type(raw_record))  # serialized Example
    print("Tensor.dtype:", raw_record.dtype)
    print("       value:", raw_record.numpy()[:50], '\n')
    
    example = tf.train.Example()
    example.ParseFromString(raw_record.numpy())  # Parse will de-serialize it
    # review this to verify the features were mapped correctly
    print(type(example), '\n', example)


### Parsing each tf.Example record

This parses the serialized tf.Example into the feature (dict).   This is where we use that common feature definition to make sure the format is good.   feature_obj_detect is imported from:
code.cfa_utils.example_utils.py  

This isn't something I defined - I lifted it out of the code in tensorflow/models

In [None]:
print ("feature_obj_detect:", type(feature_obj_detect), '\n', feature_obj_detect)
def _parse_function(example_proto):
    # Parse the input using the standard dictionary
    feature = tf.io.parse_single_example(example_proto, feature_obj_detect)
    return feature

In [None]:
parsed_dataset = raw_dataset.map(_parse_function)
print (parsed_dataset)

## tensorflow/models/object_detection/visualization_util.py

THIS is the way to display - much easier!

In [None]:
# tensorflow/models/object_detection

for n,i in enumerate(parsed_dataset.take(10)):
    print ("record type:", type(i))
    print ("image/encoded type:", type(i['image/encoded']))
    image_tensor = i['image/encoded'].numpy()  # bytes
    print ("image/encoded EagerTensor.numpy():", type(image_tensor))
    print("is jpeg:", tf.io.is_jpeg(image_tensor))
    
    jpeg_decoded_tensor = tf.image.decode_jpeg(image_tensor)
    jpeg_numpy = jpeg_decoded_tensor.numpy()
    print ("tf.image.decode_jpeg(image_tensor):", jpeg_numpy.shape)
    
    # get height/width
    height = i['image/height'].numpy()
    width =  i['image/width'].numpy()
    
    # get object classes
    obj_class_names = i['image/object/class/text'].values.numpy()
    obj_class_ids = i['image/object/class/label'].values.numpy()
    obj_count = len(obj_class_ids)
    
    print (type(obj_class_names), obj_class_names)
    # get the bounding box coordinates
    xmins = i['image/object/bbox/xmin'].values.numpy()
    xmaxs = i['image/object/bbox/xmax'].values.numpy()
    ymins = i['image/object/bbox/ymin'].values.numpy()
    ymaxs = i['image/object/bbox/ymax'].values.numpy()
    print ('xmins:', type(xmins), xmins)
    xmins_pixel = xmins * width
    xmaxs_pixel = xmaxs * width
    ymins_pixel = ymins * height
    ymaxs_pixel = ymaxs * height
   
    pil_image = Image.fromarray(jpeg_numpy)    
    for idx in range(obj_count):
        draw_bounding_box_on_image(pil_image,ymins[idx],xmins[idx], ymaxs[idx], xmaxs[idx],
                                  color=STANDARD_COLORS[obj_class_ids[idx]], 
                                  thickness=4, display_str_list=[obj_class_names[idx]],
                                  use_normalized_coordinates=True)
        
    display.display(pil_image)
    print ()

# Backup and Move to Training Locations

!!! Change DATA_DATE !!!

In [None]:
tfrecord_backup = os.path.join(PROJECT, DATA_DATE + "_tfrecords")
tfrecord_source = TFRECORD_DIR
! cd {PROJECT}
! mkdir {tfrecord_backup}

In [None]:
! cp {tfrecord_source}/train/train.* {tfrecord_backup}
! cp {tfrecord_source}/val/val.* {tfrecord_backup}
! cp {tfrecord_source}/test/test.* {tfrecord_backup}

In [None]:
tarball_name = DATA_DATE + "_tfrecords.tar.gz"
# ! tar czvf $tarball_name $tfrecord_backup
! tar cf - $tfrecord_backup | pigz > $tarball_name

In [None]:
# backup tfrecords
! aws s3 cp $tarball_name s3://jmduff.security-system/tfrecords/ --profile=jmduff

### OPTIONAL Move to ssd-dag for additional  training

In [None]:
! tar cf - /hsdata/annotation/202001 | pigz > annotation_202001.tar.gz
! tar cf - /hsdata/annotation/202002 | pigz > annotation_202002.tar.gz
! tar cf - /hsdata/annotation/202003 | pigz > annotation_202003.tar.gz
! tar cf - /hsdata/annotation/202004 | pigz > annotation_202004.tar.gz
! tar cf - /hsdata/annotation/202005 | pigz > annotation_202005.tar.gz
! tar cf - /hsdata/annotation/202006 | pigz > annotation_202006.tar.gz
! tar cf - /hsdata/annotation/202007 | pigz > annotation_202007.tar.gz
! tar cf - /hsdata/annotation/202008 | pigz > annotation_202008.tar.gz
! tar cf - /hsdata/annotation/202009 | pigz > annotation_202009.tar.gz
! tar cf - /hsdata/annotation/202010 | pigz > annotation_202010.tar.gz
! tar cf - /hsdata/annotation/202011 | pigz > annotation_202011.tar.gz
! tar cf - /hsdata/annotation/202012 | pigz > annotation_202012.tar.gz

In [None]:
! tar cf - /hsdata/jpeg_images/202001 | pigz > jpeg_images_202001.tar.gz
! tar cf - /hsdata/jpeg_images/202002 | pigz > jpeg_images_202002.tar.gz
! tar cf - /hsdata/jpeg_images/202003 | pigz > jpeg_images_202003.tar.gz
! tar cf - /hsdata/jpeg_images/202004 | pigz > jpeg_images_202004.tar.gz
! tar cf - /hsdata/jpeg_images/202005 | pigz > jpeg_images_202005.tar.gz
! tar cf - /hsdata/jpeg_images/202006 | pigz > jpeg_images_202006.tar.gz
! tar cf - /hsdata/jpeg_images/202007 | pigz > jpeg_images_202007.tar.gz
! tar cf - /hsdata/jpeg_images/202008 | pigz > jpeg_images_202008.tar.gz
! tar cf - /hsdata/jpeg_images/202009 | pigz > jpeg_images_202009.tar.gz
! tar cf - /hsdata/jpeg_images/202010 | pigz > jpeg_images_202010.tar.gz
! tar cf - /hsdata/jpeg_images/202011 | pigz > jpeg_images_202011.tar.gz
! tar cf - /hsdata/jpeg_images/202012 | pigz > jpeg_images_202012.tar.gz

In [None]:
! aws s3 cp ./ s3://jmduff.security-system/sharded_training_data/ --profile=jmduff --recursive --exclude="*" --include="annotation_2020??.tar.gz"
! aws s3 cp ./ s3://jmduff.security-system/sharded_training_data/ --profile=jmduff --recursive --exclude="*" --include="jpeg_images_2020??.tar.gz"

## Copy tfrecords to ssd-usb0
flat tfrecord directory  
- no train/val/test subdirectory

Assumes device is mounted
use group permissions

In [None]:
! rm /media/security/ssd-usb0/tfrecord/*.record-?????-of-?????
! cp {tfrecord_source}/train/train.* /media/security/ssd-usb0/tfrecord
! cp {tfrecord_source}/val/val.* /media/security/ssd-usb0/tfrecord
! cp {tfrecord_source}/test/test.* /media/security/ssd-usb0/tfrecord

In [None]:
! cp annotation_2020??.tar.gz /media/security/ssd-usb0
! cp jpeg_images_2020??.tar.gz /media/security/ssd-usb0