## Prepare FDDB Dataset

By: Alexander Comerford (alexanderjcomerford@gmail.com)

In this notebook we will take advantage of the `tensorflow_datasets` package to create a `tf.data.Dataset` containing image and annotation data from the [fddb dataset](http://vis-www.cs.umass.edu/fddb/) containing multiple thousands of images

### FDDB from UMass
![image](http://vis-www.cs.umass.edu/fddb/umasslogo.gif)

## Prepare environments and dependencies

Below we will import and prepare libraries and configuration via environment variables.

In [1]:
import os
import re
import io
import json
import glob
import enum
import errno
import tarfile
from pathlib import Path

import cv2
import numpy as np
from PIL import Image
from minio import Minio
from minio.error import ResponseError
import tensorflow_datasets.public_api as tfds

get_ipython().run_line_magic("env", "AWS_ACCESS_KEY_ID=self2face")
get_ipython().run_line_magic("env", "AWS_SECRET_ACCESS_KEY=self2face")
get_ipython().run_line_magic("env", "S3_ENDPOINT=minio.default.svc.cluster.local:9000")
get_ipython().run_line_magic("env", "AWS_ENDPOINT_URL=http://minio.default.svc.cluster.local:9000")
get_ipython().run_line_magic("env", "S3_USE_HTTPS=0")
get_ipython().run_line_magic("env", "S3_VERIFY_SSL=0")
get_ipython().run_line_magic("env", "BUCKET_NAME=test-data")

env: AWS_ACCESS_KEY_ID=self2face
env: AWS_SECRET_ACCESS_KEY=self2face
env: S3_ENDPOINT=minio.default.svc.cluster.local:9000
env: AWS_ENDPOINT_URL=http://minio.default.svc.cluster.local:9000
env: S3_USE_HTTPS=0
env: S3_VERIFY_SSL=0
env: BUCKET_NAME=test-data


In [2]:
############# SHYM, REMOVE FROM UPDATED DOCKERFILE
import sys
sys.path.append("/home/jovyan/data-vol-1/notebooks/")

In [None]:
import ipynb.fs
from .defs.UtilityNotebooks.UtilityFunctions import dotdict, 
                                                    camel_to_snake, 
                                                    search_path_by_url,
                                                    write_file_to_filepath,
                                                    latest_dir

## FDDB specific functions

In the cells below we will be defining utility functions specific towards this notebook defining string normalization, image object reading, and small wrapper functions

In [None]:
## Turn a string to utf8 striped
normalize_string_line = lambda line:str(line.decode('utf-8').strip())

In [None]:
## Use tarfile object to extract file body from tarfile
def get_file_from_tar(filepath, tarpath):
    '''Given a tarpath extract a file from tar'''
    return tarfile.open(tarpath).extractfile(filepath)

In [None]:
## Transform python file object to PIL Image 
def file_to_image(fileobj):
    '''Given a file obj, attempt to create an Image'''
    nparr = np.frombuffer(fileobj.read(), np.uint8)
    img_np = cv2.cvtColor(cv2.imdecode(nparr, 1), cv2.COLOR_BGR2RGB)
    return Image.fromarray(img_np)

In [None]:
## Custom transformation logic to turn
## ellipse coordinates to a four coordinate
## definition of a box
def image_ellipse_to_box(image, major_axis_radius, minor_axis_radius, angle, center_x, center_y, detection_score):
    '''Given a PIL image and ellipse information, return a dict with bounding box information'''
    
    imagew=image.size[1]
    imageh=image.size[0]
    box = dotdict(
        x=1.0*center_x/imagew,
        y=1.0*center_y/imageh,
        w=1.0*minor_axis_radius*2/imagew,
        h=1.0*major_axis_radius*2/imageh,
        category=0
    )
    
    if box.w>0 and box.h>0 and box.x-box.w/2>=0 and\
       box.y-box.h/2>=0 and box.x+box.w/2<=1 and box.y+box.h/2<=1:
        return box
    else:
        return False

## FDDBDataset

Below we will define a `tfds.core.GeneratorBasedBuilder` which will download the images and labels for the `FDDB` dataset which contains a collection of human faces "in the wild" in unknown positions, orientations, and locations. Using this dataset we can train machine learning models to detect faces given generic images.

### Example images with added annotation
<img src="http://vis-www.cs.umass.edu/fddb/samples/2002_08_02_big_img_275.jpg" data-canonical-src="http://vis-www.cs.umass.edu/fddb/samples/2002_08_02_big_img_275.jpg" width="200" height="400" />

In [12]:
class FDDBDataset(tfds.core.GeneratorBasedBuilder):
    """FDDB dataset"""
    
    class FDDB_Parse_State(enum.Enum):
        '''States of fddb dataset parsing'''
        FILEPATH = 1
        NUMFACES = 2
        FACELOCATION = 3
        
    FDDB_BUCKET_NAME = "fddb"
        
    FDDB_DOWNLOAD_URLS={
        "images":"http://tamaraberg.com/faceDataset/originalPics.tar.gz",
        "annotations":"http://vis-www.cs.umass.edu/fddb/FDDB-folds.tgz"
    }

    VERSION = tfds.core.Version('0.0.0')
    
    def __init__(self, download_local, download_local_root="", log_every=10,
                 *args, **kwargs):
        super(FDDBDataset, self).__init__(*args, *kwargs)
        self.download_local = download_local
        
        if download_local: assert not download_local_root, "Must provide download_local_root path"
        self.download_local_root = download_local_root
        
        self.minio_client = self.make_minio_client()
        self.data_count = 0
        self.log_every = log_every

    def _info(self):
        return tfds.core.DatasetInfo(
            builder=self,
            features=tfds.features.FeaturesDict({
                "image": tfds.features.Image(),
                "bbox": tfds.features.BBoxFeature()
            }),
            supervised_keys=("image", "bbox")
        )

    def _split_generators(self, dl_manager):
        try:
            dl_paths = dl_manager.download(self.FDDB_DOWNLOAD_URLS)
        except tfds.download.download_manager.NonMatchingChecksumError:
            pass
        
        return self.extract_and_upload()
        
    def _generate_examples(self):
        # Yields examples from the dataset
        pass  # TODO        

    def make_minio_client(self, **kwargs):
        return Minio(os.environ["S3_ENDPOINT"],
                     access_key=os.environ["AWS_ACCESS_KEY_ID"],
                     secret_key=os.environ["AWS_SECRET_ACCESS_KEY"],
                     secure=False,
                     **kwargs)
    
    def upload_file_to_minio(self, bucket_name, objname, fileobj, size):
        '''Wrapper for uploading a file opject to minio'''
        try:
            if not self.minio_client.bucket_exists(bucket_name):
                self.minio_client.make_bucket(bucket_name)
        except ResponseError:
            pass        
        
        return self.minio_client.put_object(bucket_name, objname, fileobj, size)
    
    def generate_fddb_json_annotations(self, annotations_file_content, images_tarfile, image_file_extension=".jpg"):
        '''Generator of json annotations for fddb dataset'''

        ## Define base dict, state variable, and first line
        base_json_annotation = dotdict()
        current_state = self.FDDB_Parse_State.FILEPATH
        line = normalize_string_line(annotations_file_content.readline())

        ## Iter line until empty file
        while line:

            if current_state == self.FDDB_Parse_State.FILEPATH:
                image = file_to_image(images_tarfile.extractfile(line + image_file_extension))
                base_json_annotation['file'] = line + image_file_extension
                base_json_annotation['imagew'] = image.size[1]
                base_json_annotation['imageh'] = image.size[0]            
                current_state = self.FDDB_Parse_State.NUMFACES

            elif current_state == self.FDDB_Parse_State.NUMFACES:
                face_locations = []
                num_faces = int(line)
                current_state = self.FDDB_Parse_State.FACELOCATION

            elif current_state == self.FDDB_Parse_State.FACELOCATION:
                if num_faces > 0:
                    face_location_args = map(float,line.split())
                    bbox = image_ellipse_to_box(image, *face_location_args)
                    if bbox: face_locations.append(bbox)
                    num_faces -= 1
                else:
                    if len(face_locations):
                        yield dotdict({
                            **base_json_annotation, 
                            **{"face_locations":face_locations}})                
                    current_state = self.FDDB_Parse_State.FILEPATH
                    continue

            line = normalize_string_line(annotations_file_content.readline())
            
    def extract_and_upload(self):

        ## Download paths of images and annotations
        images_tarfile_path = search_path_by_url(self.FDDB_DOWNLOAD_URLS["images"])
        images_tarfile = tarfile.open(images_tarfile_path)
        annotations_tarfile_path = search_path_by_url(self.FDDB_DOWNLOAD_URLS["annotations"])
        annotations_tarfile = tarfile.open(annotations_tarfile_path)

        ## Iterate through 'ellipse' files extracting facial annotations
        for annotation_file in filter(lambda tfn:"ellipse" in tfn.name, annotations_tarfile):
            for fddb_json_annotations in self.generate_fddb_json_annotations(annotations_tarfile.extractfile(annotation_file), images_tarfile):
                
                image_fileobj = get_file_from_tar(fddb_json_annotations["file"], images_tarfile_path)
                image_filepath = fddb_json_annotations["file"]
                annotation_fileobj = io.BytesIO(json.dumps(fddb_json_annotations).encode())
                annotation_filepath = str(Path(fddb_json_annotations["file"]).with_suffix(".json"))
                
                if not self.download_local:
                    image_response = self.upload_file_to_minio(bucket_name = self.FDDB_BUCKET_NAME, 
                                                               objname = image_filepath,
                                                               fileobj = image_fileobj,
                                                               size = images_tarfile.getmember(image_filepath).size)
                    annotation_response = self.upload_file_to_minio(bucket_name = self.FDDB_BUCKET_NAME, 
                                                                    objname = annotation_filepath,
                                                                    fileobj = annotation_fileobj,
                                                                    size = len(str(fddb_json_annotations).encode()))
                    
                else:
                    image_response = write_file_to_filepath(fileobj = image_fileobj,
                                                            filepath = os.path.join(self.download_local_root, self.FDDB_BUCKET_NAME, image_filepath))

                    annotation_response = write_file_to_filepath(fileobj = annotation_fileobj,
                                                                 filepath = os.path.join(self.download_local_root, self.FDDB_BUCKET_NAME, annotation_filepath))
                    
                    
                if self.data_count % self.log_every == 0:
                    if self.download_local:
                        print("Uploaded %d images&annotations to %s"%(self.data_count,self.download_local_root))
                    else:
                        print("Uploaded %d images&annotations to %s"%(self.data_count,self.minio_client._endpoint_url))
                        
                self.data_count += 1


## Run dataset collection

This dataset supports two different modes of operation (local or upload)

In [13]:
fddb_dataset_builder = tfds.builder(camel_to_snake(FDDBDataset.__name__), download_local=False)
fddb_dataset_builder.download_and_prepare(
    download_dir=os.path.join(os.getcwd(),"data/"))

[1mDownloading and preparing dataset fddb_dataset (?? GiB) to /home/jovyan/tensorflow_datasets/fddb_dataset/0.0.0...[0m


HBox(children=(IntProgress(value=1, bar_style='info', description='Dl Completed...', max=1, style=ProgressStyl…

HBox(children=(IntProgress(value=1, bar_style='info', description='Dl Size...', max=1, style=ProgressStyle(des…


Uploaded 0 images&annotations to http://minio.default.svc.cluster.local:9000
Uploaded 10 images&annotations to http://minio.default.svc.cluster.local:9000
Uploaded 20 images&annotations to http://minio.default.svc.cluster.local:9000
Uploaded 30 images&annotations to http://minio.default.svc.cluster.local:9000
Uploaded 40 images&annotations to http://minio.default.svc.cluster.local:9000
Uploaded 50 images&annotations to http://minio.default.svc.cluster.local:9000
Uploaded 60 images&annotations to http://minio.default.svc.cluster.local:9000
Uploaded 70 images&annotations to http://minio.default.svc.cluster.local:9000
Uploaded 80 images&annotations to http://minio.default.svc.cluster.local:9000
Uploaded 90 images&annotations to http://minio.default.svc.cluster.local:9000
Uploaded 100 images&annotations to http://minio.default.svc.cluster.local:9000
Uploaded 110 images&annotations to http://minio.default.svc.cluster.local:9000
Uploaded 120 images&annotations to http://minio.default.svc.cl

ResponseError: ResponseError: code: XMinioStorageFull, message: Storage backend has reached its minimum free disk threshold. Please delete a few objects to proceed., bucket_name: fddb, object_name: 2002/08/09/big/img_431.jpg, request_id: 15A53BEEB9227509, host_id: 3L137, region: 

In [19]:
minio_client = Minio(os.environ["MINIO_SERVICE_PORT"],
                     access_key=os.environ["AWS_ACCESS_KEY_ID"],
                     secret_key=os.environ["AWS_SECRET_ACCESS_KEY"],
                     secure=False)

InvalidEndpointError: InvalidEndpointError: message: Hostname cannot have a scheme.

In [17]:
[i.object_name for i in list(minio_client.list_objects_v2('fddb', recursive=True))][:10]

['2002/07/19/big/img_423.jpg',
 '2002/07/19/big/img_423.json',
 '2002/07/19/big/img_581.jpg',
 '2002/07/19/big/img_581.json',
 '2002/07/23/big/img_474.jpg',
 '2002/07/24/big/img_402.jpg',
 '2002/07/24/big/img_402.json',
 '2002/07/24/big/img_518.jpg',
 '2002/07/27/big/img_970.jpg',
 '2002/07/31/big/img_228.jpg']

In [33]:
minio_client.fget_object('fddb', '2002/07/19/big/img_423.jpg', "./a.jpg")

<urllib3.response.HTTPResponse at 0x7fc5c503eeb8>