# Adding a Dataset of Your Own to TFDS

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/lmoroney/dlaicourse/blob/master/TensorFlow%20Deployment/Course%203%20-%20TensorFlow%20Datasets/Week%204/Exercises/TFDS_Week4_Exercise.ipynb">
    <img src="https://www.tensorflow.org/images/colab_logo_32px.png" />
    Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/lmoroney/dlaicourse/blob/master/TensorFlow%20Deployment/Course%203%20-%20TensorFlow%20Datasets/Week%204/Exercises/TFDS_Week4_Exercise.ipynb">
    <img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />
    View source on GitHub</a>
  </td>
</table>

## Setup

In [None]:
try:
    %tensorflow_version 2.x
except:
    pass

In [1]:
import os
import textwrap
import scipy.io
import pandas as pd

import tensorflow as tf

print("\u2022 Using TensorFlow Version:", tf.__version__)

• Using TensorFlow Version: 2.0.0


## IMDB Faces Dataset

This is the largest publicly available dataset of face images with gender and age labels for training.

Source: https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/

The IMDb Faces dataset provides a separate .mat file which can be loaded with Matlab containing all the meta information. The format is as follows:  
**dob**: date of birth (Matlab serial date number)  
**photo_taken**: year when the photo was taken  
**full_path**: path to file  
**gender**: 0 for female and 1 for male, NaN if unknown  
**name**: name of the celebrity  
**face_location**: location of the face (bounding box)  
**face_score**: detector score (the higher the better). Inf implies that no face was found in the image and the face_location then just returns the entire image  
**second_face_score**: detector score of the face with the second highest score. This is useful to ignore images with more than one face. second_face_score is NaN if no second face was detected.  
**celeb_names**: list of all celebrity names  
**celeb_id**: index of celebrity name

Here you can download the raw images and the metadata. We also provide a version with the cropped faces (with 40% margin). This version is much smaller.

In [2]:
# Download and extract the IMDB Faces dataset
!wget https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/static/imdb_crop.tar
!tar xf imdb_crop.tar

--2020-03-01 11:06:06--  https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/static/imdb_crop.tar
Resolving data.vision.ee.ethz.ch (data.vision.ee.ethz.ch)... 129.132.52.162
Connecting to data.vision.ee.ethz.ch (data.vision.ee.ethz.ch)|129.132.52.162|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7012157440 (6.5G) [application/x-tar]
Saving to: ‘imdb_crop.tar’


2020-03-01 11:29:06 (4.85 MB/s) - ‘imdb_crop.tar’ saved [7012157440/7012157440]



Next, let's inspect the dataset

## Exploring the Data

In [2]:
# Inspect the directory structure
files = os.listdir('imdb_crop')
print(textwrap.fill(' '.join(sorted(files)), 80))

00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 imdb.mat


**NOTE:** In the code below we have set `/content/` as the path to the `/imdb_crop/imdb.mat` file. This will work in Google's Colab environment without any modifications. However, if you are running this notebook locally, you should change `/content/` to the appropriate path to the `/imdb_crop/imdb.mat` file on your computer.

In [4]:
# Inspect the meta data
meta = scipy.io.loadmat('./imdb_crop/imdb.mat')

In [5]:
meta

{'__header__': b'MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Sun Jan 17 11:30:27 2016',
 '__version__': '1.0',
 '__globals__': [],
 'imdb': array([[(array([[693726, 693726, 693726, ..., 726831, 726831, 726831]], dtype=int32), array([[1968, 1970, 1968, ..., 2011, 2011, 2011]], dtype=uint16), array([[array(['01/nm0000001_rm124825600_1899-5-10_1968.jpg'], dtype='<U43'),
         array(['01/nm0000001_rm3343756032_1899-5-10_1970.jpg'], dtype='<U44'),
         array(['01/nm0000001_rm577153792_1899-5-10_1968.jpg'], dtype='<U43'),
         ...,
         array(['08/nm3994408_rm926592512_1989-12-29_2011.jpg'], dtype='<U44'),
         array(['08/nm3994408_rm943369728_1989-12-29_2011.jpg'], dtype='<U44'),
         array(['08/nm3994408_rm976924160_1989-12-29_2011.jpg'], dtype='<U44')]],
       dtype=object), array([[1., 1., 1., ..., 0., 0., 0.]]), array([[array(['Fred Astaire'], dtype='<U12'),
         array(['Fred Astaire'], dtype='<U12'),
         array(['Fred Astaire'], dtype='<U12'), ..

## Extraction

Let's clear up the clutter by going to the metadata's most useful key (imdb) and start exploring all the other keys inside it

In [14]:
root = meta['imdb'][0, 0]

In [7]:
desc = root.dtype.descr
desc

[('dob', '|O'),
 ('photo_taken', '|O'),
 ('full_path', '|O'),
 ('gender', '|O'),
 ('name', '|O'),
 ('face_location', '|O'),
 ('face_score', '|O'),
 ('second_face_score', '|O'),
 ('celeb_names', '|O'),
 ('celeb_id', '|O')]

In [15]:
# EXERCISE: Fill in the missing code below.

full_path = root["full_path"][0]

# Do the same for other attributes
names = root['name'][0]
dob = root['dob'][0]
gender = root['gender'][0]
photo_taken = root['photo_taken'][0]
face_score = root['face_score'][0]
face_locations = root['face_location'][0]
second_face_score = root['second_face_score'][0]
celeb_names = root['celeb_names'][0]
celeb_ids = root['celeb_id'][0]

print('Filepaths: {}\n\n'
      'Names: {}\n\n'
      'Dates of birth: {}\n\n'
      'Genders: {}\n\n'
      'Years when the photos were taken: {}\n\n'
      'Face scores: {}\n\n'
      'Face locations: {}\n\n'
      'Second face scores: {}\n\n'
      'Celeb IDs: {}\n\n'
      .format(full_path, names, dob, gender, photo_taken, face_score, face_locations, second_face_score, celeb_ids))

Filepaths: [array(['01/nm0000001_rm124825600_1899-5-10_1968.jpg'], dtype='<U43')
 array(['01/nm0000001_rm3343756032_1899-5-10_1970.jpg'], dtype='<U44')
 array(['01/nm0000001_rm577153792_1899-5-10_1968.jpg'], dtype='<U43') ...
 array(['08/nm3994408_rm926592512_1989-12-29_2011.jpg'], dtype='<U44')
 array(['08/nm3994408_rm943369728_1989-12-29_2011.jpg'], dtype='<U44')
 array(['08/nm3994408_rm976924160_1989-12-29_2011.jpg'], dtype='<U44')]

Names: [array(['Fred Astaire'], dtype='<U12')
 array(['Fred Astaire'], dtype='<U12')
 array(['Fred Astaire'], dtype='<U12') ...
 array(['Jane Levy'], dtype='<U9') array(['Jane Levy'], dtype='<U9')
 array(['Jane Levy'], dtype='<U9')]

Dates of birth: [693726 693726 693726 ... 726831 726831 726831]

Genders: [1. 1. 1. ... 0. 0. 0.]

Years when the photos were taken: [1968 1970 1968 ... 2011 2011 2011]

Face scores: [1.45969291 2.5431976  3.45557949 ...       -inf 4.45072452 2.13350269]

Face locations: [array([[1072.926,  161.838, 1214.784,  303.696]])
 a

In [18]:
print('Celeb names: {}\n\n'.format(celeb_names))

Celeb names: [array(["'Lee' George Quinones"], dtype='<U21')
 array(["'Weird Al' Yankovic"], dtype='<U19')
 array(['2 Chainz'], dtype='<U8') ...
 array(['Éric Caravaca'], dtype='<U13')
 array(['Ólafur Darri Ólafsson'], dtype='<U21')
 array(['Óscar Jaenada'], dtype='<U13')]




Display all the distinct keys of the dataset and their corresponding values

In [19]:
names = [x[0] for x in desc]
names

['dob',
 'photo_taken',
 'full_path',
 'gender',
 'name',
 'face_location',
 'face_score',
 'second_face_score',
 'celeb_names',
 'celeb_id']

In [20]:
values = {key: root[key][0] for key in names}
values

{'dob': array([693726, 693726, 693726, ..., 726831, 726831, 726831], dtype=int32),
 'photo_taken': array([1968, 1970, 1968, ..., 2011, 2011, 2011], dtype=uint16),
 'full_path': array([array(['01/nm0000001_rm124825600_1899-5-10_1968.jpg'], dtype='<U43'),
        array(['01/nm0000001_rm3343756032_1899-5-10_1970.jpg'], dtype='<U44'),
        array(['01/nm0000001_rm577153792_1899-5-10_1968.jpg'], dtype='<U43'),
        ...,
        array(['08/nm3994408_rm926592512_1989-12-29_2011.jpg'], dtype='<U44'),
        array(['08/nm3994408_rm943369728_1989-12-29_2011.jpg'], dtype='<U44'),
        array(['08/nm3994408_rm976924160_1989-12-29_2011.jpg'], dtype='<U44')],
       dtype=object),
 'gender': array([1., 1., 1., ..., 0., 0., 0.]),
 'name': array([array(['Fred Astaire'], dtype='<U12'),
        array(['Fred Astaire'], dtype='<U12'),
        array(['Fred Astaire'], dtype='<U12'), ...,
        array(['Jane Levy'], dtype='<U9'),
        array(['Jane Levy'], dtype='<U9'),
        array(['Jane Levy']

## Cleanup

Pop out the celeb names as they are not relevant for creating the records.

In [21]:
del values['celeb_names']
names.pop(names.index('celeb_names'))

'celeb_names'

Let's see how many values are present in each key

In [22]:
for key, value in values.items():
    print(key, len(value))

dob 460723
photo_taken 460723
full_path 460723
gender 460723
name 460723
face_location 460723
face_score 460723
second_face_score 460723
celeb_id 460723


## Dataframe

Now, let's try examining one example from the dataset. To do this, let's load all the attributes that we've extracted just now into a Pandas dataframe

In [23]:
df = pd.DataFrame(values, columns=names)
df.head()

Unnamed: 0,dob,photo_taken,full_path,gender,name,face_location,face_score,second_face_score,celeb_id
0,693726,1968,[01/nm0000001_rm124825600_1899-5-10_1968.jpg],1.0,[Fred Astaire],"[[1072.926, 161.838, 1214.7839999999999, 303.6...",1.459693,1.118973,6488
1,693726,1970,[01/nm0000001_rm3343756032_1899-5-10_1970.jpg],1.0,[Fred Astaire],"[[477.184, 100.352, 622.592, 245.76]]",2.543198,1.852008,6488
2,693726,1968,[01/nm0000001_rm577153792_1899-5-10_1968.jpg],1.0,[Fred Astaire],"[[114.96964308962852, 114.96964308962852, 451....",3.455579,2.98566,6488
3,693726,1968,[01/nm0000001_rm946909184_1899-5-10_1968.jpg],1.0,[Fred Astaire],"[[622.8855056426588, 424.21750383700805, 844.3...",1.872117,,6488
4,693726,1968,[01/nm0000001_rm980463616_1899-5-10_1968.jpg],1.0,[Fred Astaire],"[[1013.8590023603723, 233.8820422075853, 1201....",1.158766,,6488


The Pandas dataframe may contain some Null values or nan. We will have to filter them later on.

In [24]:
df.isna().sum()

dob                       0
photo_taken               0
full_path                 0
gender                 8462
name                      0
face_location             0
face_score                0
second_face_score    246926
celeb_id                  0
dtype: int64

# TensorFlow Datasets

TFDS provides a way to transform all those datasets into a standard format, do the preprocessing necessary to make them ready for a machine learning pipeline, and provides a standard input pipeline using `tf.data`.

To enable this, each dataset implements a subclass of `DatasetBuilder`, which specifies:

* Where the data is coming from (i.e. its URL). 
* What the dataset looks like (i.e. its features).  
* How the data should be split (e.g. TRAIN and TEST). 
* The individual records in the dataset.

The first time a dataset is used, the dataset is downloaded, prepared, and written to disk in a standard format. Subsequent access will read from those pre-processed files directly.

## Clone the TFDS Repository

The next step will be to clone the GitHub TFDS Repository. For this particular notebook, we will clone a particular version of the repository. You can clone the repository by running the following command:

In [25]:
!git clone https://github.com/tensorflow/datasets.git -b v1.2.0

Cloning into 'datasets'...
remote: Enumerating objects: 123, done.[K
remote: Counting objects: 100% (123/123), done.[K
remote: Compressing objects: 100% (83/83), done.[K
remote: Total 29169 (delta 48), reused 107 (delta 35), pack-reused 29046[K
Receiving objects: 100% (29169/29169), 469.24 MiB | 4.79 MiB/s, done.
Resolving deltas: 100% (20689/20689), done.
Note: checking out 'dc4b79f5c4dbdb0db1c5b614f8d790bdc0013de1'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>



Below run a script to generate starting files

In [None]:
%%bash
python tensorflow_datasets/scripts/create_new_dataset.py \
  --dataset imdb_faces \
  --type image

Next, we set the current working directory to `/content/datasets`.

**NOTE:** Here we have set `/content/` as the path to the `/datasets/` directory. This will work in Google's Colab environment without any modifications. However, if you are running this notebook locally, you should change `/content/` to the appropriate path to the `/datasets/` directory on your computer.

In [None]:
cd /content/datasets

Now we will use IPython's `%%writefile` in-built magic command to write whatever is in the current cell into a file. To create or overwrite a file you can use:
```
%%writefile filename
```

Let's see an example:

In [None]:
%%writefile something.py
x = 10

Now that the file has been written, let's inspect its contents.

In [None]:
!cat something.py

## Define the Dataset with `GeneratorBasedBuilder`

Most datasets subclass `tfds.core.GeneratorBasedBuilder`, which is a subclass of `tfds.core.DatasetBuilder` that simplifies defining a dataset. It works well for datasets that can be generated on a single machine. Its subclasses implement:

* `_info`: builds the DatasetInfo object describing the dataset


* `_split_generators`: downloads the source data and defines the dataset splits


* `_generate_examples`: yields (key, example) tuples in the dataset from the source data

In this exercise, you will use the `GeneratorBasedBuilder`.

### EXERCISE: Fill in the missing code below.

In [None]:
%%writefile tensorflow_datasets/image/imdb_faces.py

# coding=utf-8
# Copyright 2019 The TensorFlow Datasets Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""IMDB Faces dataset."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import collections
import os
import re

import tensorflow as tf
import tensorflow_datasets.public_api as tfds

_DESCRIPTION = """\
Follow the URL below and write a description on IMDB Faces dataset.
Since the publicly available face image datasets are often of small to medium size, 
rarely exceeding tens of thousands of images, and often without age information we decided to collect 
a large dataset of celebrities. For this purpose, we took the list of the most popular 100,000 actors 
as listed on the IMDb website and (automatically) crawled from their profiles date of birth, 
name, gender and all images related to that person. Additionally we crawled all profile images 
from pages of people from Wikipedia with the same meta information. We removed the images without 
timestamp (the date when the photo was taken). Assuming that the images with single faces are likely 
to show the actor and that the timestamp and date of birth are correct, we were able to assign to each 
such image the biological (real) age. Of course, we can not vouch for the accuracy of the assigned age information. 
"""

_URL = ("https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/")
_DATASET_ROOT_DIR = "imdb_crop" # Put the name of the dataset root directory here
_ANNOTATION_FILE = "imdb.mat" # Put the name of annotation file here (.mat file)


_CITATION = """\
@article{Rothe-IJCV-2018,
  author = {Rasmus Rothe and Radu Timofte and Luc Van Gool},
  title = {Deep expectation of real and apparent age from a single image without facial landmarks},
  journal = {International Journal of Computer Vision},
  volume={126},
  number={2-4},
  pages={144--157},
  year={2018},
  publisher={Springer}
}
"""

# Source URL of the IMDB faces dataset
_TARBALL_URL = "https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/static/imdb_crop.tar"

class ImdbFaces(tfds.core.GeneratorBasedBuilder):
  """IMDB Faces dataset."""
  VERSION = tfds.core.Version("0.1.0")
  


  def _info(self):
    return tfds.core.DatasetInfo(
        builder=self,
        description=_DESCRIPTION,
        # Describe the features of the dataset by following this url
        # https://www.tensorflow.org/datasets/api_docs/python/tfds/features
        features=tfds.features.FeaturesDict({
            "image": tfds.features.Image(), # Create a tfds Image feature here
            "gender": tfds.features.ClassLabel(num_classes=2), # Create a tfds Class Label feature here for the two classes (Female, Male)
            # "gender": tfds.features.ClassLabel(num_classes=2, names=["Female", "Male"])
            "dob": tf.int32,
            "photo_taken": tf.int32,
            "face_location": tfds.features.BBoxFeature(), # Create a tfds Bounding box feature here
            "face_score": tf.float32,
            "second_face_score": tf.float32,
            "celeb_id": tf.int32
        }),
        supervised_keys=("image", "gender"),
        urls=[_URL],
        citation=_CITATION)


  def _split_generators(self, dl_manager):
    # Download the dataset and then extract it.
    download_path = dl_manager.download([_TARBALL_URL])
    extracted_path = dl_manager.download_and_extract([_TARBALL_URL])

    # Parsing the mat file which contains the list of train images
    def parse_mat_file(file_name):
      with tf.io.gfile.GFile(file_name, "rb") as f:
        # Add a lazy import for scipy.io and import the loadmat method to 
        # load the annotation file
        dataset = tfds.core.lazy_imports.scipy.io.loadmat(file_name)['imdb']
      return dataset

    # Parsing the mat file by using scipy's loadmat method
    # Pass the path to the annotation file using the downloaded/extracted paths above
    meta = parse_mat_file(os.path.join(extracted_path[0], _DATASET_ROOT_DIR, _ANNOTATION_FILE) )
    #meta = parse_mat_file(os.path.join(extracted_path[0], "imdb.mat") )
    # Get the names of celebrities from the metadata
    celeb_names = meta[0, 0]["celeb_names"][0]

    # Create tuples out of the distinct set of genders and celeb names
    self.info.features['gender'].names = ("Female", "Male")
    self.info.features['celeb_id'].names = tuple([x[0] for x in celeb_names])

    return [
        tfds.core.SplitGenerator(
            name=tfds.Split.TRAIN,
            gen_kwargs={
                "image_dir": extracted_path[0],
                "metadata": meta,
            })
    ]

  def _get_bounding_box_values(self, bbox_annotations, img_width, img_height):
    """Function to get normalized bounding box values.

    Args:
      bbox_annotations: list of bbox values in kitti format
      img_width: image width
      img_height: image height

    Returns:
      Normalized bounding box xmin, ymin, xmax, ymax values
    """
    ymin = bbox_annotations[0] / img_height
    xmin = bbox_annotations[1] / img_width
    ymax = bbox_annotations[2] / img_height
    xmax = bbox_annotations[3] / img_width
    return ymin, xmin, ymax, xmax
  
  def _get_image_shape(self, image_path):
    image = tf.io.read_file(image_path)
    image = tf.image.decode_image(image, channels=3)
    shape = image.shape[:2]
    return shape



  def _generate_examples(self, image_dir, metadata):
    # Add a lazy import for pandas here (pd)
    pd = tfds.core.lazy_imports.pandas

    # Extract the root dictionary from the metadata so that you can query all the keys inside it
    root = metadata[0, 0]

    """Extract image names, dobs, genders,  
               face locations, 
               year when the photos were taken,
               face scores (second face score too),
               celeb ids
    """
    image_names = root["full_path"][0]
        
    # Do the same for other attributes (dob, genders etc)
    dobs = root['dob'][0]
    genders = root['gender'][0]
    face_locations = root['face_location'][0]
    photo_taken_years = root['photo_taken'][0]
    face_scores = root['face_score'][0]
    second_face_scores = root['second_face_score'][0]
    celeb_id = root['celeb_id'][0]

        
    # Now create a dataframe out of all the features like you've seen before
    df = pd.DataFrame(
     list(zip(image_names,
             dobs,
             genders,
             face_locations,
             photo_taken_years,
             face_scores,
             second_face_scores,
             celeb_id)),
    columns=["image_names",
             "dobs",
             "genders",
             "face_locations",
             "photo_taken_years",
             "face_scores",
             "second_face_scores",
             "celeb_ids"])

    # Filter dataframe by only having the rows with face_scores > 1.0
    df = df[df["face_scores"] > 1.0]


    # Remove any records that contain Nulls/NaNs by checking for NaN with .isna()
    df = df[~df['genders'].isna()]
    df = df[~df['second_face_scores'].isna()]

    # Cast genders to integers so that mapping can take place
    df.genders = df.genders.astype(int)

    # Iterate over all the rows in the dataframe and map each feature
    for _, row in df.iterrows():
      # Extract filename, gender, dob, photo_taken, 
      # face_score, second_face_score and celeb_id
      filename = os.path.join(image_dir, _DATASET_ROOT_DIR, row['image_names'][0])
      gender = row['genders']
      dob = row['dobs']
      photo_taken = row['photo_taken_years']
      face_score = row['face_scores']
      second_face_score = row['second_face_scores']
      celeb_id = row['celeb_ids']

      # Get the image shape
      image_width, image_height = self._get_image_shape(filename)
      # Normalize the bounding boxes by using the face coordinates and the image shape
      bbox = self._get_bounding_box_values(row['face_locations'][0], 
                                           image_width, image_height)

      # Yield a feature dictionary 
      yield filename, {
          "image": filename,
          "gender": gender,
          "dob": dob,
          "photo_taken": photo_taken,
          "face_location": tfds.features.BBox( # Create a bounding box (BBox) object out of the coordinates extracted
              ymin=min(bbox[0], 1.0),
              xmin=min(bbox[1], 1.0),
              ymax=min(bbox[2], 1.0),
              xmax=min(bbox[3], 1.0)
          ),
          "face_score": face_score,
          "second_face_score": second_face_score,
          "celeb_id": celeb_id
      }


## Add an Import for Registration

All subclasses of `tfds.core.DatasetBuilder` are automatically registered when their module is imported such that they can be accessed through `tfds.builder` and `tfds.load`.

If you're contributing the dataset to `tensorflow/datasets`, you must add the module import to its subdirectory's `__init__.py` (e.g. `image/__init__.py`), as shown below:

In [None]:
%%writefile tensorflow_datasets/image/__init__.py
# coding=utf-8
# Copyright 2019 The TensorFlow Datasets Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Image datasets."""

from tensorflow_datasets.image.abstract_reasoning import AbstractReasoning
from tensorflow_datasets.image.aflw2k3d import Aflw2k3d
from tensorflow_datasets.image.bigearthnet import Bigearthnet
from tensorflow_datasets.image.binarized_mnist import BinarizedMNIST
from tensorflow_datasets.image.binary_alpha_digits import BinaryAlphaDigits
from tensorflow_datasets.image.caltech import Caltech101
from tensorflow_datasets.image.caltech_birds import CaltechBirds2010
from tensorflow_datasets.image.cats_vs_dogs import CatsVsDogs
from tensorflow_datasets.image.cbis_ddsm import CuratedBreastImagingDDSM
from tensorflow_datasets.image.celeba import CelebA
from tensorflow_datasets.image.celebahq import CelebAHq
from tensorflow_datasets.image.chexpert import Chexpert
from tensorflow_datasets.image.cifar import Cifar10
from tensorflow_datasets.image.cifar import Cifar100
from tensorflow_datasets.image.cifar10_corrupted import Cifar10Corrupted
from tensorflow_datasets.image.clevr import CLEVR
from tensorflow_datasets.image.coco import Coco
from tensorflow_datasets.image.coco2014_legacy import Coco2014
from tensorflow_datasets.image.coil100 import Coil100
from tensorflow_datasets.image.colorectal_histology import ColorectalHistology
from tensorflow_datasets.image.colorectal_histology import ColorectalHistologyLarge
from tensorflow_datasets.image.cycle_gan import CycleGAN
from tensorflow_datasets.image.deep_weeds import DeepWeeds
from tensorflow_datasets.image.diabetic_retinopathy_detection import DiabeticRetinopathyDetection
from tensorflow_datasets.image.downsampled_imagenet import DownsampledImagenet
from tensorflow_datasets.image.dsprites import Dsprites
from tensorflow_datasets.image.dtd import Dtd
from tensorflow_datasets.image.eurosat import Eurosat
from tensorflow_datasets.image.flowers import TFFlowers
from tensorflow_datasets.image.food101 import Food101
from tensorflow_datasets.image.horses_or_humans import HorsesOrHumans
from tensorflow_datasets.image.image_folder import ImageLabelFolder
from tensorflow_datasets.image.imagenet import Imagenet2012
from tensorflow_datasets.image.imagenet2012_corrupted import Imagenet2012Corrupted
from tensorflow_datasets.image.kitti import Kitti
from tensorflow_datasets.image.lfw import LFW
from tensorflow_datasets.image.lsun import Lsun
from tensorflow_datasets.image.mnist import EMNIST
from tensorflow_datasets.image.mnist import FashionMNIST
from tensorflow_datasets.image.mnist import KMNIST
from tensorflow_datasets.image.mnist import MNIST
from tensorflow_datasets.image.mnist_corrupted import MNISTCorrupted
from tensorflow_datasets.image.omniglot import Omniglot
from tensorflow_datasets.image.open_images import OpenImagesV4
from tensorflow_datasets.image.oxford_flowers102 import OxfordFlowers102
from tensorflow_datasets.image.oxford_iiit_pet import OxfordIIITPet
from tensorflow_datasets.image.patch_camelyon import PatchCamelyon
from tensorflow_datasets.image.pet_finder import PetFinder
from tensorflow_datasets.image.quickdraw import QuickdrawBitmap
from tensorflow_datasets.image.resisc45 import Resisc45
from tensorflow_datasets.image.rock_paper_scissors import RockPaperScissors
from tensorflow_datasets.image.scene_parse_150 import SceneParse150
from tensorflow_datasets.image.shapes3d import Shapes3d
from tensorflow_datasets.image.smallnorb import Smallnorb
from tensorflow_datasets.image.so2sat import So2sat
from tensorflow_datasets.image.stanford_dogs import StanfordDogs
from tensorflow_datasets.image.stanford_online_products import StanfordOnlineProducts
from tensorflow_datasets.image.sun import Sun397
from tensorflow_datasets.image.svhn import SvhnCropped
from tensorflow_datasets.image.uc_merced import UcMerced
from tensorflow_datasets.image.visual_domain_decathlon import VisualDomainDecathlon
from tensorflow_datasets.image.voc import Voc2007

# EXERCISE: Import your dataset module here
from tensorflow_datasets.image.imdb_faces import ImdbFaces

## URL Checksums

If you're contributing the dataset to `tensorflow/datasets`, add a checksums file for the dataset. On first download, the DownloadManager will automatically add the sizes and checksums for all downloaded URLs to that file. This ensures that on subsequent data generation, the downloaded files are as expected.

In [29]:
!touch tensorflow_datasets/url_checksums/imdb_faces.txt

## Build the Dataset

In [30]:
# EXERCISE: Fill in the name of your dataset.
# The name must be a string.
DATASET_NAME = 'imdb_faces'

In [None]:
%%bash -s $DATASET_NAME
python -m tensorflow_datasets.scripts.download_and_prepare \
  --register_checksums \
  --datasets=$1

We then run the `download_and_prepare` script locally to build it, using the following command:

```
%%bash -s $DATASET_NAME
python -m tensorflow_datasets.scripts.download_and_prepare \
  --register_checksums \
  --datasets=$1
```

**NOTE:** It may take more than 30 minutes to download the dataset and then write all the preprocessed files as TFRecords. Due to the enormous size of the data involved, we are not going to run the above code here. However, if you have enough disk space, you are welcome to run it locally or in a Colab. 

## Load the Dataset

Once the dataset is built you can load it in the usual way, by using `tfds.load`, as shown below:

```python
import tensorflow_datasets as tfds
dataset, info = tfds.load('imdb_faces', with_info=True)
```

**Note:** Since we didn't build the `imdb_faces` dataset due to its size, we are unable to run the above code. However, if you had enough disk space to build the `imdb_faces` dataset, you are welcome to run it locally or in a Colab.

In [2]:
import tensorflow_datasets as tfds
dataset, info = tfds.load('imdb_faces', with_info=True)



## Explore the Dataset

Once the dataset is loaded, you can explore it by using the following loop:

```python
for feature in tfds.as_numpy(dataset['train']):
  for key, value in feature.items():
    if key == 'image':
      value = value.shape
    print(key, value)
  break
```

**Note:** Since we didn't build the `imdb_faces` dataset due to its size, we are unable to run the above code. However, if you had enough disk space to build the `imdb_faces` dataset, you are welcome to run it locally or in a Colab.

The expected output from the code block shown above should be:

```python
>>>
celeb_id 12387
dob 722957
face_location [1.         0.56327355 1.         1.        ]
face_score 4.0612864
gender 0
image (96, 97, 3)
photo_taken 2007
second_face_score 3.6680346
```

In [8]:
for feature in tfds.as_numpy(dataset['train']):
  for key, value in feature.items():
    if key == 'image':
      value = value.shape
    print(key, value)
  break

celeb_id 2571
dob 717363
face_location [1. 1. 1. 1.]
face_score 3.4040062
gender 0
image (81, 82, 3)
photo_taken 1999
second_face_score 2.3661942


In [9]:
for feature in dataset['train'].take(2):
  print(feature)

{'celeb_id': <tf.Tensor: id=281, shape=(), dtype=int32, numpy=7580>, 'dob': <tf.Tensor: id=282, shape=(), dtype=int32, numpy=726008>, 'face_location': <tf.Tensor: id=283, shape=(4,), dtype=float32, numpy=array([0.37121794, 0.49405247, 0.9212591 , 1.        ], dtype=float32)>, 'face_score': <tf.Tensor: id=284, shape=(), dtype=float32, numpy=4.873932>, 'gender': <tf.Tensor: id=285, shape=(), dtype=int64, numpy=0>, 'image': <tf.Tensor: id=286, shape=(196, 196, 3), dtype=uint8, numpy=
array([[[ 79,  82,  87],
        [ 80,  83,  88],
        [ 82,  86,  89],
        ...,
        [ 17,  12,   6],
        [ 15,  10,   4],
        [ 12,   7,   1]],

       [[ 70,  73,  78],
        [ 71,  74,  79],
        [ 74,  78,  81],
        ...,
        [ 16,  11,   5],
        [ 14,   9,   3],
        [ 13,   8,   2]],

       [[ 67,  71,  74],
        [ 68,  72,  75],
        [ 71,  75,  78],
        ...,
        [ 15,  10,   4],
        [ 14,   9,   3],
        [ 13,   8,   2]],

       ...,

      

In [10]:
for feature in dataset['train'].take(2):
  print('\n')
  for key, value in feature.items():
    if key == 'image':
      value = value.shape
    print(key, value)
  



celeb_id tf.Tensor(2571, shape=(), dtype=int32)
dob tf.Tensor(717363, shape=(), dtype=int32)
face_location tf.Tensor([1. 1. 1. 1.], shape=(4,), dtype=float32)
face_score tf.Tensor(3.4040062, shape=(), dtype=float32)
gender tf.Tensor(0, shape=(), dtype=int64)
image (81, 82, 3)
photo_taken tf.Tensor(1999, shape=(), dtype=int32)
second_face_score tf.Tensor(2.3661942, shape=(), dtype=float32)


celeb_id tf.Tensor(4451, shape=(), dtype=int32)
dob tf.Tensor(721146, shape=(), dtype=int32)
face_location tf.Tensor([1.        0.5003457 1.        1.       ], shape=(4,), dtype=float32)
face_score tf.Tensor(2.458625, shape=(), dtype=float32)
gender tf.Tensor(1, shape=(), dtype=int64)
image (190, 190, 3)
photo_taken tf.Tensor(1997, shape=(), dtype=int32)
second_face_score tf.Tensor(1.9733487, shape=(), dtype=float32)


# Next steps for publishing

**Double-check the citation**  

It's important that DatasetInfo.citation includes a good citation for the dataset. It's hard and important work contributing a dataset to the community and we want to make it easy for dataset users to cite the work.

If the dataset's website has a specifically requested citation, use that (in BibTex format).

If the paper is on arXiv, find it there and click the bibtex link on the right-hand side.

If the paper is not on arXiv, find the paper on Google Scholar and click the double-quotation mark underneath the title and on the popup, click BibTeX.

If there is no associated paper (for example, there's just a website), you can use the BibTeX Online Editor to create a custom BibTeX entry (the drop-down menu has an Online entry type).
  

**Add a test**   

Most datasets in TFDS should have a unit test and your reviewer may ask you to add one if you haven't already. See the testing section below.   
**Check your code style**  

Follow the PEP 8 Python style guide, except TensorFlow uses 2 spaces instead of 4. Please conform to the Google Python Style Guide,

Most importantly, use tensorflow_datasets/oss_scripts/lint.sh to ensure your code is properly formatted. For example, to lint the image directory
See TensorFlow code style guide for more information.

**Add release notes**
Add the dataset to the release notes. The release note will be published for the next release.

**Send for review!**
Send the pull request for review.

For more information, visit https://www.tensorflow.org/datasets/add_dataset