# Adding a Dataset of Your Own to TFDS

In [1]:
import os
import textwrap
import scipy.io
import pandas as pd
from datetime import datetime

from os import getcwd

## IMDB Faces Dataset

This is the largest publicly available dataset of face images with gender and age labels for training.

Source: https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/

The IMDb Faces dataset provides a separate .mat file which can be loaded with Matlab containing all the meta information. The format is as follows:  
**dob**: date of birth (Matlab serial date number)  
**photo_taken**: year when the photo was taken  
**full_path**: path to file  
**gender**: 0 for female and 1 for male, NaN if unknown  
**name**: name of the celebrity  
**face_location**: location of the face (bounding box)  
**face_score**: detector score (the higher the better). Inf implies that no face was found in the image and the face_location then just returns the entire image  
**second_face_score**: detector score of the face with the second highest score. This is useful to ignore images with more than one face. second_face_score is NaN if no second face was detected.  
**celeb_names**: list of all celebrity names  
**celeb_id**: index of celebrity name  

Next, let's inspect the dataset

## Exploring the Data

In [3]:
# Inspect the directory structure
indir = "C:\\Users\\nick\\tensorflow_datasets\\downloads\\manual\\imdb_wiki"
imdb_crop_file_path = os.path.join(indir, 'imdb_crop')
wiki_crop_file_path = os.path.join(indir, 'wiki_crop')
imdb_files = os.listdir(imdb_crop_file_path)
wiki_files = os.listdir(wiki_crop_file_path)
print("IMDB:\n", textwrap.fill(' '.join(sorted(imdb_files)), 80), '\n\n')

print("Wiki:\n", textwrap.fill(' '.join(sorted(wiki_files)), 80))

IMDB:
 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 imdb.mat 


Wiki:
 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 wiki.mat


In [5]:
# Inspect the meta data
imdb_mat_file_path = os.path.join(indir, 'imdb_crop', 'imdb.mat')
meta_imdb = scipy.io.loadmat(imdb_mat_file_path)

imdb_mat_file_path = os.path.join(indir, 'wiki_crop', 'wiki.mat')
meta_wiki = scipy.io.loadmat(imdb_mat_file_path)

In [7]:
meta_imdb

{'__header__': b'MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Sun Jan 17 11:30:27 2016',
 '__version__': '1.0',
 '__globals__': [],
 'imdb': array([[(array([[693726, 693726, 693726, ..., 726831, 726831, 726831]]), array([[1968, 1970, 1968, ..., 2011, 2011, 2011]], dtype=uint16), array([[array(['01/nm0000001_rm124825600_1899-5-10_1968.jpg'], dtype='<U43'),
         array(['01/nm0000001_rm3343756032_1899-5-10_1970.jpg'], dtype='<U44'),
         array(['01/nm0000001_rm577153792_1899-5-10_1968.jpg'], dtype='<U43'),
         ...,
         array(['08/nm3994408_rm926592512_1989-12-29_2011.jpg'], dtype='<U44'),
         array(['08/nm3994408_rm943369728_1989-12-29_2011.jpg'], dtype='<U44'),
         array(['08/nm3994408_rm976924160_1989-12-29_2011.jpg'], dtype='<U44')]],
       dtype=object), array([[1., 1., 1., ..., 0., 0., 0.]]), array([[array(['Fred Astaire'], dtype='<U12'),
         array(['Fred Astaire'], dtype='<U12'),
         array(['Fred Astaire'], dtype='<U12'), ...,
         a

## Extraction

Let's clear up the clutter by going to the metadata's most useful key (imdb) and start exploring all the other keys inside it

In [8]:
root = meta_imdb['imdb'][0, 0]

In [9]:
desc = root.dtype.descr
desc

[('dob', '|O'),
 ('photo_taken', '|O'),
 ('full_path', '|O'),
 ('gender', '|O'),
 ('name', '|O'),
 ('face_location', '|O'),
 ('face_score', '|O'),
 ('second_face_score', '|O'),
 ('celeb_names', '|O'),
 ('celeb_id', '|O')]

In [10]:
# EXERCISE: Fill in the missing code below.

full_path = root["full_path"][0]

# Do the same for other attributes
names = root["name"][0]
dob = root["dob"][0]
gender = root["gender"][0]
photo_taken = root["photo_taken"][0]
face_score = root["face_score"][0]
face_locations = root["face_location"][0]
second_face_score = root["second_face_score"][0]
celeb_names = root["celeb_names"][0]
celeb_ids = root["celeb_id"][0]

print('Filepaths: {}\n\n'
      'Names: {}\n\n'
      'Dates of birth: {}\n\n'
      'Genders: {}\n\n'
      'Years when the photos were taken: {}\n\n'
      'Face scores: {}\n\n'
      'Face locations: {}\n\n'
      'Second face scores: {}\n\n'
      'Celeb IDs: {}\n\n'
      .format(full_path, names, dob, gender, photo_taken, face_score, face_locations, second_face_score, celeb_ids))

Filepaths: [array(['01/nm0000001_rm124825600_1899-5-10_1968.jpg'], dtype='<U43')
 array(['01/nm0000001_rm3343756032_1899-5-10_1970.jpg'], dtype='<U44')
 array(['01/nm0000001_rm577153792_1899-5-10_1968.jpg'], dtype='<U43') ...
 array(['08/nm3994408_rm926592512_1989-12-29_2011.jpg'], dtype='<U44')
 array(['08/nm3994408_rm943369728_1989-12-29_2011.jpg'], dtype='<U44')
 array(['08/nm3994408_rm976924160_1989-12-29_2011.jpg'], dtype='<U44')]

Names: [array(['Fred Astaire'], dtype='<U12')
 array(['Fred Astaire'], dtype='<U12')
 array(['Fred Astaire'], dtype='<U12') ...
 array(['Jane Levy'], dtype='<U9') array(['Jane Levy'], dtype='<U9')
 array(['Jane Levy'], dtype='<U9')]

Dates of birth: [693726 693726 693726 ... 726831 726831 726831]

Genders: [1. 1. 1. ... 0. 0. 0.]

Years when the photos were taken: [1968 1970 1968 ... 2011 2011 2011]

Face scores: [1.45969291 2.5431976  3.45557949 ...       -inf 4.45072452 2.13350269]

Face locations: [array([[1072.926,  161.838, 1214.784,  303.696]])
 a

In [11]:
print('Celeb names: {}\n\n'.format(celeb_names))

Celeb names: [array(["'Lee' George Quinones"], dtype='<U21')
 array(["'Weird Al' Yankovic"], dtype='<U19')
 array(['2 Chainz'], dtype='<U8') ...
 array(['Éric Caravaca'], dtype='<U13')
 array(['Ólafur Darri Ólafsson'], dtype='<U21')
 array(['Óscar Jaenada'], dtype='<U13')]




Display all the distinct keys and their corresponding values

In [12]:
names = [x[0] for x in desc]
names

['dob',
 'photo_taken',
 'full_path',
 'gender',
 'name',
 'face_location',
 'face_score',
 'second_face_score',
 'celeb_names',
 'celeb_id']

In [13]:
imdb_values = {key: root[key][0] for key in names}
imdb_values

{'dob': array([693726, 693726, 693726, ..., 726831, 726831, 726831]),
 'photo_taken': array([1968, 1970, 1968, ..., 2011, 2011, 2011], dtype=uint16),
 'full_path': array([array(['01/nm0000001_rm124825600_1899-5-10_1968.jpg'], dtype='<U43'),
        array(['01/nm0000001_rm3343756032_1899-5-10_1970.jpg'], dtype='<U44'),
        array(['01/nm0000001_rm577153792_1899-5-10_1968.jpg'], dtype='<U43'),
        ...,
        array(['08/nm3994408_rm926592512_1989-12-29_2011.jpg'], dtype='<U44'),
        array(['08/nm3994408_rm943369728_1989-12-29_2011.jpg'], dtype='<U44'),
        array(['08/nm3994408_rm976924160_1989-12-29_2011.jpg'], dtype='<U44')],
       dtype=object),
 'gender': array([1., 1., 1., ..., 0., 0., 0.]),
 'name': array([array(['Fred Astaire'], dtype='<U12'),
        array(['Fred Astaire'], dtype='<U12'),
        array(['Fred Astaire'], dtype='<U12'), ...,
        array(['Jane Levy'], dtype='<U9'),
        array(['Jane Levy'], dtype='<U9'),
        array(['Jane Levy'], dtype='<U9'

Repeat for Wiki

In [14]:
root = meta_wiki['wiki'][0, 0]
desc = root.dtype.descr
names = [x[0] for x in desc]
wiki_values = {key: root[key][0] for key in names}
wiki_values

{'dob': array([723671, 703186, 711677, ..., 720620, 723893, 713846]),
 'photo_taken': array([2009, 1964, 2008, ..., 2013, 2011, 2008], dtype=uint16),
 'full_path': array([array(['17/10000217_1981-05-05_2009.jpg'], dtype='<U31'),
        array(['48/10000548_1925-04-04_1964.jpg'], dtype='<U31'),
        array(['12/100012_1948-07-03_2008.jpg'], dtype='<U29'), ...,
        array(['09/9998109_1972-12-27_2013.jpg'], dtype='<U30'),
        array(['00/9999400_1981-12-13_2011.jpg'], dtype='<U30'),
        array(['80/999980_1954-06-11_2008.jpg'], dtype='<U29')],
       dtype=object),
 'gender': array([1., 1., 1., ..., 1., 1., 0.]),
 'name': array([array(['Sami Jauhojärvi'], dtype='<U15'),
        array(['Dettmar Cramer'], dtype='<U14'),
        array(['Marc Okrand'], dtype='<U11'), ...,
        array(['Michael Wiesinger'], dtype='<U17'),
        array(['Johann Grugger'], dtype='<U14'),
        array(['Greta Van Susteren'], dtype='<U18')], dtype=object),
 'face_location': array([array([[111.29109

## Cleanup

Pop out the celeb names as they are not relevant for creating the records.

In [12]:
# del values['celeb_names']
# names.pop(names.index('celeb_names'))

Let's see how many values are present in each key

In [13]:
# for key, value in values.items():
#     print(key, len(value))

## Dataframe

Now, let's try examining one example from the dataset. To do this, let's load all the attributes that we've extracted just now into a Pandas dataframe

In [15]:
df_imdb = pd.DataFrame(imdb_values, columns=names)
df_wiki = pd.DataFrame(wiki_values, columns=names)
df_imdb.head()

Unnamed: 0,dob,photo_taken,full_path,gender,name,face_location,face_score,second_face_score
0,693726,1968,[01/nm0000001_rm124825600_1899-5-10_1968.jpg],1.0,[Fred Astaire],"[[1072.926, 161.838, 1214.7839999999999, 303.6...",1.459693,1.118973
1,693726,1970,[01/nm0000001_rm3343756032_1899-5-10_1970.jpg],1.0,[Fred Astaire],"[[477.184, 100.352, 622.592, 245.76]]",2.543198,1.852008
2,693726,1968,[01/nm0000001_rm577153792_1899-5-10_1968.jpg],1.0,[Fred Astaire],"[[114.96964308962852, 114.96964308962852, 451....",3.455579,2.98566
3,693726,1968,[01/nm0000001_rm946909184_1899-5-10_1968.jpg],1.0,[Fred Astaire],"[[622.8855056426588, 424.21750383700805, 844.3...",1.872117,
4,693726,1968,[01/nm0000001_rm980463616_1899-5-10_1968.jpg],1.0,[Fred Astaire],"[[1013.8590023603723, 233.8820422075853, 1201....",1.158766,


In [16]:
df_imdb[df_imdb['name']=='Scott Grimes']

Unnamed: 0,dob,photo_taken,full_path,gender,name,face_location,face_score,second_face_score
310525,720083,2010,[41/nm0342241_rm1022358272_1971-7-9_2010.jpg],1.0,[Scott Grimes],"[[294.00498926749765, 168.6457081528558, 668.5...",4.548166,
310526,720083,1994,[41/nm0342241_rm1557565952_1971-7-9_1994.jpg],1.0,[Scott Grimes],"[[1, 1, 298, 450]]",-inf,
310527,720083,2003,[41/nm0342241_rm1724627200_1971-7-9_2003.jpg],1.0,[Scott Grimes],"[[911.7226839189836, 287.91284351739483, 1143....",5.092657,4.542781
310528,720083,1994,[41/nm0342241_rm1936167936_1971-7-9_1994.jpg],1.0,[Scott Grimes],"[[70.32421938479912, 64.50386776939919, 122.22...",3.343579,2.611601
310529,720083,2010,[41/nm0342241_rm2388233728_1971-7-9_2010.jpg],1.0,[Scott Grimes],"[[575.488, 116.736, 720.896, 262.144]]",2.858895,
310530,720083,2011,[41/nm0342241_rm2647042816_1971-7-9_2011.jpg],1.0,[Scott Grimes],"[[383.0868125430132, 118.42671155169637, 448.4...",4.144725,2.727795
310531,720083,2011,[41/nm0342241_rm2680597248_1971-7-9_2011.jpg],1.0,[Scott Grimes],"[[323.9483690955212, 89.9443776815231, 423.435...",4.239688,
310532,720083,1994,[41/nm0342241_rm3477179136_1971-7-9_1994.jpg],1.0,[Scott Grimes],"[[43.29, 53.946000000000005, 138.5280000000000...",2.287371,1.382159
310533,720083,1994,[41/nm0342241_rm3706427392_1971-7-9_1994.jpg],1.0,[Scott Grimes],"[[72.32877735181498, 72.32877735181498, 286.71...",4.628412,
310534,720083,1994,[41/nm0342241_rm3792738304_1971-7-9_1994.jpg],1.0,[Scott Grimes],"[[159.29999999999998, 101.69999999999999, 288....",4.098548,


Clean nulls

In [17]:
 # Filter dataframe by only having the rows with face_scores > 1.0
df_imdb = df_imdb[df_imdb['face_score']>1.0]
df_wiki = df_wiki[df_wiki['face_score']>1.0]

# Remove any records that contain Nulls/NaNs by checking for NaN with .isna()
df_imdb = df_imdb[~df_imdb['gender'].isna()].reset_index(drop=True)
df_wiki = df_wiki[~df_wiki['gender'].isna()].reset_index(drop=True)
#df = df[~df['second_face_scores'].isna()]

# Cast genders to integers so that mapping can take place
df_imdb.gender = df_imdb.gender.astype(int)
df_wiki.gender = df_wiki.gender.astype(int)

Get the age

In [18]:
ages=[]
for dob, photo_taken in zip(df_imdb.dob, df_imdb.photo_taken):
    ages.append(int(photo_taken) - int(datetime.fromordinal(int(dob)).year))
df_imdb['age'] = ages

ages=[]
for dob, photo_taken in zip(df_wiki.dob, df_wiki.photo_taken):
    ages.append(int(photo_taken) - int(datetime.fromordinal(int(dob)).year))
df_wiki['age'] = ages

Break out the image paths

In [19]:
#IMDB
maindir='imdb_crop'
subdirs=[]
filenames=[]
for full_path in df_imdb.full_path:
    subdir, filename = full_path[0].split('/')
    subdirs.append(subdir)
    filenames.append(filename)
    
df_imdb['maindir'] = maindir
df_imdb['subdir'] = subdirs
df_imdb['filename'] = filenames

#make sure they're strings
df_imdb['maindir'] = df_imdb['maindir'].astype(str)
df_imdb['subdir'] = df_imdb['subdir'].astype(str)
df_imdb['filename'] = df_imdb['filename'].astype(str)
    
#Wiki
maindir='wiki_crop'
subdirs=[]
filenames=[]
for full_path in df_wiki.full_path:
    subdir, filename = full_path[0].split('/')
    subdirs.append(subdir)
    filenames.append(filename)
    
df_wiki['maindir'] = maindir
df_wiki['subdir'] = subdirs
df_wiki['filename'] = filenames

#make sure they're strings
df_wiki['maindir'] = df_wiki['maindir'].astype(str)
df_wiki['subdir'] = df_wiki['subdir'].astype(str)
df_wiki['filename'] = df_wiki['filename'].astype(str)

Convert names to strings

In [20]:
df_imdb['name'] = [x[0] for x in df_imdb['name']]
df_wiki['name'] = [x[0] for x in df_wiki['name']]

In [21]:
df_imdb.head()

Unnamed: 0,dob,photo_taken,full_path,gender,name,face_location,face_score,second_face_score,age,maindir,subdir,filename
0,693726,1968,[01/nm0000001_rm124825600_1899-5-10_1968.jpg],1,Fred Astaire,"[[1072.926, 161.838, 1214.7839999999999, 303.6...",1.459693,1.118973,68,imdb_crop,1,nm0000001_rm124825600_1899-5-10_1968.jpg
1,693726,1970,[01/nm0000001_rm3343756032_1899-5-10_1970.jpg],1,Fred Astaire,"[[477.184, 100.352, 622.592, 245.76]]",2.543198,1.852008,70,imdb_crop,1,nm0000001_rm3343756032_1899-5-10_1970.jpg
2,693726,1968,[01/nm0000001_rm577153792_1899-5-10_1968.jpg],1,Fred Astaire,"[[114.96964308962852, 114.96964308962852, 451....",3.455579,2.98566,68,imdb_crop,1,nm0000001_rm577153792_1899-5-10_1968.jpg
3,693726,1968,[01/nm0000001_rm946909184_1899-5-10_1968.jpg],1,Fred Astaire,"[[622.8855056426588, 424.21750383700805, 844.3...",1.872117,,68,imdb_crop,1,nm0000001_rm946909184_1899-5-10_1968.jpg
4,693726,1968,[01/nm0000001_rm980463616_1899-5-10_1968.jpg],1,Fred Astaire,"[[1013.8590023603723, 233.8820422075853, 1201....",1.158766,,68,imdb_crop,1,nm0000001_rm980463616_1899-5-10_1968.jpg


In [22]:
df_imdb[df_imdb['name']=='Scott Grimes']

Unnamed: 0,dob,photo_taken,full_path,gender,name,face_location,face_score,second_face_score,age,maindir,subdir,filename
260030,720083,2010,[41/nm0342241_rm1022358272_1971-7-9_2010.jpg],1,Scott Grimes,"[[294.00498926749765, 168.6457081528558, 668.5...",4.548166,,38,imdb_crop,41,nm0342241_rm1022358272_1971-7-9_2010.jpg
260031,720083,2003,[41/nm0342241_rm1724627200_1971-7-9_2003.jpg],1,Scott Grimes,"[[911.7226839189836, 287.91284351739483, 1143....",5.092657,4.542781,31,imdb_crop,41,nm0342241_rm1724627200_1971-7-9_2003.jpg
260032,720083,1994,[41/nm0342241_rm1936167936_1971-7-9_1994.jpg],1,Scott Grimes,"[[70.32421938479912, 64.50386776939919, 122.22...",3.343579,2.611601,22,imdb_crop,41,nm0342241_rm1936167936_1971-7-9_1994.jpg
260033,720083,2010,[41/nm0342241_rm2388233728_1971-7-9_2010.jpg],1,Scott Grimes,"[[575.488, 116.736, 720.896, 262.144]]",2.858895,,38,imdb_crop,41,nm0342241_rm2388233728_1971-7-9_2010.jpg
260034,720083,2011,[41/nm0342241_rm2647042816_1971-7-9_2011.jpg],1,Scott Grimes,"[[383.0868125430132, 118.42671155169637, 448.4...",4.144725,2.727795,39,imdb_crop,41,nm0342241_rm2647042816_1971-7-9_2011.jpg
260035,720083,2011,[41/nm0342241_rm2680597248_1971-7-9_2011.jpg],1,Scott Grimes,"[[323.9483690955212, 89.9443776815231, 423.435...",4.239688,,39,imdb_crop,41,nm0342241_rm2680597248_1971-7-9_2011.jpg
260036,720083,1994,[41/nm0342241_rm3477179136_1971-7-9_1994.jpg],1,Scott Grimes,"[[43.29, 53.946000000000005, 138.5280000000000...",2.287371,1.382159,22,imdb_crop,41,nm0342241_rm3477179136_1971-7-9_1994.jpg
260037,720083,1994,[41/nm0342241_rm3706427392_1971-7-9_1994.jpg],1,Scott Grimes,"[[72.32877735181498, 72.32877735181498, 286.71...",4.628412,,22,imdb_crop,41,nm0342241_rm3706427392_1971-7-9_1994.jpg
260038,720083,1994,[41/nm0342241_rm3792738304_1971-7-9_1994.jpg],1,Scott Grimes,"[[159.29999999999998, 101.69999999999999, 288....",4.098548,,22,imdb_crop,41,nm0342241_rm3792738304_1971-7-9_1994.jpg
260039,720083,1994,[41/nm0342241_rm4128282624_1971-7-9_1994.jpg],1,Scott Grimes,"[[128.1250527756009, 43.0283509252003, 170.193...",2.541669,1.700773,22,imdb_crop,41,nm0342241_rm4128282624_1971-7-9_1994.jpg


In [55]:
#im going to ignore the extranneous information since the images are already cropped
df_imdb = df_imdb[['maindir', 'subdir', 'filename', 'name', 'gender', 'age']]
df_wiki = df_wiki[['maindir', 'subdir', 'filename', 'name', 'gender', 'age']]

In [56]:
from sklearn.model_selection import train_test_split

#only thing i dont like about this is that the same people are in train and test (you can smartly avoid this gut then the gender and age variation is a lot lower; so consider it a validation dataset)
df_imdb_train, df_imdb_test = train_test_split(df_imdb, train_size=0.7, test_size=0.3, stratify=df_imdb['gender'], random_state=42, shuffle=True)
df_wiki_train, df_wiki_test = train_test_split(df_wiki, train_size=0.7, test_size=0.3, stratify=df_wiki['gender'], random_state=42, shuffle=True)

df_train = pd.concat([df_imdb_train, df_wiki_train], axis=0)
df_test = pd.concat([df_imdb_test, df_wiki_test], axis=0)

#shuffle
df_train = df_train.sample(frac=1).reset_index(drop=True)
df_test = df_test.sample(frac=1).reset_index(drop=True)

print(f'train: {df_train.shape}')
print(f'test: {df_test.shape}')

df_test.head()

train: (296016, 6)
test: (126865, 6)


Unnamed: 0,maindir,subdir,filename,name,gender,age
0,wiki_crop,6,7297606_1980-07-25_2014.jpg,Sven Järve,1,33
1,imdb_crop,14,nm0001814_rm146321408_1952-7-24_2008.jpg,Gus Van Sant,1,55
2,imdb_crop,99,nm0001099_rm3767262976_1955-2-19_2013.jpg,Jeff Daniels,1,57
3,imdb_crop,38,nm0001938_rm681807104_1944-8-4_2010.jpg,Richard Belzer,1,65
4,imdb_crop,16,nm1143816_rm2716065024_1982-6-29_2011.jpg,Lily Rabe,0,28


Output annotation file

In [58]:
df_imdb.to_csv('C:\\Users\\nick\\tensorflow_datasets\\downloads\\manual\\imdb_wiki\\train_ann.csv', index=False)
df_wiki.to_csv('C:\\Users\\nick\\tensorflow_datasets\\downloads\\manual\\imdb_wiki\\test_ann.csv', index=False)

Wiki has broken images so search for them and remove them

*Looks like they just removed them from the annotation file which is fine*

In [4]:
import pandas as pd
import os

df_imdb = pd.read_csv('C:\\Users\\nick\\tensorflow_datasets\\downloads\\manual\\imdb_wiki\\train_ann.csv')
df_wiki = pd.read_csv('C:\\Users\\nick\\tensorflow_datasets\\downloads\\manual\\imdb_wiki\\test_ann.csv')

In [12]:
indir = 'C:\\Users\\nick\\tensorflow_datasets\\downloads\\manual\\imdb_wiki'
def get_paths(df):
    subdirs = ['0'+x if len(x)==1 else x for x in df.subdir.astype(str)]
    return  [os.path.join(indir, maindir, subdir, filename) for  maindir, subdir, filename in zip(df.maindir, subdirs, df.filename)]
img_paths = list(set(get_paths(df_imdb) + get_paths(df_wiki)))
img_paths[:3]

['C:\\Users\\nick\\tensorflow_datasets\\downloads\\manual\\imdb_wiki\\imdb_crop\\53\\nm0005453_rm2397869312_1981-12-2_2010.jpg',
 'C:\\Users\\nick\\tensorflow_datasets\\downloads\\manual\\imdb_wiki\\imdb_crop\\70\\nm1692270_rm1524542976_1986-11-1_2011.jpg',
 'C:\\Users\\nick\\tensorflow_datasets\\downloads\\manual\\imdb_wiki\\imdb_crop\\05\\nm0425005_rm1812570368_1972-5-2_2013.jpg']

In [14]:
empty_img_ex = 'C:\\Users\\nick\\tensorflow_datasets\\downloads\\manual\\imdb_wiki\\wiki_crop\\00\\2658600_1980-11-03_2012.jpg'
print(f'Normal img size: {os.path.getsize(img_paths[0])}')
#notice not zero since the file is still named
print(f'Bad img size: {os.path.getsize(empty_img_ex)}')

Normal img size: 2000
Bad img size: 333


In [15]:
bad_imgs = [x for x in img_paths if os.path.getsize(x)<400]
print(len(bad_imgs))
bad_imgs[:5]

0


[]

In [21]:
# #make sure some of the names dont overlap
# import jellyfish
# from tqdm import tqdm

# imdb_names = list(set(df_imdb.name))
# wiki_names = list(set(df_wiki.name))

# for imdb in tqdm(imdb_names):
#     for wiki in wiki_names:
#         distance = jellyfish.jaro_distance(imdb, wiki)
#         if 0.9 < distance < 1.0:
#             print(f'Wiki: {wiki},\t IMDB: {imdb},\t {distance}')

# TensorFlow Datasets

TFDS provides a way to transform all those datasets into a standard format, do the preprocessing necessary to make them ready for a machine learning pipeline, and provides a standard input pipeline using `tf.data`.

To enable this, each dataset implements a subclass of `DatasetBuilder`, which specifies:

* Where the data is coming from (i.e. its URL). 
* What the dataset looks like (i.e. its features).  
* How the data should be split (e.g. TRAIN and TEST). 
* The individual records in the dataset.

The first time a dataset is used, the dataset is downloaded, prepared, and written to disk in a standard format. Subsequent access will read from those pre-processed files directly.

## Clone the TFDS Repository

The next step will be to clone the GitHub TFDS Repository. For this particular notebook, we will clone a particular version of the repository. You can clone the repository by running the following command:

```
!git clone https://github.com/tensorflow/datasets.git -b v1.2.0
```

However, for simplicity, we have already cloned this repository for you and placed the files locally. Therefore, there is no need to run the above command if you are running this notebook in Coursera environment.

Next, we set the current working directory to `/datasets/`.

In [None]:
cd datasets

If you want to contribute to TFDS' repo and add a new dataset, you can use the the following script to help you generate a template of the required python file. To use it, you must first clone the tfds repository and then run the following command:

In [None]:
%%bash

python tensorflow_datasets/scripts/create_new_dataset.py \
  --dataset my_dataset \
  --type image

If you wish to see the template generated by the `create_new_dataset.py` file, navigate to the folder indicated in the above cell output. Then go to the `/image/` folder and look for a file called `my_dataset.py`. Feel free to open the file and inspect it. You will see a template with place holders, indicated with the word `TODO`, where you have to fill in the information. 

Now we will use IPython's `%%writefile` in-built magic command to write whatever is in the current cell into a file. To create or overwrite a file you can use:
```
%%writefile filename
```

Let's see an example:

In [None]:
%%writefile something.py
x = 10

Now that the file has been written, let's inspect its contents.

In [None]:
!cat something.py

## Define the Dataset with `GeneratorBasedBuilder`

Most datasets subclass `tfds.core.GeneratorBasedBuilder`, which is a subclass of `tfds.core.DatasetBuilder` that simplifies defining a dataset. It works well for datasets that can be generated on a single machine. Its subclasses implement:

* `_info`: builds the DatasetInfo object describing the dataset


* `_split_generators`: downloads the source data and defines the dataset splits


* `_generate_examples`: yields (key, example) tuples in the dataset from the source data

In this exercise, you will use the `GeneratorBasedBuilder`.

### EXERCISE: Fill in the missing code below.

In [None]:
%%writefile tensorflow_datasets/image/imdb_faces.py

# coding=utf-8
# Copyright 2019 The TensorFlow Datasets Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""IMDB Faces dataset."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import collections
import os
import re

import tensorflow as tf
import tensorflow_datasets.public_api as tfds

_DESCRIPTION = """\
Since the publicly available face image datasets are often of small to medium size, rarely exceeding tens of thousands of images, and often without age information we decided to collect a large dataset of celebrities. For this purpose, we took the list of the most popular 100,000 actors as listed on the IMDb website and (automatically) crawled from their profiles date of birth, name, gender and all images related to that person. Additionally we crawled all profile images from pages of people from Wikipedia with the same meta information. We removed the images without timestamp (the date when the photo was taken). Assuming that the images with single faces are likely to show the actor and that the timestamp and date of birth are correct, we were able to assign to each such image the biological (real) age. Of course, we can not vouch for the accuracy of the assigned age information. Besides wrong timestamps, many images are stills from movies - movies that can have extended production times. In total we obtained 460,723 face images from 20,284 celebrities from IMDb and 62,328 from Wikipedia, thus 523,051 in total.

As some of the images (especially from IMDb) contain several people we only use the photos where the second strongest face detection is below a threshold. For the network to be equally discriminative for all ages, we equalize the age distribution for training. For more details please the see the paper.
"""

_URL = ("https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/")
_IMDB_DATASET_ROOT_DIR = "imdb_crop"
_IMDB_ANNOTATION_FILE = "imdb.mat"
_WIKI_DATASET_ROOT_DIR = "wiki_crop"
_WIKI_META_DATASET_ROOT_DIR = "wiki"
_WIKI_ANNOTATION_FILE = "imdb.mat"


_CITATION = """\
@article{Rothe-IJCV-2016,
  author = {Rasmus Rothe and Radu Timofte and Luc Van Gool},
  title = {Deep expectation of real and apparent age from a single image without facial landmarks},
  journal = {International Journal of Computer Vision},
  volume={126},
  number={2-4},
  pages={144--157},
  year={2018},
  publisher={Springer}
}
@InProceedings{Rothe-ICCVW-2015,
  author = {Rasmus Rothe and Radu Timofte and Luc Van Gool},
  title = {DEX: Deep EXpectation of apparent age from a single image},
  booktitle = {IEEE International Conference on Computer Vision Workshops (ICCVW)},
  year = {2015},
  month = {December},
}
"""

# Source URL of the IMDB faces dataset
_IMDB_TARBALL_URL = "https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/static/imdb_crop.tar"
_WIKI_TARBALL_URL = "https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/static/wiki_crop.tar"
_WIKI_META_TARBALL_URL = "https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/static/wiki.tar.gz"



class ImdbWikiFaces(tfds.core.GeneratorBasedBuilder):
    """IMDB-Wiki Faces dataset."""

    VERSION = tfds.core.Version("0.1.0")

    def _info(self):
        return tfds.core.DatasetInfo(
            builder=self,
            description=_DESCRIPTION,
            # Describe the features of the dataset by following this url
            # https://www.tensorflow.org/datasets/api_docs/python/tfds/features
            features=tfds.features.FeaturesDict({
                "image": tfds.features.Image(),
                "gender": tfds.features.ClassLabel(num_classes=2),
                "dob": tf.int32,
                "photo_taken": tf.int32,
                "face_location": tfds.features.BBoxFeature(),
                "face_score": tf.float32,
                "second_face_score": tf.float32,
                "celeb_id": tf.int32
            }),
            supervised_keys=("image", "gender"),
            urls=[_URL],
            citation=_CITATION)

    def _split_generators(self, dl_manager):
        # Download the dataset and then extract it.
        imdb_download_path = dl_manager.download([_IMDB_TARBALL_URL])
        imdb_extracted_path = dl_manager.download_and_extract([_IMDB_TARBALL_URL])
        
        wiki_download_path = dl_manager.download([_WIKI_TARBALL_URL])
        wiki_extracted_path = dl_manager.download_and_extract([_WIKI_TARBALL_URL])
        
        wiki_meta_download_path = dl_manager.download([_WIKI_META_TARBALL_URL])
        wiki_meta_extracted_path = dl_manager.download_and_extract([_WIKI_META_TARBALL_URL])
        
        
        data_dirs = dl_manager.download_and_extract({
            'imdb_crop': _IMDB_TARBALL_URL_IMDB_TARBALL_URL,
            'wiki_crop': _WIKI_TARBALL_URL,
            'wiki': _WIKI_META_TARBALL_URL
        })

        # Parsing the mat file which contains the list of train images
        def parse_mat_file(file_name, dataset):
            with tf.io.gfile.GFile(file_name, "rb") as f:
                # Add a lazy import for scipy.io and import the loadmat method to 
                # load the annotation file
                imdb_dataset = tfds.core.lazy_imports.scipy.io.loadmat(file_name)[dataset]
            return dataset

        # Parsing the mat file by using scipy's loadmat method
        # Pass the path to the annotation file using the downloaded/extracted paths above
        imdb_meta = parse_mat_file(os.path.join(data_dirs['imdb_crop'], _IMDB_DATASET_ROOT_DIR, _IMDB_ANNOTATION_FILE), 'imdb')
        wiki_meta = parse_mat_file(os.path.join(data_dirs['wiki_crop'], _WIKI_META_DATASET_ROOT_DIR, _WIKI_ANNOTATION_FILE), 'wiki')

        # Get the names of celebrities from the metadata
        celeb_names = meta[0, 0]["celeb_names"][0]

        # Create tuples out of the distinct set of genders and celeb names
        self.info.features['gender'].names = ['Female', 'Male']
        self.info.features['celeb_id'].names = tuple([x[0] for x in celeb_names])

        return [
            tfds.core.SplitGenerator(
                name=tfds.Split.TRAIN,
                gen_kwargs={
                    "image_dir": extracted_path[0],
                    "metadata": meta,
                })
        ]

    def _get_bounding_box_values(self, bbox_annotations, img_width, img_height):
        """Function to get normalized bounding box values.

        Args:
          bbox_annotations: list of bbox values in kitti format
          img_width: image width
          img_height: image height

        Returns:
          Normalized bounding box xmin, ymin, xmax, ymax values
        """

        ymin = bbox_annotations[0] / img_height
        xmin = bbox_annotations[1] / img_width
        ymax = bbox_annotations[2] / img_height
        xmax = bbox_annotations[3] / img_width
        return ymin, xmin, ymax, xmax
  
    def _get_image_shape(self, image_path):
        image = tf.io.read_file(image_path)
        image = tf.image.decode_image(image, channels=3)
        shape = image.shape[:2]
        return shape

    def _generate_examples(self, image_dir, metadata):
        # Add a lazy import for pandas here (pd)
        pd = tfds.core.lazy_imports.pandas

        # Extract the root dictionary from the metadata so that you can query all the keys inside it
        root = metadata[0, 0]

        """Extract image names, dobs, genders,  
                   face locations, 
                   year when the photos were taken,
                   face scores (second face score too),
                   celeb ids
        """
        image_names = root["full_path"][0]
        # Do the same for other attributes (dob, genders etc)
        dobs = root["dob"][0]
        genders = root["gender"][0]
        photo_taken_years = root["photo_taken"][0]
        face_scores = root["face_score"][0]
        face_locations = root["face_location"][0]
        second_face_scores = root["second_face_score"][0]
        celeb_id = root["celeb_id"][0]

        # Now create a dataframe out of all the features like you've seen before
        df = pd.DataFrame(list(zip(image_names,
                                   dobs,
                                   genders,
                                   photo_taken_years,
                                   face_scores,
                                   face_locations,
                                   second_face_scores,
                                   celeb_id
                                  )),
                          columns=['image_names', 'dobs', 'genders', 'photo_taken_years',
                                   'face_scores', 'face_locations', 'second_face_scores',
                                   'celeb_ids'])

        # Filter dataframe by only having the rows with face_scores > 1.0
        df = df[df['face_scores']>1.0]


        # Remove any records that contain Nulls/NaNs by checking for NaN with .isna()
        df = df[~df['genders'].isna()]
        df = df[~df['second_face_scores'].isna()]

        # Cast genders to integers so that mapping can take place
        df.genders = df.genders.astype(int)

        # Iterate over all the rows in the dataframe and map each feature
        for _, row in df.iterrows():
            # Extract filename, gender, dob, photo_taken, 
            # face_score, second_face_score and celeb_id
            filename = os.path.join(image_dir, _DATASET_ROOT_DIR, row['image_names'][0])
            gender = row['genders']
            dob = row['dobs']
            photo_taken = row['photo_taken_years']
            face_score = row['face_scores']
            second_face_score = row['second_face_scores']
            celeb_id = row['celeb_ids']

            # Get the image shape
            image_width, image_height = self._get_image_shape(filename)
            # Normalize the bounding boxes by using the face coordinates and the image shape
            bbox = self._get_bounding_box_values(row['face_locations'][0], 
                                               image_width, image_height)

            # Yield a feature dictionary 
            yield filename, {
              "image": filename,
              "gender": gender,
              "dob": dob,
              "photo_taken": photo_taken,
              "face_location": tfds.features.BBox(ymin=min(bbox[0], 1.0),
                                                  xmin=min(bbox[1], 1.0),
                                                  ymax=min(bbox[2], 1.0),
                                                  xmax=min(bbox[3], 1.0)),
              "face_score": face_score,
              "second_face_score": second_face_score,
              "celeb_id": celeb_id
            }


## Add an Import for Registration

All subclasses of `tfds.core.DatasetBuilder` are automatically registered when their module is imported such that they can be accessed through `tfds.builder` and `tfds.load`.

If you're contributing the dataset to `tensorflow/datasets`, you must add the module import to its subdirectory's `__init__.py` (e.g. `image/__init__.py`), as shown below:

In [None]:
%%writefile tensorflow_datasets/image/__init__.py
# coding=utf-8
# Copyright 2019 The TensorFlow Datasets Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Image datasets."""

from tensorflow_datasets.image.abstract_reasoning import AbstractReasoning
from tensorflow_datasets.image.aflw2k3d import Aflw2k3d
from tensorflow_datasets.image.bigearthnet import Bigearthnet
from tensorflow_datasets.image.binarized_mnist import BinarizedMNIST
from tensorflow_datasets.image.binary_alpha_digits import BinaryAlphaDigits
from tensorflow_datasets.image.caltech import Caltech101
from tensorflow_datasets.image.caltech_birds import CaltechBirds2010
from tensorflow_datasets.image.cats_vs_dogs import CatsVsDogs
from tensorflow_datasets.image.cbis_ddsm import CuratedBreastImagingDDSM
from tensorflow_datasets.image.celeba import CelebA
from tensorflow_datasets.image.celebahq import CelebAHq
from tensorflow_datasets.image.chexpert import Chexpert
from tensorflow_datasets.image.cifar import Cifar10
from tensorflow_datasets.image.cifar import Cifar100
from tensorflow_datasets.image.cifar10_corrupted import Cifar10Corrupted
from tensorflow_datasets.image.clevr import CLEVR
from tensorflow_datasets.image.coco import Coco
from tensorflow_datasets.image.coco2014_legacy import Coco2014
from tensorflow_datasets.image.coil100 import Coil100
from tensorflow_datasets.image.colorectal_histology import ColorectalHistology
from tensorflow_datasets.image.colorectal_histology import ColorectalHistologyLarge
from tensorflow_datasets.image.cycle_gan import CycleGAN
from tensorflow_datasets.image.deep_weeds import DeepWeeds
from tensorflow_datasets.image.diabetic_retinopathy_detection import DiabeticRetinopathyDetection
from tensorflow_datasets.image.downsampled_imagenet import DownsampledImagenet
from tensorflow_datasets.image.dsprites import Dsprites
from tensorflow_datasets.image.dtd import Dtd
from tensorflow_datasets.image.eurosat import Eurosat
from tensorflow_datasets.image.flowers import TFFlowers
from tensorflow_datasets.image.food101 import Food101
from tensorflow_datasets.image.horses_or_humans import HorsesOrHumans
from tensorflow_datasets.image.image_folder import ImageLabelFolder
from tensorflow_datasets.image.imagenet import Imagenet2012
from tensorflow_datasets.image.imagenet2012_corrupted import Imagenet2012Corrupted
from tensorflow_datasets.image.kitti import Kitti
from tensorflow_datasets.image.lfw import LFW
from tensorflow_datasets.image.lsun import Lsun
from tensorflow_datasets.image.mnist import EMNIST
from tensorflow_datasets.image.mnist import FashionMNIST
from tensorflow_datasets.image.mnist import KMNIST
from tensorflow_datasets.image.mnist import MNIST
from tensorflow_datasets.image.mnist_corrupted import MNISTCorrupted
from tensorflow_datasets.image.omniglot import Omniglot
from tensorflow_datasets.image.open_images import OpenImagesV4
from tensorflow_datasets.image.oxford_flowers102 import OxfordFlowers102
from tensorflow_datasets.image.oxford_iiit_pet import OxfordIIITPet
from tensorflow_datasets.image.patch_camelyon import PatchCamelyon
from tensorflow_datasets.image.pet_finder import PetFinder
from tensorflow_datasets.image.quickdraw import QuickdrawBitmap
from tensorflow_datasets.image.resisc45 import Resisc45
from tensorflow_datasets.image.rock_paper_scissors import RockPaperScissors
from tensorflow_datasets.image.scene_parse_150 import SceneParse150
from tensorflow_datasets.image.shapes3d import Shapes3d
from tensorflow_datasets.image.smallnorb import Smallnorb
from tensorflow_datasets.image.so2sat import So2sat
from tensorflow_datasets.image.stanford_dogs import StanfordDogs
from tensorflow_datasets.image.stanford_online_products import StanfordOnlineProducts
from tensorflow_datasets.image.sun import Sun397
from tensorflow_datasets.image.svhn import SvhnCropped
from tensorflow_datasets.image.uc_merced import UcMerced
from tensorflow_datasets.image.visual_domain_decathlon import VisualDomainDecathlon

# EXERCISE: Import your dataset module here

from tensorflow_datasets.image.imdb_faces import ImdbFaces

## URL Checksums

If you're contributing the dataset to `tensorflow/datasets`, add a checksums file for the dataset. On first download, the DownloadManager will automatically add the sizes and checksums for all downloaded URLs to that file. This ensures that on subsequent data generation, the downloaded files are as expected.

In [None]:
!touch tensorflow_datasets/url_checksums/imdb_faces.txt

## Build the Dataset

In [None]:
# EXERCISE: Fill in the name of your dataset.
# The name must be a string.
DATASET_NAME = "imdb_faces"

We then run the `download_and_prepare` script locally to build it, using the following command:

```
%%bash -s $DATASET_NAME
python -m tensorflow_datasets.scripts.download_and_prepare \
  --register_checksums \
  --datasets=$1
```

**NOTE:** It may take more than 30 minutes to download the dataset and then write all the preprocessed files as TFRecords. Due to the enormous size of the data involved, we are unable to run the above script in the Coursera environment. 

## Load the Dataset

Once the dataset is built you can load it in the usual way, by using `tfds.load`, as shown below:

```python
import tensorflow_datasets as tfds
dataset, info = tfds.load('imdb_faces', with_info=True)
```

**Note:** Since we couldn't build the `imdb_faces` dataset due to its size, we are unable to run the above code in the Coursera environment.

## Explore the Dataset

Once the dataset is loaded, you can explore it by using the following loop:

```python
for feature in tfds.as_numpy(dataset['train']):
  for key, value in feature.items():
    if key == 'image':
      value = value.shape
    print(key, value)
  break
```

**Note:** Since we couldn't build the `imdb_faces` dataset due to its size, we are unable to run the above code in the Coursera environment.

The expected output from the code block shown above should be:

```python
>>>
celeb_id 12387
dob 722957
face_location [1.         0.56327355 1.         1.        ]
face_score 4.0612864
gender 0
image (96, 97, 3)
photo_taken 2007
second_face_score 3.6680346
```

# Next steps for publishing

**Double-check the citation**  

It's important that DatasetInfo.citation includes a good citation for the dataset. It's hard and important work contributing a dataset to the community and we want to make it easy for dataset users to cite the work.

If the dataset's website has a specifically requested citation, use that (in BibTex format).

If the paper is on arXiv, find it there and click the bibtex link on the right-hand side.

If the paper is not on arXiv, find the paper on Google Scholar and click the double-quotation mark underneath the title and on the popup, click BibTeX.

If there is no associated paper (for example, there's just a website), you can use the BibTeX Online Editor to create a custom BibTeX entry (the drop-down menu has an Online entry type).
  

**Add a test**   

Most datasets in TFDS should have a unit test and your reviewer may ask you to add one if you haven't already. See the testing section below.   
**Check your code style**  

Follow the PEP 8 Python style guide, except TensorFlow uses 2 spaces instead of 4. Please conform to the Google Python Style Guide,

Most importantly, use tensorflow_datasets/oss_scripts/lint.sh to ensure your code is properly formatted. For example, to lint the image directory
See TensorFlow code style guide for more information.

**Add release notes**
Add the dataset to the release notes. The release note will be published for the next release.

**Send for review!**
Send the pull request for review.

For more information, visit https://www.tensorflow.org/datasets/add_dataset

# Submission Instructions

In [None]:
# Now click the 'Submit Assignment' button above.

# When you're done or would like to take a break, please run the two cells below to save your work and close the Notebook. This frees up resources for your fellow learners.

In [None]:
%%javascript
<!-- Save the notebook -->
IPython.notebook.save_checkpoint();

In [None]:
%%javascript
<!-- Shutdown and close the notebook -->
window.onbeforeunload = null
window.close();
IPython.notebook.session.delete();