# IMDB Metadata Cleanup Notebook

This notebook contains the cleanup of the IMDB metadata file obtained from ETH Zürich's Computer Vision Lab. This metadata corresponds to 460,723 images of cropped celebrity faces. The metadata was collected in 2015 by Rasmus Rothe, Radu Timofte, and Luc Van Gool for their papers on age detection ([Deep expectation of real and apparent age from a single image without facial landmarks](https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/)).

In [1]:
#Import the necessary libraries here

import pandas as pd
import numpy as np
import scipy.io
import datetime
from datetime import tzinfo, timedelta, datetime

import warnings
warnings.filterwarnings('ignore')

In [2]:
#The .mat file is located in the same folder as the image subfolders

mat = scipy.io.loadmat('image_data/imdb_data/imdb.mat')

In [3]:
#The data is stored as dictionaries, the key 'imdb' contains the image metadata

mat.keys()

dict_keys(['__header__', '__version__', '__globals__', 'imdb'])

In [4]:
mat_data = mat['imdb']

In [5]:
mtype = mat_data.dtype

In [6]:
#There are ten series within the imdb metadata key, the first eight are relevant
#'celeb_names' and 'celeb_id' are much shorter series that correspond to imdb profile indexes;
#These indexes are obsolete because imdb switched to AWS for the hosting of their site, which
#uses another system. The eight keys correspond to the keys from the wiki file

mtype.names

('dob',
 'photo_taken',
 'full_path',
 'gender',
 'name',
 'face_location',
 'face_score',
 'second_face_score',
 'celeb_names',
 'celeb_id')

In [7]:
len(mat_data[0][0])

10

In [8]:
column_titles = list(mtype.names)

In [9]:
#I turn the imdb series into a dictionary, the keys are the column titles and the pairs are the
#series of information that correspond to the 460,723 images

image_info = {}

for i in range(0, len(column_titles)):
    image_info[column_titles[i]] = mat_data[0][0][i]

In [10]:
image_info['dob'][0]

array([693726, 693726, 693726, ..., 726831, 726831, 726831], dtype=int32)

In [11]:
#Some of the data is stored in formats that need to be adjusted; including: lists that need to
#be flattened and file paths that need to be adjusted. I make those changes here

birth_date = image_info['dob'][0]
year_taken = image_info['photo_taken'][0]
file_path = ['image_data/imdb_data/' + item for sublist in image_info['full_path'][0] for item in sublist]
gender = image_info['gender'][0]
name = [item for sublist in image_info['name'][0] for item in sublist]
face_location = [item for sublist in image_info['face_location'][0] for item in sublist]
face_score = image_info['face_score'][0]
second_face_score = image_info['second_face_score'][0]
columns = column_titles

image_data_dictionary = {columns[4]: name, columns[0]: birth_date, columns[3]: gender, columns[1]: year_taken, 'file_path': file_path, columns[5]: face_location, columns[6]: face_score, columns[7]: second_face_score}


In [12]:
#Convert the dictionary to a dataframe

photo_info = pd.DataFrame(image_data_dictionary)

In [13]:
photo_info.head()

Unnamed: 0,name,dob,gender,photo_taken,file_path,face_location,face_score,second_face_score
0,Fred Astaire,693726,1.0,1968,image_data/imdb_data/01/nm0000001_rm124825600_...,"[1072.926, 161.838, 1214.7839999999999, 303.69...",1.459693,1.118973
1,Fred Astaire,693726,1.0,1970,image_data/imdb_data/01/nm0000001_rm3343756032...,"[477.184, 100.352, 622.592, 245.76]",2.543198,1.852008
2,Fred Astaire,693726,1.0,1968,image_data/imdb_data/01/nm0000001_rm577153792_...,"[114.96964308962852, 114.96964308962852, 451.6...",3.455579,2.98566
3,Fred Astaire,693726,1.0,1968,image_data/imdb_data/01/nm0000001_rm946909184_...,"[622.8855056426588, 424.21750383700805, 844.33...",1.872117,
4,Fred Astaire,693726,1.0,1968,image_data/imdb_data/01/nm0000001_rm980463616_...,"[1013.8590023603723, 233.8820422075853, 1201.5...",1.158766,


In [14]:
photo_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 460723 entries, 0 to 460722
Data columns (total 8 columns):
name                 460723 non-null object
dob                  460723 non-null int32
gender               452261 non-null float64
photo_taken          460723 non-null uint16
file_path            460723 non-null object
face_location        460723 non-null object
face_score           460723 non-null float64
second_face_score    213797 non-null float64
dtypes: float64(3), int32(1), object(3), uint16(1)
memory usage: 23.7+ MB


In [15]:
#Save the dataframe as a .csv file

photo_info.to_csv('Photo_Dataframes/imdb_photo_metadata.csv', index = False)