# Wikipedia Metadata Cleanup Notebook

This notebook contains the cleanup of the Wikipedia metadata file obtained from ETH Zürich's Computer Vision Lab. This metadata corresponds to 62,328 images of cropped faces. The metadata was collected in 2015 by Rasmus Rothe, Radu Timofte, and Luc Van Gool for their papers on age detection ([Deep expectation of real and apparent age from a single image without facial landmarks](https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/)).

In [1]:
#Import the necessary libraries here

import pandas as pd
import numpy as np
import scipy.io
import datetime
from datetime import tzinfo, timedelta, datetime

import warnings
warnings.filterwarnings('ignore')

In [2]:
#The .mat file is located in the same folder as the image subfolders

mat = scipy.io.loadmat('image_data/wiki_data/wiki.mat')

In [3]:
#The data is stored as dictionaries, the key 'wiki' contains the image metadata

mat.keys()

dict_keys(['__header__', '__version__', '__globals__', 'wiki'])

In [4]:
mat_data = mat['wiki']

In [5]:
mtype = mat_data.dtype

In [6]:
#There are eight series within the wiki metadata key, all of which correspond to the
#keys from the imdb file

mtype.names

('dob',
 'photo_taken',
 'full_path',
 'gender',
 'name',
 'face_location',
 'face_score',
 'second_face_score')

In [7]:
len(mat_data[0][0])

8

In [8]:
column_titles = list(mtype.names)

In [9]:
#I turn the wiki series into a dictionary, the keys are the column titles and the pairs
#are the series of information that correspond to the 62,328 images

image_info = {}

for i in range(0, len(column_titles)):
    image_info[column_titles[i]] = mat_data[0][0][i]

In [10]:
image_info['dob'][0]

array([723671, 703186, 711677, ..., 720620, 723893, 713846], dtype=int32)

In [11]:
#On its own the following code creates a list of names with the length of 62,204 while all
#other lists have the size 62,328. This is because of blank values in the 'names' column.

#After checking the length of the array itself with the code "len(image_info['name'][0])",
#I separated the creation of the name key-value list with the for loop in the code below to
#obtain the correct length of 62328, with those blank values included to ensure the data
#corresponds to the correct person.

len([item[0] for sublist in image_info['name'][0] for item in sublist])

62204

In [12]:
name_items = []
count_np = 0
count_else = 0

count_inner_array = 0
count_not_inner_array = 0

for i in image_info['name'][0]:
    if type(i) == np.ndarray:
        count_np += 1
        
        try:
            name_items.append(i[0])
            count_inner_array += 1
        except:
                
            name_items.append(i)
            count_not_inner_array += 1
    else:
        print(type(i))
        count_else +=1

print("The number of elements that are type 'ndarray' is: {}".format(count_np))
print("The number of elements that are another type is: {}\n\n".format(count_else))

print("The number of inner arrays are: {}".format(count_inner_array))
print("The number that aren't inner arrays are {}".format(count_not_inner_array))

The number of elements that are type 'ndarray' is: 62328
The number of elements that are another type is: 0


The number of inner arrays are: 62204
The number that aren't inner arrays are 124


In [13]:
#Some of the data is stored in formats that need to be adjusted; including: lists that need
#to be flattened and file paths that need to be adjusted. I make those changes here

birth_date = image_info['dob'][0]
year_taken = image_info['photo_taken'][0]
file_path = ['image_data/wiki_data/' + item for sublist in image_info['full_path'][0] for item in sublist]
gender = image_info['gender'][0]
name = name_items
face_location = [item for sublist in image_info['face_location'][0] for item in sublist]
face_score = image_info['face_score'][0]
second_face_score = image_info['second_face_score'][0]
columns = column_titles

image_data_dictionary = {columns[4]: name, columns[0]: birth_date, columns[3]: gender, columns[1]: year_taken, 'file_path': file_path, columns[5]: face_location, columns[6]: face_score, columns[7]: second_face_score}


In [14]:
#A check to view the data types in the names column. An entry with a name is
#data type: numpy.str_
#A blank entry indicated by a pair of brackets is data type: numpy.ndarray

print(image_data_dictionary['name'][416])
print(type(image_data_dictionary['name'][416]))

print(image_data_dictionary['name'][418])
print(type(image_data_dictionary['name'][418]))

Tove Styrke
<class 'numpy.str_'>
[]
<class 'numpy.ndarray'>


In [15]:
#Convert the dictionary to a dataframe

photo_info = pd.DataFrame(image_data_dictionary)
photo_info.head()

Unnamed: 0,name,dob,gender,photo_taken,file_path,face_location,face_score,second_face_score
0,Sami Jauhojärvi,723671,1.0,2009,image_data/wiki_data/17/10000217_1981-05-05_20...,"[111.29109473290997, 111.29109473290997, 252.6...",4.300962,
1,Dettmar Cramer,703186,1.0,1964,image_data/wiki_data/48/10000548_1925-04-04_19...,"[252.48330229530742, 126.68165114765371, 354.5...",2.645639,1.949248
2,Marc Okrand,711677,1.0,2008,image_data/wiki_data/12/100012_1948-07-03_2008...,"[113.52, 169.83999999999997, 366.08, 422.4]",4.329329,
3,Aleksandar Matanović,705061,1.0,1961,image_data/wiki_data/65/10001965_1930-05-23_19...,"[1, 1, 634, 440]",-inf,
4,Diana Damrau,720044,0.0,2012,image_data/wiki_data/16/10002116_1971-05-31_20...,"[171.61031405173117, 75.57451239763239, 266.76...",3.408442,


In [16]:
photo_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62328 entries, 0 to 62327
Data columns (total 8 columns):
name                 62328 non-null object
dob                  62328 non-null int32
gender               59685 non-null float64
photo_taken          62328 non-null uint16
file_path            62328 non-null object
face_location        62328 non-null object
face_score           62328 non-null float64
second_face_score    4096 non-null float64
dtypes: float64(3), int32(1), object(3), uint16(1)
memory usage: 3.2+ MB


In [17]:
#I eliminate all 124 entries with empty brackets in the 'name' column

photo_info = photo_info[photo_info['name'].apply(lambda x: type(x)!=np.ndarray)].reset_index(drop=True)

In [18]:
photo_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62204 entries, 0 to 62203
Data columns (total 8 columns):
name                 62204 non-null object
dob                  62204 non-null int32
gender               59685 non-null float64
photo_taken          62204 non-null uint16
file_path            62204 non-null object
face_location        62204 non-null object
face_score           62204 non-null float64
second_face_score    4088 non-null float64
dtypes: float64(3), int32(1), object(3), uint16(1)
memory usage: 3.2+ MB


In [19]:
#Save the dataframe as a .csv file

photo_info.to_csv('Photo_Dataframes/wiki_photo_metadata.csv', index = False)