# Preparing the Wiki dataset

## Introduction

The Wiki Dataset is a subset of the IMDB-Wiki Dataset, which composes the largest age and gender classification dataset available for free online as of 2017. More information on it can be found on the IMDB-Wiki Dataset website: https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/

## Goal of this notebook

The goal of this notebook is to prepare the Wiki Dataset for gender and age classification using Keras. Namely, here we clean the dataset of any missing values, which thankfully were very few, and create stratified 8-fold cross-validation sets for model evaluation

## Part 1 - Basic setup

In [1]:
# Necessary modules
import numpy    as np
import pandas   as pd
import datetime as dt

from scipy.io                import loadmat 
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from datetime                import datetime, timedelta

In [2]:
# Ensures reproducibility
np.random.seed(0)

In [3]:
# Constants
WIKI_DATA_PATH    = "../data/wiki/"
WIKI_META_PATH    = "../data/wiki/wiki.mat"
WIKI_ALL_HEADERS  = ["dob","photo_taken","full_path","gender","name","face_loc","face_score","second_face_score"]
RELEVANT_HEADERS  = ["full_path","gender","dob","photo_taken"]
FINAL_HEADERS     = ["full_path","gender","age"]

## Part 2 - Initial loading of the metadata

The Wiki dataset metadata exists on a '.mat' (for matlab) file. Thus, we need to go through a few workarounds to adapt it to a python readable format. That occurs below. The end goal is to construct a Pandas Dataframe with the proper types

In [4]:
# Load dataset
wiki_mat = loadmat(WIKI_META_PATH)
wiki_mat = wiki_mat['wiki'][0][0]  # For some reason, the dataset was nested inside 2 1-dimensional arrays

In [5]:
# Dimensions of the Pandas Dataframe
num_rows = wiki_mat[0].shape[1]
num_cols = len(WIKI_ALL_HEADERS)

In [6]:
# Dataframe placeholder
data_matrix = np.zeros(shape=(num_rows,num_cols))
wiki_frame  = pd.DataFrame(data=data_matrix, columns=WIKI_ALL_HEADERS)

In [7]:
# Loading dataset into placeholder
for col in range(num_cols):
    curr_values = wiki_mat[col][0]
    curr_header = WIKI_ALL_HEADERS[col]
    wiki_frame[curr_header] = curr_values

wiki_frame.head()

Unnamed: 0,dob,photo_taken,full_path,gender,name,face_loc,face_score,second_face_score
0,723671,2009,[17/10000217_1981-05-05_2009.jpg],1.0,[Sami Jauhojärvi],"[[111.291094733, 111.291094733, 252.669930818,...",4.300962,
1,703186,1964,[48/10000548_1925-04-04_1964.jpg],1.0,[Dettmar Cramer],"[[252.483302295, 126.681651148, 354.531925962,...",2.645639,1.949248
2,711677,2008,[12/100012_1948-07-03_2008.jpg],1.0,[Marc Okrand],"[[113.52, 169.84, 366.08, 422.4]]",4.329329,
3,705061,1961,[65/10001965_1930-05-23_1961.jpg],1.0,[Aleksandar Matanović],"[[1, 1, 634, 440]]",-inf,
4,720044,2012,[16/10002116_1971-05-31_2012.jpg],0.0,[Diana Damrau],"[[171.610314052, 75.5745123976, 266.766115706,...",3.408442,


The data clearly needs some fixing. The DOB (Date of Birth) column does not show an age (instead it shows a serialized matlab value which needs to be converted), the name column displays 1-d arrays with the names inside of it instead of simply the names in string values and, lastly, we have unecessary columns.

In [8]:
# Dropping unecessary columns
wiki_frame = wiki_frame[RELEVANT_HEADERS]
wiki_frame.head()

Unnamed: 0,full_path,gender,dob,photo_taken
0,[17/10000217_1981-05-05_2009.jpg],1.0,723671,2009
1,[48/10000548_1925-04-04_1964.jpg],1.0,703186,1964
2,[12/100012_1948-07-03_2008.jpg],1.0,711677,2008
3,[65/10001965_1930-05-23_1961.jpg],1.0,705061,1961
4,[16/10002116_1971-05-31_2012.jpg],0.0,720044,2012


In [9]:
# Converting matlab time to a python datetime object

# Not our function, found on Stack Overflow
matlab2datetime = lambda dob: datetime.fromordinal(int(dob)) + timedelta(days=dob%1) - timedelta(days = 366)

wiki_frame["dob"] = wiki_frame["dob"].apply(lambda x: matlab2datetime(x).year)
wiki_frame.head()

Unnamed: 0,full_path,gender,dob,photo_taken
0,[17/10000217_1981-05-05_2009.jpg],1.0,1981,2009
1,[48/10000548_1925-04-04_1964.jpg],1.0,1925,1964
2,[12/100012_1948-07-03_2008.jpg],1.0,1948,2008
3,[65/10001965_1930-05-23_1961.jpg],1.0,1930,1961
4,[16/10002116_1971-05-31_2012.jpg],0.0,1971,2012


In [10]:
# Creating age feature out of dob and photo_taken
years_headers     = ["photo_taken","dob"]
wiki_frame["age"] = wiki_frame[years_headers].apply(lambda x: x[0]-x[1], axis=1)
wiki_frame.head()

Unnamed: 0,full_path,gender,dob,photo_taken,age
0,[17/10000217_1981-05-05_2009.jpg],1.0,1981,2009,28
1,[48/10000548_1925-04-04_1964.jpg],1.0,1925,1964,39
2,[12/100012_1948-07-03_2008.jpg],1.0,1948,2008,60
3,[65/10001965_1930-05-23_1961.jpg],1.0,1930,1961,31
4,[16/10002116_1971-05-31_2012.jpg],0.0,1971,2012,41


In [11]:
# Converting full_path from 1-d array containing string to just the string
wiki_frame["full_path"] = wiki_frame["full_path"].apply(lambda x: x[0])
wiki_frame.head()

Unnamed: 0,full_path,gender,dob,photo_taken,age
0,17/10000217_1981-05-05_2009.jpg,1.0,1981,2009,28
1,48/10000548_1925-04-04_1964.jpg,1.0,1925,1964,39
2,12/100012_1948-07-03_2008.jpg,1.0,1948,2008,60
3,65/10001965_1930-05-23_1961.jpg,1.0,1930,1961,31
4,16/10002116_1971-05-31_2012.jpg,0.0,1971,2012,41


## Part 2 - Initial values sanity check

In this part of the notebook, we ensure that the current columns in our dataframe all have the types of values that we'd expect them to have.

### 2.1 Age problems

In [12]:
# Nonsensical conditions in the dataset
negative_ages   = wiki_frame["photo_taken"] < wiki_frame["dob"] 
ridiculous_ages = wiki_frame["age"] > 100

# Calculating count of bad and total data
bad_rows_count    = wiki_frame[negative_ages | ridiculous_ages].shape[0]
total_rows_count = wiki_frame.shape[0]

# Quick report
bad_percentage = ( bad_rows_count / float(total_rows_count) ) * 100
print "A total of %s percent of the data is faulty" % bad_percentage

A total of 2.72429726608 percent of the data is faulty


That's not a problem. We can just drop them without loosing a significant amount of data. We expected these problems to be there because Wikipedia can be unreliable sometimes

In [13]:
# Dropping bad data
positive_ages   = wiki_frame["photo_taken"] > wiki_frame["dob"] 
reasonable_ages = wiki_frame["age"] < 100

wiki_frame = wiki_frame[positive_ages & reasonable_ages]

In [14]:
# Sanity check

# Nonsensical conditions in the dataset
negative_ages   = wiki_frame["photo_taken"] < wiki_frame["dob"] 
ridiculous_ages = wiki_frame["age"] > 100

# Calculating count of bad and total data
bad_rows_count    = wiki_frame[negative_ages | ridiculous_ages].shape[0]
print "There are now %s bad age values" % bad_rows_count

There are now 0 bad age values


### 2.2 Gender problems

In [15]:
# Checking for NaNs
wiki_frame.isnull().sum()

full_path         0
gender         2476
dob               0
photo_taken       0
age               0
dtype: int64

The gender column has NaN values. That's problematic. Let's see how much of the dataset we'd be missing if we drop them all

In [16]:
num_missing = wiki_frame[wiki_frame["gender"].isnull()].shape[0]
num_rows    = wiki_frame.shape[0]

percentage_missing = (num_missing / float(num_rows)) * 100
print "In total, %s pecent of the data is NaN" % percentage_missing

In total, 4.09879486161 pecent of the data is NaN


In [17]:
# Preserving only rows without NaNs
good_rows  = wiki_frame["gender"].notnull()
wiki_frame = wiki_frame[good_rows]

In [18]:
# Sanity check
num_missing = wiki_frame[wiki_frame["gender"].isnull()].shape[0]
print "In total, now %d percent of the data is NaN" % num_missing

In total, now 0 percent of the data is NaN


The dataframe was successfully loaded and type-corrected!

In [19]:
rows_count = wiki_frame.shape[0]
working_percentage = ( rows_count / float(num_rows) ) * 100
print "We are now working with %d percent of the original dataset" % working_percentage

We are now working with 95 percent of the original dataset


In [20]:
# Dropping unecessary headers one last time
wiki_frame = wiki_frame[FINAL_HEADERS]
wiki_frame.head()

Unnamed: 0,full_path,gender,age
0,17/10000217_1981-05-05_2009.jpg,1.0,28
1,48/10000548_1925-04-04_1964.jpg,1.0,39
2,12/100012_1948-07-03_2008.jpg,1.0,60
3,65/10001965_1930-05-23_1961.jpg,1.0,31
4,16/10002116_1971-05-31_2012.jpg,0.0,41


## Part 3 - Generating stratified cross-validation folds

Now that the dataset has been curated, we can finally create our stratified cross-validation folds. We'll do a train/validation/test split of the following percentages: 60/20/20

In [21]:
train_validation, test = train_test_split(wiki_frame, test_size = 0.2)

In [22]:
train_validation.reset_index(inplace=True)
test.reset_index(inplace=True)

In [23]:
Y_train_validation_age    = train_validation["age"]
Y_train_validation_gender = train_validation["gender"]

In [24]:
num_folds = 10
skf       = StratifiedShuffleSplit(n_splits=num_folds,random_state=0,test_size=0.2)

In [25]:
age_folds    = []
gender_folds = []

In [26]:
for train_validation_tuple in skf.split(train_validation, Y_train_validation_age):
    age_folds.append(train_validation_tuple) 

In [27]:
for train_validation_tuple in skf.split(train_validation, Y_train_validation_gender):
    gender_folds.append(train_validation_tuple) 

We now have our fold indices! ALl that remains is now to save them in an easy to transport format. That occurs in the next and last section.

## Part 4 - Saving folds and testing set as DataFrames

### 4.1 Age folds

In [28]:
age_dataframes    = []
gender_dataframes = []

In [29]:
for fold in age_folds:
    train_df = train_validation.ix[fold[0]].reset_index(inplace=False)
    valid_df = train_validation.ix[fold[1]].reset_index(inplace=False)
    fold_tupl = (train_df, valid_df)
    age_dataframes.append(fold_tupl)
    
for fold in age_folds:
    train_df = train_validation.ix[fold[0]].reset_index(inplace=False)
    valid_df = train_validation.ix[fold[1]].reset_index(inplace=False)
    fold_tupl = (train_df, valid_df)
    gender_dataframes.append(fold_tupl)

In [30]:
AGE_SAVE_PATH   = "../data/wiki_folds/age/%s"
GENDER_SAVE_PATH = "../data/wiki_folds/gender/%s"

In [31]:
count = 1
for train, valid in age_dataframes:
    fname_t = "train_%d.csv" % count
    fname_v = "valid_%d.csv" % count
    train.to_csv(AGE_SAVE_PATH % fname_t)
    valid.to_csv(AGE_SAVE_PATH % fname_v)
    
count = 1
for train, valid in gender_dataframes:
    fname_t = "train_%d.csv" % count
    fname_v = "valid_%d.csv" % count
    train.to_csv(GENDER_SAVE_PATH % fname_t)
    valid.to_csv(GENDER_SAVE_PATH % fname_v)
    count += 1

We're done! Wohoo :)