<center><h1>1. Cross Validation Prep</h1></center>

## About this notebook

In order to evaluate our model, we need to split our available data intro training, validation and testing portions. That way, we can use the traning split to learn parameters, the validation set to decide on hyperparameters and the testing set to determine the final performance of our models unseen data. This method of model evaluation is called <b>Cross Validation</b>.

However, it is often problematic to decide on which portions of the dataset should be used for training, testing and validation because the quality of the splits has a non-trivial effect on the model's performance. Luckily enough, the question of which portion to use for testing is already answered by the Adience Benchmark guidelines. More precisely, the 5th fold - that would be the fold4 because the folds are indexed starting at 0- is to be used as the testing set.

With that question out of the way, we're still concerned with how to split the remaining data into a training and validation split. The technique of <b>K-Fold Cross Validation</b> answer this question by creating K different training and validation splits out of the remaining data, then testing our model and all of them and having the average performance of our models on all splits be our measure of accuracy / 'goodness'. We'll be using a specific variant of K-Fold Cross Validation called <b>Stratified K-Fold Cross Validation</b>.

Stratified K-Fold Cross Validation forces the K different training-validation splits to have roughly the same distribution of classes in each of them. The idea here is to prevent any fold from having a non-trivial excess of a given class that would then bias the classifier created on it.

Lastly, after our folds have been created, we'll organize them into directories in such a way that Keras's Image Processing tools can make use of it without extra -often hacky- workarounds

## Suggested Sources

If the explanation above didn't quite make sense. I recommend reviewing the sources below.

1. https://www.youtube.com/watch?v=TIgfjmp-4BA
2. stats.stackexchange.com/questions/117643/why-use-stratified-cross-validation-why-does-this-not-damage-variance-related-b

## Creating the foundational splits on the Adience Benchmark

Adience Benchmark source: http://www.openu.ac.il/home/hassner/Adience/data.html

In [314]:
# Necessary imports
import os
import pandas as pd
import numpy  as np
from sklearn.model_selection import StratifiedShuffleSplit

In [315]:
# Useful constants
num_train_folds   = 4
ind_test_fold     = 4
validation_splits = 4

metadata_path  = "../data/face_image_project/fold_%s_data.txt"
metadata_test  = metadata_path % ind_test_fold

img_path         = "../data/face_image_project/aligned/%s/landmark_aligned_face.%d.%s"
keras_train_path = "../data/face_image_project/keras_format/train/%s/%d.jpg"
keras_valid_path = "../data/face_image_project/keras_format/vali/%s/%d.jpg"
keras_test_path  = "../data/face_image_project/keras_format/test/%s/%d.jpg"

relevant_cols = ["user_id","face_id","original_image","gender","age"]

In [316]:
# Extracting the test partition and creating a combined train_validation (trvl) superset

folds = []
for index in range(num_train_folds):
    path = metadata_path % index
    folds.append(pd.read_csv(filepath_or_buffer=path, sep="\t"))
    
trvl_meta = pd.concat(folds, ignore_index=True)
test_meta   = pd.read_csv(filepath_or_buffer=metadata_test, sep="\t")

trvl_meta = trvl_meta[relevant_cols]
test_meta = test_meta[relevant_cols]

At this point, we've created an initial partition of our dataset into a testing split and its complement. Both splits still need more processing before they're ready for primetime

## Overview of current splits

### Testing split

In [317]:
test_meta.head()

Unnamed: 0,user_id,face_id,original_image,gender,age
0,115321157@N03,1744,12111738395_a7f715aa4e_o.jpg,m,"(4, 6)"
1,115321157@N03,1745,12112413505_0aea8e17c6_o.jpg,m,"(48, 53)"
2,115321157@N03,1744,12112392255_995532c2f0_o.jpg,m,"(4, 6)"
3,115321157@N03,1746,12112392255_995532c2f0_o.jpg,m,"(25, 32)"
4,115321157@N03,1747,12112392255_995532c2f0_o.jpg,m,"(25, 32)"


In [318]:
test_meta.shape

(3816, 5)

In [319]:
test_meta["gender"].value_counts()

f    1848
m    1597
u     286
Name: gender, dtype: int64

In [320]:
test_meta["age"].value_counts()

(25, 32)     1056
(4, 6)        570
(38, 43)      502
(0, 2)        483
(8, 12)       340
(60, 100)     257
(48, 53)      241
(15, 20)      227
None           62
35             36
57             17
55             11
45              6
(38, 48)        5
32              3
Name: age, dtype: int64

Aha! There are 'None'-valued ages. That's problematic. We'll need to do something about those. Thanfully, it's apparently only a few

In [321]:
test_meta.isnull().values.any()

True

This is unexpected, there are null / NaN values in this split. Let us inspect that further

In [322]:
test_meta[test_meta.isnull().any(axis=1)].head()

Unnamed: 0,user_id,face_id,original_image,gender,age
2805,7285955@N06,2059,9489513876_86d04ff460_o.jpg,,
3091,8007224@N07,2118,8917875562_c7925a4e2b_o.jpg,,"(25, 32)"
3092,8007224@N07,2118,8755673180_d6945bff9f_o.jpg,,"(25, 32)"
3096,8007224@N07,2118,11866643475_3a8d5ef09f_o.jpg,,"(25, 32)"
3103,8007224@N07,2118,8917875226_98976f714e_o.jpg,,"(25, 32)"


### test complement set

In [323]:
trvl_meta.head()

Unnamed: 0,user_id,face_id,original_image,gender,age
0,30601258@N03,1,10399646885_67c7d20df9_o.jpg,f,"(25, 32)"
1,30601258@N03,2,10424815813_e94629b1ec_o.jpg,m,"(25, 32)"
2,30601258@N03,1,10437979845_5985be4b26_o.jpg,f,"(25, 32)"
3,30601258@N03,3,10437979845_5985be4b26_o.jpg,m,"(25, 32)"
4,30601258@N03,2,11816644924_075c3d8d59_o.jpg,m,"(25, 32)"


In [324]:
trvl_meta.shape

(15554, 5)

In [325]:
trvl_meta["gender"].value_counts()

f    7524
m    6523
u     813
Name: gender, dtype: int64

In [326]:
trvl_meta["age"].value_counts()

(25, 32)     3948
(0, 2)       2005
(38, 43)     1791
(8, 12)      1784
(4, 6)       1570
(15, 20)     1415
None          686
(60, 100)     615
(48, 53)      589
35            257
13            168
22            149
34            105
23             96
45             82
(27, 32)       77
55             65
36             56
(38, 42)       46
3              18
29             11
57              7
58              5
2               3
56              2
42              1
(38, 48)        1
46              1
(8, 23)         1
Name: age, dtype: int64

This is weird. There should not be ages outside ranges. Let's look into this as well

In [327]:
trvl_meta.isnull().values.any()

True

Is is TRUE that there are NaN value in this set as well. We'll have to fix it too

## Fixing dataset inconsistencies and filling missing values

The summaries above show problems with the dataset. Namely, NaNs as gender values, None as age values and inconsistent / overlapping labels for age. We address those issues right here.

### Dropping rows with NaN gender and None age simultaneously

Rows with NaN gender and None age are the most problematic because we cannot average any values in order to 'guesstimate' their real values. A possible solution would be to fill in those values by running a state-of-the-art model such as Face++ or Microsofts' Facial Features model but we think that'd be unecessary given that there aren't that many rows with these two characteristics at the same time

In [328]:
# Conditions for dropping
NaN_gender_trvl = trvl_meta["gender"].isnull()
None_age_trvl   = trvl_meta["age"] == "None"
trvl_bad_rows    = NaN_gender_trvl & None_age_trvl
trvl_bad_indices = trvl_meta[trvl_bad_rows].index.values

# Dropping rows when two conditions are present
trvl_meta.drop(labels=trvl_bad_indices, inplace=True)

In [329]:
# Conditions for dropping
NaN_gender_test = test_meta["gender"].isnull()
None_age_test   = test_meta["age"] == "None"
test_bad_rows   = NaN_gender_test & None_age_test
test_bad_indices = test_meta[test_bad_rows].index.values

# Dropping rows when two conditions are present
test_meta.drop(labels=test_bad_indices, inplace=True)

### Fixing bad age ranges

We already know that some of the age labels for some reason are not declared as a range, as most other labels, but instead as a single number. This is problematic. The next section transform real, continuous ages into their matching ranges

#### Testing complement

In [330]:
trvl_meta["age"].value_counts()

(25, 32)     3948
(0, 2)       2005
(38, 43)     1791
(8, 12)      1784
(4, 6)       1570
(15, 20)     1415
(60, 100)     615
(48, 53)      589
35            257
13            168
22            149
34            105
23             96
45             82
(27, 32)       77
55             65
36             56
(38, 42)       46
None           40
3              18
29             11
57              7
58              5
2               3
56              2
42              1
(38, 48)        1
46              1
(8, 23)         1
Name: age, dtype: int64

So we see that there are a lot of ages that are not in range format. Since the number of such occurences is rather limited. We'll fix them manually in the cell below.

In [331]:
trvl_meta["age"] = trvl_meta["age"].replace("13","(8, 13)")
trvl_meta["age"] = trvl_meta["age"].replace("(8, 12)","(8, 13)")
trvl_meta["age"] = trvl_meta["age"].replace("42","(38, 42)")
trvl_meta["age"] = trvl_meta["age"].replace("2","(0, 2)")
trvl_meta["age"] = trvl_meta["age"].replace("29","(25, 32)")

What follows is a non-trivial change. For some reason, some of the ages don't fit into the supposed labeled ranges in the dataset. So we're gonna have to take some ages and simply group them inside their closest label, which will not necessary extend to the right range

In [332]:
trvl_meta["age"] = trvl_meta["age"].replace("35","(25, 32)")
trvl_meta["age"] = trvl_meta["age"].replace("22","(15, 20)")
trvl_meta["age"] = trvl_meta["age"].replace("34","(25, 32)")
trvl_meta["age"] = trvl_meta["age"].replace("23","(25, 32)")
trvl_meta["age"] = trvl_meta["age"].replace("45","(48, 53)")
trvl_meta["age"] = trvl_meta["age"].replace("55","(48, 53)")
trvl_meta["age"] = trvl_meta["age"].replace("36","(38, 43)")
trvl_meta["age"] = trvl_meta["age"].replace("3","(0, 2)")


trvl_meta["age"] = trvl_meta["age"].replace("57","(60, 100)")
trvl_meta["age"] = trvl_meta["age"].replace("58","(60, 100)")
trvl_meta["age"] = trvl_meta["age"].replace("56","(60, 100)")
trvl_meta["age"] = trvl_meta["age"].replace("46","(48, 53)")

In [333]:
trvl_meta["age"].value_counts()

(25, 32)     4417
(0, 2)       2026
(8, 13)      1952
(38, 43)     1847
(4, 6)       1570
(15, 20)     1564
(48, 53)      737
(60, 100)     629
(27, 32)       77
(38, 42)       47
None           40
(38, 48)        1
(8, 23)         1
Name: age, dtype: int64

In [334]:
# Dropping None ages. Not worth manually tagging them
trvl_meta = trvl_meta[trvl_meta["age"] != "None"]

In [335]:
trvl_meta["age"].value_counts()

(25, 32)     4417
(0, 2)       2026
(8, 13)      1952
(38, 43)     1847
(4, 6)       1570
(15, 20)     1564
(48, 53)      737
(60, 100)     629
(27, 32)       77
(38, 42)       47
(38, 48)        1
(8, 23)         1
Name: age, dtype: int64

Now we do the same procedure as above but with the testing set

#### Testing set

In [336]:
test_meta["age"].value_counts()

(25, 32)     1056
(4, 6)        570
(38, 43)      502
(0, 2)        483
(8, 12)       340
(60, 100)     257
(48, 53)      241
(15, 20)      227
35             36
57             17
55             11
45              6
(38, 48)        5
32              3
Name: age, dtype: int64

In [337]:
test_meta["age"] = test_meta["age"].replace("35", "(25, 32)")
test_meta["age"] = test_meta["age"].replace("57", "(60, 100)")
test_meta["age"] = test_meta["age"].replace("55", "(48, 53)")
test_meta["age"] = test_meta["age"].replace("45", "(38, 43)")
test_meta["age"] = test_meta["age"].replace("32", "(25, 32)")

In [338]:
trvl_meta["gender"].value_counts()
test_meta["gender"].value_counts()

f    1848
m    1597
u     286
Name: gender, dtype: int64

In [339]:
test_meta["age"].value_counts()

(25, 32)     1095
(4, 6)        570
(38, 43)      508
(0, 2)        483
(8, 12)       340
(60, 100)     274
(48, 53)      252
(15, 20)      227
(38, 48)        5
Name: age, dtype: int64

### Nan Genders

In [340]:
NaN_gender_trvl = trvl_meta["gender"].notnull()
NaN_gender_trvl.value_counts()

True     14820
False       48
Name: gender, dtype: int64

In [341]:
NaN_gender_test = test_meta["gender"].notnull()
NaN_gender_test.value_counts()

True     3731
False      23
Name: gender, dtype: int64

Since the number of NaN gender is quite small, there's no problem with dropping those as wel

In [342]:
test_meta = test_meta[NaN_gender_test]
trvl_meta = trvl_meta[NaN_gender_trvl]

Checking value counts again for sanity check

In [343]:
NaN_gender_trvl = trvl_meta["gender"].notnull()
NaN_gender_trvl.value_counts()

True    14820
Name: gender, dtype: int64

In [344]:
NaN_gender_test = test_meta["gender"].notnull()
NaN_gender_test.value_counts()

True    3731
Name: gender, dtype: int64

Perfect! The dataset has been fully curated

## Generating k-folds of train and validation splits

In this section we create K stratified folds of training and validation splits out of the complement of the testing set. We'll set the number of folds created to be equal to 5 as that is the number of splits suggested by the Adience Benchmark README file.

Let us take this moment to discuss propotions. Each fold of the original, unprocessed, dataset contains roughly the same amount of data. Given that there are a total of 5 folds and one of them is reserved for testing, the percentage of the dataset used for testing will be roughly 20%. Out of the remaining data, we'll use  25% of each of the k-folds to be reserved for validation. This means that our testing dataset split follows roughly the following proportions.

<ul>
<li><b>Testing:</b> 20%</li>
<li><b>Validation:</b>20%</li>
<li><b>Training</b>:60%</li>
</ul>

This configuration is not accidental, we've chosen this proportions because they're common practice

In [345]:
num_splits = 5
validation_prop = 0.25

In [346]:
sss = StratifiedShuffleSplit(n_splits=num_splits, test_size=validation_prop, random_state=0)

### Prepping the test complement set

For the stratified k-fold partitioning tool to function, all classes need to appear at least twice in the training dataset. Given the amount of data we've dropped, such is not the case anymore. Observe below the current frequency of targets for age.

In [347]:
trvl_meta["age"].value_counts()

(25, 32)     4386
(0, 2)       2026
(8, 13)      1947
(38, 43)     1847
(4, 6)       1570
(15, 20)     1564
(48, 53)      732
(60, 100)     622
(27, 32)       77
(38, 42)       47
(38, 48)        1
(8, 23)         1
Name: age, dtype: int64

As one can observe, (8,23) and (38,48) appear only once. We'll artifically force them to appear twice by literally duplicating their entries.

In [348]:
query1 = trvl_meta["age"] == "(8, 23)"
query2 = trvl_meta["age"] == "(38, 48)"
trvl_meta[query1 | query2]

Unnamed: 0,user_id,face_id,original_image,gender,age
2711,9017386@N06,206,11793675354_e1761c4c06_o.jpg,m,"(38, 48)"
8207,35953373@N04,905,9494119869_174fb904e7_o.jpg,m,"(8, 23)"


In [349]:
trvl_meta = trvl_meta.append(trvl_meta[query1 | query2], ignore_index=True)
trvl_meta["age"].value_counts()

(25, 32)     4386
(0, 2)       2026
(8, 13)      1947
(38, 43)     1847
(4, 6)       1570
(15, 20)     1564
(48, 53)      732
(60, 100)     622
(27, 32)       77
(38, 42)       47
(38, 48)        2
(8, 23)         2
Name: age, dtype: int64

Now we're good to go! 

Here I'll introduce a minor but necessary fix to the dataset. Directory names in UNIX cannot start with parenthesis. Thus, I'll remove the parenthesis from the age column and replace them with '_'

In [350]:
trvl_meta["age"] = trvl_meta["age"].str.replace("(", "_")
trvl_meta["age"] = trvl_meta["age"].str.replace(")", "_")
trvl_meta["age"] = trvl_meta["age"].str.replace(" ", "-")
trvl_meta["age"] = trvl_meta["age"].str.replace(",", "")
trvl_meta["age"].value_counts()

_25-32_     4386
_0-2_       2026
_8-13_      1947
_38-43_     1847
_4-6_       1570
_15-20_     1564
_48-53_      732
_60-100_     622
_27-32_       77
_38-42_       47
_38-48_        2
_8-23_         2
Name: age, dtype: int64

In [351]:
test_meta["age"] = test_meta["age"].str.replace("(", "_")
test_meta["age"] = test_meta["age"].str.replace(")", "_")
test_meta["age"] = test_meta["age"].str.replace(" ", "-")
test_meta["age"] = test_meta["age"].str.replace(",", "")
test_meta["age"].value_counts()

_25-32_     1072
_4-6_        570
_38-43_      508
_0-2_        483
_8-12_       340
_60-100_     274
_48-53_      252
_15-20_      227
_38-48_        5
Name: age, dtype: int64

Next up, since we'll be creating models for various classification tasks (predict gender, age and gender+age) then we need to make sure that our dataset has all those targets as columns. We're missing the column for gender+age. That column is created below.

In [352]:
trvl_meta["gender_age"] = trvl_meta[["gender","age"]].apply(lambda x: x[0]+"_"+x[1], axis=1)
trvl_meta.head()

Unnamed: 0,user_id,face_id,original_image,gender,age,gender_age
0,30601258@N03,1,10399646885_67c7d20df9_o.jpg,f,_25-32_,f__25-32_
1,30601258@N03,2,10424815813_e94629b1ec_o.jpg,m,_25-32_,m__25-32_
2,30601258@N03,1,10437979845_5985be4b26_o.jpg,f,_25-32_,f__25-32_
3,30601258@N03,3,10437979845_5985be4b26_o.jpg,m,_25-32_,m__25-32_
4,30601258@N03,2,11816644924_075c3d8d59_o.jpg,m,_25-32_,m__25-32_


This column will suffer from the same problem as age: to few instances of a given class. We fix that below.

In [353]:
query1 = trvl_meta["gender_age"] == "u__4-6_"
query2 = trvl_meta["gender_age"] == "u__60-100_"
trvl_meta[query1 | query2]
trvl_meta = trvl_meta.append(trvl_meta[query1 | query2], ignore_index=True)

In [354]:
trvl_meta_X     = trvl_meta
trvl_meta_X_arr = trvl_meta_X.as_matrix()

In [355]:
trvl_meta_Y_gender  = trvl_meta["gender"].as_matrix()
trvl_meta_Y_age     = trvl_meta["age"].as_matrix()
trvl_meta_Y_both    = trvl_meta["gender_age"].as_matrix()

### Splits for gender-only classication

In [356]:
from sklearn.model_selection import StratifiedShuffleSplit
X = trvl_meta_X_arr
y = trvl_meta_Y_gender
sss = StratifiedShuffleSplit(n_splits=num_splits, test_size=validation_prop, random_state=0)
sss.get_n_splits(X, y)

fold_count   = 1
fold_container_gender = []
for train_index, test_index in sss.split(X, y):
    print "FOLD #: %d" % fold_count
    print "TRAIN      :", train_index
    print "VALIDATION :", test_index
    print "========================================================="
    fold_count += 1
    X_train, X_valid = X[train_index], X[test_index]
    fold_container_gender.append([X_train,X_valid])

FOLD #: 1
TRAIN      : [14058 14148  7913 ...,  9402  1012 10739]
VALIDATION : [ 4944 11747  8268 ..., 11517  5573 12257]
FOLD #: 2
TRAIN      : [9111 3769 7718 ..., 5112 7374 8025]
VALIDATION : [10198  9480  2627 ...,   682  6724    83]
FOLD #: 3
TRAIN      : [ 4346 13689  2784 ...,  8527 12635  2697]
VALIDATION : [13686 11767 10055 ..., 12792  2932  9481]
FOLD #: 4
TRAIN      : [ 3050  1821  3412 ..., 10572  6473  8054]
VALIDATION : [12649  8025  4613 ...,  6274 12443  5326]
FOLD #: 5
TRAIN      : [13269  1789 11673 ..., 11193  3038 10356]
VALIDATION : [ 7568 11181 10662 ..., 11991  5513 14254]


### Splits for age-only classication

In [357]:
from sklearn.model_selection import StratifiedShuffleSplit
X = trvl_meta_X_arr
y = trvl_meta_Y_age
sss = StratifiedShuffleSplit(n_splits=num_splits, test_size=validation_prop, random_state=0)
sss.get_n_splits(X, y)

fold_count   = 1
fold_container_age = []
for train_index, test_index in sss.split(X, y):
    print "FOLD #: %d" % fold_count
    print "TRAIN      :", train_index
    print "VALIDATION :", test_index
    print "========================================================="
    fold_count += 1
    X_train, X_valid = X[train_index], X[test_index]
    fold_container_age.append([X_train,X_valid])

FOLD #: 1
TRAIN      : [ 5225 12740   242 ..., 10132  9003   349]
VALIDATION : [11401  7229 13112 ..., 13433  8833 13396]
FOLD #: 2
TRAIN      : [ 6221  8654 13436 ...,  8207  6387  1063]
VALIDATION : [ 7379  9455 13714 ...,  6238  1943  4456]
FOLD #: 3
TRAIN      : [ 1207 10318 10719 ...,  9518  1574 12373]
VALIDATION : [10324 12737  1321 ..., 14414  8921 10820]
FOLD #: 4
TRAIN      : [ 7749   565 10917 ..., 14541 12886  5930]
VALIDATION : [  187 12588  9030 ...,   985  4767 10622]
FOLD #: 5
TRAIN      : [ 5947  9990  8735 ..., 14081  9244 11987]
VALIDATION : [ 3346 13372  1992 ...,  8364  3751  3497]


### Splits for age and gender classication

In [358]:
from sklearn.model_selection import StratifiedShuffleSplit
X = trvl_meta_X_arr
y = trvl_meta_Y_both
sss = StratifiedShuffleSplit(n_splits=num_splits, test_size=validation_prop, random_state=0)
sss.get_n_splits(X, y)

fold_count   = 1
fold_container_both = []
for train_index, test_index in sss.split(X, y):
    print "FOLD #: %d" % fold_count
    print "TRAIN      :", train_index
    print "VALIDATION :", test_index
    print "========================================================="
    fold_count += 1
    X_train, X_valid = X[train_index], X[test_index]
    fold_container_both.append([X_train, X_valid])

FOLD #: 1
TRAIN      : [ 6143  5890 11993 ...,  3810 12742  2735]
VALIDATION : [12208  7006 13771 ..., 10191  2029  6094]
FOLD #: 2
TRAIN      : [ 6114 14444  5450 ...,  6394 12621 11314]
VALIDATION : [8201 5276 2142 ..., 6858 3188 9070]
FOLD #: 3
TRAIN      : [ 1749  9266  4266 ..., 12381  8570 10812]
VALIDATION : [ 2793  8145  7833 ..., 12421  7752  6372]
FOLD #: 4
TRAIN      : [11124 10215  1761 ...,  1389  6741  9917]
VALIDATION : [ 1766  3151  8043 ..., 13527  8929  1764]
FOLD #: 5
TRAIN      : [4974 2204 5790 ..., 8007 7012 2252]
VALIDATION : [  489 11032  1414 ...,  1818  3856  4556]


## Generating dataframes for each classification task

In [359]:
headers = ["user_id","face_id","original_image","gender","age","gender_age"]

In [360]:
gender_fold1_train = pd.DataFrame(fold_container_gender[0][0], columns=headers)
gender_fold1_valid = pd.DataFrame(fold_container_gender[0][1], columns=headers)

gender_fold2_train = pd.DataFrame(fold_container_gender[1][0], columns=headers)
gender_fold2_valid = pd.DataFrame(fold_container_gender[1][1], columns=headers)

gender_fold3_train = pd.DataFrame(fold_container_gender[2][0], columns=headers)
gender_fold3_valid = pd.DataFrame(fold_container_gender[2][1], columns=headers)

gender_fold4_train = pd.DataFrame(fold_container_gender[3][0], columns=headers)
gender_fold4_valid = pd.DataFrame(fold_container_gender[3][1], columns=headers)

gender_fold5_train = pd.DataFrame(fold_container_gender[4][0], columns=headers)
gender_fold5_valid = pd.DataFrame(fold_container_gender[4][1], columns=headers)

In [361]:
age_fold1_train = pd.DataFrame(fold_container_age[0][0], columns=headers)
age_fold1_valid = pd.DataFrame(fold_container_age[0][1], columns=headers)

age_fold2_train = pd.DataFrame(fold_container_age[1][0], columns=headers)
age_fold2_valid = pd.DataFrame(fold_container_age[1][1], columns=headers)

age_fold3_train = pd.DataFrame(fold_container_age[2][0], columns=headers)
age_fold3_valid = pd.DataFrame(fold_container_age[2][1], columns=headers)

age_fold4_train = pd.DataFrame(fold_container_age[3][0], columns=headers)
age_fold4_valid = pd.DataFrame(fold_container_age[3][1], columns=headers)

age_fold5_train = pd.DataFrame(fold_container_age[4][0], columns=headers)
age_fold5_valid = pd.DataFrame(fold_container_age[4][1], columns=headers)

In [362]:
both_fold1_train = pd.DataFrame(fold_container_both[0][0], columns=headers)
both_fold1_valid = pd.DataFrame(fold_container_both[0][1], columns=headers)

both_fold2_train = pd.DataFrame(fold_container_both[1][0], columns=headers)
both_fold2_valid = pd.DataFrame(fold_container_both[1][1], columns=headers)

both_fold3_train = pd.DataFrame(fold_container_both[2][0], columns=headers)
both_fold3_valid = pd.DataFrame(fold_container_both[2][1], columns=headers)

both_fold4_train = pd.DataFrame(fold_container_both[3][0], columns=headers)
both_fold4_valid = pd.DataFrame(fold_container_both[3][1], columns=headers)

both_fold5_train = pd.DataFrame(fold_container_both[4][0], columns=headers)
both_fold5_valid = pd.DataFrame(fold_container_both[4][1], columns=headers)

## Augment Dataframes to contain the current image path and the image path to generate

In [363]:
img_path         = "../data/face_image_project/aligned/%s/landmark_aligned_face.%d.%s"

keras_gender_train_path = "../data/face_image_project/keras_format/gender/%d/train/%s/%d.jpg"
keras_gender_valid_path = "../data/face_image_project/keras_format/gender/%d/valid/%s/%d.jpg"

keras_age_train_path = "../data/face_image_project/keras_format/age/%d/train/%s/%d.jpg"
keras_age_valid_path = "../data/face_image_project/keras_format/age/%d/valid/%s/%d.jpg"

keras_both_train_path = "../data/face_image_project/keras_format/both/%d/train/%s/%d.jpg"
keras_both_valid_path = "../data/face_image_project/keras_format/both/%d/valid/%s/%d.jpg"

relevant_cols = ["user_id","face_id","original_image","gender","age","gender_age"]

### Image path

In [364]:
gender_fold1_train["img_path"] = gender_fold1_train[relevant_cols].apply(lambda x: img_path % (x[0],x[1],x[2]), axis=1)
gender_fold1_valid["img_path"] = gender_fold1_valid[relevant_cols].apply(lambda x: img_path % (x[0],x[1],x[2]), axis=1)

gender_fold2_train["img_path"] = gender_fold2_train[relevant_cols].apply(lambda x: img_path % (x[0],x[1],x[2]), axis=1)
gender_fold2_valid["img_path"] = gender_fold2_valid[relevant_cols].apply(lambda x: img_path % (x[0],x[1],x[2]), axis=1)

gender_fold3_train["img_path"] = gender_fold3_train[relevant_cols].apply(lambda x: img_path % (x[0],x[1],x[2]), axis=1)
gender_fold3_valid["img_path"] = gender_fold3_valid[relevant_cols].apply(lambda x: img_path % (x[0],x[1],x[2]), axis=1)

gender_fold4_train["img_path"] = gender_fold4_train[relevant_cols].apply(lambda x: img_path % (x[0],x[1],x[2]), axis=1)
gender_fold4_valid["img_path"] = gender_fold4_valid[relevant_cols].apply(lambda x: img_path % (x[0],x[1],x[2]), axis=1)

gender_fold5_train["img_path"] = gender_fold5_train[relevant_cols].apply(lambda x: img_path % (x[0],x[1],x[2]), axis=1)
gender_fold5_valid["img_path"] = gender_fold5_valid[relevant_cols].apply(lambda x: img_path % (x[0],x[1],x[2]), axis=1)

In [365]:
age_fold1_train["img_path"] = age_fold1_train[relevant_cols].apply(lambda x: img_path % (x[0],x[1],x[2]), axis=1)
age_fold1_valid["img_path"] = age_fold1_valid[relevant_cols].apply(lambda x: img_path % (x[0],x[1],x[2]), axis=1)

age_fold2_train["img_path"] = age_fold2_train[relevant_cols].apply(lambda x: img_path % (x[0],x[1],x[2]), axis=1)
age_fold2_valid["img_path"] = age_fold2_valid[relevant_cols].apply(lambda x: img_path % (x[0],x[1],x[2]), axis=1)

age_fold3_train["img_path"] = age_fold3_train[relevant_cols].apply(lambda x: img_path % (x[0],x[1],x[2]), axis=1)
age_fold3_valid["img_path"] = age_fold3_valid[relevant_cols].apply(lambda x: img_path % (x[0],x[1],x[2]), axis=1)

age_fold4_train["img_path"] = age_fold4_train[relevant_cols].apply(lambda x: img_path % (x[0],x[1],x[2]), axis=1)
age_fold4_valid["img_path"] = age_fold4_valid[relevant_cols].apply(lambda x: img_path % (x[0],x[1],x[2]), axis=1)

age_fold5_train["img_path"] = age_fold5_train[relevant_cols].apply(lambda x: img_path % (x[0],x[1],x[2]), axis=1)
age_fold5_valid["img_path"] = age_fold5_valid[relevant_cols].apply(lambda x: img_path % (x[0],x[1],x[2]), axis=1)

In [366]:
both_fold1_train["img_path"] = both_fold1_train[relevant_cols].apply(lambda x: img_path % (x[0],x[1],x[2]), axis=1)
both_fold1_valid["img_path"] = both_fold1_valid[relevant_cols].apply(lambda x: img_path % (x[0],x[1],x[2]), axis=1)

both_fold2_train["img_path"] = both_fold2_train[relevant_cols].apply(lambda x: img_path % (x[0],x[1],x[2]), axis=1)
both_fold2_valid["img_path"] = both_fold2_valid[relevant_cols].apply(lambda x: img_path % (x[0],x[1],x[2]), axis=1)

both_fold3_train["img_path"] = both_fold3_train[relevant_cols].apply(lambda x: img_path % (x[0],x[1],x[2]), axis=1)
both_fold3_valid["img_path"] = both_fold3_valid[relevant_cols].apply(lambda x: img_path % (x[0],x[1],x[2]), axis=1)

both_fold4_train["img_path"] = both_fold4_train[relevant_cols].apply(lambda x: img_path % (x[0],x[1],x[2]), axis=1)
both_fold4_valid["img_path"] = both_fold4_valid[relevant_cols].apply(lambda x: img_path % (x[0],x[1],x[2]), axis=1)

both_fold5_train["img_path"] = both_fold5_train[relevant_cols].apply(lambda x: img_path % (x[0],x[1],x[2]), axis=1)
both_fold5_valid["img_path"] = both_fold5_valid[relevant_cols].apply(lambda x: img_path % (x[0],x[1],x[2]), axis=1)

### Keras path

In [367]:
gender_fold1_train["keras_path"] = gender_fold1_train[relevant_cols].apply(lambda x: keras_gender_train_path % (1, x[3],x.name), axis=1)
gender_fold1_valid["keras_path"] = gender_fold1_valid[relevant_cols].apply(lambda x: keras_gender_valid_path % (1, x[3],x.name), axis=1)

gender_fold2_train["keras_path"] = gender_fold2_train[relevant_cols].apply(lambda x: keras_gender_train_path % (2, x[3],x.name), axis=1)
gender_fold2_valid["keras_path"] = gender_fold2_valid[relevant_cols].apply(lambda x: keras_gender_valid_path % (2, x[3],x.name), axis=1)

gender_fold3_train["keras_path"] = gender_fold3_train[relevant_cols].apply(lambda x: keras_gender_train_path % (3, x[3],x.name), axis=1)
gender_fold3_valid["keras_path"] = gender_fold3_valid[relevant_cols].apply(lambda x: keras_gender_valid_path % (3, x[3],x.name), axis=1)

gender_fold4_train["keras_path"] = gender_fold4_train[relevant_cols].apply(lambda x: keras_gender_train_path % (4, x[3],x.name), axis=1)
gender_fold4_valid["keras_path"] = gender_fold4_valid[relevant_cols].apply(lambda x: keras_gender_valid_path % (4, x[3],x.name), axis=1)

gender_fold5_train["keras_path"] = gender_fold5_train[relevant_cols].apply(lambda x: keras_gender_train_path % (5, x[3],x.name), axis=1)
gender_fold5_valid["keras_path"] = gender_fold5_valid[relevant_cols].apply(lambda x: keras_gender_valid_path % (5, x[3],x.name), axis=1)

In [368]:
age_fold1_train["keras_path"] = age_fold1_train[relevant_cols].apply(lambda x: keras_age_train_path % (1, x[4],x.name), axis=1)
age_fold1_valid["keras_path"] = age_fold1_valid[relevant_cols].apply(lambda x: keras_age_valid_path % (1, x[4],x.name), axis=1)

age_fold2_train["keras_path"] = age_fold2_train[relevant_cols].apply(lambda x: keras_age_train_path % (2, x[4],x.name), axis=1)
age_fold2_valid["keras_path"] = age_fold2_valid[relevant_cols].apply(lambda x: keras_age_valid_path % (2, x[4],x.name), axis=1)

age_fold3_train["keras_path"] = age_fold3_train[relevant_cols].apply(lambda x: keras_age_train_path % (3, x[4],x.name), axis=1)
age_fold3_valid["keras_path"] = age_fold3_valid[relevant_cols].apply(lambda x: keras_age_valid_path % (3, x[4],x.name), axis=1)

age_fold4_train["keras_path"] = age_fold4_train[relevant_cols].apply(lambda x: keras_age_train_path % (4, x[4],x.name), axis=1)
age_fold4_valid["keras_path"] = age_fold4_valid[relevant_cols].apply(lambda x: keras_age_valid_path % (4, x[4],x.name), axis=1)

age_fold5_train["keras_path"] = age_fold5_train[relevant_cols].apply(lambda x: keras_age_train_path % (5, x[4],x.name), axis=1)
age_fold5_valid["keras_path"] = age_fold5_valid[relevant_cols].apply(lambda x: keras_age_valid_path % (5, x[4],x.name), axis=1)

In [369]:
both_fold1_train["keras_path"] = both_fold1_train[relevant_cols].apply(lambda x: keras_both_train_path % (1, x[5],x.name), axis=1)
both_fold1_valid["keras_path"] = both_fold1_valid[relevant_cols].apply(lambda x: keras_both_valid_path % (1, x[5],x.name), axis=1)

both_fold2_train["keras_path"] = both_fold2_train[relevant_cols].apply(lambda x: keras_both_train_path % (2, x[5],x.name), axis=1)
both_fold2_valid["keras_path"] = both_fold2_valid[relevant_cols].apply(lambda x: keras_both_valid_path % (2, x[5],x.name), axis=1)

both_fold3_train["keras_path"] = both_fold3_train[relevant_cols].apply(lambda x: keras_both_train_path % (3, x[5],x.name), axis=1)
both_fold3_valid["keras_path"] = both_fold3_valid[relevant_cols].apply(lambda x: keras_both_valid_path % (3, x[5],x.name), axis=1)

both_fold4_train["keras_path"] = both_fold4_train[relevant_cols].apply(lambda x: keras_both_train_path % (4, x[5],x.name), axis=1)
both_fold4_valid["keras_path"] = both_fold4_valid[relevant_cols].apply(lambda x: keras_both_valid_path % (4, x[5],x.name), axis=1)

both_fold5_train["keras_path"] = both_fold5_train[relevant_cols].apply(lambda x: keras_both_train_path % (5, x[5],x.name), axis=1)
both_fold5_valid["keras_path"] = both_fold5_valid[relevant_cols].apply(lambda x: keras_both_valid_path % (5, x[5],x.name), axis=1)

In [395]:
def prep_adience():
    all_gender_train = [gender_fold1_train,gender_fold2_train,gender_fold3_train,gender_fold4_train,gender_fold5_train]
    all_gender_valid = [gender_fold1_valid,gender_fold2_valid,gender_fold3_valid,gender_fold4_valid,gender_fold5_valid]
    
    all_age_train = [age_fold1_train,age_fold2_train,age_fold3_train,age_fold4_train,age_fold5_train]
    all_age_valid = [age_fold1_valid,age_fold2_valid,age_fold3_valid,age_fold4_valid,age_fold5_valid]

    all_both_train = [both_fold1_train,both_fold2_train,both_fold3_train,both_fold4_train,both_fold5_train]
    all_both_valid = [both_fold1_valid,both_fold2_valid,both_fold3_valid,both_fold4_valid,both_fold5_valid]
    
    full_meta = pd.concat(all_gender_train + all_gender_valid + all_age_train + all_age_valid + all_both_train + all_both_valid)

    for index, row in full_meta.iterrows():
        
        mkdir_template = "mkdir -p %s"
        command_template = "cp -p %s %s"
        command0 = mkdir_template % row["keras_path"]
        command1 = command_template % (row["img_path"], row["keras_path"])
        wkspFldr = os.path.dirname(row["keras_path"])
        
        os.system(command0)
        os.system(command1)


In [58]:
raise Ex

NameError: name 'Ex' is not defined

In [None]:
prep_adience()

## 2. Testing Image Pre-Processing tools

In [None]:
from keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img

datagen = ImageDataGenerator(
        rotation_range=40,
        width_shift_range=0.2,
        height_shift_range=0.2,
        shear_range=0.2,
        zoom_range=0.2,
        horizontal_flip=True,
        fill_mode='nearest')

img = load_img("../data/face_image_project/keras_format/train/m/9.jpg")
img

In [None]:
x = img_to_array(img)
x = x.reshape((1,) + x.shape)

#os.makedirs("../data/preview")

i = 0
for batch in datagen.flow(x, batch_size=1,
                          save_to_dir='../data/preview', save_prefix='augmented_', save_format='jpeg'):
    i += 1
    if i > 5:
        break  # otherwise the generator would loop indefinitely

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

from IPython.display import display, Image

from glob import glob
import PIL
images = [ PIL.Image.open(f) for f in glob('../data/preview/*') ]

def img2array(im):
    if im.mode != 'RGB':
        im = im.convert(mode='RGB')
    return np.fromstring(im.tobytes(), dtype='uint8').reshape((im.size[1], im.size[0], 3))

np_images = [ img2array(im) for im in images ]



for img in np_images:
    plt.figure()
    plt.imshow(img)

