# DATA PRE-PROCESSING

## First visual contact with the data

To get started we proceed with a first visual inspection of the data, this allows us to get a feel for the time of outliers we are dealing with and eventually the best methodologies to apply to eliminate the latter

It is clear from the data that the main images include faces (at least those that we are interested in). The images are all of the same size (256x256) and all present RGB channels (256x256x3) this is actually quite large, as, thinking ahead of ourselves, will lead to quite a large number of parameters when thinking about it, since, if we are to include all channels, we have a total of 196608 features per example (per image). So we will definitely have to be careful when delving with our future Neural Networks or Convolutional Neural Networks.

From the Dataset it is clear that we have a few interesting outliers: mainly grey, occasionally mountainous scenery and blue/grey sky. Faces are also noised up with watermarks at certain occasions. The main outliers spotted and strange cases that have been spotted at a first glance are the following: 

- 600.png --> lake, mountain with clouds
- 639.png --> well orangey
- 596.png --> black wierd stuff going on here
- 663.png --> trickier case, has quite some orangy colours
- 665.png --> might be eliminated if not careful (perhaps grey coding might solve this)
- 650.png --> empty image
- 574.png --> partially filled
- 535.png --> distorted data (w/ face)
- 499.png --> highly pixelated data
- 72.png --> destroyed nasal feature
- 390.png, 118.png --> Watermarks (right on the face)
- 206.png --> slightly tilted (might make us wonder if there are any upside down faces)
- 291.png --> hand perturbation
- 302.png --> noisy RGB background

At this point I'm starting to think about data augmentation since i'm noticing quite a bit of peculiar cases and especially quite a bit outliers, which might reduce the size of our data quite considerably.



Honestly the first thing that comes to my mind when looking to find a methodology to eliminate outliers in this particular image based dataset would be to use basic facedetectors which are widely available in libraries such as opencv or dlib. I was also fortunate enough to have been able to play quite a bit with face and emotion recognition in general, my code for both Fisher face based emotion recognition and landmark based emotion recognition can be found on my github: https://github.com/brunocalogero/individual_study and https://github.com/brunocalogero/ECE420FinalProject

It is also our lucky day since it seems at first glance that not many images are upside down or have bizzare orientations, and not too many faces are tilted, in fact most of them are purely frontal, this is ideal for a HOG which is not very bad orientation friendly, but does a great job with frontal faces.

- Hence one obvious approach is to go for the classic HOG + SVM combo (Histogram of Oriented Gradients). 
- Other statistical methods could be envisioned but might not allow us to achieve as good a result. 
- Considering to also use the infamous `cvlib` which detects faces at most angles and in real time in a more efficient manner

In [18]:
import os
import cv2
import dlib
import time
import PIL

import numpy as np
import datetime as dt
import pandas as pd

from keras.preprocessing import image

# PATH TO ALL IMAGES
global basedir, image_paths, target_size
images_dir = './dataset'
labels_filename = 'attribute_list.csv'

hog_face_detector = dlib.get_frontal_face_detector()
predictor = dlib.shape_predictor('shape_predictor_68_face_landmarks.dat')

In [19]:
def run_dlib_hog(image):
    """
    input grayscale image array 
    """
    
    # Declare histogram equalizer
    clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8,8))
    
    # resize image to uint
    resized_image = image.astype('uint8')
    resized_image_256 = resized_image.reshape(256,256)
    
    # Gray scale image to avoid colour messing up our detections
#     gray = cv2.cvtColor(resized_image, cv2.COLOR_BGR2GRAY)
#     gray = gray.astype('uint8')
    
    # Apply histogram equalization to increase contrast and reveal more features on brightened up or saturated pics
#     equalised_gray_img = clahe.apply(resized_image)


    # detect faces in the grayscale image
    rects = hog_face_detector(resized_image_256, 1)
    num_faces = len(rects)

    if num_faces == 0:
        return True
    else:
        return False

In [28]:
# populating array of outlier images
outliers = []

# go through data and collect the ones that had or didnt faces
for (root, dirs, dat_files) in os.walk('{0}'.format(images_dir)):
    # sanity check to see that we will indeed iterate over all the files
    print(len(dat_files))
    start_time = dt.datetime.now()
    print('Start learning at {}'.format(str(start_time)))
    for file in dat_files:
        # Gray scale image to avoid colour messing up our detections (used with 0 parameter to avoid error)
        img = cv2.imread('{0}/{1}'.format(images_dir, file), 0)
        img_array = image.img_to_array(img)
        outlier = run_dlib_hog(img_array)
        if outlier:
            print('outlier detected: {0}'.format(file))
            # store detected outliers filename as int in list 
            outliers.append(int(file[:-4]))
            
end_time = dt.datetime.now()
print('Stop learning {}'.format(str(end_time)))
elapsed_time= end_time - start_time
print('Elapsed learning {}'.format(str(elapsed_time)))

print(outliers)        

5000
Start learning at 2018-12-13 15:10:02.317813
outlier detected: 1029.png
outlier detected: 1031.png
outlier detected: 1036.png
outlier detected: 1060.png
outlier detected: 1064.png
outlier detected: 1065.png
outlier detected: 1080.png
outlier detected: 1085.png
outlier detected: 1094.png
outlier detected: 11.png
outlier detected: 1100.png
outlier detected: 1114.png
outlier detected: 1124.png
outlier detected: 1133.png
outlier detected: 1135.png
outlier detected: 1136.png
outlier detected: 1140.png
outlier detected: 1147.png
outlier detected: 1148.png
outlier detected: 1149.png
outlier detected: 1153.png
outlier detected: 1164.png
outlier detected: 1167.png
outlier detected: 1183.png
outlier detected: 1205.png
outlier detected: 1208.png
outlier detected: 1222.png
outlier detected: 1229.png
outlier detected: 123.png
outlier detected: 1235.png
outlier detected: 1237.png
outlier detected: 1239.png
outlier detected: 1241.png
outlier detected: 1248.png
outlier detected: 125.png
outlier d

outlier detected: 3448.png
outlier detected: 3452.png
outlier detected: 3458.png
outlier detected: 3461.png
outlier detected: 3489.png
outlier detected: 3503.png
outlier detected: 3516.png
outlier detected: 3518.png
outlier detected: 3525.png
outlier detected: 3533.png
outlier detected: 3545.png
outlier detected: 3564.png
outlier detected: 3566.png
outlier detected: 3574.png
outlier detected: 358.png
outlier detected: 3580.png
outlier detected: 3587.png
outlier detected: 359.png
outlier detected: 3596.png
outlier detected: 3609.png
outlier detected: 362.png
outlier detected: 3620.png
outlier detected: 3621.png
outlier detected: 3622.png
outlier detected: 3627.png
outlier detected: 364.png
outlier detected: 3652.png
outlier detected: 3654.png
outlier detected: 3660.png
outlier detected: 3667.png
outlier detected: 3669.png
outlier detected: 368.png
outlier detected: 3682.png
outlier detected: 3696.png
outlier detected: 3712.png
outlier detected: 3732.png
outlier detected: 3739.png
outlie

In [29]:
print(len(outliers)) 

578


As a First Estimate we have 578 outliers, lets see how accurate that actually is. The data is indeed labeled, and if it isn't an image, all the different features should be set to -1 in the labels_csv file, this will allow us to create an accuracy estimate of our HOG classifier as we know the total length of our dataset: 5000 images. Lets use pandas to import the csv as a nice little Dataframe, because who likes dicts and numpy anyways?? right?? (Also we need to make sure that we set the keys of the dataframe to be the file_name, thus the parameters I am passing)

I have also originally used an equaliser as can be seen in the code, with the equaliser we get 609 outliers, so more outliers are detected but as can be seen later the equalisation has just added more cartoons to the outliers. In some sense it does help the classifier, since the latter is trained to detect human based faces (not cartoons) so equalisation would be usefull to use if our dataset was solely based on real human faces, it would classify actually alot better and eliminate more noisy none-wanted images such as cartoons in that particular case. The idea behind using histogram equalisation is to lighting being an issue in images. For example, a very lightened up face might prove to lose most of its contrast and therefore facial features aren't as apparent, equalisation in this case will increase the contrast and reveal usefull facial information to the classifier. 

In [20]:
df = pd.read_csv(labels_filename, skiprows=1, index_col='file_name')

In [21]:
print(df)

           hair_color  eyeglasses  smiling  young  human
file_name                                               
1                   1          -1        1      1     -1
2                   4          -1        1      1      1
3                   5          -1        1     -1     -1
4                  -1          -1       -1     -1     -1
5                  -1          -1       -1     -1     -1
6                  -1          -1       -1     -1     -1
7                   2          -1        1      1     -1
8                   3          -1        1      1     -1
9                   1           1        1      1     -1
10                  5          -1        1     -1     -1
11                 -1          -1       -1     -1     -1
12                  1          -1        1      1     -1
13                  3          -1       -1      1      1
14                  4          -1       -1      1      1
15                 -1          -1       -1     -1     -1
16                  5          

In [22]:
list_of_outliers = df.index[(df['hair_color'] == -1) & (df['eyeglasses'] == -1) & (df['smiling'] == -1) & (df['young'] == -1) & (df['human'] == -1)].tolist()
print(list_of_outliers)
print(len(list_of_outliers))

[4, 5, 6, 11, 15, 21, 27, 58, 67, 125, 129, 151, 167, 175, 193, 203, 207, 220, 222, 227, 248, 251, 253, 266, 289, 301, 305, 316, 324, 326, 341, 358, 359, 364, 368, 386, 387, 393, 415, 427, 432, 440, 449, 452, 466, 471, 503, 511, 512, 517, 539, 542, 548, 574, 575, 596, 600, 610, 625, 638, 639, 650, 663, 669, 692, 695, 711, 714, 718, 728, 731, 741, 748, 754, 762, 778, 779, 805, 813, 821, 824, 843, 862, 865, 868, 875, 876, 879, 893, 915, 931, 939, 952, 982, 983, 985, 989, 1031, 1036, 1064, 1065, 1080, 1094, 1100, 1114, 1124, 1133, 1135, 1140, 1147, 1148, 1149, 1164, 1167, 1183, 1205, 1222, 1229, 1235, 1237, 1239, 1248, 1285, 1300, 1312, 1318, 1319, 1335, 1336, 1337, 1338, 1345, 1370, 1381, 1393, 1400, 1403, 1421, 1451, 1506, 1508, 1529, 1530, 1533, 1539, 1545, 1546, 1568, 1572, 1580, 1603, 1613, 1626, 1629, 1645, 1649, 1671, 1682, 1702, 1716, 1723, 1729, 1738, 1783, 1792, 1811, 1815, 1873, 1903, 1909, 1913, 1933, 1935, 1955, 1965, 1966, 1978, 1989, 1992, 2011, 2037, 2040, 2053, 2056, 2058

In [33]:
# calculate accuracies and images that were misclassified and consider using another facedetector
# list of real outliers not detected:
real_outliers_n_detect = list(set(list_of_outliers) - set(outliers))
print(real_outliers_n_detect)


[]


Great so at least we detected all the outliers, however we also detected outliers that were actually not, lets have a look:

In [34]:
non_real_outliers_detect = list(set(outliers) - set(list_of_outliers))
print(non_real_outliers_detect)
print(len(non_real_outliers_detect))

[1029, 3089, 3609, 540, 3102, 4130, 4642, 1060, 3621, 3622, 2604, 562, 1589, 4151, 567, 2107, 2108, 1085, 3136, 1615, 4177, 82, 3154, 2647, 3682, 3171, 2662, 2665, 4715, 623, 1136, 3696, 4724, 123, 1153, 2694, 4242, 4761, 161, 2210, 4260, 2213, 4263, 1703, 2732, 2739, 1208, 3257, 188, 3261, 707, 196, 1733, 4297, 4298, 3789, 4814, 720, 1241, 730, 734, 2271, 3807, 3298, 4836, 1256, 4841, 1771, 240, 241, 2802, 4852, 4341, 1795, 4872, 4881, 4884, 1813, 2329, 793, 2331, 797, 2334, 4382, 3873, 3362, 2854, 2343, 1833, 810, 1322, 309, 3897, 3911, 2892, 4948, 3420, 3421, 1890, 1892, 362, 4974, 3951, 4977, 2421, 3448, 377, 4478, 1409, 1412, 3979, 1423, 912, 913, 1940, 1429, 1944, 3992, 4507, 421, 1447, 1449, 1962, 2475, 1964, 444, 3518, 4545, 3014, 1479, 4553, 4042, 1485, 3533, 3545, 476, 3036, 2017, 1511, 3052, 496, 1524, 2559]
143


In [35]:
# lets see some of these images (first 10):
list_im = []
for image in non_real_outliers_detect[:45]:
    list_im.append('{0}/{1}.png'.format(images_dir, str(image)))

imgs = [ PIL.Image.open(i) for i in list_im ]
# pick the image which is the smallest, and resize the others to match it (can be arbitrary image shape here)
min_shape = sorted( [(np.sum(i.size), i.size ) for i in imgs])[0][1]
imgs_comb = np.hstack( (np.asarray( i.resize(min_shape) ) for i in imgs ) )

# save that beautiful picture
imgs_comb = PIL.Image.fromarray( imgs_comb)
imgs_comb.save( 'subplot_no_eq.jpg' )  

It is clear from observing the Subplot that most of the faces that were not detected by our HOG classifier were mainly cartoon based pictures, moreover the latter are mainly dark toned skin with no clear jawlines, especially the ones with the beards which might have discombobulated our HOG classifier which was actually mainly trained off of real people (This face detector is made using a Histogram of Oriented Gradients (HOG) feature combined with a linear classifier, an image pyramid, and sliding window detection scheme, it was trained on about 5000 real face images - not cartoon). Thus explaining the fact that we mainly see cartoons as faces which have not been detected. This said, we now have a few different solutions, the first would be to call it a day and just use what we have, as all the outliers are gone and only 174 faces (mainly cartoon based) have been eliminated wrongly. We can then reinsert these wrongly eliminated images since we know exactly which ones they are. But that would be too easy, there are still different things we can do: use another HOG classifier on the faces which have been treated as outliers, (chaining another HOG face detector), chain another type of face detector on the subset (like a fisherface, haarcascade facial detectors or even a pre-trained CNN facedetector from dlib), which are also widely available and arent based on specific coordinate based feature extraction but more-so on the full image itself, so it might make our facedetector a little less racist!)

# Using a pre-trained CNN Face-detector


Let us run a pre-trained CNN face-detector from dlib. It is 'pre-trained' since we are importing weights from a pre-trained CNN model. In our case, the latter corresponds to: http://dlib.net/files/mmod_human_face_detector.dat.bz2 
It is known that the HOG classifier has a hard time with odd face angles and is much better for basic "frontal" detection, the CNN classifier should be better in the sense that we can get better detections but, because of it being a CNN, if we were to train the latter, it would be a much slower process (and a more demanding one). Let's see how it performs and if it will capture more of the dark-toned skin cartoons that have not been detected by our HOG classifier.

In [25]:
# initialize cnn based face detector with the weights
weights = 'mmod_human_face_detector.dat'
cnn_face_detector = dlib.cnn_face_detection_model_v1(weights)

In [26]:
def run_dlib_cnn(image):
    """
    input grayscale image array 
    """
    
    # Declare histogram equalizer
#     clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8,8))
    
    # resize image to uint
    resized_image = image.astype('uint8')
    resized_image_256 = resized_image.reshape(256,256)
    
    # Gray scale image to avoid colour messing up our detections
#     gray = cv2.cvtColor(resized_image, cv2.COLOR_BGR2GRAY)
#     gray = gray.astype('uint8')
    
    # Apply histogram equalization to increase contrast and reveal more features on brightened up or saturated pics
#     equalised_gray_img = clahe.apply(resized_image)


    # detect faces in the grayscale image
    rects = cnn_face_detector(resized_image_256, 1)
    num_faces = len(rects)

    if num_faces == 0:
        return True
    else:
        return False

In [4]:
# populating array of outlier images
outliers = []

# go through data and collect the ones that had or didnt faces
for (root, dirs, dat_files) in os.walk('{0}'.format(images_dir)):
    # sanity check to see that we will indeed iterate over all the files
    print(len(dat_files))
    start_time = dt.datetime.now()
    print('Start learning at {}'.format(str(start_time)))
    for file in dat_files:
        # Gray scale image to avoid colour messing up our detections (used with 0 parameter to avoid error)
        img = cv2.imread('{0}/{1}'.format(images_dir, file), 0)
        img_array = image.img_to_array(img)
        outlier = run_dlib_cnn(img_array)
        if outlier:
            print('outlier detected: {0}'.format(file))
            # store detected outliers filename as int in list 
            outliers.append(int(file[:-4]))
            
end_time = dt.datetime.now()
print('Stop learning {}'.format(str(end_time)))
elapsed_time= end_time - start_time
print('Elapsed learning {}'.format(str(elapsed_time)))

print(outliers)        

5000
Start learning at 2018-12-13 16:02:49.526497
outlier detected: 1028.png
outlier detected: 1031.png
outlier detected: 1036.png
outlier detected: 1046.png
outlier detected: 1064.png
outlier detected: 1065.png
outlier detected: 1066.png
outlier detected: 1080.png
outlier detected: 1090.png
outlier detected: 1094.png
outlier detected: 11.png
outlier detected: 1100.png
outlier detected: 1104.png
outlier detected: 1114.png
outlier detected: 112.png
outlier detected: 1124.png
outlier detected: 1133.png
outlier detected: 1135.png
outlier detected: 1140.png
outlier detected: 1147.png
outlier detected: 1148.png
outlier detected: 1149.png
outlier detected: 1153.png
outlier detected: 1164.png
outlier detected: 1167.png
outlier detected: 1183.png
outlier detected: 1205.png
outlier detected: 1222.png
outlier detected: 1229.png
outlier detected: 1235.png
outlier detected: 1237.png
outlier detected: 1239.png
outlier detected: 1248.png
outlier detected: 125.png
outlier detected: 1285.png
outlier d

outlier detected: 3580.png
outlier detected: 3587.png
outlier detected: 359.png
outlier detected: 3596.png
outlier detected: 362.png
outlier detected: 3620.png
outlier detected: 3623.png
outlier detected: 3627.png
outlier detected: 364.png
outlier detected: 3652.png
outlier detected: 3654.png
outlier detected: 3660.png
outlier detected: 3667.png
outlier detected: 3669.png
outlier detected: 368.png
outlier detected: 3712.png
outlier detected: 3717.png
outlier detected: 3732.png
outlier detected: 3739.png
outlier detected: 3755.png
outlier detected: 3776.png
outlier detected: 3782.png
outlier detected: 3783.png
outlier detected: 3789.png
outlier detected: 3792.png
outlier detected: 3795.png
outlier detected: 3802.png
outlier detected: 3806.png
outlier detected: 3808.png
outlier detected: 3825.png
outlier detected: 3835.png
outlier detected: 3841.png
outlier detected: 3843.png
outlier detected: 3844.png
outlier detected: 385.png
outlier detected: 3851.png
outlier detected: 3858.png
outlie

In [5]:
print(len(outliers))

543


We obtain a very similar amount of outliers for the CNN, however slightly less than with the HOG+SVM classifier (578), this is because the HOG+SVM classifier probably detected less "cartoon"  faces compared to the CNN. It is important to notice thought that to run the CNN it took us a total of 55 minutes, this is much larger than running the HOG+SVM todetect faces which only took 4 minutes for similar results. Let's now look at some results:

In [10]:
# calculate accuracies and images that were misclassified and consider using another facedetector
# list of real outliers not detected:
real_outliers_n_detect = list(set(list_of_outliers) - set(outliers))
print(real_outliers_n_detect)

[]


Our CNN has also detected all of the outliers!

In [11]:
non_real_outliers_detect = list(set(outliers) - set(list_of_outliers))
print(non_real_outliers_detect)
print(len(non_real_outliers_detect))

[1028, 3091, 1046, 4631, 4130, 3623, 1066, 3117, 1090, 3145, 3149, 1615, 1104, 2640, 4182, 2648, 604, 605, 4205, 112, 4726, 3191, 4734, 1153, 4228, 3717, 137, 1674, 4239, 4247, 2226, 3257, 2746, 2751, 2244, 204, 3789, 2257, 4817, 2775, 2778, 3806, 3295, 2277, 1771, 3825, 3314, 2814, 3841, 2312, 3851, 4881, 4884, 1813, 2329, 2841, 793, 1308, 797, 4894, 3365, 2854, 810, 1328, 2865, 3380, 4411, 3902, 3392, 3396, 4424, 844, 337, 2899, 3412, 4950, 3414, 858, 2907, 3420, 349, 4957, 2399, 362, 1391, 880, 2929, 2421, 385, 4999, 4489, 3467, 1424, 406, 925, 421, 2981, 2479, 3504, 2492, 3529, 975, 3544, 2009, 3036, 478, 488, 2544]
108


We get 108 faces that were not detected, this is better than the HOG+SVM where 143 faces werent detected, lets have a look at what the latter look like:

In [12]:
# lets see some of these images (first 10):
list_im = []
for image in non_real_outliers_detect[:45]:
    list_im.append('{0}/{1}.png'.format(images_dir, str(image)))

imgs = [ PIL.Image.open(i) for i in list_im ]
# pick the image which is the smallest, and resize the others to match it (can be arbitrary image shape here)
min_shape = sorted( [(np.sum(i.size), i.size ) for i in imgs])[0][1]
imgs_comb = np.hstack( (np.asarray( i.resize(min_shape) ) for i in imgs ) )

# save that beautiful picture
imgs_comb = PIL.Image.fromarray( imgs_comb)
imgs_comb.save( 'subplot_cnn.jpg' )

It is EXTREMELY clear from the the 'subplot_cnn.jpg' that our CNN has a real problem with bearded cartoons. Once again also mainly dark skin toned cartoons. This does make sense, the CNN weights came from training based off of real life faces first of all, second, the dataset was mainly based off white caucasians without necesarily large or any beards at all (since people with beards were probably harder to find to add to the dataset). Hence our results make sense. It seems like our HOG+SVM was slightly less sensible to the beard problem, but still sensible to the skin colour problem. In some sense the CNN is actually doing a great job, because it basically segregated cartoon images with considerable beards which it was never trained to recognise (it hasn't learnt those particular representations when talking in terms of weights), so the weights turned out to not learn that specific feature.

In [1]:
list_cartoons = [1028, 3091, 1046, 4631, 4130, 3623, 1066, 3117, 1090, 3145, 3149, 1615, 1104, 2640, 4182, 2648, 604, 605, 4205, 112, 4726, 3191, 4734, 1153, 4228, 3717, 137, 1674, 4239, 4247, 2226, 3257, 2746, 2751, 2244, 204, 3789, 2257, 4817, 2775, 2778, 3806, 3295, 2277, 1771, 3825, 3314, 2814, 3841, 2312, 3851, 4881, 4884, 1813, 2329, 2841, 793, 1308, 797, 4894, 3365, 2854, 810, 1328, 2865, 3380, 4411, 3902, 3392, 3396, 4424, 844, 337, 2899, 3412, 4950, 3414, 858, 2907, 3420, 349, 4957, 2399, 362, 1391, 880, 2929, 2421, 385, 4999, 4489, 3467, 1424, 406, 925, 421, 2981, 2479, 3504, 2492, 3529, 975, 3544, 2009, 3036, 478, 488, 2544]

In [7]:
# populating array of outlier images
new_outliers = []

# go through data and collect the ones that had or didnt faces
for (root, dirs, dat_files) in os.walk('{0}'.format(images_dir)):
    # sanity check to see that we will indeed iterate over all the files
    print(len(dat_files))
    start_time = dt.datetime.now()
    print('Start learning at {}'.format(str(start_time)))
    for file in dat_files:
        int_file = int(file[:-4])
        if int_file in list_cartoons:
            # Gray scale image to avoid colour messing up our detections (used with 0 parameter to avoid error)
            img = cv2.imread('{0}/{1}'.format(images_dir, file), 0)
            img_array = image.img_to_array(img)
            outlier = run_dlib_hog(img_array)
            if outlier:
                print('outlier detected: {0}'.format(file))
                # store detected outliers filename as int in list 
                new_outliers.append(int(file[:-4]))
            
end_time = dt.datetime.now()
print('Stop learning {}'.format(str(end_time)))
elapsed_time= end_time - start_time
print('Elapsed learning {}'.format(str(elapsed_time)))

print(new_outliers)        

5000
Start learning at 2018-12-23 17:03:59.551319
outlier detected: 1153.png
outlier detected: 1615.png
outlier detected: 1771.png
outlier detected: 1813.png
outlier detected: 2421.png
outlier detected: 2854.png
outlier detected: 3036.png
outlier detected: 3257.png
outlier detected: 3420.png
outlier detected: 362.png
outlier detected: 3789.png
outlier detected: 3902.png
outlier detected: 4130.png
outlier detected: 421.png
outlier detected: 4881.png
outlier detected: 4884.png
outlier detected: 793.png
outlier detected: 797.png
outlier detected: 810.png
Stop learning 2018-12-23 17:04:03.173909
Elapsed learning 0:00:03.622590
[1153, 1615, 1771, 1813, 2421, 2854, 3036, 3257, 3420, 362, 3789, 3902, 4130, 421, 4881, 4884, 793, 797, 810]


Here we have fundamentally cascaded a CNN based face detector and a HOG based face-detector, this gives us better results, as we have reduced the number of false positives (in other words the number of cartoon faces that were detected as outliers) from 108 to 19

In [8]:
print(len(new_outliers))

19


We can plot the latter to see which ones werent detected:


In [14]:
non_real_outliers_detect = list(set(new_outliers) - set(list_of_outliers))
print(non_real_outliers_detect)
print(len(non_real_outliers_detect))

[1153, 4130, 3257, 421, 2854, 362, 1771, 810, 3789, 3036, 1615, 4881, 4884, 2421, 1813, 793, 3420, 797, 3902]
19


In [15]:
# lets see some of these images:
list_im = []
for image in non_real_outliers_detect:
    list_im.append('{0}/{1}.png'.format(images_dir, str(image)))

imgs = [ PIL.Image.open(i) for i in list_im ]
# pick the image which is the smallest, and resize the others to match it (can be arbitrary image shape here)
min_shape = sorted( [(np.sum(i.size), i.size ) for i in imgs])[0][1]
imgs_comb = np.hstack( (np.asarray( i.resize(min_shape) ) for i in imgs ) )

# save that beautiful picture
imgs_comb = PIL.Image.fromarray( imgs_comb)
imgs_comb.save( 'subplot_cnn_hog.jpg' )

Once again, we observe that the cartoons that were treated as outliers are mainly the ones with occlusive elements on their face such as glasses or beards, the latter hide the key features that our classifiers look at, in our case this might be the jawline. We also notice a prevalent number of dark skin cartoons (mainly black skin). These darker coulours are probably not representations or features familiar to our HOG and CNN, because of the way they were initially trained. We can cascade a third HOG classifier and observe whether or not we can get closer to zero cartoons being treated as outliers:

In [23]:
new_list_cartoons = [1153, 4130, 3257, 421, 2854, 362, 1771, 810, 3789, 3036, 1615, 4881, 4884, 2421, 1813, 793, 3420, 797, 3902]

In [24]:
# populating array of outlier images
new_outliers_2 = []

# go through data and collect the ones that had or didnt faces
for (root, dirs, dat_files) in os.walk('{0}'.format(images_dir)):
    # sanity check to see that we will indeed iterate over all the files
    start_time = dt.datetime.now()
    print('Start learning at {}'.format(str(start_time)))
    for file in dat_files:
        int_file = int(file[:-4])
        if int_file in new_list_cartoons:
            # Gray scale image to avoid colour messing up our detections (used with 0 parameter to avoid error)
            img = cv2.imread('{0}/{1}'.format(images_dir, file), 0)
            img_array = image.img_to_array(img)
            outlier = run_dlib_hog(img_array)
            if outlier:
                print('outlier detected: {0}'.format(file))
                # store detected outliers filename as int in list 
                new_outliers_2.append(int(file[:-4]))
            
end_time = dt.datetime.now()
print('Stop learning {}'.format(str(end_time)))
elapsed_time= end_time - start_time
print('Elapsed learning {}'.format(str(elapsed_time)))

print(new_outliers_2)    
print(len(new_outliers_2))

Start learning at 2018-12-23 17:19:17.010686
outlier detected: 1153.png
outlier detected: 1615.png
outlier detected: 1771.png
outlier detected: 1813.png
outlier detected: 2421.png
outlier detected: 2854.png
outlier detected: 3036.png
outlier detected: 3257.png
outlier detected: 3420.png
outlier detected: 362.png
outlier detected: 3789.png
outlier detected: 3902.png
outlier detected: 4130.png
outlier detected: 421.png
outlier detected: 4881.png
outlier detected: 4884.png
outlier detected: 793.png
outlier detected: 797.png
outlier detected: 810.png
Stop learning 2018-12-23 17:19:17.689153
Elapsed learning 0:00:00.678467
[1153, 1615, 1771, 1813, 2421, 2854, 3036, 3257, 3420, 362, 3789, 3902, 4130, 421, 4881, 4884, 793, 797, 810]
19


Cascading another HOG classifier doesn't help detecting more cartoons, however we might look into other famous face based detectors such as fisherfaces or Haarcascade based face detection, the latter is based on the whole image moreso than just feature based detections, hence it wont work as well for emotion recognition but will work just fine for frontal face-detection, having tilted faces however and other occlusive elements will always work best on the more SOA CNN based method above. We can try to cascade another CNN, since we only have 19 images to do detection on, shouldnt take an hour like last time:

In [27]:
# populating array of outlier images
new_outliers_2 = []

# go through data and collect the ones that had or didnt faces
for (root, dirs, dat_files) in os.walk('{0}'.format(images_dir)):
    # sanity check to see that we will indeed iterate over all the files
    start_time = dt.datetime.now()
    print('Start learning at {}'.format(str(start_time)))
    for file in dat_files:
        int_file = int(file[:-4])
        if int_file in new_list_cartoons:
            # Gray scale image to avoid colour messing up our detections (used with 0 parameter to avoid error)
            img = cv2.imread('{0}/{1}'.format(images_dir, file), 0)
            img_array = image.img_to_array(img)
            outlier = run_dlib_cnn(img_array)
            if outlier:
                print('outlier detected: {0}'.format(file))
                # store detected outliers filename as int in list 
                new_outliers_2.append(int(file[:-4]))
            
end_time = dt.datetime.now()
print('Stop learning {}'.format(str(end_time)))
elapsed_time= end_time - start_time
print('Elapsed learning {}'.format(str(elapsed_time)))

print(new_outliers_2)    
print(len(new_outliers_2))

Start learning at 2018-12-23 17:36:31.514420
outlier detected: 1153.png
outlier detected: 1615.png
outlier detected: 1771.png
outlier detected: 1813.png
outlier detected: 2421.png
outlier detected: 2854.png
outlier detected: 3036.png
outlier detected: 3257.png
outlier detected: 3420.png
outlier detected: 362.png
outlier detected: 3789.png
outlier detected: 3902.png
outlier detected: 4130.png
outlier detected: 421.png
outlier detected: 4881.png
outlier detected: 4884.png
outlier detected: 793.png
outlier detected: 797.png
outlier detected: 810.png
Stop learning 2018-12-23 17:36:45.093359
Elapsed learning 0:00:13.578939
[1153, 1615, 1771, 1813, 2421, 2854, 3036, 3257, 3420, 362, 3789, 3902, 4130, 421, 4881, 4884, 793, 797, 810]
19


from here, we can tell that cascading a CNN instead of the HOG, has the same effect, we still cannot do a better job of getting those 19 cartoons classified as valid faces. We can now try a fisherface (more suited for actually facial traits in recognition like that of emotion, moreso than just facedetection) or even a haarcascade based approach, in this particular case we will chose the Cascade based classifier instead of the Fisherface approach due to the large amount of pre-trained models that can be found in the OpenCV pre-trained model library, we will cascade the four haarcascade based models to make sure we don't miss out on any images because of basic occlusion or tilts and other noise in the images:

In [28]:
faceDet = cv2.CascadeClassifier("opencv_data/haarcascade_frontalface_default.xml")
faceDet2 = cv2.CascadeClassifier("opencv_data/haarcascade_frontalface_alt2.xml")
faceDet3 = cv2.CascadeClassifier("opencv_data/haarcascade_frontalface_alt.xml")
faceDet4 = cv2.CascadeClassifier("opencv_data/haarcascade_frontalface_alt_tree.xml")

In [29]:
def run_haarcascade(image):
    """
    input grayscale image array 
    """
    
    # Resize image to uint
    resized_image = image.astype('uint8')
    resized_image_256 = resized_image.reshape(256,256)

    # Detect face (grayscale resized) using 4 different classifiers
    face = faceDet.detectMultiScale(resized_image_256, scaleFactor=1.1, minNeighbors=10, minSize=(5, 5), flags=cv2.CASCADE_SCALE_IMAGE)
    face2 = faceDet2.detectMultiScale(resized_image_256, scaleFactor=1.1, minNeighbors=10, minSize=(5, 5), flags=cv2.CASCADE_SCALE_IMAGE)
    face3 = faceDet3.detectMultiScale(resized_image_256, scaleFactor=1.1, minNeighbors=10, minSize=(5, 5), flags=cv2.CASCADE_SCALE_IMAGE)
    face4 = faceDet4.detectMultiScale(resized_image_256, scaleFactor=1.1, minNeighbors=10, minSize=(5, 5), flags=cv2.CASCADE_SCALE_IMAGE)

    # Go over detected faces, stop at first detected face, return empty if no face.
    if len(face) == 1:
        return False
    elif len(face2) == 1:
        return False
    elif len(face3) == 1:
        return False
    elif len(face4) == 1:
        return False
    else:
        return True

In [30]:
# populating array of outlier images
new_outliers_3 = []

# go through data and collect the ones that had or didnt faces
for (root, dirs, dat_files) in os.walk('{0}'.format(images_dir)):
    # sanity check to see that we will indeed iterate over all the files
    start_time = dt.datetime.now()
    print('Start learning at {}'.format(str(start_time)))
    for file in dat_files:
        int_file = int(file[:-4])
        if int_file in new_list_cartoons:
            # Gray scale image to avoid colour messing up our detections (used with 0 parameter to avoid error)
            img = cv2.imread('{0}/{1}'.format(images_dir, file), 0)
            img_array = image.img_to_array(img)
            outlier = run_haarcascade(img_array)
            if outlier:
                print('outlier detected: {0}'.format(file))
                # store detected outliers filename as int in list 
                new_outliers_3.append(int(file[:-4]))
            
end_time = dt.datetime.now()
print('Stop learning {}'.format(str(end_time)))
elapsed_time= end_time - start_time
print('Elapsed learning {}'.format(str(elapsed_time)))

print(new_outliers_3)    
print(len(new_outliers_3))

Start learning at 2018-12-23 18:56:25.831553
outlier detected: 1153.png
outlier detected: 1615.png
outlier detected: 1771.png
outlier detected: 1813.png
outlier detected: 2421.png
outlier detected: 2854.png
outlier detected: 3036.png
outlier detected: 3257.png
outlier detected: 3420.png
outlier detected: 362.png
outlier detected: 3789.png
outlier detected: 3902.png
outlier detected: 4130.png
outlier detected: 421.png
outlier detected: 4881.png
outlier detected: 4884.png
outlier detected: 793.png
outlier detected: 797.png
outlier detected: 810.png
Stop learning 2018-12-23 18:56:27.719257
Elapsed learning 0:00:01.887704
[1153, 1615, 1771, 1813, 2421, 2854, 3036, 3257, 3420, 362, 3789, 3902, 4130, 421, 4881, 4884, 793, 797, 810]
19


WOW! So even 4 back to back haarcascade classifier weren't able to make out that these 19 pictures (all cartoons with beards, glasses - that can hide eyes at times, with dark skin) aren't outliers! the pre-trained models were most likely trained on real humans so similar story to the other previous tests we've made, 19 missclassified images as outliers should prove to not affect our dataset too much, let us create the new dataset having all the usefull information we need:

In [35]:
# building final outlier list
final_outlier_list = list_of_outliers + new_outliers_3
print(len(final_outlier_list))

454


In [36]:
import shutil

In [37]:
# build list of outliers (filename format)
outliers_filename = [] 
for int_file in final_outlier_list:
    string_file = '{0}.png'.format(int_file)
    outliers_filename.append(string_file)
print(outliers_filename)
print(len(outliers_filename))

['4.png', '5.png', '6.png', '11.png', '15.png', '21.png', '27.png', '58.png', '67.png', '125.png', '129.png', '151.png', '167.png', '175.png', '193.png', '203.png', '207.png', '220.png', '222.png', '227.png', '248.png', '251.png', '253.png', '266.png', '289.png', '301.png', '305.png', '316.png', '324.png', '326.png', '341.png', '358.png', '359.png', '364.png', '368.png', '386.png', '387.png', '393.png', '415.png', '427.png', '432.png', '440.png', '449.png', '452.png', '466.png', '471.png', '503.png', '511.png', '512.png', '517.png', '539.png', '542.png', '548.png', '574.png', '575.png', '596.png', '600.png', '610.png', '625.png', '638.png', '639.png', '650.png', '663.png', '669.png', '692.png', '695.png', '711.png', '714.png', '718.png', '728.png', '731.png', '741.png', '748.png', '754.png', '762.png', '778.png', '779.png', '805.png', '813.png', '821.png', '824.png', '843.png', '862.png', '865.png', '868.png', '875.png', '876.png', '879.png', '893.png', '915.png', '931.png', '939.png',

In [44]:
# create new dataset with no outliers (make sure this folder is already created)
dest_dir = 'new_dataset'
# go through data and collect all images, if in outlier list do not copy over
for (root, dirs, dat_files) in os.walk('{0}'.format(images_dir)):
    for file in dat_files:
        if file not in outliers_filename:
            print(file)
            shutil.copy('dataset/{0}'.format(file), dest_dir)

1.png
10.png
100.png
1000.png
1001.png
1002.png
1003.png
1004.png
1005.png
1006.png
1007.png
1008.png
1009.png
101.png
1010.png
1011.png
1012.png
1013.png
1014.png
1015.png
1016.png
1017.png
1018.png
1019.png
102.png
1020.png
1021.png
1022.png
1023.png
1024.png
1025.png
1026.png
1027.png
1028.png
1029.png
103.png
1030.png
1032.png
1033.png
1034.png
1035.png
1037.png
1038.png
1039.png
104.png
1040.png
1041.png
1042.png
1043.png
1044.png
1045.png
1046.png
1047.png
1048.png
1049.png
105.png
1050.png
1051.png
1052.png
1053.png
1054.png
1055.png
1056.png
1057.png
1058.png
1059.png
106.png
1060.png
1061.png
1062.png
1063.png
1066.png
1067.png
1068.png
1069.png
107.png
1070.png
1071.png
1072.png
1073.png
1074.png
1075.png
1076.png
1077.png
1078.png
1079.png
108.png
1081.png
1082.png
1083.png
1084.png
1085.png
1086.png
1087.png
1088.png
1089.png
109.png
1090.png
1091.png
1092.png
1093.png
1095.png
1096.png
1097.png
1098.png
1099.png
110.png
1101.png
1102.png
1103.png
1104.png
1105.png
1106.png

1927.png
1928.png
1929.png
1930.png
1931.png
1932.png
1934.png
1936.png
1937.png
1938.png
1939.png
194.png
1940.png
1941.png
1942.png
1943.png
1944.png
1945.png
1946.png
1947.png
1948.png
1949.png
195.png
1950.png
1951.png
1952.png
1953.png
1954.png
1956.png
1957.png
1958.png
1959.png
196.png
1960.png
1961.png
1962.png
1963.png
1964.png
1967.png
1968.png
1969.png
197.png
1970.png
1971.png
1972.png
1973.png
1974.png
1975.png
1976.png
1977.png
1979.png
198.png
1980.png
1981.png
1982.png
1983.png
1984.png
1985.png
1986.png
1987.png
1988.png
199.png
1990.png
1991.png
1993.png
1994.png
1995.png
1996.png
1997.png
1998.png
1999.png
2.png
20.png
200.png
2000.png
2001.png
2002.png
2003.png
2004.png
2005.png
2006.png
2007.png
2008.png
2009.png
201.png
2010.png
2012.png
2013.png
2014.png
2015.png
2016.png
2017.png
2018.png
2019.png
202.png
2020.png
2021.png
2022.png
2023.png
2024.png
2025.png
2026.png
2027.png
2028.png
2029.png
2030.png
2031.png
2032.png
2033.png
2034.png
2035.png
2036.png
2038.p

2858.png
286.png
2860.png
2861.png
2862.png
2863.png
2864.png
2865.png
2866.png
2867.png
2868.png
287.png
2871.png
2872.png
2873.png
2874.png
2875.png
2876.png
2877.png
2878.png
2879.png
288.png
2880.png
2881.png
2883.png
2884.png
2885.png
2887.png
2889.png
2890.png
2891.png
2892.png
2893.png
2894.png
2895.png
2896.png
2897.png
2898.png
2899.png
29.png
290.png
2900.png
2901.png
2902.png
2903.png
2904.png
2905.png
2906.png
2907.png
2908.png
2909.png
291.png
2910.png
2911.png
2913.png
2914.png
2916.png
2917.png
2918.png
2919.png
292.png
2920.png
2921.png
2922.png
2923.png
2924.png
2925.png
2926.png
2927.png
2928.png
2929.png
293.png
2930.png
2931.png
2932.png
2933.png
2934.png
2935.png
2936.png
2937.png
2938.png
2939.png
294.png
2940.png
2941.png
2942.png
2943.png
2944.png
2945.png
2946.png
2947.png
2948.png
2949.png
295.png
2950.png
2951.png
2952.png
2953.png
2954.png
2955.png
2956.png
2957.png
2958.png
2959.png
296.png
2960.png
2961.png
2962.png
2963.png
2964.png
2965.png
2966.png
2967

3774.png
3775.png
3777.png
3778.png
3779.png
378.png
3780.png
3781.png
3784.png
3785.png
3786.png
3787.png
3788.png
379.png
3790.png
3791.png
3793.png
3794.png
3796.png
3797.png
3798.png
3799.png
38.png
380.png
3800.png
3801.png
3803.png
3804.png
3805.png
3806.png
3807.png
3809.png
381.png
3810.png
3811.png
3812.png
3813.png
3814.png
3815.png
3816.png
3817.png
3818.png
3819.png
382.png
3820.png
3821.png
3822.png
3823.png
3824.png
3825.png
3826.png
3827.png
3828.png
3829.png
383.png
3830.png
3831.png
3832.png
3833.png
3834.png
3836.png
3837.png
3838.png
3839.png
384.png
3840.png
3841.png
3842.png
3845.png
3846.png
3847.png
3848.png
3849.png
385.png
3850.png
3851.png
3852.png
3853.png
3854.png
3855.png
3856.png
3857.png
3859.png
3860.png
3861.png
3862.png
3863.png
3864.png
3865.png
3866.png
3867.png
3868.png
3869.png
3870.png
3871.png
3872.png
3873.png
3874.png
3875.png
3876.png
3877.png
3878.png
3879.png
388.png
3880.png
3881.png
3882.png
3883.png
3884.png
3885.png
3886.png
3887.png
389

4721.png
4722.png
4724.png
4725.png
4726.png
4727.png
4728.png
473.png
4730.png
4731.png
4732.png
4733.png
4734.png
4735.png
4736.png
4737.png
4738.png
4739.png
474.png
4740.png
4741.png
4742.png
4743.png
4744.png
4745.png
4746.png
4747.png
4748.png
4749.png
475.png
4751.png
4752.png
4753.png
4754.png
4755.png
4756.png
4757.png
4758.png
4759.png
476.png
4760.png
4761.png
4762.png
4763.png
4765.png
4766.png
4767.png
4768.png
477.png
4770.png
4771.png
4772.png
4773.png
4774.png
4775.png
4776.png
4777.png
4779.png
478.png
4780.png
4781.png
4782.png
4783.png
4784.png
4785.png
4786.png
4787.png
4788.png
4789.png
479.png
4790.png
4791.png
4792.png
4793.png
4794.png
4795.png
4796.png
4797.png
4798.png
4799.png
48.png
480.png
4800.png
4801.png
4802.png
4803.png
4805.png
4806.png
4807.png
4808.png
4809.png
481.png
4810.png
4811.png
4812.png
4813.png
4814.png
4815.png
4816.png
4817.png
4818.png
4819.png
482.png
4820.png
4821.png
4822.png
4823.png
4824.png
4825.png
4826.png
4827.png
4828.png
4829