# DATA PRE-PROCESSING

## First visual contact with the data

To get started we proceed with a first visual inspection of the data, this allows us to get a feel for the time of outliers we are dealing with and eventually the best methodologies to apply to eliminate the latter

It is clear from the data that the main images include faces (at least those that we are interested in). The images are all of the same size (256x256) and all present RGB channels (256x256x3) this is actually quite large, as, thinking ahead of ourselves, will lead to quite a large number of parameters when thinking about it, since, if we are to include all channels, we have a total of 196608 features per example (per image). So we will definitely have to be careful when delving with our future Neural Networks or Convolutional Neural Networks.

From the Dataset it is clear that we have a few interesting outliers: mainly grey, occasionally mountainous scenery and blue/grey sky. Faces are also noised up with watermarks at certain occasions. The main outliers spotted and strange cases that have been spotted at a first glance are the following: 

- 600.png --> lake, mountain with clouds
- 639.png --> well orangey
- 596.png --> black wierd stuff going on here
- 663.png --> trickier case, has quite some orangy colours
- 665.png --> might be eliminated if not careful (perhaps grey coding might solve this)
- 650.png --> empty image
- 574.png --> partially filled
- 535.png --> distorted data (w/ face)
- 499.png --> highly pixelated data
- 72.png --> destroyed nasal feature
- 390.png, 118.png --> Watermarks (right on the face)
- 206.png --> slightly tilted (might make us wonder if there are any upside down faces)
- 291.png --> hand perturbation
- 302.png --> noisy RGB background

At this point I'm starting to think about data augmentation since i'm noticing quite a bit of peculiar cases and especially quite a bit outliers, which might reduce the size of our data quite considerably.



Honestly the first thing that comes to my mind when looking to find a methodology to eliminate outliers in this particular image based dataset would be to use basic facedetectors which are widely available in libraries such as opencv or dlib. I was also fortunate enough to have been able to play quite a bit with face and emotion recognition in general, my code for both Fisher face based emotion recognition and landmark based emotion recognition can be found on my github: https://github.com/brunocalogero/individual_study and https://github.com/brunocalogero/ECE420FinalProject

It is also our lucky day since it seems at first glance that not many images are upside down or have bizzare orientations, and not too many faces are tilted, in fact most of them are purely frontal, this is ideal for a HOG which is not very bad orientation friendly, but does a great job with frontal faces.

- Hence one obvious approach is to go for the classic HOG + SVM combo (Histogram of Oriented Gradients). 
- Other statistical methods could be envisioned but might not allow us to achieve as good a result. 
- Considering to also use the infamous `cvlib` which detects faces at most angles and in real time in a more efficient manner

In [83]:
import os
import cv2
import dlib
import time
import PIL

import numpy as np
import datetime as dt
import pandas as pd

# PATH TO ALL IMAGES
global basedir, image_paths, target_size
images_dir = './dataset'
labels_filename = 'attribute_list.csv'

hog_face_detector = dlib.get_frontal_face_detector()
predictor = dlib.shape_predictor('shape_predictor_68_face_landmarks.dat')

In [72]:
def run_dlib_hog(image):
    """
    input grayscale image array 
    """
    
    # Declare histogram equalizer
    clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8,8))
    
    # resize image to uint
    resized_image = image.astype('uint8')

    
    # Gray scale image to avoid colour messing up our detections
#     gray = cv2.cvtColor(resized_image, cv2.COLOR_BGR2GRAY)
#     gray = gray.astype('uint8')
    
    # Apply histogram equalization to increase contrast and reveal more features on brightened up or saturated pics
    equalised_gray_img = clahe.apply(resized_image)

    # detect faces in the grayscale image
    rects = detector(equalised_gray_img, 1)
    num_faces = len(rects)

    if num_faces == 0:
        return True
    else:
        return False

In [73]:
# populating array of outlier images
outliers = []

# go through data and collect the ones that had or didnt faces
for (root, dirs, dat_files) in os.walk('{0}'.format(images_dir)):
    # sanity check to see that we will indeed iterate over all the files
    print(len(dat_files))
    start_time = dt.datetime.now()
    print('Start learning at {}'.format(str(start_time)))
    for file in dat_files:
        # Gray scale image to avoid colour messing up our detections (used with 0 parameter to avoid error)
        img = cv2.imread('{0}/{1}'.format(images_dir, file), 0)
        img_array = image.img_to_array(img)
        outlier = run_dlib_hog(img_array)
        if outlier:
            print('outlier detected: {0}'.format(file))
            # store detected outliers filename as int in list 
            outliers.append(int(file[:-4]))
            
end_time = dt.datetime.now()
print('Stop learning {}'.format(str(end_time)))
elapsed_time= end_time - start_time
print('Elapsed learning {}'.format(str(elapsed_time)))

print(outliers)        

5000
Start learning at 2018-12-12 01:15:11.337195
outlier detected: 1029.png
outlier detected: 1031.png
outlier detected: 1036.png
outlier detected: 1060.png
outlier detected: 1064.png
outlier detected: 1065.png
outlier detected: 1071.png
outlier detected: 1075.png
outlier detected: 1080.png
outlier detected: 1085.png
outlier detected: 1094.png
outlier detected: 11.png
outlier detected: 1100.png
outlier detected: 1114.png
outlier detected: 1124.png
outlier detected: 1133.png
outlier detected: 1135.png
outlier detected: 1140.png
outlier detected: 1147.png
outlier detected: 1148.png
outlier detected: 1149.png
outlier detected: 1153.png
outlier detected: 1161.png
outlier detected: 1164.png
outlier detected: 1167.png
outlier detected: 1183.png
outlier detected: 1205.png
outlier detected: 1208.png
outlier detected: 1222.png
outlier detected: 1229.png
outlier detected: 123.png
outlier detected: 1235.png
outlier detected: 1237.png
outlier detected: 1239.png
outlier detected: 1241.png
outlier 

outlier detected: 3322.png
outlier detected: 3324.png
outlier detected: 3326.png
outlier detected: 3335.png
outlier detected: 3338.png
outlier detected: 3339.png
outlier detected: 3349.png
outlier detected: 3350.png
outlier detected: 3357.png
outlier detected: 3360.png
outlier detected: 3364.png
outlier detected: 3375.png
outlier detected: 338.png
outlier detected: 3380.png
outlier detected: 3389.png
outlier detected: 3398.png
outlier detected: 3402.png
outlier detected: 341.png
outlier detected: 3416.png
outlier detected: 3420.png
outlier detected: 3421.png
outlier detected: 3425.png
outlier detected: 3438.png
outlier detected: 3441.png
outlier detected: 3443.png
outlier detected: 3448.png
outlier detected: 3452.png
outlier detected: 3458.png
outlier detected: 3461.png
outlier detected: 3466.png
outlier detected: 3489.png
outlier detected: 3503.png
outlier detected: 3504.png
outlier detected: 3516.png
outlier detected: 3518.png
outlier detected: 3525.png
outlier detected: 3533.png
out

In [74]:
print(len(outliers)) 

609


As a First Estimate we have 609 outliers, lets see how accurate that actually is. The data is indeed labeled, and if it isn't an image, all the different features should be set to -1 in the labels_csv file, this will allow us to create an accuracy estimate of our HOG classifier as we know the total length of our dataset: 5000 images. Lets use pandas to import the csv as a nice little Dataframe, because who likes dicts and numpy anyways?? right?? (Also we need to make sure that we set the keys of the dataframe to be the file_name, thus the parameters I am passing)

In [75]:
df = pd.read_csv(labels_filename, skiprows=1, index_col='file_name')

In [77]:
print(df)

           hair_color  eyeglasses  smiling  young  human
file_name                                               
1                   1          -1        1      1     -1
2                   4          -1        1      1      1
3                   5          -1        1     -1     -1
4                  -1          -1       -1     -1     -1
5                  -1          -1       -1     -1     -1
6                  -1          -1       -1     -1     -1
7                   2          -1        1      1     -1
8                   3          -1        1      1     -1
9                   1           1        1      1     -1
10                  5          -1        1     -1     -1
11                 -1          -1       -1     -1     -1
12                  1          -1        1      1     -1
13                  3          -1       -1      1      1
14                  4          -1       -1      1      1
15                 -1          -1       -1     -1     -1
16                  5          

In [78]:
list_of_outliers = df.index[(df['hair_color'] == -1) & (df['eyeglasses'] == -1) & (df['smiling'] == -1) & (df['young'] == -1) & (df['human'] == -1)].tolist()
print(list_of_outliers)
print(len(list_of_outliers))

[4, 5, 6, 11, 15, 21, 27, 58, 67, 125, 129, 151, 167, 175, 193, 203, 207, 220, 222, 227, 248, 251, 253, 266, 289, 301, 305, 316, 324, 326, 341, 358, 359, 364, 368, 386, 387, 393, 415, 427, 432, 440, 449, 452, 466, 471, 503, 511, 512, 517, 539, 542, 548, 574, 575, 596, 600, 610, 625, 638, 639, 650, 663, 669, 692, 695, 711, 714, 718, 728, 731, 741, 748, 754, 762, 778, 779, 805, 813, 821, 824, 843, 862, 865, 868, 875, 876, 879, 893, 915, 931, 939, 952, 982, 983, 985, 989, 1031, 1036, 1064, 1065, 1080, 1094, 1100, 1114, 1124, 1133, 1135, 1140, 1147, 1148, 1149, 1164, 1167, 1183, 1205, 1222, 1229, 1235, 1237, 1239, 1248, 1285, 1300, 1312, 1318, 1319, 1335, 1336, 1337, 1338, 1345, 1370, 1381, 1393, 1400, 1403, 1421, 1451, 1506, 1508, 1529, 1530, 1533, 1539, 1545, 1546, 1568, 1572, 1580, 1603, 1613, 1626, 1629, 1645, 1649, 1671, 1682, 1702, 1716, 1723, 1729, 1738, 1783, 1792, 1811, 1815, 1873, 1903, 1909, 1913, 1933, 1935, 1955, 1965, 1966, 1978, 1989, 1992, 2011, 2037, 2040, 2053, 2056, 2058

In [79]:
# calculate accuracies and images that were misclassified and consider using another facedetector
# list of real outliers not detected:
real_outliers_n_detect = list(set(list_of_outliers) - set(outliers))
print(real_outliers_n_detect)


[]


Great so at least we detected all the outliers, however we also detected outliers that were actually not, lets have a look:

In [88]:
non_real_outliers_detect = list(set(outliers) - set(list_of_outliers))
print(non_real_outliers_detect)
print(len(non_real_outliers_detect))

[4100, 2564, 1029, 1550, 2577, 3089, 3092, 3609, 2586, 540, 3102, 4130, 1060, 2604, 1071, 562, 1075, 4151, 567, 1594, 2108, 572, 1085, 3136, 3138, 3140, 2118, 3147, 77, 1615, 4177, 82, 3154, 2647, 599, 3682, 3171, 2662, 4200, 4715, 623, 3696, 3698, 632, 123, 4735, 1153, 2694, 1161, 4761, 161, 2210, 4260, 4263, 1703, 2732, 2735, 2739, 2740, 1208, 3257, 188, 3261, 196, 4297, 4298, 3789, 3278, 720, 4819, 2774, 1241, 730, 3292, 734, 2271, 4831, 3807, 3298, 1760, 4836, 2792, 4841, 1256, 1771, 238, 240, 241, 4852, 4341, 1795, 4872, 1290, 3338, 3851, 4881, 785, 4884, 3349, 793, 2329, 2331, 797, 2334, 4382, 1310, 3873, 2854, 2343, 809, 810, 1322, 1833, 4398, 3380, 309, 1844, 3897, 3911, 3912, 2892, 338, 4952, 3420, 3421, 1890, 1892, 4969, 362, 4974, 3951, 2417, 4977, 2421, 3448, 377, 4478, 1409, 1412, 3466, 1424, 913, 1940, 1429, 1944, 4507, 2460, 2465, 1449, 1962, 1964, 3504, 1457, 1458, 444, 3518, 4545, 3014, 1479, 4553, 2506, 1485, 3533, 4561, 3545, 476, 3036, 3040, 2017, 1511, 3052, 496, 1

In [90]:
# lets see some of these images (first 10):
list_im = []
for image in non_real_outliers_detect[:45]:
    list_im.append('{0}/{1}.png'.format(images_dir, str(image)))

imgs = [ PIL.Image.open(i) for i in list_im ]
# pick the image which is the smallest, and resize the others to match it (can be arbitrary image shape here)
min_shape = sorted( [(np.sum(i.size), i.size ) for i in imgs])[0][1]
imgs_comb = np.hstack( (np.asarray( i.resize(min_shape) ) for i in imgs ) )

# save that beautiful picture
imgs_comb = PIL.Image.fromarray( imgs_comb)
imgs_comb.save( 'subplot.jpg' )  