# Data exploration on AffectNet raw dataset
First we declare the necessary variables and import the necessary libraries.

In [1]:
import numpy as np
import os

try: 
    from src import RAW_AFFECTNET_DIR
except ModuleNotFoundError:
    print("Ensure that src is added to PATH and restart the kernel")

In [2]:
train_path = os.path.join(RAW_AFFECTNET_DIR, "train_set")
val_path = os.path.join(RAW_AFFECTNET_DIR, "val_set")
print("The training path is:", train_path)
print("The validation path is:", val_path)

The training path is: /mnt/gpid08/datasets/affectnet/raw/train_set
The validation path is: /mnt/gpid08/datasets/affectnet/raw/val_set


Check how many annotations there are in the dataset adn how many photos.


In [3]:
train_path_annotation = os.path.join(train_path, "annotations")
val_path_annotation = os.path.join(val_path, "annotations")
train_path_images = os.path.join(train_path, "images")
val_path_images = os.path.join(val_path, "images")

print("There are", len(os.listdir(train_path_annotation)), "annotation files in the training set")
print("There are", len(os.listdir(val_path_annotation)), "annotation files in the validation set")
print("There are", len(os.listdir(train_path_images)), "images in the training set")
print("There are", len(os.listdir(val_path_images)), "images in the validation set")

There are 1150604 annotation files in the training set
There are 15996 annotation files in the validation set
There are 287651 images in the training set
There are 3999 images in the validation set


We can observe they do not have the same number of annotations and photos. This is because per each photo there are many files representing different annotations. We will see which kind of annotations are there.

Check kind of annotations in the files and how many there are of each kind.

In [4]:
# Check the number of files in the training set

file_type_set = dict()
for file in os.listdir(train_path_annotation):
    file_type = file.split("_")[1].split(".")[0]
    if file.endswith(".npy"):
        if file_type not in file_type_set:
            file_type_set[file_type] = 1
        else:
            file_type_set[file_type] += 1
print("The file types for train are:", file_type_set)

# Check now the validation set
val_path_annotation = os.path.join(val_path, "annotations")
file_type_set = dict()
for file in os.listdir(val_path_annotation):
    file_type = file.split("_")[1].split(".")[0]
    if file.endswith(".npy"):
        if file_type not in file_type_set:
            file_type_set[file_type] = 1
        else:
            file_type_set[file_type] += 1
print("The file types for val are:", file_type_set)

The file types for train are: {'aro': 287651, 'exp': 287651, 'lnd': 287651, 'val': 287651}
The file types for val are: {'val': 3999, 'exp': 3999, 'lnd': 3999, 'aro': 3999}


We can observe that there are the same amount of annotations for each image. The type of annotations is the same one. Now we will check that the number of annotations per id is the same:

In [5]:
# Check the number of files in the training set
id_dict_train = dict()
for file in os.listdir(train_path_annotation):
    id = file.split("_")[0]
    if file.endswith(".npy"):
        if id not in id_dict_train:
            id_dict_train[id] = 1
        else:
            id_dict_train[id] += 1
print("There are", len(id_dict_train), "id in the training set")

# Check now the validation set
id_dict_val = dict()
for file in os.listdir(val_path_annotation):
    id = file.split("_")[0]
    if file.endswith(".npy"):
        if id not in id_dict_val:
            id_dict_val[id] = 1
        else:
            id_dict_val[id] += 1
print("There are", len(id_dict_val), "id in the validation set")

There are 287651 id in the training set
There are 3999 id in the validation set


In [6]:
if any(value != 4 for value in id_dict_train.values()):
    print("There is at least one id in 'train' set with less/more than 4 annotations")
else:
    print("All ids have 4 annotations in 'train' set")

if any(value != 4 for value in id_dict_val.values()):
    print("There is at least one id in 'val' set with less/more than 4 annotations")
else:
    print("All ids have 4 annotations in 'val' set")

All ids have 4 annotations in 'train' set
All ids have 4 annotations in 'val' set


Now we will check that per each annotation, the number of expected values inside and the data type is correct:

* `val`: Valence. The expected length of numpy vector is 1. The data type is <Ux (variable length x unicode string encoded using little-endian). The data type represented is float. The accepted range is [-1,1].
* `aro`: Arousal. The expected length of numpy vector is 1. The data type is <Ux (variable length x unicode string encoded using little-endian). The data type represented is float. The accepted range is [-1,1].
* `exp`: Expression (categorical emotion). The expected length of numpy vector is 1. The data type is <Ux (variable length x unicode string encoded using little-endian). The data type represented is integer. The accepted values are [0, 1, 2, 3, 4, 5, 6, 7] from .
* `lnd`: landmarks of the face. The expected length of numpy vector is 136 (68 points with succesive x and y coordinates). The data type is float for each coordinate.
, and  are stored using unicode strings, for lnd is used the float type (but as I have not much interest in this landmarks i will not be using them).

If no output is printed(excepting the helping function to analyze it), this means that the rules are followed.

**Cited from the AffectNet8**

* The x and y coordination and the annotations are stored in .npy files. Use Python numpy to read the files separated with a semi-colon. 
* Expression: expression ID of the face (0: Neutral, 1: Happy, 2: Sad, 3:Surprise, 4: Fear, 5: Disgust, 6: Anger, 7: Contempt) 
* Valence: valence value of the expression in interval [-1,+1] (for Uncertain and No-face categories the value is -2) 
* Arousal: arousal value of the expression in interval [-1,+1] (for Uncertain and No-face categories the value is -2)

In [7]:
print("Analyzing the annotations in the training set")
analyzed_ids = 0
for id in id_dict_train.keys():
    for filename in file_type_set.keys():
        data = np.load(os.path.join(train_path_annotation, id + "_" + filename + ".npy"))
        if filename == 'lnd':
            if data.size != 136:
                print("The file", id + "_" + filename + ".npy", "has", data.size, "annotations from the standard 136")
            if not np.issubdtype(data.dtype, np.floating):
                print("The file", id + "_" + filename + ".npy", "has no float values")
        else:
            if data.size != 1:
                print("The file", id + "_" + filename + ".npy", "has", data.size, "annotations")
            if not np.issubdtype(data.dtype, np.unicode_):
                print("The file", id + "_" + filename + ".npy", "has been stored using other type than string")
        
        if filename == 'exp':
            if int(data) not in [0, 1, 2, 3, 4, 5, 6, 7]: # Available expressions
                print("The file", id + "_" + filename + ".npy", "has an expression value of", int(data))
        elif filename == 'val' or filename == 'aro':
            if float(data) < -1.0 or float(data) > 1.0:
                print("The file", id + "_" + filename + ".npy", "has a value of", float(data), "for", filename)
    if analyzed_ids % 25000 == 0:
        print("Analyzed", analyzed_ids, "ids from", len(id_dict_train))
    analyzed_ids += 1

print("Analyzing the annotations in the validation set")
analyzed_ids = 0
for id in id_dict_val.keys():
    for filename in file_type_set.keys():
        data = np.load(os.path.join(val_path_annotation, id + "_" + filename + ".npy"))
        if filename == 'lnd':
            if data.size != 136:
                print("The file", id + "_" + filename + ".npy", "has", data.size, "annotations from the standard 136")
            if not np.issubdtype(data.dtype, np.floating):
                print("The file", id + "_" + filename + ".npy", "has no float values")
        else:
            if data.size != 1:
                print("The file", id + "_" + filename + ".npy", "has", data.size, "annotations")
                
            if not np.issubdtype(data.dtype, np.unicode_):
                print("The file", id + "_" + filename + ".npy", "has been stored using other type than string")

        if filename == 'exp':
            if int(data) not in [0, 1, 2, 3, 4, 5, 6, 7]: # Available expressions
                print("The file", id + "_" + filename + ".npy", "has an expression value of", int(data))
        elif filename == 'val' or filename == 'aro':
            if float(data) < -1.0 or float(data) > 1.0:
                print("The file", id + "_" + filename + ".npy", "has a value of", float(data), "for", filename)
    if analyzed_ids % 10000 == 0:
        print("Analyzed", analyzed_ids, "ids from", len(id_dict_val))
    analyzed_ids += 1

Analyzing the annotations in the training set
Analyzed 0 ids from 287651
Analyzed 25000 ids from 287651
Analyzed 50000 ids from 287651
Analyzed 75000 ids from 287651
Analyzed 100000 ids from 287651
Analyzed 125000 ids from 287651
Analyzed 150000 ids from 287651
Analyzed 175000 ids from 287651
Analyzed 200000 ids from 287651
Analyzed 225000 ids from 287651
Analyzed 250000 ids from 287651
Analyzed 275000 ids from 287651
Analyzing the annotations in the validation set
Analyzed 0 ids from 3999


Finally we check that per each annotation id known, it has their own image file:

In [8]:
# Convert the dictionary to a set
id_set_train = set(id_dict_train)
for image_path in os.listdir(train_path_images):
    id = image_path.split(".")[0]
    if id in id_set_train:
        id_set_train.remove(id)
    else :
        print("The image", image_path, "is not present in the training set")

if (len(id_set_train) == 0):
    print("All images have an annotation on 'train' set, and all annotations have an image")

id_set_val = set(id_dict_val)
for image_path in os.listdir(val_path_images):
    id = image_path.split(".")[0]
    if id in id_set_val:
        id_set_val.remove(id)
    else :
        print("The image", image_path, "is not present in the validation set")

if (len(id_set_val) == 0):
    print("All images have an annotation on 'val' set, and all annotations have an image")

All images have an annotation on 'train' set, and all annotations have an image
All images have an annotation on 'val' set, and all annotations have an image


# Conclusion
We have checked that the dataset structure to see if we can do a fast and easy data loading for interim creation. We have seen that the dataset is well structured and we can load it easily. Also the lnd files are not necessary for our project, so we will not be using them (as we don't expect to have a landmark detection model in our project). Furthermore, we have confirmed that each photo has the same amount of annotations, and that each annotation has the same amount of values. We have also checked that the values are in the expected range and that the data type is correct. Finally, we have checked that each annotation id has their own image file.

There are nly train and validation sets, so we will have to remake the partitions for data preprocessing.

Steps to do in interim:
* Check if data is well distributed and if we have to do some data augmentation.
* Images can be loaded and not corrupt, and their size is 224x244.
* The val/aro does not appear to show -2 values for no-face or uncertain (no value is lower than -2), so we will have to check if there are some images with this values.
