# Deep Learning homework

<a href="https://colab.research.google.com/github/floriensk/deep_learning_homework/blob/main/src/model.ipynb">
<button>
Open in Colab
</button>
</a>

## Installing dependencies

In [None]:
%pip install requests
%pip install tqdm
%pip install sklearn

## Data fetching
We use the *fairface* dataset to train our model.

We use a streaming solution to fetch data, this way we are able to track progress.

In [7]:
from tqdm import tqdm
import requests
import os

def download_file(uri, target_path):
    # Create directory path to target file
    if not os.path.exists(os.path.dirname(target_path)):
        os.makedirs(os.path.dirname(target_path))

    # Download file using streaming, so we can iterate over the response
    response = requests.get(uri, stream=True)
    total_size_in_bytes= int(response.headers.get('content-length', 0)) # Total size of data to download
    block_size = 1024 # Download in chunks for progress tracking
    progress_bar = tqdm(total=total_size_in_bytes, unit='iB', unit_scale=True) # Use a progress bar to track progress

    with open(target_path, 'wb') as file:
        for data in response.iter_content(block_size):
            progress_bar.update(len(data))
            file.write(data) # Write downloaded chunk to file
    progress_bar.close()

    if total_size_in_bytes != 0 and progress_bar.n != total_size_in_bytes:
        print(f"Error during download of {target_path}")
    else:
        print(f"Downloading {target_path} finished successfully")

First, we fetch the images from the corresponding Google Drive folder.

In [2]:
dir_path = "../data" # Path of directory to extract the downloaded data into

In [None]:
uri_images = "https://drive.google.com/uc?export=download&id=1g7qNOZz9wC7OfOhcPqH1EZ5bk1UFGmlL&confirm=t&uuid=729c215d-4fa4-4799-b03f-aea00a016230&at=ALAFpqx7EciTPuBT0YNhhbYsVpML:1666561770553"
images_file_path = "../data/fairface.zip" # Path of downloaded ZIP file

download_file(uri_images, images_file_path)

Then we fetch the CSV files containing the labels for the images.

In [3]:
uri_labels_train = "https://drive.google.com/uc?export=download&id=1i1L3Yqwaio7YSOCj7ftgk8ZZchPG7dmH"
labels_train_valid_file_path = os.path.join(dir_path, "labels_train_valid.csv") # Will be split into train and valid, so already naming it that way

uri_labels_val = "https://drive.google.com/uc?export=download&id=1wOdja-ezstMEp81tX1a-EYkFebev4h7D"
labels_test_file_path = os.path.join(dir_path, "labels_test.csv") # Will be used as test set, so already naming it that way

In [None]:
download_file(uri_labels_train, labels_train_valid_file_path) # Download train and valid data sets
download_file(uri_labels_val, labels_test_file_path) # Download test data set

### Data extraction
The data needs to be uncompressed.

In [None]:
from zipfile import ZipFile

with ZipFile(images_file_path) as zip:
    zip.extractall(dir_path)

In [None]:
# Delete ZIP after extracting
os.remove(images_file_path)

Next, we read the labels into memory.

In [4]:
import numpy as np

labels_train_valid = np.loadtxt(labels_train_valid_file_path, delimiter=",", skiprows=1, dtype="str") # Read while skipping header
labels_test = np.loadtxt(labels_test_file_path, delimiter=",", skiprows=1, dtype="str")

## Data segmentation
Finally, we split the data into train, validation and test datasets for further use by our model.

Data in the downloaded dataset is already split into *train* and *val* subsets (the latter makes up about 10% of all images). Since we need to split the dataset into train, validation and test subsets, we will turn the specified *val* subset into the test subset and split the specified *train* subset into train and validation subsets.

The resulting split ratios are as follows:
+ train: ~74%
+ validation: ~15%
+ test: ~11%

In [5]:
dir_train_valid_path = os.path.join(dir_path, "train_valid")
dir_test_path = os.path.join(dir_path, "test")

Rename the extracted folders accordingly.

In [None]:
# Turn "train" into "train_valid"
os.rename(os.path.join(dir_path, "train"), dir_train_valid_path)
# Turn "val" into "test"
os.rename(os.path.join(dir_path, "val"), dir_test_path)

# Preprocessing

In [6]:
import keras
from sklearn import preprocessing
import tensorflow
from sklearn.model_selection import train_test_split

Initially, we wanted to use a generator to feed data into the network, but the popular generator API provided by keras seemed to be incompatible with our goal to classify according to multiple labels at the same time. However, loading all data into memory has limitations due to limited memory, so until we further improve our model, we only use a subset of the train data, 10 000 samples.

For the same reason, we do not load the test data set into memory yet.

In [7]:
train_data_count = 10000

labels_train_valid = labels_train_valid[:train_data_count]

images_train_valid = [
    keras.utils.img_to_array(keras.utils.load_img(os.path.join(dir_train_valid_path, file[0].split("/")[1]))) # read the file name from the labels and load the image
    for file in labels_train_valid]

Preprocess the input according to the needs of the VGG16 model. This involves scaling pixel values to [0; 1].

In [None]:
from tensorflow.keras.applications.vgg16 import preprocess_input

images_train_valid = np.array(images_train_valid)
images_train_valid = preprocess_input(images_train_valid)

We one-hot encode all labels separately, then construct the desired output of the model by concatenating these arrays.

In [8]:
# one-hot encode labels
age_encoder = preprocessing.OneHotEncoder(sparse=False)
age_labels = age_encoder.fit_transform(labels_train_valid[:, 1].reshape(-1, 1))

gender_encoder = preprocessing.OneHotEncoder(sparse=False)
gender_labels = gender_encoder.fit_transform(labels_train_valid[:, 2].reshape(-1, 1))

race_encoder = preprocessing.OneHotEncoder(sparse=False)
race_labels = race_encoder.fit_transform(labels_train_valid[:, 3].reshape(-1, 1))

labels_train_valid_encoded = np.append(age_labels, gender_labels, axis=1)
labels_train_valid_encoded = np.append(labels_train_valid_encoded, race_labels, axis=1)

Finally, split the first data set into train and validation data sets.

In [10]:
# split into train and valid
images_train, images_valid, labels_train, labels_valid = train_test_split(images_train_valid, labels_train_valid_encoded, test_size=0.2, random_state=42, shuffle=True)

We have successfully created the three subsets, *train*, *valid* and *test*. (Note that in the file system, only test is in a separate directory, as it was in the original database that way. Separating the other subsets would be an unnecessary operation.)

In [23]:
train_images_count = len(labels_train)
valid_images_count = len(labels_valid)
test_images_count = len(os.listdir(dir_test_path))
images_count = train_images_count + valid_images_count + test_images_count

print(f"train: {train_images_count} images ({train_images_count / images_count * 100:.1f}%)")
print(f"valid: {valid_images_count} images ({valid_images_count / images_count * 100:.1f}%)")
print(f"test:  {test_images_count} images ({test_images_count / images_count * 100:.1f}%)")

train: 8000 images (38.2%)
valid: 2000 images (9.5%)
test:  10954 images (52.3%)


# Model

In [13]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout, Flatten, Conv2D, MaxPooling2D, GlobalAveragePooling2D

# from tensorflow.keras.optimizers import SGD

Below is a model based on the VGG16 architecture, which became unstable for reasons we could not find out. We thus discontinued using the model below.

In [248]:
# image_shape = images_train.shape[1:]
# kernel_size = (3, 3)
# convolution_activation = None
# convolution_padding = "same"
# kernel_counts = [32, 64, 128, 128, 128]

# base_model = Sequential()
# base_model.add(Conv2D(kernel_counts[0], kernel_size, input_shape=image_shape, padding=convolution_padding, activation=convolution_activation))
# base_model.add(Conv2D(kernel_counts[0], kernel_size, padding=convolution_padding))
# base_model.add(MaxPooling2D(pool_size=(2, 2)))

# base_model.add(Conv2D(kernel_counts[1], kernel_size, padding=convolution_padding, activation=convolution_activation))
# base_model.add(Conv2D(kernel_counts[1], kernel_size, padding=convolution_padding))
# base_model.add(MaxPooling2D(pool_size=(2, 2)))

# base_model.add(Conv2D(kernel_counts[2], kernel_size, padding=convolution_padding, activation=convolution_activation))
# base_model.add(Conv2D(kernel_counts[2], kernel_size, padding=convolution_padding))
# base_model.add(Conv2D(kernel_counts[2], kernel_size, padding=convolution_padding))
# base_model.add(MaxPooling2D(pool_size=(2, 2)))

# base_model.add(Conv2D(kernel_counts[3], kernel_size, padding=convolution_padding, activation=convolution_activation))
# base_model.add(Conv2D(kernel_counts[3], kernel_size, padding=convolution_padding))
# base_model.add(Conv2D(kernel_counts[3], kernel_size, padding=convolution_padding))
# base_model.add(MaxPooling2D(pool_size=(2, 2)))

# base_model.add(Conv2D(kernel_counts[4], kernel_size, padding=convolution_padding, activation=convolution_activation))
# base_model.add(Conv2D(kernel_counts[4], kernel_size, padding=convolution_padding))
# base_model.add(Conv2D(kernel_counts[4], kernel_size, padding=convolution_padding))
# base_model.add(MaxPooling2D(pool_size=(2, 2)))

# base_model.add(Flatten())
# base_model.add(Dense(1000, activation="relu"))
# base_model.add(Dropout(0.5))
# base_model.add(Dense(500, activation="relu"))
# base_model.add(Dropout(0.5))
# base_model.add(Dense(labels_train.shape[1], activation="softmax"))

# base_model.compile(loss='categorical_crossentropy', optimizer="adam", metrics=['accuracy'])

We decided on using the original VGG16 model, with which, however, we couldn't train on our dataset either.

In [14]:
from tensorflow.keras.applications.vgg16 import VGG16

In [20]:
image_shape = images_train.shape[1:]

base_model = VGG16(include_top=False, input_shape=image_shape) # use VGG16 as base model for transfer learning

base_model.trainable = False # freeze the base model

# Create new model on top
inputs = keras.Input(shape=image_shape)
x = base_model(inputs, training=False)

x = GlobalAveragePooling2D()(x)
x = Dense(100, activation="relu")(x)
x = Dropout(0.5)(x)
x = Dense(labels_train.shape[1], activation="softmax")(x)

model = keras.Model(inputs, x)

model.compile(loss='categorical_crossentropy', optimizer="adam", metrics=['accuracy'])

In [21]:
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_3 (InputLayer)        [(None, 224, 224, 3)]     0         
                                                                 
 vgg16 (Functional)          (None, 7, 7, 512)         14714688  
                                                                 
 global_average_pooling2d_1   (None, 512)              0         
 (GlobalAveragePooling2D)                                        
                                                                 
 dense_2 (Dense)             (None, 100)               51300     
                                                                 
 dropout (Dropout)           (None, 100)               0         
                                                                 
 dense_3 (Dense)             (None, 18)                1818      
                                                             

We tried various batch sizes, optimizers, number of attached final dense layers with various sizes and activations, but our accuracy would barely exceed 0.06 even on the training data set. We were unable to tell why.

In [22]:
batch_size = 30
epochs = 10

model.fit(
    images_train,
    labels_train,
    epochs=epochs,
    validation_data=(images_valid, labels_valid),
    batch_size=batch_size,
    verbose=1
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
  3/267 [..............................] - ETA: 16:55 - loss: 3879.5803 - accuracy: 0.0333  

KeyboardInterrupt: 

# Evaluation

As our model did not learn as expected, evaluating its performance became obsolete. If we, however, figured out how the training process could avoid failing (which we definitely will!), we would add the following to our notebook:
1. Load the test data set into memory and preprocess it similarly to the train and validation data sets.
1. Execute our model on the test data set to see its final performance.
1. Compute the confusion matrix of our model for all parameters, age, gender and race to see its performance broken down by each individual parameter.

Then we would further experiment with how transfer learning could be done efficiently, as a final goal of our project.

### Model use

Functions for interpreting the output of the model:

In [None]:
def get_age(model_output):
    model_output = np.atleast_2d(model_output)
    return age_encoder.inverse_transform(model_output[:, :age_encoder.categories_[0].shape[0]]).reshape(-1)

def get_gender(model_output):
    model_output = np.atleast_2d(model_output)
    return gender_encoder.inverse_transform(model_output[:, age_encoder.categories_[0].shape[0]:][:,gender_encoder.categories_[0].shape[0]]).reshape(-1)

def get_race(model_output):
    model_output = np.atleast_2d(model_output)
    return race_encoder.inverse_transform(model_output[:, -race_encoder.categories_[0].shape[0]:]).reshape(-1)