# Deep Learning homework

<a href="https://colab.research.google.com/github/floriensk/deep_learning_homework/blob/main/src/model.ipynb">
<button>
Open in Colab
</button>
</a>

## Installing dependencies

In [None]:
%pip install requests
%pip install tqdm
%pip install sklearn

## Data fetching
We use the *fairface* dataset to train our model.

The below switch allows to use the **dataset** from the **local** filesystem instead of **downloading** it from the internet.

In [2]:
USE_LOCAL_DATASET = False

We use a streaming solution to fetch data, this way we are able to track progress.

In [3]:
from tqdm import tqdm
import requests
import os

def download_file(uri, target_path):
    # Create directory path to target file
    if not os.path.exists(os.path.dirname(target_path)):
        os.makedirs(os.path.dirname(target_path))

    # Download file using streaming, so we can iterate over the response
    response = requests.get(uri, stream=True)
    total_size_in_bytes= int(response.headers.get('content-length', 0)) # Total size of data to download
    block_size = 1024 # Download in chunks for progress tracking
    progress_bar = tqdm(total=total_size_in_bytes, unit='iB', unit_scale=True) # Use a progress bar to track progress

    with open(target_path, 'wb') as file:
        for data in response.iter_content(block_size):
            progress_bar.update(len(data))
            file.write(data) # Write downloaded chunk to file
    progress_bar.close()

    if total_size_in_bytes != 0 and progress_bar.n != total_size_in_bytes:
        print(f"Error during download of {target_path}")
    else:
        print(f"Downloading {target_path} finished successfully")

### Downloading the images

First, we fetch the images from the corresponding Google Drive folder.

In [4]:
dir_path = "../data" # Path of directory to extract the downloaded data into

In [5]:
if not USE_LOCAL_DATASET:
    uri_images = "https://drive.google.com/uc?export=download&id=1g7qNOZz9wC7OfOhcPqH1EZ5bk1UFGmlL&confirm=t&uuid=729c215d-4fa4-4799-b03f-aea00a016230&at=ALAFpqx7EciTPuBT0YNhhbYsVpML:1666561770553"
    images_file_path = "../data/fairface.zip" # Path of downloaded ZIP file

    download_file(uri_images, images_file_path)

### Uncompressing the images

The images need to be uncompressed.

In [6]:
if not USE_LOCAL_DATASET:
    from zipfile import ZipFile

    with ZipFile(images_file_path) as zip:
        zip.extractall(dir_path)

In [7]:
# Delete ZIP after extracting
if not USE_LOCAL_DATASET:
    os.remove(images_file_path)

### Downloading the labels

Then we fetch the CSV files containing the labels for the images.

In [8]:
labels_train_valid_file_path = os.path.join(dir_path, "labels_train_valid.csv") # Will be split into train and valid, so already naming it that way

labels_test_file_path = os.path.join(dir_path, "labels_test.csv") # Will be used as test set, so already naming it that way

In [9]:
if not USE_LOCAL_DATASET:
    uri_labels_train = "https://drive.google.com/uc?export=download&id=1i1L3Yqwaio7YSOCj7ftgk8ZZchPG7dmH"
    download_file(uri_labels_train, labels_train_valid_file_path) # Download train and valid data sets

    uri_labels_val = "https://drive.google.com/uc?export=download&id=1wOdja-ezstMEp81tX1a-EYkFebev4h7D"
    download_file(uri_labels_val, labels_test_file_path) # Download test data set

Next, read the labels into memory.

In [10]:
import numpy as np

labels_train_valid = np.loadtxt(labels_train_valid_file_path, delimiter=",", skiprows=1, dtype="str") # Read while skipping header
labels_test = np.loadtxt(labels_test_file_path, delimiter=",", skiprows=1, dtype="str")

## Data segmentation
Finally, we split the data into train, validation and test datasets for further use by our model.

Data in the downloaded dataset is already split into *train* and *val* subsets (the latter makes up about 10% of all images). Since we need to split the dataset into train, validation and test subsets, we will turn the specified *val* subset into the test subset and split the specified *train* subset into train and validation subsets.

In [11]:
dir_train_valid_path = os.path.join(dir_path, "train_valid")
dir_test_path = os.path.join(dir_path, "test")

Rename the extracted folders accordingly.

In [12]:
if not USE_LOCAL_DATASET:
    # Turn "train" into "train_valid"
    os.rename(os.path.join(dir_path, "train"), dir_train_valid_path)
    # Turn "val" into "test"
    os.rename(os.path.join(dir_path, "val"), dir_test_path)

# Preprocessing

In [13]:
from sklearn import preprocessing
import tensorflow
from tensorflow.data import Dataset

In [14]:
## Extracts the file name from a file path
def get_file_name(file_path):
    split_path = tensorflow.strings.split(file_path, "\\").numpy()
    split_path = np.atleast_2d(split_path)[:,-1]
    split_path = tensorflow.strings.split(split_path, "/").numpy()
    split_path = np.atleast_2d(split_path)[:,-1]
    return split_path

## Returns the image as a numpy array from a file path
def get_image(file_path):
    return tensorflow.keras.utils.img_to_array(
        tensorflow.keras.utils.load_img(file_path)
    )

## Returns the labels for a file path
## @param test_dataset: if True, the test dataset will be used, otherwise the train and valid dataset
def get_labels(file_path, test_dataset=False):
    file_name = get_file_name(file_path) # get only the file name, without a path

    labels = labels_test if test_dataset else labels_train_valid # get the desired set of labels
    
    return labels[get_file_name(labels[:, 0]) == file_name].ravel()[1:] # get the labels for the file name

We one-hot encode all labels separately, then construct the desired output of the model by concatenating these arrays.

In [15]:
# one-hot encode labels
labels_all = np.append(labels_train_valid, labels_test, axis=0)

age_encoder = preprocessing.OneHotEncoder(sparse=False, dtype="float32")
age_encoder.fit(labels_all[:, 1].reshape(-1, 1))

gender_encoder = preprocessing.OneHotEncoder(sparse=False, dtype="float32")
gender_encoder.fit(labels_all[:, 2].reshape(-1, 1))

race_encoder = preprocessing.OneHotEncoder(sparse=False, dtype="float32")
race_encoder.fit(labels_all[:, 3].reshape(-1, 1))

## Converts the labels to the output of the model
def encode_labels(labels):
    labels = np.atleast_2d(labels)

    age_labels = age_encoder.transform(np.atleast_2d(labels[:, 0]))
    gender_labels = gender_encoder.transform(np.atleast_2d(labels[:, 1]))
    race_labels = race_encoder.transform(np.atleast_2d(labels[:, 2]))
    
    return np.append( # append age_labels + gender_labels + race_labels
        np.append(
            age_labels,
            gender_labels,
            axis=1
            ),
        race_labels,
        axis=1
        ).reshape(-1)

## Converts the output of the model to the decoded labels
def decode_labels(labels):
    labels = np.atleast_2d(labels)

    age_labels = age_encoder.inverse_transform(labels[:, :9])
    gender_labels = gender_encoder.inverse_transform(labels[:, 9:11])
    race_labels = race_encoder.inverse_transform(labels[:, 11:])

    return np.append( # append the decoded labels
        np.append(
            age_labels,
            gender_labels,
            axis=1
        ),
        race_labels,
        axis=1
    )

In [16]:
## Maps a file path to the image and the labels
def get_datapoint(file_path, test_dataset=False):
    # file_path = file_path.numpy() # convert tensor to value
    image = get_image(file_path)
    labels = encode_labels(get_labels(file_path, test_dataset))
    
    return (
        image,
        labels
    )

In [17]:
# Read the file paths from the train and valid dataset
dataset_train_valid = Dataset.list_files(f"{dir_train_valid_path}/*", shuffle=True, name="dataset_train_valid")

In [18]:
# Map the file paths in the dataset to the actual image and label data
dataset_train_valid = dataset_train_valid.map(
    # using py_function to enable eager execution and thus the use of numpy in the map function
    lambda file_path: tensorflow.numpy_function(
        get_datapoint,
        inp=[file_path],
        Tout=(tensorflow.float32, tensorflow.float32),
        stateful=False
    ),
    num_parallel_calls=tensorflow.data.experimental.AUTOTUNE,
)

In [19]:
## Returns the shapes of the input and output of the model
def get_datapoint_shapes():
    sample = dataset_train_valid.as_numpy_iterator().next()
    return sample[0].shape, sample[1].shape # trim batch dimension

# Save the shapes of the input and output of the model
input_shape, output_shape = get_datapoint_shapes()

In [20]:
# Because of the use of py_function, the shapes of the input and output are not known for the tensorflow graph builder,
# so we have to reshape the input and output to the correct shape
dataset_train_valid = dataset_train_valid.map(
    lambda image, labels: (
        tensorflow.reshape(image, input_shape),
        tensorflow.reshape(labels, output_shape)
    ),
    num_parallel_calls=tensorflow.data.experimental.AUTOTUNE,
)

Split the first data set into train and validation data sets.

In [21]:
train_data_ratio = 0.8
batch_size = 32

data_count_train = int(len(labels_train_valid) * train_data_ratio)
data_count_valid = len(labels_train_valid) - data_count_train

# split the dataset into train and validation data
dataset_train = dataset_train_valid.take(data_count_train, name="dataset_train")
dataset_valid = dataset_train_valid.skip(data_count_train, name="dataset_valid")

Finally, prepare the datasets to be an appropriate source for the model.

In [22]:
def prepare_dataset(dataset):
    return (dataset
        .repeat()
        .batch(
            batch_size,
            num_parallel_calls=tensorflow.data.experimental.AUTOTUNE,
            deterministic=False,
            name="dataset_batch"
        ).prefetch(tensorflow.data.experimental.AUTOTUNE, name="dataset_prefetch"))


dataset_train = prepare_dataset(dataset_train)
dataset_valid = prepare_dataset(dataset_valid)

We have successfully created the three subsets, *train*, *valid* and *test*. (Note that in the file system, only test is in a separate directory, as it was that way in the original database. Separating the other subsets would be an unnecessary operation.)

In [23]:
data_count_test = 0 # len(os.listdir(dir_test_path))
images_count = data_count_train + data_count_valid + data_count_test

print(f"train: {data_count_train} images ({data_count_train / images_count * 100:.1f}%)")
print(f"valid: {data_count_valid} images ({data_count_valid / images_count * 100:.1f}%)")
print(f"test:  {data_count_test } images ({data_count_test  / images_count * 100:.1f}%)")

train: 69395 images (80.0%)
valid: 17349 images (20.0%)
test:  0 images (0.0%)


# Model

In [46]:
from tensorflow.keras import Model, Input
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout, Flatten, Conv2D, MaxPooling2D, GlobalAveragePooling2D, Rescaling
# from tensorflow.keras.optimizers import SGD

Below is a model based on the VGG16 architecture, which became unstable for reasons we could not find out. We thus discontinued using the model below.

In [47]:
# image_shape = images_train.shape[1:]
# kernel_size = (3, 3)
# convolution_activation = None
# convolution_padding = "same"
# kernel_counts = [32, 64, 128, 128, 128]

# base_model = Sequential()
# base_model.add(Conv2D(kernel_counts[0], kernel_size, input_shape=image_shape, padding=convolution_padding, activation=convolution_activation))
# base_model.add(Conv2D(kernel_counts[0], kernel_size, padding=convolution_padding))
# base_model.add(MaxPooling2D(pool_size=(2, 2)))

# base_model.add(Conv2D(kernel_counts[1], kernel_size, padding=convolution_padding, activation=convolution_activation))
# base_model.add(Conv2D(kernel_counts[1], kernel_size, padding=convolution_padding))
# base_model.add(MaxPooling2D(pool_size=(2, 2)))

# base_model.add(Conv2D(kernel_counts[2], kernel_size, padding=convolution_padding, activation=convolution_activation))
# base_model.add(Conv2D(kernel_counts[2], kernel_size, padding=convolution_padding))
# base_model.add(Conv2D(kernel_counts[2], kernel_size, padding=convolution_padding))
# base_model.add(MaxPooling2D(pool_size=(2, 2)))

# base_model.add(Conv2D(kernel_counts[3], kernel_size, padding=convolution_padding, activation=convolution_activation))
# base_model.add(Conv2D(kernel_counts[3], kernel_size, padding=convolution_padding))
# base_model.add(Conv2D(kernel_counts[3], kernel_size, padding=convolution_padding))
# base_model.add(MaxPooling2D(pool_size=(2, 2)))

# base_model.add(Conv2D(kernel_counts[4], kernel_size, padding=convolution_padding, activation=convolution_activation))
# base_model.add(Conv2D(kernel_counts[4], kernel_size, padding=convolution_padding))
# base_model.add(Conv2D(kernel_counts[4], kernel_size, padding=convolution_padding))
# base_model.add(MaxPooling2D(pool_size=(2, 2)))

# base_model.add(Flatten())
# base_model.add(Dense(1000, activation="relu"))
# base_model.add(Dropout(0.5))
# base_model.add(Dense(500, activation="relu"))
# base_model.add(Dropout(0.5))
# base_model.add(Dense(labels_train.shape[1], activation="softmax"))

# base_model.compile(loss='categorical_crossentropy', optimizer="adam", metrics=['accuracy'])

We decided on using the original VGG16 model, with which, however, we couldn't train on our dataset either.

In [48]:
from tensorflow.keras.applications.vgg16 import VGG16, preprocess_input

In [49]:
# use VGG16 as base model for transfer learning
base_model = VGG16(
    include_top=False,
    input_shape=input_shape,
    input_tensor=Input(shape=input_shape),
)
base_model.trainable = False # freeze the base model

# Create a separate model for the appended layers, so that the trainable property can be set separately
appended_model = Sequential(name="appended_model")
appended_model.add(GlobalAveragePooling2D(input_shape=base_model.output_shape[1:]))
appended_model.add(Dense(1000, activation="relu"))
appended_model.add(Dropout(0.2))
appended_model.add(Dense(100, activation="relu"))
appended_model.add(Dropout(0.2))
appended_model.add(Dense(output_shape[0], activation="softmax"))

# Connect the base and appended models
inputs = Input(shape=input_shape, name="main_model_input")
x = preprocess_input(inputs)
x = base_model(x, training=False)
x = appended_model(x)

model = Model(inputs, x, name="main_model")

model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

In [51]:
model.summary()

Model: "main_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 main_model_input (InputLaye  [(None, 448, 448, 3)]    0         
 r)                                                              
                                                                 
 tf.__operators__.getitem (S  (None, 448, 448, 3)      0         
 licingOpLambda)                                                 
                                                                 
 tf.nn.bias_add (TFOpLambda)  (None, 448, 448, 3)      0         
                                                                 
 vgg16 (Functional)          (None, 14, 14, 512)       14714688  
                                                                 
 appended_model (Sequential)  (None, 18)               614918    
                                                                 
Total params: 15,329,606
Trainable params: 614,918
Non-t

We tried various batch sizes, optimizers, number of attached final dense layers with various sizes and activations, but our accuracy would barely exceed 0.06 even on the training data set. We were unable to tell why.

In [54]:
epochs = 10

model.fit(
    dataset_train,
    epochs=epochs,
    validation_data=dataset_valid,
    steps_per_epoch=data_count_train // batch_size,
    validation_steps=data_count_valid // batch_size,
    verbose=1
)

Epoch 1/10
  51/2168 [..............................] - ETA: 12:33:45 - loss: 2284.6646 - accuracy: 0.0619

# Evaluation

As our model did not learn as expected, evaluating its performance became obsolete. If we, however, figured out how the training process could avoid failing (which we definitely will!), we would add the following to our notebook:
1. Load the test data set into memory and preprocess it similarly to the train and validation data sets.
1. Execute our model on the test data set to see its final performance.
1. Compute the confusion matrix of our model for all parameters, age, gender and race to see its performance broken down by each individual parameter.

Then we would further experiment with how transfer learning could be done efficiently, as a final goal of our project.

### Model use

Functions for interpreting the output of the model:

In [None]:
def get_age(model_output):
    model_output = np.atleast_2d(model_output)
    return age_encoder.inverse_transform(model_output[:, :age_encoder.categories_[0].shape[0]]).reshape(-1)

def get_gender(model_output):
    model_output = np.atleast_2d(model_output)
    return gender_encoder.inverse_transform(model_output[:, age_encoder.categories_[0].shape[0]:][:,gender_encoder.categories_[0].shape[0]]).reshape(-1)

def get_race(model_output):
    model_output = np.atleast_2d(model_output)
    return race_encoder.inverse_transform(model_output[:, -race_encoder.categories_[0].shape[0]:]).reshape(-1)