# Predicting Ethnicities, Age, and Gender with Neural Networks

## Table of Contents:
1. *Introduction*
2. *Data Preperation*
    * 2.1 Data Cleaning/Preparation
    * 2.2 Data Loading
    * 2.3 Data Preprocessing
    * 2.4 Data Exploration
3. *Model Architecture*
    * 3.1 Neural Network Design
    * 3.2 Model Compilation
4. *Model Training*
    * 4.1 Training Process
5. *Model Evaluation*
    * 5.1 Model Performance
    * 5.2 Confusion Matrix
6. *Model Deployment*
    * 6.1 Model Saving
7. *Conclusion*
8. *References*


## 1. Introduction:
In this notebook, we will explore the process of building a neural network model to predict ethnicities, age, and gender based on certain features. Predicting ethnicities, age, and gender can have various applications, including demographic analysis, social studies, and more. We will follow a step-by-step approach, covering data preprocessing, model architecture design, training, evaluation, and interpretation.

## 2. Data Preparation

### 2.1 Data Cleaning/Preparation

The dataset from FairFace is split between a training set and a validation set, lacking the test set. In this section, I will:
- Combine all the images into one folder
- Rename the images starting from 1 to the total number of images
- Compile all the csv files into one with the new names
- Split the images and csv into training, validation, and test sets

In [84]:
import os
import shutil
import csv
import random

In [24]:
# Moving the training images to the all images folder
temp_train_path = "C:\\Users\\mashe\\Downloads\\temp\\train"
all_images = "C:\\Users\\mashe\\Downloads\\temp\\all"

train_files = os.listdir(temp_train_path)

# Moving the training images to the all images folder
for file_name in train_files:
    train_path = os.path.join(temp_train_path, file_name)
    all_path = os.path.join(all_images, file_name)
    shutil.move(train_path, all_path)

In [43]:
# Moving the contents of the train csv to a general csv file
# Changing the file column to only include the name of the file
temp_train_csv = "C:\\Users\\mashe\\Downloads\\temp_csv\\images_train.csv"
general_csv = "C:\\Users\\mashe\\Downloads\\temp_csv\\all.csv"

# getting the data from the csv without the extra path on the filename
new_rows = []
with open(temp_train_csv, 'r') as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        modified_row = [row[0].split('/')[-1]] + row[1:]
        new_rows.append(modified_row)

# writing the data back into the csv
with open(general_csv, 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerows(new_rows)

In [51]:
# Setting a variable for the name of the last image in train
img_num = len(os.listdir(all_images)) + 1
img_num

86745

Since the validation images and csv also start from 1, we will have to rename the images starting from the end of the training images, while also renaming the file column in the val csv file.

In [74]:
# Moving the validation images
temp_val_path = "C:\\Users\\mashe\\Downloads\\temp\\val"
all_images = "C:\\Users\\mashe\\Downloads\\temp\\all"
temp_val_csv = "C:\\Users\\mashe\\Downloads\\temp_csv\\images_val.csv"

# getting our images without the extra path
image_data = []
with open(temp_val_csv, 'r') as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        modified_row = [row[0].split('/')[-1]] + row[1:]
        image_data.append(modified_row)

# get rid of the header row
image_data = image_data[1:]

# changing the name of the images and the csv data and moving them
for index, image in enumerate(image_data, start=img_num):
    new_image_name = str(index) + ".jpg"
    new_image_path = os.path.join(all_images, new_image_name)
    old_image_path = os.path.join(temp_val_path, image[0])
    shutil.move(old_image_path, new_image_path)

    for i, row in enumerate(image_data):
        if (row[0] == image[0]):
            image_data[i][0] = new_image_name
            break

# writing the new data into the csv file
with open(temp_val_csv, 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerows(image_data)

In [83]:
# Combining the val csv with the general csv
general_csv = "C:\\Users\\mashe\\Downloads\\temp_csv\\all.csv"
temp_val_csv = "C:\\Users\\mashe\\Downloads\\temp_csv\\images_val.csv"

# getting the data from general csv
rows = []
with open(general_csv, 'r') as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        rows.append(row)

# adding the val data to the data we already have
with open(temp_val_csv, 'r') as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        rows.append(row)

# writing the data to the general csv
with open(general_csv, 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerows(rows)

We now need to split the images and csv into a 70-15-15 split for training, validation, and testing data.

In [92]:
image_source = "C:\\Users\\mashe\\Downloads\\all"
csv_source = "C:\\Users\\mashe\\Downloads\\all.csv"
image_destination = "C:\\Users\\mashe\\Downloads\\images"
csv_destination = "C:\\Users\\mashe\\Downloads\\csv"

# creating the directories
splits = ["train", "val", "test"]
for split in splits:
    os.mkdir(os.path.join(image_destination, split))

# getting the data from the csv file
csv_data = []
header = []
with open(csv_source, 'r') as csvfile:
    reader = csv.reader(csvfile)
    header = reader.__next__() # getting the header and skipping it
    for row in reader:
        csv_data.append(row)

random.shuffle(csv_data)

# splitting the data into train, val, test
total_images = len(csv_data)
train_size = int(0.70*total_images)
val_size = int(0.15*total_images)

train_data = csv_data[ : train_size]
val_data = csv_data[train_size : train_size + val_size]
test_data = csv_data[train_size + val_size : ]

# moving the images and creating the csv files
for split, data in zip(splits, [train_data, val_data, test_data]):
    for row in data:
        image_path = os.path.join(image_source, row[0])
        split_destination = os.path.join(image_destination, split)
        shutil.move(image_path, split_destination)

    csv_path = os.path.join(csv_destination, split + ".csv")
    with open(csv_path, 'w', newline='') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(header)
        for row in data:
            modified_row = [os.path.join(split, row[0])] + row[1:]
            writer.writerow(modified_row)

In [93]:
# making sure the data was moved properly
train_dir = "C:\\Users\\mashe\\Downloads\\images\\train"
val_dir = "C:\\Users\\mashe\\Downloads\\images\\val"
test_dir = "C:\\Users\\mashe\\Downloads\\images\\test"
image_source = "C:\\Users\\mashe\\Downloads\\all"
train_csv = "C:\\Users\\mashe\\Downloads\\csv\\train.csv"
val_csv = "C:\\Users\\mashe\\Downloads\\csv\\val.csv"
test_csv = "C:\\Users\\mashe\\Downloads\\csv\\test.csv"

train_len = len(os.listdir(train_dir))
val_len = len(os.listdir(val_dir))
test_len = len(os.listdir(test_dir))

print(train_len, val_len, test_len)
print(len(os.listdir(image_source))) # confirms data was moved
print(train_len + val_len + test_len)

counter = 0
with open(train_csv, 'r') as csvfile:
    reader = csv.reader(csvfile)
    next(reader)
    for row in reader:
        counter += 1
print(counter)

counter1 = 0
with open(val_csv, 'r') as csvfile:
    reader = csv.reader(csvfile)
    next(reader)
    for row in reader:
        counter1 += 1
print(counter1)

counter2 = 0
with open(test_csv, 'r') as csvfile:
    reader = csv.reader(csvfile)
    next(reader)
    for row in reader:
        counter2 += 1
print(counter2)

print(counter + counter1 + counter2)

68388 14654 14656
0
97698
68388
14654
14656
97698


### 2.2 Data Loading

Now that we've cleaned up our data, we can start loading it in.

### 2.3 Data Preprocessing

Now we will use the dlib library to detect and crop the faces, and store these cropped faces for training, validation, and testing later on.

In [94]:
import dlib
import cv2

In [97]:
face_detector = dlib.cnn_face_detection_model_v1('dlib_models/mmod_human_face_detector.dat')
face_detector

<_dlib_pybind11.cnn_face_detection_model_v1 at 0x21a626266f0>

### 2.4 Data Exploration

## 3. Model Architecture

### 3.1 Neural Network Design

### 3.2 Model Compilation

## 4. Model Training

### 4.1 Training Process

## 5. Model Evaluation

### 5.1 Model Performance

### 5.2 Confusion Matrix

## 6. Model Deployment

### 6.1 Model Saving

## 7. Conclusion

## 8. References

The dataset was retrieved from the FairFace study.

Karkkainen, Kimmo and Joo, Jungseock. (2021). FairFace: Face Attribute Dataset for Balanced Race, Gender, and Age for Bias Measurement and Mitigation. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1548-1558. 10.1109/WACV48630.2021.00159