# Predicting Ethnicities, Age, and Gender with Neural Networks

## Table of Contents:
1. *Introduction*
2. *Data Preperation*
    * 2.1 Data Cleaning/Preparation
    * 2.2 Data Loading
    * 2.3 Data Exploration
    * 2.4 Data Preprocessing
3. *Model Architecture*
    * 3.1 Neural Network Design
    * 3.2 Model Compilation
4. *Model Training*
    * 4.1 Training Process
    * 4.2 Improving the Model
5. *Model Evaluation*
    * 5.1 Model Performance
    * 5.2 Confusion Matrix
    * 5.3 Model Prediction Graphing
6. *Model Deployment*
    * 6.1 Model Saving
7. *Conclusion*
8. *References*


## 1. Introduction:
In this notebook, we will explore the process of building a neural network model to predict ethnicities, age, and gender based on certain features. Predicting ethnicities, age, and gender can have various applications, including demographic analysis, social studies, and more. We will follow a step-by-step approach, covering data preprocessing, model architecture design, training, evaluation, and interpretation.

## 2. Data Preparation

### 2.1 Data Cleaning/Preparation

The dataset from FairFace is split between a training set and a validation set, lacking the test set. In this section, I will:
- Combine all the images into one folder
- Rename the images starting from 1 to the total number of images
- Compile all the csv files into one with the new names

In [1]:
import os
import shutil
import csv
import random

In [24]:
# Moving the training images to the all images folder
temp_train_path = "C:\\Users\\mashe\\Downloads\\temp\\train"
all_images = "C:\\Users\\mashe\\Downloads\\temp\\all"

train_files = os.listdir(temp_train_path)

# Moving the training images to the all images folder
for file_name in train_files:
    train_path = os.path.join(temp_train_path, file_name)
    all_path = os.path.join(all_images, file_name)
    shutil.move(train_path, all_path)

In [43]:
# Moving the contents of the train csv to a general csv file
# Changing the file column to only include the name of the file
temp_train_csv = "C:\\Users\\mashe\\Downloads\\temp_csv\\images_train.csv"
general_csv = "C:\\Users\\mashe\\Downloads\\temp_csv\\all.csv"

# getting the data from the csv without the extra path on the filename
new_rows = []
with open(temp_train_csv, 'r') as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        modified_row = [row[0].split('/')[-1]] + row[1:]
        new_rows.append(modified_row)

# writing the data back into the csv
with open(general_csv, 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerows(new_rows)

In [51]:
# Setting a variable for the name of the last image in train
img_num = len(os.listdir(all_images)) + 1
img_num

86745

Since the validation images and csv also start from 1, we will have to rename the images starting from the end of the training images, while also renaming the file column in the val csv file.

In [74]:
# Moving the validation images
temp_val_path = "C:\\Users\\mashe\\Downloads\\temp\\val"
all_images = "C:\\Users\\mashe\\Downloads\\temp\\all"
temp_val_csv = "C:\\Users\\mashe\\Downloads\\temp_csv\\images_val.csv"

# getting our images without the extra path
image_data = []
with open(temp_val_csv, 'r') as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        modified_row = [row[0].split('/')[-1]] + row[1:]
        image_data.append(modified_row)

# get rid of the header row
image_data = image_data[1:]

# changing the name of the images and the csv data and moving them
for index, image in enumerate(image_data, start=img_num):
    new_image_name = str(index) + ".jpg"
    new_image_path = os.path.join(all_images, new_image_name)
    old_image_path = os.path.join(temp_val_path, image[0])
    shutil.move(old_image_path, new_image_path)

    for i, row in enumerate(image_data):
        if (row[0] == image[0]):
            image_data[i][0] = new_image_name
            break

# writing the new data into the csv file
with open(temp_val_csv, 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerows(image_data)

In [83]:
# Combining the val csv with the general csv
general_csv = "C:\\Users\\mashe\\Downloads\\temp_csv\\all.csv"
temp_val_csv = "C:\\Users\\mashe\\Downloads\\temp_csv\\images_val.csv"

# getting the data from general csv
rows = []
with open(general_csv, 'r') as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        rows.append(row)

# adding the val data to the data we already have
with open(temp_val_csv, 'r') as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        rows.append(row)

# writing the data to the general csv
with open(general_csv, 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerows(rows)

### 2.2 Data Loading

Now that we've cleaned our data, we can start loading it in.

In [3]:
import pandas as pd

In [None]:
general_csv = "C:\\Users\\mashe\\Downloads\\all.csv"
df = pd.read_csv(general_csv)
df.head()

### 2.3 Data Exploration

In [None]:
import matplotlib.pyplot as plt

In [None]:
df.describe()

In [None]:
# graph distribution of ages

In [None]:
# graph genders of each ethnicity compared to age

In [None]:
# graph distribution of gender and race to amount

In [None]:
df['age'].describe()

In [None]:
# Graphing example images
import random
image_source = "C:\\Users\\mashe\\Downloads\\all"
plt.figure(figsize=(10,10))
for i in range(4):
    ax = plt.subplot(2, 2, i+1)
    rand = random.randint(0, len(image_source))
    img = df.iloc[rand, 0]
    image_path = os.path.join(image_source, img)
    plt.imshow(cv2.imread(image_source))
    plt.title(f"age: {df.iloc[rand, 1]}, gender: {df.iloc[rand, 2]}, ethnicity: {df.iloc[rand, 3]}")

### 2.4 Data Preprocessing

We now need to split the images and csv into a 70-15-15 split for training, validation, and testing data.

In [92]:
image_source = "C:\\Users\\mashe\\Downloads\\all"
csv_source = "C:\\Users\\mashe\\Downloads\\all.csv"
image_destination = "C:\\Users\\mashe\\Downloads\\images"
csv_destination = "C:\\Users\\mashe\\Downloads\\csv"

# creating the directories
splits = ["train", "val", "test"]
for split in splits:
    os.mkdir(os.path.join(image_destination, split))

# getting the data from the csv file
csv_data = []
header = []
with open(csv_source, 'r') as csvfile:
    reader = csv.reader(csvfile)
    header = reader.__next__() # getting the header and skipping it
    for row in reader:
        csv_data.append(row)

random.shuffle(csv_data)

# splitting the data into train, val, test
total_images = len(csv_data)
train_size = int(0.70*total_images)
val_size = int(0.15*total_images)

train_data = csv_data[ : train_size]
val_data = csv_data[train_size : train_size + val_size]
test_data = csv_data[train_size + val_size : ]

# moving the images and creating the csv files
for split, data in zip(splits, [train_data, val_data, test_data]):
    for row in data:
        image_path = os.path.join(image_source, row[0])
        split_destination = os.path.join(image_destination, split)
        shutil.move(image_path, split_destination)

    csv_path = os.path.join(csv_destination, split + ".csv")
    with open(csv_path, 'w', newline='') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(header)
        for row in data:
            modified_row = [os.path.join(split, row[0])] + row[1:]
            writer.writerow(modified_row)

In [2]:
# making sure the data was moved properly
train_dir = "C:\\Users\\mashe\\Downloads\\images\\train"
val_dir = "C:\\Users\\mashe\\Downloads\\images\\val"
test_dir = "C:\\Users\\mashe\\Downloads\\images\\test"
train_csv = "C:\\Users\\mashe\\Downloads\\csv\\train.csv"
val_csv = "C:\\Users\\mashe\\Downloads\\csv\\val.csv"
test_csv = "C:\\Users\\mashe\\Downloads\\csv\\test.csv"

train_len = len(os.listdir(train_dir))
val_len = len(os.listdir(val_dir))
test_len = len(os.listdir(test_dir))

print(train_len, val_len, test_len)
print(train_len + val_len + test_len)

counter = 0
with open(train_csv, 'r') as csvfile:
    reader = csv.reader(csvfile)
    next(reader)
    for row in reader:
        counter += 1
print(counter)

counter1 = 0
with open(val_csv, 'r') as csvfile:
    reader = csv.reader(csvfile)
    next(reader)
    for row in reader:
        counter1 += 1
print(counter1)

counter2 = 0
with open(test_csv, 'r') as csvfile:
    reader = csv.reader(csvfile)
    next(reader)
    for row in reader:
        counter2 += 1
print(counter2)

print(counter + counter1 + counter2)

68388 14654 14656
97698
68388
14654
14656
97698


In [4]:
train_df = pd.read_csv(train_csv)
val_df = pd.read_csv(val_csv)
test_df = pd.read_csv(test_csv)

In [65]:
train_df.head()

Unnamed: 0,file,age,gender,race,service_test
0,train\34113.jpg,10-19,Female,Latino_Hispanic,True
1,train\58446.jpg,3-9,Male,Latino_Hispanic,False
2,train\27338.jpg,30-39,Female,East Asian,True
3,train\60223.jpg,50-59,Female,Latino_Hispanic,True
4,train\94498.jpg,30-39,Male,Latino_Hispanic,False


In [38]:
val_df.head()

Unnamed: 0,file,age,gender,race,service_test
0,val\72512.jpg,20-29,Female,White,False
1,val\60392.jpg,50-59,Male,East Asian,False
2,val\65764.jpg,30-39,Male,Black,True
3,val\15221.jpg,40-49,Female,White,False
4,val\12071.jpg,10-19,Female,White,True


In [39]:
test_df.head()

Unnamed: 0,file,age,gender,race,service_test
0,test\53352.jpg,3-9,Female,Latino_Hispanic,True
1,test\95368.jpg,20-29,Female,Southeast Asian,True
2,test\331.jpg,20-29,Female,White,False
3,test\18069.jpg,30-39,Female,Latino_Hispanic,False
4,test\25704.jpg,50-59,Male,White,False


Now we will use the dlib library to detect and crop the faces, and store these cropped faces for training, validation, and testing later on.

In [5]:
import dlib
import cv2
import os
import concurrent.futures

In [6]:
face_detector = dlib.cnn_face_detection_model_v1('dlib_models/mmod_human_face_detector.dat')
face_detector

<_dlib_pybind11.cnn_face_detection_model_v1 at 0x2183d8fb7b0>

Creating a function to detect a face, crop the face, and save the face to a folder.

In [11]:
def crop_face_and_save(image_name, image_path, output_folder):
    # Load the image, convert to grayscale, and detect
    image = cv2.imread(image_path)
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

    # Resize the image to speed up detection
    original_height, original_width = image.shape[:2]
    aspect_ratio = original_width / original_height
    target_width = 128
    target_height = int(target_width / aspect_ratio)
    resized_image = cv2.resize(gray, (target_width, target_height))

    faces = face_detector(resized_image, 1)

    if len(faces) == 0:
        print("No faces found in the image")
        return

    # Process the first detected face, since we only have data for the main face in each image
    for idx, face in enumerate(faces):
        # Get the coordinates of the face rectangle
        x, y, w, h = face.rect.left(), face.rect.top(), face.rect.width(), face.rect.height()
        
        # Calculate the corresponding coordinates on the original image
        x = int(x * (original_width / target_width))
        y = int(y * (original_height / target_height))
        w = int(w * (original_width / target_width))
        h = int(h * (original_height / target_height))
        face_img = image[y:y+h, x:x+w]

        # Save the cropped face
        output_path = os.path.join(output_folder, image_name)
        cv2.imwrite(output_path, face_img)
        break

In [None]:
image_source = "C:\\Users\\mashe\\Downloads\\images"
image_destination = "C:\\Users\\mashe\\Downloads\\cropped_images"

Looping through the training, validation and test images using the csv's to crop the images and save them.

In [15]:
train_files = train_df["file"]
val_files = val_df["file"]
test_files = test_df["file"]

In [16]:
def process_image(image):
    image_path = os.path.join(image_source, image)
    crop_face_and_save(image, image_path, image_destination)

Using parallel processing to compute the images faster

In [17]:
with concurrent.futures.ThreadPoolExecutor(max_workers=6) as executor:
    executor.map(process_image, train_files)

with concurrent.futures.ThreadPoolExecutor(max_workers=6) as executor:
    executor.map(process_image, val_files)

with concurrent.futures.ThreadPoolExecutor(max_workers=6) as executor:
    executor.map(process_image, test_files)

No faces found in the image
No faces found in the image
No faces found in the image
No faces found in the image
No faces found in the image
No faces found in the image
No faces found in the image
No faces found in the image
No faces found in the image
No faces found in the image
No faces found in the image
No faces found in the image
No faces found in the image
No faces found in the image
No faces found in the image
No faces found in the image
No faces found in the image
No faces found in the image
No faces found in the image
No faces found in the image
No faces found in the image
No faces found in the image
No faces found in the image
No faces found in the image
No faces found in the image
No faces found in the image
No faces found in the image
No faces found in the image
No faces found in the image
No faces found in the image
No faces found in the image
No faces found in the image
No faces found in the image
No faces found in the image
No faces found in the image
No faces found in th

Now that we have detected and cropped most faces, we will find and remove the faces that weren't cropped from our dataframes since they are not useful for our model.

In [None]:
# cleaning the training csv
files = train_df['file']
for image in files:
    image_path = os.path.join(image_source, image)
    if not os.path.exists(image_path):
        train_df.drop(train_df[train_df['file'] == image].index, inplace=True)

In [None]:
# cleaning the training csv
files = val_df['file']
for image in files:
    image_path = os.path.join(image_source, image)
    if not os.path.exists(image_path):
        val_df.drop(val_df[val_df['file'] == image].index, inplace=True)

In [None]:
# cleaning the training csv
files = test_df['file']
for image in files:
    image_path = os.path.join(image_source, image)
    if not os.path.exists(image_path):
        test_df.drop(test_df[test_df['file'] == image].index, inplace=True)

Now that we have our csv and cropped images, we can begin to prepare our data for training by converting the images in each set into tensors.

In [None]:
import tensorflow as tf
import numpy as np

In [None]:
# Turning the training, validation, and testing sets into tensors


In [None]:
# Turning our data into numbers
def data_transform(df):
    # one-hot encode the age categories
    # turn gender into 0 or 1
    # one hot encode ethnicities

In [None]:
# Normalizing data

## 3. Model Architecture

### 3.1 Neural Network Design

### 3.2 Model Compilation

## 4. Model Training

### 4.1 Training Process

### 4.2 Improving the Model

improving a model:
1. Creating a model - we can add more layers, increase the number of neurons within each of the hidden layers, or change the activation functions of each layer
2. Compiling a model - we can change the optimization function, or the **learning rate** of the optimization function, or perhaps the **loss** function
3. Fitting a model - we can fit a model for more **epochs** (leave it training for longer) or on more data (give the model more examples to learn from)
- add layers
- increase the number of hidden units
- change the activation functions (ReLU, tanh, softmax/sigmoid, etc) (combine linear and non-linear functions)
- change the optimization function (SGD, Adam, etc)
- change the loss function (MAE, Huber, BinaryCrossentropy, CategoricalCrossentropy, etc)
- find the ideal/change the learning rate (most important hyperparameter) (how much the model updates the patterns it learns)
- fit on more data (more examples to learn patterns)
- train for longer (more epochs)
- keep experimenting with each hyperparameter

## 5. Model Evaluation

evaluating a model:
- visualizing (data, model, predictions, training, etc)
- comparing (predictions to ground truth labels)
- evaluation metrics (MAE, Crossentropy, Precision/Recall/F1(classification report), confusion matrix, etc)
- tweaking/experiment (small experiments, then compare, then tweak, then compare, etc)
(plot_model, summary, plotting data/learning rate, history, decision boundary, etc)

### 5.1 Model Performance

In [None]:
#using classification evaluation metrics
from sklearn.metrics import confusion_matrix, f1_score, classification_report

In [None]:
# make predictions
y_preds = tf.round(model.predict(X_test))

In [None]:
# precision recall and f1
m = tf.keras.metrics.Precision()
m.update_state(y_test, y_preds)
print(m.result())
n = tf.keras.metrics.Recall()
n.update_state(y_test, y_preds)
print(n.result())
print(f1_score(y_test, y_preds))

In [None]:
#classification report
classification_report(y_test, y_preds)

### 5.2 Confusion Matrix

In [None]:
# remix of sci-kit learns plot_confusion_matrix
# https://scikit-learn.org/1.0/modules/generated/sklearn.metrics.plot_confusion_matrix.html

import itertools
figsize = (10, 10)

# create the confusion matrix
cm = confusion_matrix(y_test, y_preds)
cm_norm = cm.astype("float") / cm.sum(axis=1)[:, np.newaxis] #normalize our confusion matrix
n_classes = cm.shape[0] #gets the number of classes

# prettify the matrix
fig, ax = plt.subplots(figsize=figsize)

# create a matrix plot
cax = ax.matshow(cm, cmap=plt.cm.Blues)
fig.colorbar(cax)

# create classes
if classes:
    labels = classes
else:
    labels = np.arange(cm.shape[0])

# label the axises
ax.set(
    title="Confusion matrix",
    xlabel="predicted label",
    ylabel="true label",
    xticks=np.arange(n_classes),
    yticks=np.arange(n_classes),
    xticklabels=labels,
    yticklabels=labels
    )

# set threshold for different colours
threshold = (cm.max() + cm.min()) / 2

# plot the text on each cell
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
    plt.text(j, i, f"{cm[i, j]} ({cm_norm[i, j]*100:.1f}%)",
    horizontalalignment="center",
    color="white" if cm[i, j] > threshold else "black",
    size=15)

### 5.3 Model Prediction Graphing

In [None]:
# plot a random image and the prediction/truth
import random
def plot_random_image(model, images, true_labels, classes):
    """
    Picks a random image, plots it and labels it with prediction and truth label
    """
    # set up random integer
    i = random.randint(0, len(images))
    # create predictions and targets
    target_image = images[i]
    pred_probs = model.predict(target_image.reshape(1, 28, 28))
    # need to reshape because we predict on array of images
    pred_label = classes[pred_probs.argmax()]
    true_label = classes[true_labels[i]]
    # plot the image
    plt.imshow(target_image, cmap=plt.cm.binary)
    # change the color of the titles depending on if the prediction is right or wrong
    if (pred_label == true_label):
        color = "green"
    else:
        color = "red"
    # add xlabel information(prediction/true label)
    plt.xlabel("pred: {} {:2.0f}% (True: {})".format(pred_label,
                                                    100*tf.reduce_max(pred_probs),
                                                    true_label),
                                                    color=color)

In [None]:
plot_random_image(model, X_test, y_test, class_names)

## 6. Model Deployment

### 6.1 Model Saving

In [None]:
model.save('models')

## 7. Conclusion

In this notebook, we successfully developed a machine learning model for ethnicity, age, and gender predictions from facial images. Key highlights include:

- Thoughtful data collection and preprocessing contributed to model accuracy.
- Convolutional Neural Networks effectively captured facial features.
- Rigorous validation and testing ensured reliable performance.
- The model was seamlessly integrated into our full-stack website.
- Opportunities for ongoing improvement and expansion remain.

If you made it this far, thanks for joining me on this journey exploring the power of modern machine learning API's and frameworks. I hope you enjoyed it as much as I did!

## 8. References

The dataset was retrieved from the FairFace study.

Karkkainen, Kimmo and Joo, Jungseock. (2021). FairFace: Face Attribute Dataset for Balanced Race, Gender, and Age for Bias Measurement and Mitigation. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1548-1558. 10.1109/WACV48630.2021.00159