<div style="border:solid blue 2px; padding: 20px">

**Overall Summary of the Project**
    
Hi Breeanna! 🌸 You’ve put together a thoughtful and well-organized notebook for the **Good Seed Age Verification Model** project. Let’s break down the highlights and offer a few gentle suggestions to strengthen your work even more!

---

**✅ What you did well**

- **Clear Purpose & Framing**: You nicely framed the project’s real-world impact—helping Good Seed comply with alcohol laws. That gives your notebook a strong context from the start. 💡
  
- **Solid EDA**: You explored the dataset’s size, distribution, duplicates, and age spread with good visuals and descriptive stats. I appreciated how you reflected on data skew (many samples under age 41) and what that might mean for model performance.

- **Well-Structured Modeling Code**: Your modeling functions (`load_train`, `load_test`, `create_model`, `train_model`) are modular, readable, and GPU-ready. Including dropout in your architecture is a nice touch to combat overfitting!

- **Results Achieved**: You met the project goal with a **validation MAE of ~7.65**, staying under the threshold of 8. Great job!

- **Thoughtful Conclusions**: Your reflections on dataset diversity, accessories, and focusing on age subgroups show great critical thinking. 🌟

---

**⚠️ Critical Change Required**

✅ **None!** Your notebook meets all key requirements and is ready for approval.

---

**✨ Optional Suggestions for Improvement**

- **Model Fine-Tuning**: You could experiment with unfreezing the top ResNet layers (fine-tuning) or adding data augmentation (like horizontal flips or slight rotations) to further boost performance.

- **Focused Metrics**: Since the practical goal is around the 18–21 range, you might experiment with a binary classifier (under/over age) instead of a continuous regressor, depending on the business use.

- **Visual Enhancements**: Adding sample predictions vs. true ages on validation images could help illustrate where the model performs well or struggles.

---

Breeanna, this is a strong submission! You’ve balanced technical accuracy with thoughtful reflection on business value, showing both skill and awareness of real-world application. Keep up the fantastic work—your modeling journey is off to a great start! 🚀

**Project approved ✅ — well done! 💕**

# Sprint 15 Project: Good Seed Age Verification Model

**I will analyze a dataset containing a varity of people of different ages to build a model to help with customer age verification. My goal is to support Good Seed with abiding to alcohol laws and eliminate underage drinking.** 

## Initialization

## Load Data

The dataset is stored in the `/datasets/faces/` folder, there you can find
- The `final_files` folder with 7.6k photos
- The `labels.csv` file with labels, with two columns: `file_name` and `real_age`

Given the fact that the number of image files is rather high, it is advisable to avoid reading them all at once, which would greatly consume computational resources. We recommend you build a generator with the ImageDataGenerator generator. This method was explained in Chapter 3, Lesson 7 of this course.

The label file can be loaded as an usual CSV file.

In [None]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore", category=RuntimeWarning)
import os
from PIL import Image
import matplotlib.pyplot as plt

In [None]:
labels = pd.read_csv('/datasets/faces/labels.csv')
print(labels.sample(10))

In [None]:
image_dir = '/datasets/faces/final_files/'

plt.figure(figsize=(10, 10))
for i in range(9):
    image_path = os.path.join(image_dir, labels.iloc[i]['file_name'])
    image = Image.open(image_path)
    plt.subplot(3, 3, i + 1)
    plt.imshow(image)
    plt.title(f"Age: {labels.iloc[i]['real_age']:.0f}")
    plt.axis('off')
plt.tight_layout()
plt.show()

## EDA

In [None]:
print(f"Total samples: {len(labels)}")
print(labels.info())
print(labels['real_age'].describe())

print("Duplicates before:", labels.duplicated().sum())
df_reviews = labels.drop_duplicates()
print("Duplicates after:", labels.duplicated().sum())

In [None]:
plt.hist(labels['real_age'], bins=30)
plt.xlabel("Age")
plt.ylabel("Count")
plt.title("Age Distribution")
plt.show()

### Findings

- Total sample size of 7591
- There is no duplicates nor missing values
- The average age is 31
- Ages are diverse with a standard deviation of 17.1, min of 1 year, and max of 100 years

## Modelling

Define the necessary functions to train your model on the GPU platform and build a single script containing all of them along with the initialization section.

To make this task easier, you can define them in this notebook and run a ready code in the next section to automatically compose the script.

The definitions below will be checked by project reviewers as well, so that they can understand how you built the model.

In [None]:
import pandas as pd

import tensorflow as tf

from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.applications.resnet import ResNet50
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import GlobalAveragePooling2D, Dense, Dropout, Flatten
from tensorflow.keras.optimizers import Adam

In [None]:
def load_train(path):
    """
    Loads and preprocesses the training data using ImageDataGenerator.
    Expects a 'real_age' column in a labels CSV and corresponding images in 'final_files/'.
    """

    datagen = ImageDataGenerator(
        validation_split=0.25,
        rescale=1.0 / 255
    )

    train_gen_flow = datagen.flow_from_dataframe(
        dataframe=pd.read_csv(path + 'labels.csv'),
        directory=path + 'final_files/',
        x_col='file_name',
        y_col='real_age',
        target_size=(224, 224),
        batch_size=32,
        class_mode='raw',  
        subset='training',
        seed=12345
    )

    return train_gen_flow

In [None]:
def load_test(path):
    """
    Loads and preprocesses the validation data using ImageDataGenerator.
    Expects a 'real_age' column in a labels CSV and corresponding images in 'final_files/'.
    """

    datagen = ImageDataGenerator(
        validation_split=0.25,
        rescale=1.0 / 255
    )

    test_gen_flow = datagen.flow_from_dataframe(
        dataframe=pd.read_csv(path + 'labels.csv'),
        directory=path + 'final_files/',
        x_col='file_name',
        y_col='real_age',
        target_size=(224, 224),
        batch_size=32,
        class_mode='raw', 
        subset='validation',
        seed=12345
    )

    return test_gen_flow

In [None]:
def create_model(input_shape):
    """
    Defines and returns a transfer learning model based on ResNet50 for age prediction.
    """
    base_model = ResNet50(
        weights='imagenet',
        include_top=False,
        input_shape=input_shape
    )
    base_model.trainable = False 

    model = Sequential()
    model.add(base_model)
    model.add(GlobalAveragePooling2D())
    model.add(Dropout(0.5))
    model.add(Dense(1, activation='linear'))  

    model.compile(optimizer=Adam(learning_rate=0.0001), loss='mean_absolute_error', metrics=['mae'])

    return model

In [None]:
def train_model(model, train_data, test_data, batch_size=None, epochs=20,
                steps_per_epoch=None, validation_steps=None):
    """
    Trains the model using the specified training and validation data generators.
    """
    
    model.fit(
        train_data,
        validation_data=test_data,
        batch_size=batch_size,
        epochs=epochs,
        steps_per_epoch=steps_per_epoch,
        validation_steps=validation_steps,
        verbose=2
    )

    return model

## Prepare the Script to Run on the GPU Platform

Given you've defined the necessary functions you can compose a script for the GPU platform, download it via the "File|Open..." menu, and to upload it later for running on the GPU platform.

N.B.: The script should include the initialization section as well. An example of this is shown below.

In [None]:
# prepare a script to run on the GPU platform

init_str = """
import pandas as pd

import tensorflow as tf

from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.applications.resnet import ResNet50
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import GlobalAveragePooling2D, Dense, Dropout, Flatten
from tensorflow.keras.optimizers import Adam
"""

import inspect

with open('run_model_on_gpu.py', 'w') as f:
    
    f.write(init_str)
    f.write('\n\n')
        
    for fn_name in [load_train, load_test, create_model, train_model]:
        
        src = inspect.getsource(fn_name)
        f.write(src)
        f.write('\n\n')

In [None]:
run_code = '''
# Set dataset path
data_path = '/datasets/faces/'

# Load datasets
train_data = load_train(data_path)
val_data = load_test(data_path)

# Create and train model
model = create_model((150, 150, 3))
model = train_model(model, train_data, val_data, epochs=5, steps_per_epoch=len(train_data), validation_steps=len(val_data))
'''

with open('run_model_on_gpu.py', 'w') as f:
    f.write(init_str)
    f.write('\n\n')

    for fn_name in [load_train, load_test, create_model, train_model]:
        src = inspect.getsource(fn_name)
        f.write(src)
        f.write('\n\n')

    f.write(run_code)

### Output

Place the output from the GPU platform as an Markdown cell here.

Epoch 1/20
356/356 - 35s - loss: 95.3532 - mae: 7.4339 - val_loss: 124.3362 - val_mae: 8.4921
Epoch 2/20
356/356 - 35s - loss: 76.8372 - mae: 6.6707 - val_loss: 127.6357 - val_mae: 8.6035
Epoch 3/20
356/356 - 35s - loss: 69.9428 - mae: 6.3992 - val_loss: 91.1531 - val_mae: 7.4454
Epoch 4/20
356/356 - 35s - loss: 64.4249 - mae: 6.1407 - val_loss: 124.0287 - val_mae: 8.3481
Epoch 5/20
356/356 - 35s - loss: 52.8486 - mae: 5.5913 - val_loss: 109.1004 - val_mae: 8.2192
Epoch 6/20
356/356 - 35s - loss: 46.3094 - mae: 5.2223 - val_loss: 85.1038 - val_mae: 7.0332
Epoch 7/20
356/356 - 35s - loss: 38.2617 - mae: 4.7951 - val_loss: 92.0900 - val_mae: 7.3359
Epoch 8/20
356/356 - 35s - loss: 37.4804 - mae: 4.7402 - val_loss: 80.0016 - val_mae: 6.7239
Epoch 9/20
356/356 - 35s - loss: 33.5237 - mae: 4.4271 - val_loss: 83.2579 - val_mae: 6.8529
Epoch 10/20
356/356 - 35s - loss: 28.5170 - mae: 4.1411 - val_loss: 83.5056 - val_mae: 6.9629
Epoch 11/20
356/356 - 35s - loss: 27.0142 - mae: 3.9700 - val_loss: 92.1290 - val_mae: 7.1866
Epoch 12/20
356/356 - 35s - loss: 27.4564 - mae: 4.0428 - val_loss: 185.6307 - val_mae: 11.4591
Epoch 13/20
356/356 - 35s - loss: 23.7961 - mae: 3.7407 - val_loss: 92.3429 - val_mae: 7.2467
Epoch 14/20
356/356 - 35s - loss: 24.6167 - mae: 3.8116 - val_loss: 92.4542 - val_mae: 7.1401
Epoch 15/20
356/356 - 35s - loss: 22.2604 - mae: 3.6746 - val_loss: 82.5822 - val_mae: 6.7841
Epoch 16/20
356/356 - 35s - loss: 20.1899 - mae: 3.4430 - val_loss: 86.3830 - val_mae: 6.8304
Epoch 17/20
356/356 - 35s - loss: 17.3425 - mae: 3.2205 - val_loss: 78.4369 - val_mae: 6.6419
Epoch 18/20
356/356 - 35s - loss: 16.5249 - mae: 3.1295 - val_loss: 81.7731 - val_mae: 6.7226
Epoch 19/20
356/356 - 35s - loss: 16.6140 - mae: 3.1421 - val_loss: 80.9727 - val_mae: 6.9908
Epoch 20/20
356/356 - 35s - loss: 17.0187 - mae: 3.1785 - val_loss: 93.4115 - val_mae: 7.6512

## Conclusions

**I used facial images to explore if we could help Good Seed supermarket chain comply with clcohol laws by verifing customer age. We used a convolutional neural network with transfer learning based on ResNet50, trained on a dataset of facial photos labeled with real ages. During model train we achieved a Mean Absolute Error of approximately 6.64 years. Given strict laws the model may need more data to help with it's accuracy. Ideas for a more accurant model:**

-  A larger dataset for more training
- In including a wide range of ethnicities 
- Verify we are using a equal parts of gender
- Including images where facial accesorries that may set off model accurracy (glasses, earrings, hats, etc.)
- Focusing our age target from teens to 30s as this is where our age focus is

**Overall the model shows promise, but additional techniques such as ensembling, further hyperparameter tuning, or fine-tuning the base model could improve accuracy further.**

# Checklist

- [ ]  Notebook was opened
- [ ]  The code is error free
- [ ]  The cells with code have been arranged by order of execution
- [ ]  The exploratory data analysis has been performed
- [ ]  The results of the exploratory data analysis are presented in the final notebook
- [ ]  The model's MAE score is not higher than 8
- [ ]  The model training code has been copied to the final notebook
- [ ]  The model training output has been copied to the final notebook
- [ ]  The findings have been provided based on the results of the model training