# Generating marketing slogans for product images

This notebook shows how to fine-tune a generative AI model to generate marketing slogans for product images. 

We start with a foundation model, BLIP, available through HuggingFace. We fine-tune it through an Amazon SageMaker training job. Then we evaluate the generated slogans created by our fine-tuned model to slogans created by an "out of the box" model. 

TL;DR - the fine-tuned model shows better results.

| Metric | Baseline model | Fine-tuned model |
| -- | -- | -- |
| BERT Score (F1 - higher is better) | 0.82 | 0.85 |
| WER (lower is better) | 2.06 | 1.24 |
| ROUGE (higher is better ) | 0.05 | 0.09 |

## License

    Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
    SPDX-License-Identifier: MIT-0

## Data set

We use the [Automatic Understanding of Image and Video Advertisements](https://people.cs.pitt.edu/~kovashka/ads/) image dataset. The citation for this data set is:

    Automatic Understanding of Image and Video Advertisements. Zaeem Hussain, Mingda Zhang, Xiaozhong Zhang, Keren Ye, Christopher Thomas, Zuha Agha, Nathan Ong, Adriana Kovashka. To appear, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.

## Prerequisites

This notebook was built in Amazon SageMaker Studio. It uses an `ml.g4dn.xlarge` instance with the `PyTorch 1.13 Python 3.9 GPU Optimized` image.

Download the data set from the public links. It comes as a set of 11 zip files with images, `subfolder-0.zip` through `subfolder-10.zip`, plus a zip file with metadata, `annotations_images.zip`.

Create a directory called `data` in the same directory as this notebook and unzip all of the zip files there. You should end up with one subdirectory for each of the zip files.

You will need to make sure that you have increased your default account quotas to let you use a `p4d.24xlarge` instance for training.

## Install libraries

Make sure we have the latest versions of these packages.

In [None]:
!pip install transformers datasets evaluate -q

In [None]:
from datasets import load_dataset 

## Prepare data

In this section we need to create a dataset in the standard format for images. We need a folder with all of the images, and a metadata file that maps images to ground-truth captions (slogans).

We'll read the mapping of slogans to images from the `Slogans.json` file, and update a new metadata list. Since many of the images have muultiple slogans, we will create multiple copies, one for each slogan.

In [None]:
import json

with open('data/image/Slogans.json', 'r') as S:
    slogans = json.load(S)

In [None]:
import os
image_folder = 'image_folder_blip'

if not os.path.exists(image_folder):
    os.mkdir(image_folder)

In [None]:
import shutil
captions = []
for image_file_name in slogans:
    path_parts = os.path.split(image_file_name)
    base_name = path_parts[-1]
    for idx, slogan in enumerate(slogans[image_file_name]):
        s_file_name = f"{idx}-{base_name}"
        captions.append({"file_name": s_file_name, "text": slogan})
        shutil.copyfile(os.path.join('data', image_file_name), os.path.join(image_folder, s_file_name))

In [None]:
with open(os.path.join(image_folder, "metadata.jsonl"), 'w') as f:
    for item in captions:
        f.write(json.dumps(item) + "\n")

In [None]:
from datasets import load_dataset 

ds = load_dataset("imagefolder", data_dir=image_folder, split="train")

In [None]:
ds = ds.train_test_split(test_size=0.1)
train_ds = ds["train"]
test_ds = ds["test"]

In [None]:
import sagemaker
sess = sagemaker.Session()
s3_bucket = sess.default_bucket() 
print(s3_bucket)

In [None]:
train_path = 'ads/blip/train'
test_path = 'ads/blip/test'
train_ds.save_to_disk(dataset_path=f"s3://{s3_bucket}/{train_path}")
test_ds.save_to_disk(dataset_path=f"s3://{s3_bucket}/{test_path}")

## Run training job

Next we'll run a training job on Amazon SageMaker using the HuggingFace classes in the Python SDK.

In [None]:
import sagemaker
sess = sagemaker.Session()
role = sagemaker.get_execution_role()

In [None]:
import sagemaker.huggingface

In [None]:
from sagemaker.huggingface import HuggingFace

# hyperparameters, which are passed into the training job
hyperparameters={
    'epochs': 10,
    'model_name': 'Salesforce/blip-image-captioning-base',
    'learning_rate': 5e-5,
    'train_batch_size': 8,
    'output_dir': '/opt/ml/model'
}

# configuration for running training on smdistributed Data Parallel
distribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}}

# instance configurations
instance_type='ml.p4d.24xlarge'
instance_count=1
volume_size=500

# create the Estimator
huggingface_estimator = HuggingFace(
        entry_point='train.py',
        source_dir='./scripts',
        instance_type=instance_type,
        instance_count=instance_count,
        volume_size=volume_size,
        role=role,
        image_uri='763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-training:1.13.1-transformers4.26.0-gpu-py39-cu117-ubuntu20.04',
        py_version='py39',
        distribution= distribution,
        hyperparameters = hyperparameters
)

In [None]:
huggingface_estimator.fit(
  {'train': f"s3://{s3_bucket}/{train_path}"}
)

## Evaluate

We'll evaluate the predictions from both the fine-tuned model and the base model against the ground truth slogans. We'll calculate several metrics including WER, BERTScore, and ROUGE.

### Download model artifact from S3

In the next cell, specify the job name, which you can find from the output of the `fit` method in the last code cell.

In [None]:
training_job_name = 'huggingface-pytorch-training-2023-03-30-01-58-11-853'

In [None]:
model_artifact = sess.describe_training_job(training_job_name)['ModelArtifacts']['S3ModelArtifacts']
print(model_artifact)

In [None]:
model_folder = 'model_folder_blip'

if not os.path.exists(model_folder):
    os.mkdir(model_folder)
    
sagemaker.s3.S3Downloader.download(model_artifact, model_folder)

In [None]:
!tar zxf model_folder_blip/model.tar.gz -C model_folder_blip

### Load model and preview results

In [None]:
from transformers import BlipForConditionalGeneration
model = BlipForConditionalGeneration.from_pretrained(model_folder)

In [None]:
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

In [None]:
b_model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
b_model.to(device)

In [None]:
from matplotlib import pyplot as plt
import random
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("Salesforce/blip-image-captioning-base")

fig = plt.figure(figsize=(18, 35))

# prepare image for the model
for cnt in range(12):
    idx = random.randint(0, len(test_ds))
    example = test_ds[idx]
    image = example["image"]
    orig_caption = example["text"]
    inputs = processor(images=image, return_tensors="pt").to(device)
    pixel_values = inputs.pixel_values

    generated_ids = model.generate(pixel_values=pixel_values, max_length=50)
    generated_caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    
    b_ids = b_model.generate(pixel_values=pixel_values, max_length=50)
    b_caption = processor.batch_decode(b_ids, skip_special_tokens=True)[0]
    
    fig.add_subplot(6, 2, cnt+1)
    plt.imshow(image)
    plt.axis("off")
    plt.title(f"Original: {orig_caption}\nGenerated: {generated_caption}\nBaseline: {b_caption}")

### Get predictions from test set

In [None]:
!pip install bert-score

In [None]:
from evaluate import load
bertscore = load("bertscore")

In [None]:
predictions = []
b_predictions = []
references = []

for idx in range(len(test_ds)):
    example = test_ds[idx]
    image = example["image"]
    orig_caption = example["text"]
    inputs = processor(images=image, return_tensors="pt").to(device)
    pixel_values = inputs.pixel_values

    generated_ids = model.generate(pixel_values=pixel_values, max_length=50)
    generated_caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    
    b_ids = b_model.generate(pixel_values=pixel_values, max_length=50)
    b_caption = processor.batch_decode(b_ids, skip_special_tokens=True)[0]
    
    references.append(orig_caption)
    predictions.append(generated_caption)
    b_predictions.append(b_caption)

### Bert Score (higher is better)

In [None]:
results = bertscore.compute(predictions=predictions, references=references, lang="en")
b_results = bertscore.compute(predictions=b_predictions, references=references, lang="en")

In [None]:
import numpy as np
print(f"F1 - tuned: {np.mean(results['f1'])}, baseline: {np.mean(b_results['f1'])}")
print(f"Precision - tuned: {np.mean(results['precision'])}, baseline: {np.mean(b_results['precision'])}")
print(f"Recall - tuned: {np.mean(results['recall'])}, baseline: {np.mean(b_results['recall'])}")

### WER (lower is better)

In [None]:
!pip install jiwer

In [None]:
from evaluate import load
wer = load("wer")
wer_score = wer.compute(predictions=predictions, references=references)
b_wer_score = wer.compute(predictions=b_predictions, references=references)

In [None]:
print(f"WER: {wer_score}, baseline: {b_wer_score}")

### Rouge (higher is better)

In [None]:
!pip install rouge-score nltk

In [None]:
rouge = load('rouge')

In [None]:
rouge_result = rouge.compute(predictions=predictions,
                             references=references,
                             use_aggregator=True)
b_rouge_result = rouge.compute(predictions=b_predictions,
                             references=references,
                             use_aggregator=True)

In [None]:
rouge_result

In [None]:
b_rouge_result

## Next steps

Next steps might include trying different foundation models, training for more epochs, or adding human feedback to improve the results.