# ECE 5700 Final Project - Image Captioning

This Notebook will train and evaluate an instance of the GLobal Enhanced Transformer (Ji et al.). Please consult the repository's ReadMe for setup requirements and information on how to customize the run.

## Environment Setup

First, we mount the Google Drive, as this is where we expect the image features and captions to be stored prior to running.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


And then we install needed dependencies.

In [21]:
%cd '/content/image_captioning'

! sudo pip install -r requirements.txt

/content/image_captioning
Collecting cudf-cu12@ https://pypi.nvidia.com/cudf-cu12/cudf_cu12-24.10.1-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (from -r requirements.txt (line 60))
  Downloading https://pypi.nvidia.com/cudf-cu12/cudf_cu12-24.10.1-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (24.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.9/24.9 MB[0m [31m81.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting en-core-web-sm@ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl#sha256=86cc141f63942d4b2c5fcee06630fd6f904788d2f0ab005cce45aadb8fb73889 (from -r requirements.txt (line 91))
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting libcu

## Training

This cell kicks off a training session with a batch size of 50. When training completes, it proceeds to test and score the model.

In [8]:
%cd '/content/image_captioning'

! python run_project.py --batch_size 50 \
                     --features_path "/content/drive/MyDrive/Colab Notebooks/ece570_project/meshed-memory-transformer/coco_detections.hdf5" \
                     --annotation_folder "/content/drive/MyDrive/Colab Notebooks/ece570_project/meshed-memory-transformer/annotations" \
                     --save_path "/content/drive/MyDrive/Colab Notebooks/ece570_project"

/content/image_captioning
2024-11-10 20:39:32.402899: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-11-10 20:39:32.420913: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-11-10 20:39:32.442254: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-11-10 20:39:32.448759: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-10 20:39:32.464389: I tensorflow/core

# Evaluation

The previous cell both trains and tests the model. However, if you wanted to skip training and test using a pre-trained model, you could skip the previous cell and just run this one.

This cell will produce the average loss of the model and the CIDEr score for 100 images in the test dataset.

**Note that the cell will take a very long time for the first two batches while captions are generated, but will then go much faster.**

In [14]:
%cd '/content/image_captioning'

! python run_project.py --batch_size 50 \
                     --features_path "/content/drive/MyDrive/Colab Notebooks/ece570_project/meshed-memory-transformer/coco_detections.hdf5" \
                     --annotation_folder "/content/drive/MyDrive/Colab Notebooks/ece570_project/meshed-memory-transformer/annotations" \
                     --save_path "/content/drive/MyDrive/Colab Notebooks/ece570_project" \
                     --load_model "/content/image_captioning/saved_models/GET.pth"

/content/image_captioning
2024-11-10 23:31:50.756138: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-11-10 23:31:50.774500: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-11-10 23:31:50.795846: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-11-10 23:31:50.802376: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-10 23:31:50.817903: I tensorflow/core

It takes an extremely long time to generate the actual strings when testing with the run_project.py script. However, the repo contains pre-generated captions from the best model created. The following cell can be run to skip the lengthy generation process and immediately get the CIDEr score from the existing json files.

In [23]:
%cd '/content/image_captioning'

import json
from pycocoevalcap.cider.cider import Cider

def evaluate(generations, references):
    """
    Scores model based on various image captioning metrics
        generations: Generated captions
        references: Original captions
    """

    print("Evaluating CIDEr...")

    # Ensure dictionaries have same keys
    possibly_missing_data = generations.keys()
    to_delete = []
    for key in references.keys():
      if key not in possibly_missing_data:
        to_delete.append(key)

    # Delete keys that do not match between reference and generated
    for key in to_delete:
        del references[key]

    # Cider expects captions as lists so we convert the strings to
    # lists of strings here
    cider_refs = {key: [val] for key, val in references.items()}
    cider_preds = {key: [val] for key, val in generations.items()}

    # Compute score
    cider_eval = Cider()
    cider_score, _ = cider_eval.compute_score(cider_refs, cider_preds)

    print("\n==============================")
    print(f" CIDEr Score: {cider_score}")
    print("==============================\n")

    # Print out 5 example captions and what the input was
    print("Example generations:")
    i = 0
    for gen, ref in zip(generations.values(), references.values()):
        print(f"\nExpected Caption: {ref}")
        print(f"Generated Caption: {gen}")
        i += 1

        if i == 5:
            break

ref = {}
gen = {}
with open('/content/image_captioning/saved_models/generations_GET.json') as f_in:
        gen = json.load(f_in)

with open('/content/image_captioning/saved_models/references_GET.json') as f_in:
        ref = json.load(f_in)

evaluate(gen, ref)

/content/image_captioning
Evaluating CIDEr...

 CIDEr Score: 0.0

Example generations:

Expected Caption: A clock with the appearance of the wheel of a bicycle 
Generated Caption: [unused0] [unused0] [unused0] [unused2] [unused2] [unused4] [unused0] [unused4] [unused2] [unused5] [unused11] [unused5] [unused1] [unused2] [unused6] [unused17] [unused1] [unused2] [unused19] [unused10] [unused6] [unused23] [unused4] [unused18] [unused26] [unused17] [unused5] [unused16] [unused15] [unused4] [unused6] [unused3] [unused22] [unused26] [unused24] [unused12] [unused39] [unused22] [unused24] [unused39] [unused22] [unused30] [unused4] [unused27] [unused15] [unused2]

Expected Caption: A motorcycle with its brake extended standing outside
Generated Caption: [unused0] [unused1] [unused3] [unused1] [unused0] [unused1] [unused2] [unused5] [unused7] [unused3] [unused8] [unused2] [unused9] [unused4] [unused2] [unused7] [unused5] [unused7] [unused10] [unused16] [unused23] [unused7] [unused24] [unused16] [