# Assignment 4: Image Captioning

This assignment is somewhat short.  We want you to spend your time on the project instead!

This assignment explores models connecting different modalities - exploring a connection between images and text.  By the time you're done with this assignment, you'll have:

* investigated a few captioning techniques
* worked with CLIP embeddings for images and captions
* worked with the BLIP image captioning system

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datasci-w266/2025-spring-main/blob/master/assignment/a4/image_captioning.ipynb)


# Foundational image captioning papers

## Show & Tell

[Show and Tell: A Neural Image Caption Generator](https://arxiv.org/pdf/1411.4555.pdf) was the first step towards neural image captioning.  Fundamentally it is an encoder-decoder scheme similar to what we've seen in class.  Concretely, it uses the CNN structure of an (at the time) state of the art image classification CNN as the encoder and it uses an LSTM as a decoder.  As in the generation models in class, it continues to generate text until a special "stop" token is emitted.  After **reading** the paper, answer the following questions:

### Questions (Part A)

1.  What parts of the CNN were fine-tuned during the image caption generation training process?
2.  What was the biggest concern when deciding how to train the model?
3.  How was the encoded image representation input into the decoder?
4.  Given we are "translating" from an image to a caption (without a length constraint), which evaluation metric did the authors determine was reasonable for a top line metric?
5.  What beam width is equivalent to one where you select the highest probability word in each decoding step?


## Deep Visual Alignment

[Deep Visual-Semantic Alignments for Generating Image Descriptions](https://cs.stanford.edu/people/karpathy/cvpr2015.pdf) is a fun read for which we will ask no questions.  Its critical insights are around understanding an image as a composition of regions, and building upon that understanding to construct both a caption for the whole image, but labels for its consistuent parts.

## Show, Attend & Tell

[Show, Attend & Tell](https://arxiv.org/pdf/1502.03044.pdf) applies the same "provide the decoder more context, as directly as possible" trick we've seen over the course: adding attention.  After **skimming** the paper, answer the following questions:

### Questions (Part B)

1. What is the model paying attention to?
2. What do the figures with highlight shading represent in Figures 2, 3 and 5?

# Exploring an MS COCO captioner

There are many examples of image captioners ML engineers have built on the MS COCO dataset you explored. [This one](https://replicate.com/rmokady/clip_prefix_caption) uses a (more) modern large language model as its decoder, GPT-2.  

* **Explore** the samples and play with using beam search and not.  What do you notice?

This is an example from the Show & Tell paper of a low-quality caption (see figure 5).  The GPT-2 model proposes "the car that person drove to the hospital." vs. "A yellow school bus parked in a parking lot" from the original paper. ![Misclassified](littlecar.png)

# CLIP Embeddings and Image Classification

The [CLIP paper](https://arxiv.org/pdf/2103.00020.pdf)  describes a system that emits encodings that represent both images and text captions. The system learns to match a picture with its caption so the encoding for the image and the encoding for an associated caption should have a very high cosine similarity.  Systems like DALL-E use CLIP embeddings to generate images based on a text description by using the text encoding to get the image encoding and then processing the image encoding to generate the final image.  We're going to use CLIP in the opposite direction.  Namely we're going to use CLIP embeddings to classify images, that is to score a set of captions for an image based on the image's content.


We can use the HuggingFace implementation of CLIP to experiment with this multimodal capability. Since we are not fine-tuning it we do not need access to a GPU.

In [1]:
!pip install -q transformers

In [2]:
!pip install -q diffusers --upgrade

In [3]:
!pip install -q invisible_watermark accelerate safetensors

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m16.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.5/4.5 MB[0m [31m25.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m32.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m25.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m20.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [4]:
import tensorflow as tf
from PIL import Image
import requests
from transformers import CLIPProcessor, TFCLIPModel

In [5]:
model = TFCLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/4.19k [00:00<?, ?B/s]

tf_model.h5:   0%|          | 0.00/606M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFCLIPModel.

All the layers of TFCLIPModel were initialized from the model checkpoint at openai/clip-vit-base-patch32.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFCLIPModel for predictions without further training.


preprocessor_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/592 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/862k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.22M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/389 [00:00<?, ?B/s]

Now let's begin our experiment.  We're going to select two images that contain both zebras and cars.  They may contain other things as well.  We're also going to generate a set of captions that we will score.  Specifically, we'll pass the output for the captions through a softmax to give us a probability distribution over the four captions.

In [6]:
# Example tags: animal = zebra, transport = car

urls = ["http://farm1.staticflickr.com/9/15631288_605abb3096_z.jpg", #zebras foreground, cars background
        "http://farm4.staticflickr.com/3057/3033996041_11293469b7_z.jpg"]  #zebra foreground, tiny car background
captions = ["a photo of cars",
            "a photo of a giraffe",
            "a photo of zebras in a field",
            "a photo of some zebras and cars"]

for url in urls:
    image = Image.open(requests.get(url, stream=True).raw)

    inputs = processor(
        text=captions, images=image, return_tensors="tf", padding=True
    )

    outputs = model(**inputs)
    logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
    probs = tf.nn.softmax(logits_per_image, axis=1)  # we can take the softmax to get the label probabilities

    print()
    print(url)
    for i, caption in enumerate(captions):
        print('%40s - %.4f' % (caption, probs[0, i]))


http://farm1.staticflickr.com/9/15631288_605abb3096_z.jpg
                         a photo of cars - 0.0014
                    a photo of a giraffe - 0.0475
            a photo of zebras in a field - 0.0151
         a photo of some zebras and cars - 0.9360

http://farm4.staticflickr.com/3057/3033996041_11293469b7_z.jpg
                         a photo of cars - 0.0000
                    a photo of a giraffe - 0.0000
            a photo of zebras in a field - 0.9660
         a photo of some zebras and cars - 0.0339


The CLIP embeddings allow us to associate captions with images.  Specifically, we can build a classifier that assigns probabilities to each of the captions.  We want the highest probability to go to the most descriptive caption out of the four captions for the given image.  Notice here that even though both images contain zebras, one of them features a line of clearly visible cars.  The other image only has one small car off in the distance.  Note that the first image with the cars scores high for the caption of ```a photo of some zebras and cars``` because the zebras and cars are very visible.  The second image scores highest for ```a photo of zebras in a field``` but the small car is less noticed but scores above a zero.

In [7]:
# Example tags: two dogs in bike, human bike tiny dog

urls = ["http://farm1.staticflickr.com/8/10896131_6a184b48cb_z.jpg",  #2 dogs in bike basket
        "http://farm4.staticflickr.com/3082/2797293301_dd26fd613f_z.jpg"] #human and bike with tiny dog
captions = ["a photo of a dog",
            "a photo of some dogs in a basket",
            "a photo of a bike",
            "a photo of some dogs with a bike"]

for url in urls:
    image = Image.open(requests.get(url, stream=True).raw)

    inputs = processor(
        text=captions, images=image, return_tensors="tf", padding=True
    )

    outputs = model(**inputs)
    logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
    probs = tf.nn.softmax(logits_per_image, axis=1)  # we can take the softmax to get the label probabilities

    print()
    print(url)
    for i, caption in enumerate(captions):
        print('%40s - %.4f' % (caption, probs[0, i]))


http://farm1.staticflickr.com/8/10896131_6a184b48cb_z.jpg
                        a photo of a dog - 0.0001
        a photo of some dogs in a basket - 0.0378
                       a photo of a bike - 0.0007
        a photo of some dogs with a bike - 0.9614

http://farm4.staticflickr.com/3082/2797293301_dd26fd613f_z.jpg
                        a photo of a dog - 0.0005
        a photo of some dogs in a basket - 0.0000
                       a photo of a bike - 0.9586
        a photo of some dogs with a bike - 0.0408


Again, these two images both contain bicycles and dogs.  The first image is two dogs in a basket on the front of a bike.  While the bike is visible, the two dogs are the focus of the image.  The second image features a person with their bike.  The bike happens to contain a small dog.  We would expect the embeddings to reflect the different emphases of the photos and indeed they do.

In [8]:
# Example tags: animal = dog, transport = bike

urls = ["http://farm1.staticflickr.com/124/405495389_d4316b1224_z.jpg",   #dog foreground and tiny bikes background
        "http://farm8.staticflickr.com/7194/6991675037_3c298541c0_z.jpg"] #motorbike foreground, many bikes and tiny dog background
captions = ["a photo of a dog",
            "a photo of a motorbike",
            "a photo of a plane",
            "a photo of some bikes"]

for url in urls:
    image = Image.open(requests.get(url, stream=True).raw)

    inputs = processor(
        text=captions, images=image, return_tensors="tf", padding=True
    )

    outputs = model(**inputs)
    logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
    probs = tf.nn.softmax(logits_per_image, axis=1)  # we can take the softmax to get the label probabilities

    print()
    print(url)
    for i, caption in enumerate(captions):
        print('%40s - %.4f' % (caption, probs[0, i]))


http://farm1.staticflickr.com/124/405495389_d4316b1224_z.jpg
                        a photo of a dog - 0.9990
                  a photo of a motorbike - 0.0002
                      a photo of a plane - 0.0008
                   a photo of some bikes - 0.0000

http://farm8.staticflickr.com/7194/6991675037_3c298541c0_z.jpg
                        a photo of a dog - 0.0013
                  a photo of a motorbike - 0.8967
                      a photo of a plane - 0.0000
                   a photo of some bikes - 0.1020


For the third example, the first image includes a dog in the foreground and a number of small bikes in the distant background.  You can look at the annotations associated with the image to see where these objects are located. The second image includes a motorbike/motorcycle in the the foreground but a number of bikes and a tiny dog in the background.  Again we're hand crafting these captions to include the items in the image but we want the score for the caption to reflect what's in the foreground of the image.  

Now it is your turn.  You will essentially replicate the examples above but you will do it with images **you** select.  First you need to select *two* images for processing. Go to [the COCO Explorer](https://cocodataset.org/#explore), click on two tag icons: an animal (see icon column of animals) and a mode of transportation (see icon column of ), and search. (You pick which; you might have to try a few combinations until you get multiple image results.)

Find two different images that each contain your animal and your mode of transportation.  It's okay if they contain other things as well.  If you click on the URL icon above each image, you'll see a link to the annotated image and the original (unlabeled) image. Put the original image link in the code cell below *your image 1 url* and *your image 2 url*, then create four captions that mention only one of the objects each vs both objects together. You can see the captions we created for the three examples above.  The goal is to get probabilities above 0.85 for the caption that best describes the first image and the caption that best describes the second image.

As in the examples above, you must find a pair of images with the same two objects tagged in them, but which get different results for which caption has the highest probability according to the CLIP model.

Note which object tags you used, and give a brief explanation of what looks different about the two images that you think made them get different CLIP results for the most likely caption.  Enter that explanation in the cell below.  You **do not need to enter it in the answers sheet**.  Just leave it in the notebook that you submit.

In [24]:
# Example tags: animal = ???, transportation = ???

### YOUR CODE HERE
# Example tags: animal = dog, transportation = bicycle
urls = [
    "http://farm6.staticflickr.com/5223/5854934356_c2ae138fd2_z.jpg",
    "http://farm1.staticflickr.com/149/338655676_b5fade2afb_z.jpg"
]
captions = [
    "a photo of a truck",
    "a photo of a dog",
    "a photo of a truck and a dog",
    "a photo of two dogs"
]
### END YOUR CODE

for url in urls:
    image = Image.open(requests.get(url, stream=True).raw)

    inputs = processor(
        text=captions, images=image, return_tensors="tf", padding=True
    )

    outputs = model(**inputs)
    logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
    probs = tf.nn.softmax(logits_per_image, axis=1)  # we can take the softmax to get the label probabilities

    print()
    print(url)
    for i, caption in enumerate(captions):
        print('%40s - %.4f' % (caption, probs[0, i]))


http://farm6.staticflickr.com/5223/5854934356_c2ae138fd2_z.jpg
                      a photo of a truck - 0.0009
                        a photo of a dog - 0.0001
            a photo of a truck and a dog - 0.9989
                     a photo of two dogs - 0.0002

http://farm1.staticflickr.com/149/338655676_b5fade2afb_z.jpg
                      a photo of a truck - 0.0002
                        a photo of a dog - 0.0465
            a photo of a truck and a dog - 0.0133
                     a photo of two dogs - 0.9401


### Questions (Part C)

1. What is the animal tag you selected? Dog

2. What is the transportation tag you selected? Truck

3. What is the probability associated with the most likely caption for image 1? 0.9989

4. What is the probability associated with the most likely caption for image 2? 0.9401

**(Answer 5 below but do NOT enter your sentences in the answers file)**

5. Why do you think the differences between your two images are reflected in the 4 captions you produced.  

Please answer in two to four sentences right here:

*BEGIN Q 5 ANSWER HERE*
In the first image, the dog and truck are both clearly visible, so “a photo of a truck and a dog” is the most accurate. In the second image, there are two dogs together and no obvious truck, so “a photo of two dogs” is the best match.

*END Q 5 ANSWER HERE*


We used CLIP to evaluate the captions and to select the best caption given a choice from four.  Now let's use a model named [BLIP](https://huggingface.co/docs/transformers/en/model_doc/blip) to generate the caption for an image.

In [25]:
!pip install -q invisible_watermark transformers accelerate safetensors

In [26]:

from transformers import AutoProcessor, TFBlipForConditionalGeneration

In [27]:
bl_processor = AutoProcessor.from_pretrained("Salesforce/blip-image-captioning-base")

bl_model = TFBlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")


preprocessor_config.json:   0%|          | 0.00/287 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/506 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/4.56k [00:00<?, ?B/s]

tf_model.h5:   0%|          | 0.00/990M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBlipForConditionalGeneration.

All the layers of TFBlipForConditionalGeneration were initialized from the model checkpoint at Salesforce/blip-image-captioning-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBlipForConditionalGeneration for predictions without further training.


Now let's begin our experiment.  We're going re-use the two images you used in the previous CLIP exercise. Your images contain both the animal and the type of transporation you selected.  They may contain other things as well.  We're also going to generate a caption for each one that we will score.  Specifically, we'll pass the output for the captions through a softmax to give us a probability distribution over the four captions.

First, let's generate a caption for your first image, the one in C3.  Paste the image URL into the spot below.

In [28]:
#image one URL

### YOUR CODE HERE
url = "http://farm4.staticflickr.com/3498/4022398227_6b375d1ae3_z.jpg"
### END YOUR CODE

image = Image.open(requests.get(url, stream=True).raw)

text = "A picture of cars and a stop sign"

inputs = bl_processor(images=image, text=text, return_tensors="tf")

#outputs = bl_model(**inputs)
outputs = bl_model.generate(**inputs, max_new_tokens=25)

print(bl_processor.decode(outputs[0], skip_special_tokens=True))


a picture of cars and a stop sign


Next, let's generate a caption for your second image, the one in C4.  Paste the image URL into the spot below.

In [29]:
#image two URL

### YOUR CODE HERE
url = "http://farm9.staticflickr.com/8158/7518186846_2cd85af966_z.jpg"
### END YOUR CODE

image = Image.open(requests.get(url, stream=True).raw)

text = "A picture of two people with a umbrella"

inputs = bl_processor(images=image, text=text, return_tensors="tf")

#outputs = bl_model(**inputs)
outputs = bl_model.generate(**inputs, max_new_tokens=25)

print(bl_processor.decode(outputs[0], skip_special_tokens=True))

a picture of two people with a umbrella


Now lets see how the captions you just generated work as describing your images.  We're going to use CLIP to evaluate the captions you just generated.  Fill out the cell below by copying the URLs for the images you selected with the animal and the transportation. Now copy the BLIP caption for your first image and past it in to caption #1.  Copy the the BLIP caption for your second image and past it in to caption #3. Now take the highest scoring caption for image #1 from question 3c and paste that caption into slot 2.  Then take the highest scoring caption for image #2 from question 4c and paste that caption into slot 4. Now rerun CLIP and look at the scores.  

In [35]:
# Example tags from section C: animal = ???, transportation = ???

### YOUR CODE HERE
urls = ["http://farm4.staticflickr.com/3498/4022398227_6b375d1ae3_z.jpg",   #
        "http://farm9.staticflickr.com/8158/7518186846_2cd85af966_z.jpg"] #
captions = ["A photo of some cars",
            "A photo of a stop sign",
            "A photo of cars and a stop sign",
            "A picture of two people with an umbrella"]
### END YOUR CODE
            #
for url in urls:
    image = Image.open(requests.get(url, stream=True).raw)

    inputs = processor(
        text=captions, images=image, return_tensors="tf", padding=True
    )

    outputs = model(**inputs)
    logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
    probs = tf.nn.softmax(logits_per_image, axis=1)  # we can take the softmax to get the label probabilities

    print()
    print(url)
    for i, caption in enumerate(captions):
        print('%40s - %.4f' % (caption, probs[0, i]))



http://farm4.staticflickr.com/3498/4022398227_6b375d1ae3_z.jpg
                    A photo of some cars - 0.1169
                  A photo of a stop sign - 0.0832
         A photo of cars and a stop sign - 0.7011
A picture of two people with an umbrella - 0.0988

http://farm9.staticflickr.com/8158/7518186846_2cd85af966_z.jpg
                    A photo of some cars - 0.0000
                  A photo of a stop sign - 0.0001
         A photo of cars and a stop sign - 0.0000
A picture of two people with an umbrella - 0.9998


### Questions (Part D)

1. Does the BLIP caption win or do other captions win for image #1? The best caption for image #1 is "A photo of cars and a stop sign" with probability 0.7011, which beats the other three captions.

2. Does the BLIP caption win or do other captions win for image #2? 0.7011

3. What is the probability associated with the most likely caption for image #1? For image #2, "A picture of two people with an umbrella" wins overwhelmingly, at probability 0.9998

4. What is the probability associated with the most likely caption for image# 2? 0.9998

**(Answer 5 below but do NOT enter your sentences in the answers file)**

5. Why do you think the winning caption scored higher than the 3 others?

Please answer Q 5 in two to four sentences right here:

BEGIN Q 5 ANSWER HERE
     The first image clearly shows cars and a stop sign, so “A photo of cars and a stop sign” best matches what CLIP sees. In the second image, two people with an umbrella are prominent, so the caption mentioning them is far more accurate than any caption about cars or stop signs. Hence, CLIP assigns a very high probability (0.9998) to “A picture of two people with an umbrella.”

END Q 5 ANSWER HERE

## Yay, you're done with your 266 homework.  Now focus on your project!