# FROMAGe Inference Example

This is a notebook showcasing how to run some of the tasks that FROMAGe is capable of. It reproduces several examples in our paper, [Grounding Language Models to Images for Multimodal Generation](https://arxiv.org/abs/2301.13823).

For reproducibility, all examples in this notebook use greedy (deterministic) decoding. However, it is possible to change to nucleus sampling for more diverse and higher quality outputs (used for some of the figures in the paper) by changing the `temperature` and `top_p` in the `generate()` function.

At least 18GB of GPU memory is required to run this model (OPT-6.7B takes up a bit of memory), and it has only been tested on A6000, V100, and 3090 GPUs.

In [1]:
from google.colab import drive
drive.mount('/content/drive',force_remount=True)

Mounted at /content/drive


In [2]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m78.3 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m107.2 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m22.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transformers-4.28.1


In [3]:
import numpy as np
import copy
import torch
from transformers import logging
logging.set_verbosity_error()

from PIL import Image
!pip install matplotlib
import matplotlib.pyplot as plt

import models
import utils

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Define some helper methods for displaying outputs:

### Load Model and Embedding Matrix

We only have access to about 2.6M images from CC3M which still have valid URLs (outputs may differ slightly from the paper due to this). This limited set somewhat restricts the ability of the model to produce good outputs for certain prompts, which may be alleviated through collecting more images (e.g., from [LAION](https://laion.ai/blog/laion-400-open-dataset/)).

In [4]:
# Load model used in the paper.
model_dir = '/content/drive/MyDrive/fromage/fromage/'
model = models.load_fromage(model_dir)

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/651 [00:00<?, ?B/s]

Using HuggingFace AutoFeatureExtractor for openai/clip-vit-base-patch32.


Downloading (…)rocessor_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]



Using facebook/opt-125m for the language model.
Using openai/clip-vit-base-patch32 for the visual model with 1 visual tokens.


Downloading pytorch_model.bin:   0%|          | 0.00/251M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

Freezing the LM.
Initializing embedding for the retrieval token [RET] (id = 50266).
Restoring pretrained weights for the visual model.


Downloading (…)lve/main/config.json:   0%|          | 0.00/4.19k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/605M [00:00<?, ?B/s]

Freezing the VM.


In [5]:
%load_ext autoreload
%autoreload 2

### Multi-Modal Concept Composition Examples

FROMAGe can seamlessly composite image and text data to retrieve images with the desired style or content. Note that the object ("cat") is never explicitly mentioned in text. This reproduces one of the examples in Fig. 3 of our paper.

In [11]:
# Load an image of a cat.
inp_image = utils.get_image_from_url('https://www.alleycat.org/wp-content/uploads/2019/03/FELV-cat.jpg')
torch.manual_seed(0)
torch.cuda.manual_seed(0)
# Get FROMAGe to retrieve images of cats in other styles.
for inp_text in ['watercolor drawing [RET]']:
    # to support batches you have to pass it as [[ex1_img1, ex1_img2], [text1_img1,text1_img2]]
    # this is a dummy example to use whether it is working!
    prompt = [[np.array(inp_image),np.array(inp_image)],[inp_text,inp_text+' good to']]
    print('Prompt:')
    print('=' * 30)
    model_outputs = model.generate_for_images_and_texts(prompt,
                                                        max_img_per_ret=3,num_words=30)

    print(model_outputs)

Prompt:
get_pixel_values_for_model
torch.Size([2, 3, 224, 224])  pixel
torch.Size([2, 1, 768])  visemb
torch.Size([1, 5])
torch.Size([1, 4, 768])  text_emb
torch.Size([1, 7])
torch.Size([1, 6, 768])  text_emb
----end of loop-----
--------
torch.Size([1, 1, 768])
torch.Size([1, 4, 768])
CUR-EX  torch.Size([1, 5, 768])
--------
torch.Size([1, 1, 768])
torch.Size([1, 6, 768])
CUR-EX  torch.Size([1, 7, 768])
pad  torch.Size([1, 1, 768])
before padding  torch.Size([1, 5, 768])
attention final  torch.Size([2, 7])
embedding final  torch.Size([2, 7, 768])
before generate
expanded_attn_mask  torch.Size([2, 1, 7, 7])  combined_attention_mask  torch.Size([2, 1, 7, 7])
expanded_attn_mask  torch.Size([2, 1, 8, 7])  combined_attention_mask  torch.Size([2, 1, 8, 8])


RuntimeError: ignored