# Guide Generation Using FLAN

Load a kitchen video from ego4d.

In [None]:
from pytorchvideo.data.video import VideoPathHandler

video_path_handler = VideoPathHandler()
video = video_path_handler.video_from_path(
    "../../ego4d/v2/full_scale/7f9f75fd-a660-4635-8890-239c6ad82023.mp4"
)
clip = video.get_clip(27, 35)

Load `Salesforce/blip2-flan-t5-xxl`

In [None]:
import torch
from transformers import Blip2Processor

from video_blip2.model import VideoBlip2ForConditionalGeneration

device = "cuda" if torch.cuda.is_available() else "cpu"
processor = Blip2Processor.from_pretrained("Salesforce/blip2-flan-t5-xxl")
model = VideoBlip2ForConditionalGeneration.from_pretrained(
    "Salesforce/blip2-flan-t5-xxl", torch_dtype=torch.float16
).to(device)
print(model.generation_config)

pixel_values = (
    processor.image_processor(
        clip["video"][:, 0 : clip["video"].size(1) : 30, ...].permute(1, 0, 2, 3),
        return_tensors="pt",
    )
    .to(device, torch.float16)
    .pixel_values.permute(1, 0, 2, 3)
    .unsqueeze(0)
)

Perform a conversational style reasoning task.

Note that `flan-t5-xxl` with its 11B parameters is not "big" enough to perform zero-shot inference on unseen recipes, so the performance is not great.

In [None]:
# flake8: noqa
from transformers import GenerationConfig

prompts = [
    ("What is the camera wearer currently doing?", pixel_values),
    (
        """Answer questions based on context below:

The camera wearer is trying to make a mug cake following the recipe below:

Recipe C: Mug Cake
Ingredients
2 Tablespoons all-purpose flour
1.5 Tablespoons granulated sugar
1/4 teaspoon baking powder
Pinch salt
2 teaspoons canola or vegetable oil
2 Tablespoons water
1/4 teaspoon vanilla extract
Container of chocolate frosting (premade)

Tools and Utensils
measuring spoons small mixing bowl whisk
paper cupcake liner 12-ounce coffee mug plate
microwave
zip-top bag, snack or sandwich size scissors
spoon
toothpick

Steps
1. Place the paper cupcake liner inside the mug. Set aside.
2. Measure and add the flour, sugar, baking powder, and salt to the mixing bowl.
3. Whisk to combine.
4. Measure and add the oil, water, and vanilla to the bowl.
5. Whisk batter until no lumps remain.
6. Pour batter into prepared mug.
7. Microwave the mug and batter on high power for 60 seconds.
8. Check if the cake is done by inserting and toothpick into the center of the cake and then
removing. If wet batter clings to the toothpick, microwave for an additional 5 seconds. If the
toothpick comes out clean, continue.
9. Invert the mug to release the cake onto a plate. Allow to cool until it is no longer hot to the
touch, then carefully remove paper liner.
10. While the cake is cooling, prepare to pipe the frosting. Scoop 4 spoonfuls of chocolate frosting
into a zip-top bag and seal, removing as much air as possible.
11. Use scissors to cut one corner from the bag to create a small opening 1⁄4-inch in diameter.
12. Squeeze the frosting through the opening to apply small dollops of frosting to the plate in a
circle around the base of the cake.

Here's the description of the current situation:
Recipe steps the camera wearer has completed: none.
The camera wearer is supposed to work on step 1 - Place the paper cupcake liner inside the mug. Set aside.

Which step in the recipe is the camera wearer actually performing?""",
        None,
    ),
    ("Has the camera wearer followed the recipe correctly so far?", None),
    (
        "What instruction would you give to the camera wearer to correct their mistakes? If no instruction is necessary, answer with <None>.",
        None,
    ),
]
generation_config = GenerationConfig(max_new_tokens=128, num_beams=4)

context = ""
for text, video in prompts:
    print(text + "\n")
    context += text + "\n"
    inputs = processor(text=text, return_tensors="pt").to(device)
    with torch.no_grad():
        if video is None:
            generated_ids = model.language_model.generate(
                **inputs, generation_config=generation_config
            )
        else:
            inputs["pixel_values"] = video
            generated_ids = model.generate(
                **inputs, generation_config=generation_config
            )
    answer = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
    print(answer + "\n")
    context += answer + "\n"