<a href="https://colab.research.google.com/github/donbr/visionary_storytelling/blob/main/notebooks/ms_phi3_vision_flashattention.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# FLASH ATTENTION... and the Phi3 Vision Language Model

- see the [model card](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct) for additional detail
- as the current Phi 3 Vision model requires Flash Attention it will only run on very specific GPUs
  - on Google Colab, it will run on A100 or L4
  - will also run on T4 with an additional model parameter
  - refer to [model card README](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct#running-on-windows-or-without-flash-attention) for additional detail on the fix
- Currently bleeding edge - Phi3 Vision model card directs devs to install  transformers using lates updates from GitHub

## Install Python Dependencies

In [None]:
!pip install git+https://github.com/huggingface/transformers

In [None]:
!pip install -qU flash_attn accelerate

## Import necessary libraries

In [None]:
from PIL import Image
import requests
from transformers import AutoModelForCausalLM
from transformers import AutoProcessor


## Hugging Face model configuration
- Load the Phi-3 Vision model and processor from Hugging Face

In [None]:
# Hugging Face model configuration

model_id = "microsoft/Phi-3-vision-128k-instruct"

# standard model config
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda", trust_remote_code=True, torch_dtype="auto")

# model config workaround for flash_attention_2 issues on older GPU architectures
# refer to [model card README](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct#running-on-windows-or-without-flash-attention) for additional detail
# model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda", trust_remote_code=True, torch_dtype="auto", _attn_implementation="eager")

# processor configuration
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

## Visualize the Model Architecture

- Print the model architecture to understand its structure

In [None]:
# Visualize the Model Architecture

print(model)

## Setup the input for the Vision model

In [None]:
# Define the messages to be processed by the model

messages = [
    {"role": "user", "content": "<|image_1|>\nWhat is shown in this image?"},

]

# Load the image from a URL
url = "https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/15b17bf0-fb1a-4fb2-b952-beee07706068/width=832/00088-3178799381.jpeg"
image = Image.open(requests.get(url, stream=True).raw)

# Process the messages and image to create input tensors
prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(prompt, [image], return_tensors="pt").to("cuda:0")


## Setup and Run Vision model inference


In [None]:
# Define generation arguments for the model's response
generation_args = {
    "max_new_tokens": 500,
    "temperature": 0.0,
    "do_sample": False,
}

# Generate the response from the Vision model
generate_ids = model.generate(**inputs, eos_token_id=processor.tokenizer.eos_token_id, **generation_args)

# Remove input tokens from the generated response to get the actual output tokens
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]

# Decode the response to get the final text output
response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]


## Discover the Model's Insights!

In [None]:
print(response)

In [None]:
display(image)