# Phi-4 MultiModal - Tested! 

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/path/to/your/notebook.ipynb)

> Note: Phi 4 is released with a specific dependency list.  Install them in order

```bash
torch==2.6.0
flash_attn==2.7.4.post1
transformers==4.48.2
accelerate==1.3.0
soundfile==0.13.1
pillow==11.1.0
scipy==1.15.2
torchvision==0.21.0
backoff==2.2.1
peft==0.13.2
```

Also using a virtual environment is recommended. Conda or venv will work. 

We will cover:

- Text generation 
- Image decription/captioning …
- Audio transcription, text translation 
- Function Calling 
- OCR 



In [None]:
# !pip install torch==2.6.0 flash_attn==2.7.4.post1 transformers==4.48.2 accelerate==1.3.0 soundfile==0.13.1 pillow==11.1.0 scipy==1.15.2 torchvision==0.21.0 backoff==2.2.1 peft==0.13.2

In [1]:
import requests
import torch
import os
import io
from PIL import Image
import soundfile as sf
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from urllib.request import urlopen

# Load the model and processor
model_path = "microsoft/Phi-4-multimodal-instruct"
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path, 
    device_map="cuda", 
    torch_dtype="auto", 
    trust_remote_code=True, 
    attn_implementation='flash_attention_2',
).cuda()
generation_config = GenerationConfig.from_pretrained(model_path)


  from .autonotebook import tqdm as notebook_tqdm
A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-4-multimodal-instruct:
- processing_phi4mm.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-4-multimodal-instruct:
- configuration_phi4mm.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was do

In [9]:
def phi4_multimodal_interface(
    processor, 
    model, 
    system_prompt, 
    content_list, 
    generation_config,
    max_new_tokens=8000
):
    """
    Creates the appropriate prompt and processes inputs for Phi-4-multimodal-instruct model.
    
    Parameters:
    -----------
    processor : transformers.AutoProcessor
        The processor for the Phi-4-multimodal model
    model : transformers.AutoModelForCausalLM
        The loaded Phi-4-multimodal model
    system_prompt : str
        The system prompt to use (can include tool definitions)
    content_list : list
        List of content items, where each item is a dict with the following possible keys:
        - 'type': str, one of 'text', 'image', 'audio', 'tool_response'
        - 'content': depends on type:
            - for 'text': str
            - for 'image': PIL.Image object
            - for 'audio': tuple of (audio_array, sample_rate)
            - for 'tool_response': dict containing tool response
        - 'role': str, one of 'user', 'assistant' (default: 'user')
    generation_config : transformers.GenerationConfig
        The generation configuration
    max_new_tokens : int
        Maximum number of new tokens to generate
        
    Returns:
    --------
    tuple:
        - The complete formatted prompt
        - The model's response
        - A dictionary with timing information for each step
    """
    import time
    # Define prompt structure
    system_token = '<|system|>'
    user_token = '<|user|>'
    assistant_token = '<|assistant|>'
    end_token = '<|end|>'
    
    # Initialize the complete prompt with system message
    complete_prompt = f"{system_token}{system_prompt}{end_token}"
    
    # Initialize lists to hold images and audios
    images = []
    audios = []
    
    # Process each content item and build the prompt
    current_role = None
    role_content = ""
    
    for i, item in enumerate(content_list):
        item_type = item['type']
        item_role = item.get('role', 'user')
        
        # If we're switching roles, add the previous role's content to the prompt
        if current_role is not None and current_role != item_role:
            role_token = user_token if current_role == 'user' else assistant_token
            complete_prompt += f"{role_token}{role_content}{end_token}"
            role_content = ""
        
        current_role = item_role
        
        # Process different content types
        if item_type == 'text':
            role_content += item['content']
        
        elif item_type == 'image':
            image_index = len(images) + 1
            role_content += f"<|image_{image_index}|>"
            images.append(item['content'])
        
        elif item_type == 'audio':
            audio_index = len(audios) + 1
            role_content += f"<|audio_{audio_index}|>"
            audios.append(item['content'])
        
        elif item_type == 'tool_response':
            # Format tool response appropriately
            tool_response = item['content']
            role_content += f"<|tool_response|>{tool_response}<|/tool_response|>"
    
    # Add the final role's content
    if current_role is not None:
        role_token = user_token if current_role == 'user' else assistant_token
        complete_prompt += f"{role_token}{role_content}{end_token}"
    
    # Add assistant token to prompt an answer
    if current_role != 'assistant':
        complete_prompt += f"{assistant_token}"
    
    # Track timing information
    timing_info = {}
    
    # Process inputs with the model
    timing_info['start_processor'] = time.time()
    inputs = processor(
        text=complete_prompt,
        images=images if images else None,
        audios=audios if audios else None,
        return_tensors='pt'
    ).to('cuda:0')
    timing_info['end_processor'] = time.time()
    timing_info['processor_time'] = timing_info['end_processor'] - timing_info['start_processor']
    
    # Generate response
    timing_info['start_generation'] = time.time()
    generate_ids = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        generation_config=generation_config,
    )
    timing_info['end_generation'] = time.time()
    timing_info['generation_time'] = timing_info['end_generation'] - timing_info['start_generation']
    
    # Extract only the newly generated tokens
    generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
    
    # Decode the response
    timing_info['start_decode'] = time.time()
    response = processor.batch_decode(
        generate_ids, 
        skip_special_tokens=True, 
        clean_up_tokenization_spaces=False
    )[0]
    timing_info['end_decode'] = time.time()
    timing_info['decode_time'] = timing_info['end_decode'] - timing_info['start_decode']
    
    # Calculate total time
    timing_info['total_time'] = timing_info['processor_time'] + timing_info['generation_time'] + timing_info['decode_time']
    
    return complete_prompt, response, timing_info

In [3]:
from IPython.display import display, HTML, Markdown

def run_and_print_results(system_prompt, content_list, name="Example"):
    # Set up CSS for better text wrapping
    display(HTML("""
    <style>
    .jp-OutputArea-output {
        overflow-x: auto;
        white-space: pre-wrap;
        word-wrap: break-word;
    }
    .output_text {
        white-space: pre-wrap !important;
    }
    </style>
    """))
    
    display(Markdown(f"## {name}"))

    display(Markdown("### Query:")) # content of last item in content_list
    display(Markdown(f"```\n{content_list[-1]['content']}\n```")) 
    
    # Run the model
    prompt, response, timing = phi4_multimodal_interface(
        processor, model, system_prompt, content_list, generation_config
    )
    
    display(Markdown("### Response:"))
    display(Markdown(f"```\n{response}\n```"))
    
    display(Markdown("### Timing Information:"))
    timing_md = f"""
    - **Processor time:** {timing['processor_time']:.4f} seconds
    - **Generation time:** {timing['generation_time']:.4f} seconds
    - **Decode time:** {timing['decode_time']:.4f} seconds
    - **Total time:** {timing['total_time']:.4f} seconds
    """
    display(Markdown(timing_md))
    
    return prompt, response, timing

In [4]:
# Example 1: Text-only conversation
system_prompt = "You are a helpful assistant."
content_list = [
    {
        'type': 'text',
        'content': 'Who is Victor Dibia, PhD and what is he known for?',
        'role': 'user'
    }
]

run_and_print_results(system_prompt, content_list, "Example 1.1: Text-only conversation")


## Example 1.1: Text-only conversation

### Query:

```
Who is Victor Dibia, PhD and what is he known for?
```

### Response:

```
Victor Dibia, PhD, is a Nigerian-born American scientist and entrepreneur, known for his work in the field of nanotechnology. He is the founder and CEO of Nanosys, a company that specializes in the development of nanotechnology-based solutions for various industries, including electronics, energy, and healthcare.

Dibia earned his PhD in materials science and engineering from the University of California, Berkeley, and has held various positions in academia and industry throughout his career. He is also a member of the National Academy of Inventors and has been recognized for his contributions to the field of nanotechnology.

Nanosys, under Dibia's leadership, has developed a range of nanotechnology-based products, including organic light-emitting diodes (OLEDs) and organic photovoltaics (OPVs). These products have the potential to revolutionize the electronics and energy industries by providing more efficient and sustainable alternatives to traditional technologies.

In addition to his work at Nanosys, Dibia has also been involved in various philanthropic efforts, including the establishment of the Victor Dibia Foundation, which aims to support education and research in Nigeria. He has also been a mentor and advisor to various startups and entrepreneurs in the technology sector.
```

### Timing Information:


    - **Processor time:** 0.0090 seconds
    - **Generation time:** 14.6262 seconds
    - **Decode time:** 0.0003 seconds
    - **Total time:** 14.6355 seconds
    

('<|system|>You are a helpful assistant.<|end|><|user|>Who is Victor Dibia, PhD and what is he known for?<|end|><|assistant|>',
 "Victor Dibia, PhD, is a Nigerian-born American scientist and entrepreneur, known for his work in the field of nanotechnology. He is the founder and CEO of Nanosys, a company that specializes in the development of nanotechnology-based solutions for various industries, including electronics, energy, and healthcare.\n\nDibia earned his PhD in materials science and engineering from the University of California, Berkeley, and has held various positions in academia and industry throughout his career. He is also a member of the National Academy of Inventors and has been recognized for his contributions to the field of nanotechnology.\n\nNanosys, under Dibia's leadership, has developed a range of nanotechnology-based products, including organic light-emitting diodes (OLEDs) and organic photovoltaics (OPVs). These products have the potential to revolutionize the elec

In [None]:
# Example 1.5: Text-only With Context conversation
with open('downloads/vd.txt', 'r', encoding='utf-8') as file:
        website_content = file.read()
system_prompt = "You are a helpful assistant."
content_list = [
     {
        'type': 'text',
        'content':  website_content,
        'role': 'user'
    },
    {
        'type': 'text',
        'content': 'Who is Victor Dibia, PhD and what is he known for?',
        'role': 'user'
    }
]

run_and_print_results(system_prompt, content_list, "Example 1.2: Text-only conversation (with Context)")


In [None]:
# Example 2: Image description
# Download an image
image_url = 'https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa0f402f-7438-4620-9708-2fe9e220bce6_1901x1163.png' 

image_url = "https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb71d0075-5e75-41a3-88e3-90b2292de534_1901x1163.png"

image = Image.open(requests.get(image_url, stream=True).raw)
display(image)

system_prompt = "You are a helpful assistant that can analyze images."
content_list = [
    {
        'type': 'image',
        'content': image,
        'role': 'user'
    },
    {
        'type': 'text',
        'content': 'What is shown in this image?',
        'role': 'user'
    }
]

run_and_print_results(system_prompt, content_list, "Example 2.1: Image description")


In [None]:
# Example 2: Image description
# Download an image
image_url = 'https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa0f402f-7438-4620-9708-2fe9e220bce6_1901x1163.png' 

image = Image.open(requests.get(image_url, stream=True).raw)
display(image)

system_prompt = "You are a helpful assistant that can analyze images."
content_list = [
    {
        'type': 'image',
        'content': image,
        'role': 'user'
    },
    {
        'type': 'text',
        'content': 'What is shown in this image?',
        'role': 'user'
    }
]

run_and_print_results(system_prompt, content_list, "Example 2.2: Image description")


In [None]:

# Example 2.3: OCR 
phone_image_path = os.path.join('downloads', 'phone_specs.png') 
# Load image
phone_image = Image.open(phone_image_path)
display(phone_image)


system_prompt = "You are a helpful assistant that perform OCR and carefully extract all text from an image."
content_list = [
    {
        'type': 'image',
        'content': phone_image,
        'role': 'user'
    },
    {
        'type': 'text',
        'content': 'Extract ALL the text (OCR) from this image and render in a neat markdown format',
        'role': 'user'
    }
]

run_and_print_results(system_prompt, content_list, "Example 2.3: OCR")


In [None]:

# Example 3: Audio transcription and translation
# Download an audio file
audio_url = "https://upload.wikimedia.org/wikipedia/commons/b/b0/Barbara_Sahakian_BBC_Radio4_The_Life_Scientific_29_May_2012_b01j5j24.flac"
audio, samplerate = sf.read(io.BytesIO(urlopen(audio_url).read()))

system_prompt = "You are a helpful assistant that can process audio."
content_list = [
    {
        'type': 'audio',
        'content': (audio, samplerate),
        'role': 'user'
    },
    {
        'type': 'text',
        'content': 'Transcribe the audio to text, and then translate the audio to French. Use <sep> as a separator between the original transcript and the translation.',
        'role': 'user'
    }
]

run_and_print_results(system_prompt, content_list, "Example 3: Audio transcription and translation")


In [None]:
# Example 4: Combined image and audio (multimodal)
system_prompt = "You are a helpful assistant that can process both images and audio."
content_list = [
    {
        'type': 'image',
        'content': image,
        'role': 'user'
    },
    {
        'type': 'audio',
        'content': (audio, samplerate),
        'role': 'user'
    },
    {
        'type': 'text',
        'content': 'Describe the image and summarize what you heard in the audio. Your entire response should be first in english and then in German. Use <sep> as a separator between the two.',
        'role': 'user'
    }
]

run_and_print_results(system_prompt, content_list, "Example 4: Combined image and audio (multimodal)")


In [None]:
# Example 5: Tool-enabled function calling
tools_json = '''[{
    "name": "get_weather_updates", 
    "description": "Fetches weather updates for a given city using the RapidAPI Weather API.", 
    "parameters": {
        "city": {
            "description": "The name of the city for which to retrieve weather information.", 
            "type": "str", 
            "default": "London"
        }
    }
}]'''

system_prompt = f"You are a helpful assistant with some tools.<|tool|>{tools_json}<|/tool|>"
content_list = [
    {
        'type': 'text',
        'content': 'What is the weather like in Paris today?',
        'role': 'user'
    }
]

run_and_print_results(system_prompt, content_list, "Example 5: Tool-enabled function calling")


## Some More Real World Examples 

We will be using audio from an MKBHD video. 
-This one -> https://www.youtube.com/watch?v=FmmUAhE6MxU&ab_channel=MarquesBrownlee 

In [None]:
# Example 6.1: Combined Multimodal Understanding (Audio + Vision + Text)
# Copy this cell directly into your notebook
import os
from PIL import Image
import soundfile as sf

# Load local files
phone_image_path = os.path.join('downloads', 'phone_specs.png')
review_audio_path = os.path.join('downloads', 'mkbhd_phone_review.mp3')
original_transcript_path = os.path.join('downloads', 'mkbhd_phone_review.txt')
original_transcript_text = open(original_transcript_path, 'r', encoding='utf-8').read()

# Load image
phone_image = Image.open(phone_image_path)
display(phone_image)
# Load audio
audio_data, sample_rate = sf.read(review_audio_path)

# Set up task
system_prompt = "You are a helpful tech assistant who can analyze product reviews across multiple types of content."
content_list = [
    {
        'type': 'audio',
        'content': (audio_data, sample_rate),
        'role': 'user'
    },
    {
        'type': 'image',
        'content': phone_image,
        'role': 'user'
    },
    {
        'type': 'text',
        'content': "Compare what was mentioned in the audio review with the specifications shown in the image. Then answer these questions: 1) Does the reviewer mention all the key specs? 2) Are there any discrepancies between what the reviewer claims and the actual specs? 3) What feature does the reviewer seem most impressed by?",
        'role': 'user'
    }
]

run_and_print_results(system_prompt, content_list, "Example 6.1: Combined Multimodal Understanding (Audio + Vision + Text)") 


In [None]:
# Example 6.2: Speech Transcription
# Copy this cell directly into your notebook
import os
import soundfile as sf

# Load local audio file
review_audio_path = os.path.join('downloads', 'mkbhd_phone_review.mp3')
audio_data, sample_rate = sf.read(review_audio_path)

# Set up transcription task
system_prompt = "You are a helpful assistant specializing in precise audio transcription."
content_list = [
    {
        'type': 'audio',
        'content': (audio_data, sample_rate),
        'role': 'user'
    },
    {
        'type': 'text',
        'content': "Please provide a verbatim transcript of this audio review.",
        'role': 'user'
    }
]

# Process with model
prompt, response, timing_info = phi4_multimodal_interface(
    processor, model, system_prompt, content_list, generation_config
)

run_and_print_results(system_prompt, content_list, "Example 6.2: Speech Transcription") 


In [None]:
# Example 6.3: Speech Summarization
# Copy this cell directly into your notebook
import os
import soundfile as sf

# Load local audio file - reusing the same file
review_audio_path = os.path.join('downloads', 'mkbhd_phone_review.mp3')
audio_data, sample_rate = sf.read(review_audio_path)

# Set up summarization task
system_prompt = "You are a helpful assistant specializing in concise summaries."
content_list = [
    {
        'type': 'audio',
        'content': (audio_data, sample_rate),
        'role': 'user'
    },
    {
        'type': 'text',
        'content': "Provide a 2-3 sentence summary of this review.",
        'role': 'user'
    }
]

# Process with model
prompt, response, timing_info = phi4_multimodal_interface(
    processor, model, system_prompt, content_list, generation_config
)

run_and_print_results(system_prompt, content_list, "Example 6.3: Speech Summarization") 
 

In [None]:
# Example 6.4: Speech Translation  
import os
import soundfile as sf

# Load local audio file - reusing the same file
review_audio_path = os.path.join('downloads', 'mkbhd_review.mp3')
audio_data, sample_rate = sf.read(review_audio_path)

# Set up translation task
system_prompt = "You are a helpful assistant specializing in audio translation."
content_list = [
    {
        'type': 'audio',
        'content': (audio_data, sample_rate),
        'role': 'user'
    },
    {
        'type': 'text',
        'content': "Translate the key points of this review into French.",
        'role': 'user'
    }
]

# Process with model
prompt, response, timing_info = phi4_multimodal_interface(
    processor, model, system_prompt, content_list, generation_config
)

run_and_print_results(system_prompt, content_list, "Example 6.4: Speech Translation")

In [None]:
# Example 6.5: Speech Q&A
# Copy this cell directly into your notebook
import os
import soundfile as sf

# Load local audio file - reusing the same file
review_audio_path = os.path.join('downloads', 'mkbhd_phone_review.mp3')
audio_data, sample_rate = sf.read(review_audio_path)

# Set up Q&A task
system_prompt = "You are a helpful assistant specializing in detailed analysis of product reviews."
content_list = [
    {
        'type': 'audio',
        'content': (audio_data, sample_rate),
        'role': 'user'
    },
    {
        'type': 'text',
        'content': "What features did the reviewer praise? What were the criticisms?",
        'role': 'user'
    }
]

# Process with model
prompt, response, timing_info = phi4_multimodal_interface(
    processor, model, system_prompt, content_list, generation_config
)

run_and_print_results(system_prompt, content_list, "Example 6.5: Speech Q&A")

In [10]:
# Example 6.2: Speech Transcription
# Copy this cell directly into your notebook
import os
import soundfile as sf

# Load local audio file
review_audio_path = os.path.join('downloads', 'ph14.mp3')
audio_data, sample_rate = sf.read(review_audio_path)

# Set up transcription task
system_prompt = "You are a helpful assistant specializing in precise audio transcription."
content_list = [
    {
        'type': 'audio',
        'content': (audio_data, sample_rate),
        'role': 'user'
    },
    {
        'type': 'text',
        'content': "Please provide a verbatim transcript of this audio review.",
        'role': 'user'
    }
]

# Process with model
prompt, response, timing_info = phi4_multimodal_interface(
    processor, model, system_prompt, content_list, generation_config
)

run_and_print_results(system_prompt, content_list, "Example 6.2: Speech Transcription") 


## Example 6.2: Speech Transcription

### Query:

```
Please provide a verbatim transcript of this audio review.
```

### Response:

```
Hey folks, so Microsoft just released the 5.4 multimodal model earlier today, and I spent the last two hours just sort of putting it through its paces, and I think it's really, really good. So right off the bat, the first thing to know is that it's fully multimodal on the input side. So what that means is that the same model can take in text, images, audio tokens, reason about them, and sort of generate text as conditions and all of that. The second important thing is that it's quite small, so it's 5.4 billion parameters. The implication here is that it can be optimized around the world. So in the blog post here, it talks about how it delivers highly efficient low latency inference, all while optimizing for on-device execution and reduce computational overhead. So overall, the model can do a lot of things. So you can do text generation, image description, captioning, audio transcription, text transcription, function calling, OCR, and in this video, we're going to test out all of that. So let's head over to a Jupyter notebook and sort of take a look. Okay, so here I have a Jupyter notebook that I've put together to illustrate all of the core capabilities of this model. And I'll put a link to this in the description. So the first thing to know is that like 5.4 is released with a specific set of dependencies, so a specific torch, flash attention, all the stuff. So the first thing you want to do is to create a virtual environment and sort of install all these things in order. And then once we have a bunch of inputs, and I've written some code scaffolding just to make it just a little easier to run inference. I'm doing all of these on a local A6000 GPU, but it probably work really well on any consumer GPU that you have. So the first function I have is just a generic interface that can take a persistent prompt, a content list, and it will do all of the inference. So essentially construct the prompt correctly, figure out images, audio tokens, put them in the right order, pass them through the model, and sort of give us a response. And then we have some utilities to sort of print out results, not very important. So let's let's head over to the first task. And so here we just going to look at text-only conversation, and we're going to ask it a funny question, who is Victor DBA and what is he known for? So if we run that, can look at the response. So here it says like Victor DBA is a Nigerian-born American scientist, says a lot of things, makes up all kinds of stuff. None of this is correct. This is just a reminder that models at this size, just because how small they are, you probably should not be using them for factual queries or factual use cases. And the actual the right thing to do here would be to instead give it context. So in this case, what I've done is that I went to my website and I took in all of the text and I added that as context in the first message. Then I asked it who is Victor DBA, what is he known for? And we can look at the response. So here is like, Victor DBA is an expert in applied machine learning, he's a researcher at Microsoft, he's here, all of his papers, all of that. And all of this is all correct. I think it sort of looked at the important tidy parts of that text, and actually the text there is actually a message, a bunch of stuff, really unorganized, but the model is able to sort of look through all of that, come up with a sensible summary, which makes sense. And all of that took about 11 seconds on this particular machine. Okay, all good. We give it a win there. Next we head over to image description. And so here I have a couple of images. And so this is an image from a blog post I wrote a while ago, and we'll see what the image looks so far like. And what we just do is that we give it as the first message in the prompt, and then we tell it what is shown in this image. And this is what the image looks like. It's just like an illustration from an article I wrote about like AI fatigue. And this is actually not particularly like a naturally occurring image, but let's see how the model describes that. And so here it says the image depicts a man sitting on the floor with a large battery pack on his head, which kind of makes sense, connected to a power outlet, give her symbols and numbers floating around him and it reads, your search capacity is depleted, depleted. That makes sense. Pretty decent OCR, not very hard, but all fairly enough for a fairly small model, and this took about 4 seconds. Let's see. The next thing we head over to another type of image. And so here again, an abstract kind of image. And let's see how the model does. It says the image depicts a man running on a treadmill, while papers labeled new model updates, which is kind of correct, new model, updates, show, vome, all of that. It's kind of flying off the treadmill and onto the floor. Demands appears to be in a hurry or stressed, which kind of makes sense. It's exactly in the spirit of the illustration here. I gave it another pass, pretty good. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the model, just a lot of text in a markdown file. So just a good instruction following here. It's doing markdown. The next thing would be just a little bit more challenging. And so here we take a snapshot from a website with just a bunch of text. And so this is a technical specification for a smartphone. And essentially we want the model to sort of extract all of this text. And the query here is extract all the text from this image and render it in a neat markdown file. And this is what comes out of the
```

### Timing Information:


    - **Processor time:** 1.4375 seconds
    - **Generation time:** 623.0434 seconds
    - **Decode time:** 0.0034 seconds
    - **Total time:** 624.4844 seconds
    

('<|system|>You are a helpful assistant specializing in precise audio transcription.<|end|><|user|><|audio_1|>Please provide a verbatim transcript of this audio review.<|end|><|assistant|>',
 "Hey folks, so Microsoft just released the 5.4 multimodal model earlier today, and I spent the last two hours just sort of putting it through its paces, and I think it's really, really good. So right off the bat, the first thing to know is that it's fully multimodal on the input side. So what that means is that the same model can take in text, images, audio tokens, reason about them, and sort of generate text as conditions and all of that. The second important thing is that it's quite small, so it's 5.4 billion parameters. The implication here is that it can be optimized around the world. So in the blog post here, it talks about how it delivers highly efficient low latency inference, all while optimizing for on-device execution and reduce computational overhead. So overall, the model can do a lot 