<a href="https://colab.research.google.com/github/atul-ai/prompt-engineering-class/blob/main/MultiModalPrompting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Multimodal Prompting and Use Case

In this file we deal with reading a few images (local or web) and generating a description for those. Eventually using these descriptions, we will generate a story.

Basic idea and code from - https://mer.vin/2024/09/groq-multi-modal/

In [1]:
!pip install groq

Collecting groq
  Downloading groq-0.11.0-py3-none-any.whl.metadata (13 kB)
Downloading groq-0.11.0-py3-none-any.whl (106 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m106.5/106.5 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: groq
Successfully installed groq-0.11.0


## Image to Descrption

We will read the image as base64 encoded data and feed to a multimodal model to understand it and generate the description.

In [11]:
#image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
#image_url = "https://images-wixmp-ed30a86b8c4ca887773594c2.wixmp.com/f/e8f976a1-e1dc-440a-9236-487140f0bb22/dg4tch2-eda868b1-ab03-439c-a1fa-dc0ce969d6fc.png/v1/fit/w_828,h_1174,q_70,strp/aftermath_of_the_something_soemthing_by_fr0z3n_f3nn3k_dg4tch2-414w-2x.jpg?token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJ1cm46YXBwOjdlMGQxODg5ODIyNjQzNzNhNWYwZDQxNWVhMGQyNmUwIiwiaXNzIjoidXJuOmFwcDo3ZTBkMTg4OTgyMjY0MzczYTVmMGQ0MTVlYTBkMjZlMCIsIm9iaiI6W1t7ImhlaWdodCI6Ijw9MTI0MCIsInBhdGgiOiJcL2ZcL2U4Zjk3NmExLWUxZGMtNDQwYS05MjM2LTQ4NzE0MGYwYmIyMlwvZGc0dGNoMi1lZGE4NjhiMS1hYjAzLTQzOWMtYTFmYS1kYzBjZTk2OWQ2ZmMucG5nIiwid2lkdGgiOiI8PTg3NCJ9XV0sImF1ZCI6WyJ1cm46c2VydmljZTppbWFnZS5vcGVyYXRpb25zIl19.UC_M9c08YpPjYm0PbtCDTDyZHLuvLu7aoNmzfUyD6tA"
image_url = "image_path.jpeg"

import base64
import httpx
from pathlib import Path

def encode_image(image_source):
    """
    Encode an image from either a URL or local file path to base64.

    Args:
        image_source (str): Either a URL starting with 'http'/'https' or a local file path

    Returns:
        str: Base64 encoded image data
    """
    try:
        # Check if the source is a URL
        if image_source.lower().startswith(('http://', 'https://')):
            # Handle web image
            image_data = base64.b64encode(httpx.get(image_source).content).decode('utf-8')
        else:
            # Handle local file
            path = Path(image_source)
            if not path.exists():
                raise FileNotFoundError(f"Image file not found: {image_source}")

            with open(path, 'rb') as image_file:
                image_data = base64.b64encode(image_file.read()).decode('utf-8')

        return image_data

    except Exception as e:
        raise Exception(f"Error encoding image: {str(e)}")

image_data = encode_image(image_url)


In [None]:
import os
from groq import Groq

os.environ["GROQ_API_KEY"] = "<INSERT YOUR GROQ KEY>"

client = Groq()
llava_model = 'llava-v1.5-7b-4096-preview'
llama31_model = 'llama-3.1-70b-versatile'

## Code copied with gratitude from: https://mer.vin/2024/09/groq-multi-modal/
def image_to_text(client, model, base64_image, prompt):
    chat_completion = client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{base64_image}",
                        },
                    },
                ],
            }
        ],
        model=model
    )

    return chat_completion.choices[0].message.content

prompt = "Describe the image"
image_description = image_to_text(client=client, model=llava_model, base64_image=image_data, prompt=prompt)
print(image_description)


## Story Generation

Now we take this image description and ask Llamma to generate the story for this image.

In [None]:
## Code copied with gratitude from: https://mer.vin/2024/09/groq-multi-modal/
def short_story_generation(client, image_description, topic):
    chat_completion = client.chat.completions.create(
        messages=[
            {
                "role": "system",
                "content": f"You are a children's book author. Write a short story about the scene depicted in this image or images. The story is about Topic - {topic}",
            },
            {
                "role": "user",
                "content": image_description,
            }
        ],
        model=llama31_model
    )

    return chat_completion.choices[0].message.content

prompt = '''
Describe this image in detail, including the appearance of the people and any notable actions or behaviors.
'''
image_description = image_to_text(client, llava_model, image_data, prompt)

topic = "dad going away for work"

print("\n\n--- Image Description (Labradoodle) ---\n")
print(image_description)

print("\n\n--- Short Story (Based on Labradoodle) ---\n")
print(short_story_generation(client, image_description, topic))


## Can we generate a story with multiple images?

Based on pattern at https://mer.vin/2024/09/groq-multi-modal/ - I am reading multiple image files, generate description for the images and generate the story.

In [None]:
first_image = "first_image.jpeg"
second_image = "second_image.jpeg"
third_image = "thrid_image.jpeg"

image_data1 = encode_image(first_image)
image_data2 = encode_image(second_image)
image_data3 = encode_image(third_image)

image_description1 = image_to_text(client, llava_model, image_data1, prompt)
image_description2 = image_to_text(client, llava_model, image_data2, prompt)
image_description3 = image_to_text(client, llava_model, image_data3, prompt)

total_description = image_description1 + "\n\n" + image_description2 + "\n\n" + image_description3
print(total_description)

print("\n\n--- Short Story (Based on combined description) ---\n")
print(short_story_generation(client, total_description, topic=topic))