# OCI Multimodal Vision LLM Step-by-step

### What this file does:
Demonstrates multimodal (image+text) prompts with Oracle Cloud's Generative AI, using only the OCI Python SDK (no LangChain needed).

**Documentation to reference:**
- OCI Gen AI: https://docs.oracle.com/en-us/iaas/Content/generative-ai/home.htm
- OCI Python SDK: https://github.com/oracle/oci-python-sdk/tree/master/src/oci/generative_ai_inference

**Relevant slack channels:**
- #generative-ai-users: *for questions on OCI Gen AI*
- #igiu-innovation-lab: *general discussions on your project*
- #igiu-ai-learning: *help with sandbox environment or help with running this code*

**Env setup:**
- sandbox.yaml: Contains OCI config and compartment.
- .env: Load environment variables if needed.
- configure cwd for jupyter match your workspace python code: 
    -  vscode menu -> Settings > Extensions > Jupyter > Notebook File Root
    -  change from `${fileDirname}` to `${workspaceFolder}`


**How to run in notebook:**
- Make sure your runtime environment has all dependencies and access to required config files.
- Run the notebook cells in order.

---

## Step 1: Setup and Requirements

**Key Concepts:**
- **Environment Setup:** Before interacting with OCI services, you need to configure your environment with credentials, compartment IDs, and dependencies. This is typically done via a YAML config file (sandbox.yaml) and environment variables.
- **Dependencies:** Install necessary libraries like the OCI SDK and configuration loaders.
- **Libraries:** Use packages for config management, environment variables, and OCI interactions.

In this step, we'll install dependencies and import required libraries.

In [None]:
# Import necessary libraries
import os
import base64
from dotenv import load_dotenv
from envyaml import EnvYAML
import oci
import time

# Load environment variables
load_dotenv()

print("Libraries imported and environment loaded.")

## Step 2: Load OCI Configuration

**Key Concepts:**
- **Configuration Loading:** Securely load settings from a config file to authenticate and specify resources without hardcoding credentials.
- **OCI Config:** Includes paths to config files, profiles, compartments, and regions.
- **Error Handling:** Validate that the config is loaded correctly to avoid runtime errors.

In this step, we'll load and validate the configuration.

In [None]:
# Define paths and load configuration
# Make sure your sandbox.yaml file is set up for your environment. You might have to specify the full path depending on your `cwd`.
# You can also try making your cwd for jupyter match your workspace python code:
# vscode menu -> Settings > Extensions > Jupyter > Notebook File Root
# change from ${fileDirname} to ${workspaceFolder}

SANDBOX_CONFIG_FILE = "sandbox.yaml"

def load_config(config_path):
    try:
        return EnvYAML(config_path)
    except FileNotFoundError:
        print(f"Error: Configuration file '{config_path}' not found.")
        return None
    except Exception as e:
        print(f"Error loading config: {e}")
        return None

scfg = load_config(SANDBOX_CONFIG_FILE)
assert scfg is not None and 'oci' in scfg and 'configFile' in scfg['oci'] and 'profile' in scfg['oci'] and 'compartment' in scfg['oci'], "Check your sandbox.yaml config!"
config = oci.config.from_file(os.path.expanduser(scfg["oci"]["configFile"]), scfg["oci"]["profile"])
compartment_id = scfg["oci"]["compartment"]

print("Configuration loaded successfully.")

## Step 3: Model List & Endpoint

**Key Concepts:**
- **Model Selection:** Choose from available OCI Gen AI models for multimodal tasks, including text and image understanding.
- **Service Endpoint:** The regional endpoint for the Generative AI service, which handles inference requests.
- **Model Variety:** Different models may perform better on certain tasks, so testing multiple is educational.

In this step, we'll define the models to test and the service endpoint.

In [None]:
# Define models to test
MODEL_LIST = [
    "meta.llama-4-scout-17b-16e-instruct",
    "openai.gpt-4.1",
    "xai.grok-4",
]

# Define the service endpoint (region-specific)
llm_service_endpoint = "https://inference.generativeai.us-chicago-1.oci.oraclecloud.com"

print("Models and endpoint configured.")

## Step 4: Select and Encode Your Image

**Key Concepts:**
- **Image Selection:** Choose the image file to analyze. Ensure it's in a supported format (e.g., JPEG).
- **Base64 Encoding:** Multimodal models require images to be encoded as base64 strings for transmission.
- **Display:** Visualize the image to confirm it's the correct one before processing.

In this step, we'll select the image, display it, and encode it.

In [None]:
# Import display library
from IPython.display import Image, display

# Select the image
IMAGE_PATH = 'vision/dussera-b.jpg'   # Change if you use a different image!

print(f"Selected image: {IMAGE_PATH}")

# Display the image
display(Image(filename=IMAGE_PATH))

In [None]:
# Function to encode image to base64
def encode_image(path):
    with open(path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

# Encode the image
IMAGE_B64 = encode_image(IMAGE_PATH)

print("Image encoded successfully.")

## Step 5: Build the Multimodal User Message

**Key Concepts:**
- **Multimodal Input:** Combine text and image in a single message for the LLM.
- **Message Structure:** Use OCI SDK classes to create text and image content objects.
- **User Message:** Encapsulate the content in a UserMessage for the chat API.

In this step, we'll construct the message with text and image.

In [None]:
# Define the user text prompt
USER_TEXT = "Tell me about this image"  # You can change this!

# Function to build user message
def build_user_message(img_b64, text):
    content1 = oci.generative_ai_inference.models.TextContent()
    content1.text = text
    content2 = oci.generative_ai_inference.models.ImageContent()
    image_url = oci.generative_ai_inference.models.ImageUrl()
    image_url.url = f"data:image/jpeg;base64,{img_b64}"
    content2.image_url = image_url
    message = oci.generative_ai_inference.models.UserMessage()
    message.content = [content1, content2]
    return message

print("User message structure defined.")

## Step 6: Chat Request Utilities

**Key Concepts:**
- **Request Parameters:** Configure the chat request with options like number of generations, streaming, token limits, and temperature.
- **Chat Details:** Wrap the request with serving mode, compartment, and model details for the API call.
- **API Format:** Use the generic format for multimodal support.

In this step, we'll define utilities for building the chat request.

In [None]:
# Function to create chat request
def get_chat_request(message):
    chat_request = oci.generative_ai_inference.models.GenericChatRequest()
    chat_request.messages = [message]
    chat_request.api_format = oci.generative_ai_inference.models.BaseChatRequest.API_FORMAT_GENERIC
    chat_request.num_generations = 1
    chat_request.is_stream = False
    chat_request.max_tokens = 500
    chat_request.temperature = 0.75
    return chat_request

# Function to create chat details
def get_chat_detail(llm_request, compartment_id, model_id):
    chat_detail = oci.generative_ai_inference.models.ChatDetails()
    chat_detail.serving_mode = oci.generative_ai_inference.models.OnDemandServingMode(model_id=model_id)
    chat_detail.compartment_id = compartment_id
    chat_detail.chat_request = llm_request
    return chat_detail

print("Chat request utilities defined.")

## Step 7: Initialize the LLM Client

**Key Concepts:**
- **Client Initialization:** Create a client instance with config, endpoint, and retry/timeout settings.
- **Generative AI Inference:** This client handles multimodal requests to OCI's Gen AI service.
- **Timeouts:** Set appropriate timeouts for inference, especially for image processing.

In this step, we'll set up the client for inference.

In [None]:
# Initialize the Generative AI Inference client
llm_client = oci.generative_ai_inference.GenerativeAiInferenceClient(
    config=config,
    service_endpoint=llm_service_endpoint,
    retry_strategy=oci.retry.NoneRetryStrategy(),
    timeout=(10, 240)
)

print("LLM client initialized.")

## Step 8: Run Inference ‚Äì Compare Results!

**Key Concepts:**
- **Loop Through Models:** Test the multimodal prompt across different models to compare responses.
- **Inference Process:** Send the message, receive the response, and extract the text content.
- **Performance Timing:** Measure response time for each model to understand efficiency.

In this step, we'll run the inference for each model and display results.

In [None]:
# Loop through models and run inference
for model_id in MODEL_LIST:
    print("\n" + "="*80)
    print(f"RESULTS FOR MODEL: {model_id}\n" + "="*80)
    start_time = time.time()
    
    # Build the message
    user_msg = build_user_message(IMAGE_B64, USER_TEXT)
    
    # Prepare the request
    llm_payload = get_chat_request(user_msg)
    chat_detail = get_chat_detail(llm_payload, compartment_id, model_id)
    
    # Send the request
    llm_response = llm_client.chat(chat_detail)
    
    # Process the response
    if (llm_response is not None and hasattr(llm_response, 'data') and hasattr(llm_response.data, 'chat_response') and llm_response.data.chat_response is not None and hasattr(llm_response.data.chat_response, 'choices') and llm_response.data.chat_response.choices):
        llm_text = llm_response.data.chat_response.choices[0].message.content[0].text
        print(llm_text)
    else:
        print("Error: Invalid response from LLM.")
    
    end_time = time.time()
    print(f"\nTime taken: {end_time - start_time:.2f} seconds\n")

## Play and Explore

- Change `USER_TEXT` to ask any question about your picture.
- Swap in a different image.
- Compare models easily!

## üßë‚Äçüíª Project Ideas for Practice

Below are some fun project prompts. Try one (or all) after you run a basic image through the models!

1. **Business Card ‚Üí vCard**
   - Upload a photo of a business card.
   - Write a prompt such as: "Extract all the contact details from this business card and output as a vCard file."
   - Post-process the model's output to save/download the .vcf file.

2. **Agenda/Schedule ‚Üí Calendar File**
   - Try an image of a handwritten or printed agenda.
   - Prompt: "Read and convert this agenda into an iCalendar (.ics) file."
   - Save and import to your calendar app!

3. **Driver's License ‚Üí CRM/Create Record**
   - Upload an image of a driver's license (redact sensitive fields if needed).
   - Prompt: "Extract all key customer information for a CRM record."
   - Map the LLM's result into your database or spreadsheet.

If you see errors, double-check credentials or configurations. Refer to comments or docs for help.

---
**Happy building!**