## Multimodal agent

This tutorial explores the capability of using llm agents to call external tool as well as multimodality capabilities of llm.

- Tool : A specific abstraction around a function that makes it easy for a language model to interact with it. Specifically, the interface of a tool has a single text input and a single text output.

- Agents : The language model that drives decision making.

For this tutorial, we create two classes, one for detecting objects in an image and another one for captioning images into texts. These classes will then be passed into the llm agent as tools that it can use to answer user's query. It will decide whether it needs to use the tool to answer the user's query. For example, if the user asks, "Generate a caption in this image." the llm should understand that it needs to use the image captioning model to parse the image into text description and return to the user.

Since we are using image captioning models from Huggingface, we will need torch for this tutorial. If you have Nvidia GPU, we recommend compiling torch with cuda enabled. The cuda version used for this tutorial is cuda 11.8. If you have a different version (check using nvidia-smi), refer to the official page for pytorch installation: https://pytorch.org/get-started/locally/. The packages are large, make sure you have > 30GB left in your disk.

If you do not have an Nvidia GPU, please change `device = "cuda"` to `device = "cpu"`.

## Install required packages

In [1]:
# pip install -qU transformers tabulate timm ipywidgets

In [2]:
# pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

## Import required packages

In [3]:
from langchain.tools import BaseTool
from transformers import BlipProcessor, BlipForConditionalGeneration, DetrImageProcessor, DetrForObjectDetection
from PIL import Image
import torch

import os
from tempfile import NamedTemporaryFile
from langchain.agents import initialize_agent
from langchain.chat_models import ChatOpenAI
from langchain.chains.conversation.memory import ConversationBufferWindowMemory

## Build custom tools

In [4]:
class ImageCaptionTool(BaseTool):
    name = "Image Captioner"
    
    description = "Generate captions for images"

    def _run(self, img_path):
        image = Image.open(img_path).convert('RGB')

        model_name = "Salesforce/blip-image-captioning-large"
        device = "cuda"

        processor = BlipProcessor.from_pretrained(model_name)
        model = BlipForConditionalGeneration.from_pretrained(model_name).to(device)

        inputs = processor(image, return_tensors='pt').to(device)
        outputs = model.generate(**inputs, max_new_tokens=20)

        caption = processor.decode(outputs[0], skip_special_tokens=True)

        return caption
    
    def _arun(self, query: str):
        raise NotImplementedError("This tool does not support async")


In [5]:
class ObjectDetectionTool(BaseTool):
    name = "Object detector"
    description = "Use this tool when given the path to an image that you would like to detect objects. " \
                  "It will return a list of all detected objects. Each element in the list in the format: " \
                  "[x1, y1, x2, y2] class_name confidence_score."

    def _run(self, img_path):
        image = Image.open(img_path).convert('RGB')

        processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
        model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")

        inputs = processor(images=image, return_tensors="pt")
        outputs = model(**inputs)

        # convert outputs (bounding boxes and class logits) to COCO API
        # let's only keep detections with score > 0.9
        target_sizes = torch.tensor([image.size[::-1]])
        results = processor.post_process_object_detection(outputs, target_sizes=target_sizes, threshold=0.9)[0]

        detections = ""
        for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
            detections += '[{}, {}, {}, {}]'.format(int(box[0]), int(box[1]), int(box[2]), int(box[3]))
            detections += ' {}'.format(model.config.id2label[int(label)])
            detections += ' {}\n'.format(float(score))

        return detections

    def _arun(self, query: str):
        raise NotImplementedError("This tool does not support async")

## Define Helper functions

In [6]:
def get_image_caption(image_path):
    """
    Generates a short caption for the provided image.

    Args:
        image_path (str): The path to the image file.

    Returns:
        str: A string representing the caption for the image.
    """
    image = Image.open(image_path).convert('RGB')

    model_name = "Salesforce/blip-image-captioning-large"
    device = "cuda"  # cuda

    processor = BlipProcessor.from_pretrained(model_name)
    model = BlipForConditionalGeneration.from_pretrained(model_name).to(device)

    inputs = processor(image, return_tensors='pt').to(device)
    output = model.generate(**inputs, max_new_tokens=20)

    caption = processor.decode(output[0], skip_special_tokens=True)

    return caption


def detect_objects(image_path):
    """
    Detects objects in the provided image.

    Args:
        image_path (str): The path to the image file.

    Returns:
        str: A string with all the detected objects. Each object as '[x1, x2, y1, y2, class_name, confindence_score]'.
    """
    image = Image.open(image_path).convert('RGB')

    processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
    model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")

    inputs = processor(images=image, return_tensors="pt")
    outputs = model(**inputs)

    # convert outputs (bounding boxes and class logits) to COCO API
    # let's only keep detections with score > 0.9
    target_sizes = torch.tensor([image.size[::-1]])
    results = processor.post_process_object_detection(outputs, target_sizes=target_sizes, threshold=0.9)[0]

    detections = ""
    for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
        detections += '[{}, {}, {}, {}]'.format(int(box[0]), int(box[1]), int(box[2]), int(box[3]))
        detections += ' {}'.format(model.config.id2label[int(label)])
        detections += ' {}\n'.format(float(score))

    return detections

## Tool use

In [7]:
#initialize the agent
tools = [ImageCaptionTool(), ObjectDetectionTool()]

conversational_memory = ConversationBufferWindowMemory(
    memory_key='chat_history',
    k=5,
    return_messages=True
)

llm = ChatOpenAI(
    openai_api_key= os.environ.get("OPENAIAI_API_KEY"),
    temperature=0,
    model_name="gpt-3.5-turbo"
)

agent = initialize_agent(
    agent="chat-conversational-react-description",
    tools=tools,
    llm=llm,
    max_iterations=5,
    verbose=True,
    memory=conversational_memory,
    early_stopping_method='generate'
)

  warn_deprecated(
  warn_deprecated(


In [8]:
image_path = "../docs/images/traffic.jpg"
user_question = "generate a caption for this iamge?"
response = agent.run(f'{user_question}, this is the image path: {image_path}')
print(response)

  warn_deprecated(




[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m```json
{
    "action": "Image Captioner",
    "action_input": "../docs/images/traffic.jpg"
}
```[0m
Observation: [36;1m[1;3mcars are driving down the street in traffic at a green light[0m
Thought:[32;1m[1;3m```json
{
    "action": "Final Answer",
    "action_input": "cars are driving down the street in traffic at a green light"
}
```[0m

[1m> Finished chain.[0m
cars are driving down the street in traffic at a green light


As you can see, the llm is able to decide that the Image Captioner tool is needed to answer the user query. The implementation of the tool can be found in agent/tool.