p# Basic usage of inference module for image-based inference
In this tutorial, we will be using the `inference` module from `urban-worm`, which supports three frameworks to run MLLMs: Ollama (built on top of llama.cpp) and Llama.cpp to showcase inference with single and multiple images with InternVL3.

Three type of output schema will be demonstrated for inference:

- plain text generation
- multiple questions with binary answers
- multiple choices

In [1]:
from urbanworm.inference.llama import InferenceLlamacpp, InferenceOllama

First, let's set up some schema for defining output format and prompts for demonstrating inference tasks.

In [2]:
# define the schema for model output

# this the default built-in schema for plain text generation
normal_format = {
    "questions": (str, ...),
    "answer": (str, ...),
}

# binary answer
bool_format = {
    "questions": (str, ...),
    "answer": (bool, ...),
}

# multiple choice
from typing import Literal
multiple_choice_format = {
    "questions": (str, ...),
    "answer": (Literal['occupied', 'unoccupied'], ...),
    "explanation": (str, ...),
}

# define the inference task and emphasize the output format in the prompt
multi_questions_prompt =  '''
    Question 1 - Is there any damage on the roof?
    Question 2 - Is any window broken or boarded?
    Question 3 - Is any door broken, missing, or boarded?

    For each question, you have to respond in the following format:
    yes (true) / no (false)
'''

multi_choice_prompt = '''
    Does the house look occupied?
    For each question, you have to respond in the following format:
    'occupied' / 'unoccupied'
'''

We will be using three street views that capture a single residential property from different angles:

<div style="display: flex; justify-content: center;">
  <img src="./data/img_3.jpg" alt="Description 1" style="width: 33%; padding: 5px;">
  <img src="./data/img_1.jpg" alt="Description 2" style="width: 33%; padding: 5px;">
  <img src="./data/img_2.jpg" alt="Description 3" style="width: 33%; padding: 5px;">
</div>

## 1 one-time inference

### 1.1 Ollama

In [3]:
# build constructor
# All these three images in constructor will be used together for a single inference
data = InferenceOllama(llm='hf.co/ggml-org/InternVL3-8B-Instruct-GGUF:Q8_0',
                       image=["./data/img_1.jpg",
                              "./data/img_2.jpg",
                              "./data/img_3.jpg",],
                       schema=normal_format)
# inference
result = data.one_inference(prompt='what is the color of the house?')
result

Unnamed: 0,questions1,answer1,data
0,What is the color of the house?,The house in each image appears to be light-co...,"[./data/img_1.jpg, ./data/img_2.jpg, ./data/im..."


In [4]:
result['answer1'][0]

"The images depict a two-story house with white siding and multiple windows. The yard appears to be fenced, and there's an assortment of items near the entrance such as trash bins and possibly gardening tools. There is also a sidewalk leading up to the front door."

In [5]:
# image can also be provided for a single inference
data.schema = bool_format # replace the output format
result = data.one_inference(prompt=multi_questions_prompt,
                            image="./data/img_1.jpg")
result

Unnamed: 0,questions1,answer1,questions2,answer2,questions3,answer3,data
0,Is there any damage on the roof?,False,Is any window broken or boarded?,False,"Is any door broken, missing, or boarded?",True,[./data/img_1.jpg]


In [16]:
# multiple choice
data.schema = multiple_choice_format # replace the output format
result = data.one_inference(prompt=multi_choice_prompt,
                            image="./data/img_1.jpg")
result

Unnamed: 0,questions1,answer1,explanation1,data
0,Does the house look occupied?,unoccupied,The porch area appears empty and there are no ...,[./data/img_1.jpg]


### 1.2 Llama.cpp

In [10]:
# build constructor
data = InferenceLlamacpp(
    # if model amd mmproj are already downloaded,
    # you can directly specify the path to model files in the constructor, for example:
    # llm = "model/InternVL3-8B-Instruct-Q8_0.gguf"
    # mp = "model/mmproj-InternVL3-8B-Instruct-Q8_0.gguf"

    # you can also just provide model's hf repo id and its quant directly:
    llm='ggml-org/InternVL3-8B-Instruct-GGUF:Q8_0',
    image=["./data/img_1.jpg",
           "./data/img_2.jpg",
           "./data/img_3.jpg",], # All these three images in constructor will be used together for the inference
    # schema=normal_format
)

In [14]:
# inference
result = data.one_inference(prompt='what is the color of the house?')
result

Unnamed: 0,questions1,answer1,data
0,What is the color of the house?,The house in each image appears to be light-co...,"[./data/img_1.jpg, ./data/img_2.jpg, ./data/im..."


In [18]:
# single image inference
data.schema = bool_format
result = data.one_inference(prompt=multi_questions_prompt, image="./data/img_1.jpg")
result

Unnamed: 0,questions1,answer1,questions2,answer2,questions3,answer3,data
0,Is there any damage on the roof?,False,Is any window broken or boarded?,False,"Is any door broken, missing, or boarded?",True,[./data/img_1.jpg]


In [17]:
# multiple choice
data.schema = multiple_choice_format # replace the output format
result = data.one_inference(prompt=multi_choice_prompt,
                            image="./data/img_1.jpg")
result

Unnamed: 0,questions1,answer1,explanation1,data
0,Does the house look occupied?,unoccupied,The porch area appears empty and there are no ...,[./data/img_1.jpg]


## 2 Batched inference with multiple-image input
To implement batched multi-image input for inference, we just need to pack images (path) into a nested list/tuple.

### 2.1 Ollama

In [4]:
data = InferenceOllama(llm='hf.co/ggml-org/InternVL3-8B-Instruct-GGUF:Q8_0',
                       schema=bool_format)
data.imgs = [
    ["./data/img_1.jpg",
     "./data/img_2.jpg",],
    ["./data/img_2.jpg",
     "./data/img_3.jpg",]
]

# uncommnet the code below to do batched single-image inference
# data.imgs = [
#     ["./data/img_1.jpg",
#      "./data/img_2.jpg",
#      "./data/img_3.jpg",]
# ]

data.batch_inference(prompt=multi_questions_prompt)

Processing...: 100%|█████████████████████████| 2/2 [00:23<00:00, 11.56s/it]


Unnamed: 0,questions1,answer1,questions2,answer2,questions3,answer3,data
0,Is there any damage on the roof?,False,Is any window broken or boarded?,False,"Is any door broken, missing, or boarded?",True,"[./data/img_1.jpg, ./data/img_2.jpg]"
1,Is there any damage on the roof?,False,Is any window broken or boarded?,False,"Is any door broken, missing, or boarded?",True,"[./data/img_2.jpg, ./data/img_3.jpg]"


In [5]:
data.results

{'responses': [[QnA(questions='Is there any damage on the roof?', answer=False),
   QnA(questions='Is any window broken or boarded?', answer=False),
   QnA(questions='Is any door broken, missing, or boarded?', answer=True)],
  [QnA(questions='Is there any damage on the roof?', answer=False),
   QnA(questions='Is any window broken or boarded?', answer=False),
   QnA(questions='Is any door broken, missing, or boarded?', answer=True)]],
 'data': [['./data/img_1.jpg', './data/img_2.jpg'],
  ['./data/img_2.jpg', './data/img_3.jpg']]}

In [6]:
data.df

Unnamed: 0,questions1,answer1,questions2,answer2,questions3,answer3,data
0,Is there any damage on the roof?,False,Is any window broken or boarded?,False,"Is any door broken, missing, or boarded?",True,"[./data/img_1.jpg, ./data/img_2.jpg]"
1,Is there any damage on the roof?,False,Is any window broken or boarded?,False,"Is any door broken, missing, or boarded?",True,"[./data/img_2.jpg, ./data/img_3.jpg]"


### 2.2 Llama.cpp

In [3]:
data = InferenceLlamacpp(llm='ggml-org/InternVL3-8B-Instruct-GGUF:Q8_0', schema=bool_format)
# pack images in a nested list to batch multiple-image inference
data.imgs = [
    ["./data/img_1.jpg",
     "./data/img_2.jpg",],
    ["./data/img_2.jpg",
     "./data/img_3.jpg",]
]

# uncommnet the code below to batch single-image inference
# data.imgs = [
#     ["./data/img_1.jpg",
#      "./data/img_2.jpg",
#      "./data/img_3.jpg",]
# ]

data.batch_inference(prompt=multi_questions_prompt)

Processing...: 100%|█████████████████████████| 2/2 [00:16<00:00,  8.16s/it]


Unnamed: 0,questions_1,answer_1,questions_2,answer_2,questions_3,answer_3,data_1,data_2
0,Is there any damage on the roof?,False,Is any window broken or boarded?,False,"Is any door broken, missing, or boarded?",False,./data/img_1.jpg,./data/img_2.jpg
1,Is there any damage on the roof?,False,Is any window broken or boarded?,False,"Is any door broken, missing, or boarded?",False,./data/img_2.jpg,./data/img_3.jpg


In [4]:
data.df

Unnamed: 0,questions_1,answer_1,questions_2,answer_2,questions_3,answer_3,data_1,data_2
0,Is there any damage on the roof?,False,Is any window broken or boarded?,False,"Is any door broken, missing, or boarded?",False,./data/img_1.jpg,./data/img_2.jpg
1,Is there any damage on the roof?,False,Is any window broken or boarded?,False,"Is any door broken, missing, or boarded?",False,./data/img_2.jpg,./data/img_3.jpg
