# Foundation models for zero-shot detection and segmentation

Based on [Ollama](https://github.com/ollama/ollama) project.

In [None]:
!curl -L https://ollama.com/download/ollama-linux-amd64 -o ollama
!chmod +x ollama
!./ollama pull llava
#!cp ./ollama /usr/bin/ollama

In [None]:
import subprocess
subprocess.Popen(["./ollama", "serve"])
import time
time.sleep(3) # Wait for a few seconds for Ollama to load!

In [None]:
!wget -q -O xxx.jpg https://github.com/ant-nik/neural_network_course/blob/main/practice_2_data/video_1_fixed/image_001.jpg?raw=true

In [None]:
%%writefile prompt.txt
Find entities on the image.
Split answer in two sections a LIST and a EXPLANATION.
Put only detected object names to the LIST section.
Put an explanation of the answer into the EXPLANATION section

In [None]:
!echo '{ "model": "llava", "prompt": "'`cat prompt.txt`'", "images": ["'`base64 -w 0 /content/xxx.jpg`'"], "stream": false}' > body.json

In [None]:
!curl http://localhost:11434/api/generate --data-binary "@body.json"

In [None]:
%%writefile step-2-prompt.txt
Extract text between LIST and EXPLANATION sections and consider it as TEXT in the instruction below.
Split answer in two parts: OUTPUT and INFO.
Remove any enumeration symbols in the TEXT and place only one list entity per line to the OUTPUT section between START and END markers.
Put any explanation of the answer to INFO section.

LIST:
1. Bottle
2. Man
3. Water bottle
4. Rocks
5. Dirt
6. Trash bag
7. Grass
8. River
9. Dogs (if any)
10. Mountain

EXPLANATION:
The image shows a man outside in a natural environment. He appears to be bending over, possibly interacting with the ground or some kind of litter in his hands. There is a bottle near him, and it seems like he might be picking up trash from the area. The landscape suggests a rural or semi-rural setting with rocks, dirt, grass, and what could be a small river or stream visible in the background. Additionally, there appears to be a trash bag nearby, which supports the idea that the man is cleaning up litter.

OUTPUT:


In [None]:
!echo '{ "model": "llama3.1", "prompt": "'`cat step-2-prompt.txt`'", "stream": false}' > step-2-body.json

In [None]:
!curl --data-binary "@step-2-body.json" -o step-2-result.txt http://localhost:11434/api/generate

In [None]:
import json

with open("step-2-result.txt", "r") as file:
    step2_response = json.loads(file.read())
print(step2_response["response"])

In [None]:
objects = [item for item in step2_response["response"].split("START")[1].split("END")[0].split("\n") if not item=='']

In [None]:
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection

model_id = "IDEA-Research/grounding-dino-base"
device = "cuda" if torch.cuda.is_available() else "cpu"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to(device)

image_url = "https://drive.usercontent.google.com/u/0/uc?id=1Abxa12JrIk-R2iupQL0nEH5MWPWtD2H1&export=download"
image = Image.open("xxx.jpg")

In [None]:
# VERY important: text queries need to be lowercased + end with a dot
text = " . ".join([f"all {item}" for item in objects]).lower() + '.'
print(text)

In [None]:
inputs = processor(images=image, text=text, return_tensors="pt").to(device)
with torch.no_grad():
    outputs = model(**inputs)

results = processor.post_process_grounded_object_detection(
    outputs,
    inputs.input_ids,
    box_threshold=0.2,
    text_threshold=0.2,
    target_sizes=[image.size[::-1]]
)
results

In [None]:
!pip install supervision

In [None]:
import cv2
import supervision
import numpy


box_annotator = supervision.BoxAnnotator()
detections = supervision.Detections(
    xyxy=results[0]["boxes"].numpy(),
    class_id=numpy.ones(results[0]["boxes"].shape[0], dtype=int)
) #, 2, 3, 4])#results[0]["labels"]

"""
labels = [
    f"{class_id} {confidence:0.2f}"
    for confidence, class_id, boxes in results
]
"""
annotated_frame = box_annotator.annotate(scene=image.copy(),
                                         detections=detections) #, labels=labels)

%matplotlib inline
supervision.plot_image(annotated_frame, (16, 16))