# Object recognition pipeline

The pipeline is based on few models/frameworks:


**BLIP-2** - image captioning ([paper](https://arxiv.org/pdf/2301.12597), [HuggingFace](https://huggingface.co/docs/transformers/model_doc/blip-2)).


**spaCY** - english text analyser (https://spacy.io/), see [dependency parsing](https://spacy.io/usage/linguistic-features#dependency-parse).


**GroundingDINO/SAM** - open set object detection and segmentation ([SAM official site](https://segment-anything.com/), [SAM demo](https://segment-anything.com/demo#), [SAM Github](https://github.com/facebookresearch/segment-anything), [GroundingDINO Github](https://github.com/IDEA-Research/GroundingDINO), [HF Grounding DINO demo](https://huggingface.co/spaces/merve/Grounding_DINO_demo), [GroundingSAM GitHub](https://github.com/IDEA-Research/Grounded-Segment-Anything), [GroundingSAM example in Colab](https://colab.research.google.com/github/betogaona7/Grounded-Segment-Anything/blob/main/grounded_sam_colab_demo.ipynb)).

**LLaVA** - general purpose multimodal model that was learned by Chat-GPT-3.5 to solve visual understanding tasks ([official repo](https://github.com/haotian-liu/LLaVA), [demo](https://llava-vl.github.io/)).




## Set-up environment

In [None]:
!pip install transformers

## Load an image

In [None]:
import requests
from PIL import Image

url = 'https://raw.githubusercontent.com/ant-nik/semares/master/data/stereo-camera-cyl/image68_r.jpg'
image = Image.open(requests.get(url, stream=True).raw).convert('RGB')
display(image.resize((596, 437)))

## BLIP 2

### Loading BLIP 2 model

There are model and checkpoints in HuggingFace.The model and its processor can be found at [hub](https://huggingface.co/models?other=blip-2). Also it is require d to load a checkpoint (pre-trained OPT model by Meta AI, which as 2.7 billion parameters).

In [None]:
from transformers import AutoProcessor, Blip2ForConditionalGeneration
import torch

processor = AutoProcessor.from_pretrained("Salesforce/blip2-opt-2.7b")
# optimize RAM by using float16
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16)

A GPU improves a performance.

In [None]:
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

### List objects on an image with BLIP 2

No prompt is required if we only want to captionize an image.

In [None]:
prompt = "Question: What objects are in the image? Answer:"

inputs = processor(image, text=prompt, return_tensors="pt").to(device, torch.float16)

generated_ids = model.generate(**inputs, max_new_tokens=20)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
print(generated_text)

Examples of a context (a chat-like dialogue) - describe objects from previous step.

In [None]:
 # "a ball, a checkerboard, a person, and a ball."
context = f"{prompt} {generated_text}"
chat_prompt = f"{context}. Question: Which color is a person's pants? Answer:"
inputs = processor(image, text=chat_prompt, return_tensors="pt").to(device, torch.float16)

generated_ids = model.generate(**inputs, max_new_tokens=20)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
print(generated_text)

## GroundingSAM

## LLaVa

[GitHub](https://github.com/haotian-liu/LLaVA), [Demo](https://llava-vl.github.io/)

In [None]:
!git clone https://github.com/ant-nik/LLaVA.git

In [None]:
# !pip install accelerate

In [None]:
%cd LLaVA

In [None]:
!mkdir offload

In [None]:
pip install -e .

In [None]:
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path
from llava.eval.run_llava import eval_model

model_path = "liuhaotian/llava-v1.5-7b"

tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path=model_path,
    model_base=None,
    model_name=get_model_name_from_path(model_path)
)

In [None]:
model_path = "liuhaotian/llava-v1.5-7b"
prompt = "Provide bounding boxes for objects."
image_file = "https://raw.githubusercontent.com/ant-nik/semares/master/data/stereo-camera-cyl/image68_r.jpg"

args = type('Args', (), {
    "model_path": model_path,
    "model_base": None,
    "model_name": get_model_name_from_path(model_path),
    "query": prompt,
    "conv_mode": None,
    "image_file": image_file,
    "sep": ",",
    "temperature": 0,
    "top_p": None,
    "num_beams": 1,
    "max_new_tokens": 512,
    "offload_folder": "./offload"
})()

eval_model(args)

## Semantic segmentation

[demo](https://replicate.com/cjwbw/semantic-segment-anything)

## Semantic Segment Anything

[SSA Github](https://github.com/fudan-zvg/Semantic-Segment-Anything)

[SSA Demo](https://replicate.com/cjwbw/semantic-segment-anything)

## SAM -> BLIP 2 tool

[Colab example](https://colab.research.google.com/github/ttengwang/Caption-Anything/blob/main/notebooks/tutorial.ipynb)

## SAM based annotation

[SegDrawer home page](https://github.com/lujiazho/SegDrawer)

[SegDrawer in colab with ngrok proxy](https://github.com/lujiazho/SegDrawer/blob/main/SegDrawer.ipynb)

## spyCY

Text dependencies parsing is required to find, for example, list of objects (nouns) and their dependencies to parse models output.

See [installation](https://github.com/explosion/spaCy?tab=readme-ov-file#-install-spacy), [model loading](https://github.com/explosion/spaCy?tab=readme-ov-file#-install-spacy) and [dependency parsing](https://spacy.io/usage/linguistic-features#dependency-parse).


In [None]:
!pip install spacy

In [None]:
!python -m spacy download en_core_web_sm

In [None]:
import spacy
import en_core_web_sm

nlp = en_core_web_sm.load()
doc = nlp("This is a sentence.")

In [None]:
import pandas

table = {
    "text": [],  "dep": [], "head_text": [], "head_pos": [],
    "children": []
}
for token in doc:
    table["text"].append(token.text)
    table["head_text"].append(token.head.text)
    table["head_pos"].append(token.head.pos_)
    table["dep"].append(token.dep_)
    table["children"].append([child for child in token.children])

table = pandas.DataFrame(table)
table