Possible models to test for:

- PaliGemma 3B - Google
- Phi3.5-Vision 4B - Microsoft
- Lllama 3.2 7B - Meta
- Molmo 7B-D, 7B-O - AllenAI
- Qwen VL2 7B - Qwen


Possible features:

- Style:

    Overall design style (e.g., modern, traditional, rustic, industrial, mid-century modern)
    Specific style elements (e.g., tufted, skirted, wingback for chairs)

- Color:

    Primary color
    Secondary colors or color combinations
    Finish type (e.g., matte, glossy, distressed)

- Material:

    Main material (e.g., wood, metal, leather, fabric, glass)
    Secondary materials
    For fabrics: texture or pattern (e.g., smooth, woven, floral print)

- Shape and Form:

    Overall shape (e.g., rectangular, curved, L-shaped for sofas)
    Distinctive features (e.g., high back, rolled arms, tapered legs)

- Size:

    Approximate dimensions or size category (e.g., small, medium, large)
    Number of seats for seating furniture

- Function:

    Type of furniture (e.g., chair, sofa, table, bed)
    Specific subcategory (e.g., dining chair, lounge chair, accent chair)

- Details and Embellishments:

    Decorative elements (e.g., nailhead trim, carved details, button tufting)
    Hardware style (e.g., brass knobs, stainless steel legs)


Use bigger model to generate catpions for images in dataset and then fine-tune smaller model with that?

## Paligemma Experiments

https://huggingface.co/google/paligemma-3b-mix-448

In [1]:
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
import torch
import os
import time

model_id = "google/paligemma-3b-mix-448"
device = "cuda:0"
dtype = torch.bfloat16

model = PaliGemmaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=dtype,
    device_map=device,
    revision="bfloat16",
).eval()
processor = AutoProcessor.from_pretrained(model_id)

  from .autonotebook import tqdm as notebook_tqdm
Downloading shards: 100%|██████████| 2/2 [08:04<00:00, 242.42s/it]
`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.
Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00,  1.23s/it]


Prompts:

- caption color one word - returns color of object
- caption design style - returns style of object
- caption material that object is made of  - returns material (kinda bad)


In [32]:
##url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image_folder = "/home/s464915/future-designer/experiments/images"

for filename in os.listdir(image_folder):
    if filename.endswith((".png", ".jpg", ".jpeg")):
        image_path = os.path.join(image_folder, filename)
        image = Image.open(image_path)
        
        prompt = "caption material that object is made of"
    
        start_time = time.time()
        
        model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
        input_len = model_inputs["input_ids"].shape[-1]
        
        with torch.inference_mode():
            generation = model.generate(**model_inputs, max_new_tokens=30, do_sample=False)
            generation = generation[0][input_len:]
            decoded = processor.decode(generation, skip_special_tokens=True)
        
        # End timing
        end_time = time.time()
        
        # Calculate and print the processing time
        processing_time = end_time - start_time
        
        print(f"Image: {filename}")
        print(f"Caption: {decoded}")
        print(f"Processing time: {processing_time:.2f} seconds")
        print("--------------------")

Image: Bainton 110 Upholstered Sofa_1.jpg
Caption: leather
Processing time: 0.14 seconds
--------------------
Image: Clifford Upholstered Armchair_1.jpg
Caption: wood
Processing time: 0.12 seconds
--------------------
Image: Offline Outdoor Lounge Chair_1.jpg
Caption: wood
Processing time: 0.13 seconds
--------------------
Image: Miller Upholstered Armchair_1.jpg
Caption: wood
Processing time: 0.13 seconds
--------------------
Image: Cache Lounge Chair_5.jpg
Caption: metal
Processing time: 0.13 seconds
--------------------
Image: Bobbie 98 Upholstered Sofa_3.jpg
Caption: wood
Processing time: 0.12 seconds
--------------------


## Phi-3.5-vision Experiments

In [33]:
from PIL import Image 
import requests 
from transformers import AutoModelForCausalLM 
from transformers import AutoProcessor 

model_id = "microsoft/Phi-3.5-vision-instruct" 

# Note: set _attn_implementation='eager' if you don't have flash_attn installed
model = AutoModelForCausalLM.from_pretrained(
  model_id, 
  device_map="cuda", 
  trust_remote_code=True, 
  torch_dtype="auto", 
  _attn_implementation='eager'    
)

# for best performance, use num_crops=4 for multi-frame, num_crops=16 for single-frame.
processor = AutoProcessor.from_pretrained(model_id, 
  trust_remote_code=True, 
  num_crops=16
) 

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3.5-vision-instruct:
- configuration_phi3_v.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3.5-vision-instruct:
- modeling_phi3_v.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
Downloading shards: 100%|██████████| 2/2 [01:22<00:00, 41.29s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:03<00:00,  1.53s/it]
A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3.5-vision-instruct:
- processing_phi3_v.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revisi

In [43]:
for filename in os.listdir(image_folder):
    if filename.endswith((".png", ".jpg", ".jpeg")):
        image_path = os.path.join(image_folder, filename)
        image = Image.open(image_path)
        
        messages = [
    {
        "role": "system",
        "content": "You are a furniture expert. Analyze the image and provide a detailed description in JSON format."
    },
    {
        "role": "user",
        "content": """
<|image_1|>
Describe the furniture in the image using the following JSON structure. Use only one word be really specific. If any field is not applicable or cannot be determined, use "N/A".

{
    "type": "Main furniture type (e.g., chair, table, sofa)",
    "style": "Overall style (e.g., modern, traditional, rustic)",
    "color": "Main color",
    "material": "Primary material",
    "shape": "General shape",
    "size": "Size category (small, medium, large)",
    "details": "Any decorative features"
    "condition": "Apparent condition if relevant"
}
"""
    }
]

        prompt = processor.tokenizer.apply_chat_template(
            messages, 
            tokenize=False, 
            add_generation_prompt=True
        )
    
        start_time = time.time()
        
        inputs = processor(prompt, image, return_tensors="pt").to("cuda:0") 

        generation_args = { 
            "max_new_tokens": 1000, 
            "temperature": 0.0, 
            "do_sample": False, 
        } 

        generate_ids = model.generate(**inputs, 
            eos_token_id=processor.tokenizer.eos_token_id, 
            **generation_args
        )

        # remove input tokens 
        generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
        response = processor.batch_decode(generate_ids, 
        skip_special_tokens=True, 
        clean_up_tokenization_spaces=False)[0] 
        # End timing
        end_time = time.time()
        
        # Calculate and print the processing time
        processing_time = end_time - start_time
        
        print(f"Image: {filename}")
        print(f"Caption: {response}")
        print(f"Processing time: {processing_time:.2f} seconds")
        print("--------------------")

Image: Bainton 110 Upholstered Sofa_1.jpg
Caption: {
    "type": "Sofa",
    "style": "Modern",
    "color": "White",
    "material": "Leather",
    "shape": "L",
    "size": "Medium",
    "details": "N/A",
    "condition": "New"
}
Processing time: 4.19 seconds
--------------------
Image: Clifford Upholstered Armchair_1.jpg
Caption: {
    "type": "Chair",
    "style": "Modern",
    "color": "Multicolored",
    "material": "Upholstered",
    "shape": "Armchair",
    "size": "Medium",
    "details": "Striped pattern",
    "condition": "New"
}
Processing time: 4.57 seconds
--------------------
Image: Offline Outdoor Lounge Chair_1.jpg
Caption: {
    "type": "Chair",
    "style": "Modern",
    "color": "Black",
    "material": "Metal",
    "shape": "Rectangular",
    "size": "Medium",
    "details": "Wooden slats",
    "condition": "New"
}
Processing time: 4.46 seconds
--------------------
Image: Miller Upholstered Armchair_1.jpg
Caption: {
    "type": "Chair",
    "style": "Modern",
    "

## Molmo Experiments


In [1]:
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from PIL import Image
import requests
import os
import time
import torch

# load the processor
processor = AutoProcessor.from_pretrained(
    'allenai/Molmo-7B-D-0924',
    trust_remote_code=True,
    torch_dtype='auto',
    device_map='auto'
)

# load the model
model = AutoModelForCausalLM.from_pretrained(
    'allenai/Molmo-7B-D-0924',
    trust_remote_code=True,
    torch_dtype='auto',
    device_map='auto'
)

  from .autonotebook import tqdm as notebook_tqdm
Loading checkpoint shards: 100%|██████████| 7/7 [00:06<00:00,  1.02it/s]
Some parameters are on the meta device because they were offloaded to the cpu.


In [None]:
image_folder = "/home/s464915/future-designer/experiments/images"
for filename in os.listdir(image_folder):
    if filename.endswith((".png", ".jpg", ".jpeg")):
        image_path = os.path.join(image_folder, filename)
        image = Image.open(image_path)

        # process the image and text
        inputs = processor.process(
            images=[image],
            text="Describe this image."
        )

        # move inputs to the correct device and make a batch of size 1
        inputs = {k: v.to(model.device).unsqueeze(0) for k, v in inputs.items()}
        
    
        start_time = time.time()
        
        # generate output; maximum 200 new tokens; stop generation when <|endoftext|> is generated
        output = model.generate_from_batch(
            inputs,
            GenerationConfig(max_new_tokens=200, stop_strings="<|endoftext|>"),
            tokenizer=processor.tokenizer
        )

        # only get generated tokens; decode them to text
        generated_tokens = output[0,inputs['input_ids'].size(1):]
        generated_text = processor.tokenizer.decode(generated_tokens, skip_special_tokens=True)
        
        # End timing
        end_time = time.time()
        
        # Calculate and print the processing time
        processing_time = end_time - start_time
        
        print(f"Image: {filename}")
        print(f"Caption: {generated_text}")
        print(f"Processing time: {processing_time:.2f} seconds")
        print("--------------------")

## Qwen VL2 7b Experiments

In [1]:
from PIL import Image
import requests
import torch
from torchvision import io
from typing import Dict
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
import time
import os

# Load the model in half-precision on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
     attn_implementation="flash_attention_2",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")


`Qwen2VLRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46


Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

In [2]:
image_folder = "/home/s464915/future-designer/experiments/images"

for filename in os.listdir(image_folder):
    if filename.endswith((".png", ".jpg", ".jpeg")):
        image_path = os.path.join(image_folder, filename)
        image = Image.open(image_path)

        conversation = [
             {
                "role": "system",
                "content": "You are a furniture expert. Analyze the image and provide a detailed description in JSON format."
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                    },
                    {
                        "type": "text", "text": """Describe the furniture in the image using the following JSON structure. 
                                                Use only one word be really specific. If any field is not applicable or cannot be determined, use "N/A".
                            {
                                "type": "Main furniture type (e.g., chair, table, sofa)",
                                "style": "Overall style (e.g., modern, traditional, rustic)",
                                "color": "Main color",
                                "material": "Primary material",
                                "shape": "General shape",
                                "size": "Size category (small, medium, large)",
                                "details": "Any decorative features"
                                "condition": "Apparent condition if relevant"
                            }
                    """},
                ],
            }
        ]
        # Preprocess the inputs
        text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
        # Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n'

        inputs = processor(
            text=[text_prompt], images=[image], padding=True, return_tensors="pt"
        )
        inputs = inputs.to("cuda")

        start_time = time.time()

        # Inference: Generation of the output
        output_ids = model.generate(**inputs, max_new_tokens=128)
        generated_ids = [
            output_ids[len(input_ids) :]
            for input_ids, output_ids in zip(inputs.input_ids, output_ids)
        ]
        output_text = processor.batch_decode(
            generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
        )
        
        # End timing
        end_time = time.time()
        
        # Calculate and print the processing time
        processing_time = end_time - start_time
        
        print(f"Image: {filename}")
        print(f"Caption: {output_text}")
        print(f"Processing time: {processing_time:.2f} seconds")
        print("--------------------")

Image: Bainton 110 Upholstered Sofa_1.jpg
Caption: ['{\n    "type": "Sofa",\n    "style": "Modern",\n    "color": "Light gray",\n    "material": "Leather",\n    "shape": "Boxy",\n    "size": "Large",\n    "details": "Clean lines, minimalistic design",\n    "condition": "New"\n}']
Processing time: 3.00 seconds
--------------------
Image: Clifford Upholstered Armchair_1.jpg
Caption: ['{\n    "type": "Chair",\n    "style": "Modern",\n    "color": "Multicolored",\n    "material": "Fabric",\n    "shape": "Square",\n    "size": "Medium",\n    "details": "Striped pattern",\n    "condition": "New"\n}']
Processing time: 2.39 seconds
--------------------
Image: Offline Outdoor Lounge Chair_1.jpg
Caption: ['{\n    "type": "Chair",\n    "style": "Modern",\n    "color": "Dark gray",\n    "material": "Metal",\n    "shape": "U-shaped",\n    "size": "Medium",\n    "details": "Wooden slats",\n    "condition": "New"\n}']
Processing time: 2.42 seconds
--------------------
Image: orange_sofa.jpg
Caption: 

## Qwen VL2 2b Experiments

In [1]:
from PIL import Image
import requests
import torch
from torchvision import io
from typing import Dict
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
import time
import os

min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28

# Load the model in half-precision on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-2B-Instruct",
    torch_dtype=torch.bfloat16,
     attn_implementation="flash_attention_2",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")

`Qwen2VLRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [2]:
image_folder = "/home/s464915/future-designer/experiments/images"
times = []

for filename in os.listdir(image_folder):
    if filename.endswith((".png", ".jpg", ".jpeg")):
        image_path = os.path.join(image_folder, filename)
        image = Image.open(image_path)

        conversation = [
             {
                "role": "system",
                "content": "You are a furniture expert. Analyze the image and provide a detailed description in JSON format."
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "resized_height": 280,
                        "resized_width": 420
                    },
                    {
                        "type": "text", "text": """Describe the furniture in the image using the following JSON structure. 
                                                Use only one word be really specific. If any field is not applicable or cannot be determined, use "N/A".
                            {
                                "type": "Main furniture type (e.g., chair, table, sofa)",
                                "style": "Overall style (e.g., modern, traditional, rustic)",
                                "color": "Main color",
                                "material": "Primary material",
                                "shape": "General shape",
                                "size": "Size category (small, medium, large)",
                                "details": "Any decorative features"
                                "condition": "Apparent condition if relevant"
                            }
                    """},
                ],
            }
        ]
        # Preprocess the inputs
        text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
        # Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n'

        inputs = processor(
            text=[text_prompt], images=[image], padding=True, return_tensors="pt"
        )
        inputs = inputs.to("cuda")

        start_time = time.time()

        # Inference: Generation of the output
        output_ids = model.generate(**inputs, max_new_tokens=128)
        generated_ids = [
            output_ids[len(input_ids) :]
            for input_ids, output_ids in zip(inputs.input_ids, output_ids)
        ]
        output_text = processor.batch_decode(
            generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
        )
        
        # End timing
        end_time = time.time()
        
        # Calculate and print the processing time
        processing_time = end_time - start_time
        times.append(processing_time)
        
        print(f"Image: {filename}")
        print(f"Caption: {output_text}")
        print(f"Processing time: {processing_time:.2f} seconds")
        print("--------------------")

print(f'Avg time of interface: {sum(times)/len(times):.2f} s')

Image: Bainton 110 Upholstered Sofa_1.jpg
Caption: ['```json\n{\n  "type": "sofa",\n  "style": "modern",\n  "color": "light grey",\n  "material": "leather",\n  "shape": "two-seater",\n  "size": "medium",\n  "details": "no visible decorative features"\n}\n```']
Processing time: 2.55 seconds
--------------------
Image: Clifford Upholstered Armchair_1.jpg
Caption: ['```json\n{\n  "type": "Chair",\n  "style": "Modern",\n  "color": "Multicolored",\n  "material": "Fabric",\n  "shape": "Modern",\n  "size": "Medium",\n  "details": "No specific decorative features"\n}\n```']
Processing time: 2.02 seconds
--------------------
Image: Offline Outdoor Lounge Chair_1.jpg
Caption: ['```json\n{\n  "type": "Chair",\n  "style": "Modern",\n  "color": "N/A",\n  "material": "Wood and metal",\n  "shape": "Lounge chair",\n  "size": "Medium",\n  "details": "No specific decorative features",\n  "condition": "N/A"\n}\n```']
Processing time: 2.30 seconds
--------------------
Image: orange_sofa.jpg
Caption: ['```

## Qwen VL2 AWQ 2b Experiments

In [1]:
from PIL import Image
import requests
import torch
from torchvision import io
from typing import Dict
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
import time
import os

min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28

# Load the model in half-precision on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-2B-Instruct-AWQ",
    torch_dtype=torch.float16,
     attn_implementation="flash_attention_2",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct-AWQ")

model.safetensors:  13%|#3        | 388M/2.95G [00:00<?, ?B/s]

`Qwen2VLRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46


generation_config.json:   0%|          | 0.00/249 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/594 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/4.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/392 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/613 [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

In [3]:
image_folder = "/home/s464915/future-designer/experiments/images"
times = []

for filename in os.listdir(image_folder):
    if filename.endswith((".png", ".jpg", ".jpeg")):
        image_path = os.path.join(image_folder, filename)
        image = Image.open(image_path)

        conversation = [
             {
                "role": "system",
                "content": "You are a furniture expert. Analyze the image and provide a detailed description in JSON format."
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "resized_height": 280,
                        "resized_width": 420
                    },
                    {
                        "type": "text", "text": """Describe the furniture in the image using the following JSON structure. 
                                                Use only one word be really specific. If any field is not applicable or cannot be determined, use "N/A".
                            {
                                "type": "Main furniture type (e.g., chair, table, sofa)",
                                "style": "Overall style (e.g., modern, traditional, rustic)",
                                "color": "Main color",
                                "material": "Primary material",
                                "shape": "General shape",
                                "size": "Size category (small, medium, large)",
                                "details": "Any decorative features"
                                "condition": "Apparent condition if relevant"
                            }
                    """},
                ],
            }
        ]
        # Preprocess the inputs
        text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
        # Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n'

        inputs = processor(
            text=[text_prompt], images=[image], padding=True, return_tensors="pt"
        )
        inputs = inputs.to("cuda")

        start_time = time.time()

        # Inference: Generation of the output
        output_ids = model.generate(**inputs, max_new_tokens=128)
        generated_ids = [
            output_ids[len(input_ids) :]
            for input_ids, output_ids in zip(inputs.input_ids, output_ids)
        ]
        output_text = processor.batch_decode(
            generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
        )
        
        # End timing
        end_time = time.time()
        
        # Calculate and print the processing time
        processing_time = end_time - start_time
        times.append(processing_time)
        
        print(f"Image: {filename}")
        print(f"Caption: {output_text}")
        print(f"Processing time: {processing_time:.2f} seconds")
        print("--------------------")

print(f'Avg time of interface: {sum(times)/len(times):.2f} s')

Image: Bainton 110 Upholstered Sofa_1.jpg
Caption: ['```json\n{\n  "type": "Main furniture type (e.g., chair, table, sofa)",\n  "style": "Modern",\n  "color": "White",\n  "material": "Fabric",\n  "shape": "Modern",\n  "size": "Large",\n  "details": "No decorative features",\n  "condition": "Apparent condition"\n}\n```']
Processing time: 2.65 seconds
--------------------
Image: Clifford Upholstered Armchair_1.jpg
Caption: ['```json\n{\n  "type": "Chair",\n  "style": "Modern",\n  "color": "Multicolored",\n  "material": "Fabric",\n  "shape": "Modern",\n  "size": "Small",\n  "details": "Decorative features"\n}\n```']
Processing time: 2.10 seconds
--------------------
Image: Offline Outdoor Lounge Chair_1.jpg
Caption: ['```json\n{\n  "type": "Chair",\n  "style": "Modern",\n  "color": "Natural wood",\n  "material": "Teak",\n  "shape": "Lounge",\n  "size": "Large",\n  "details": "No decorative features"\n}\n```']
Processing time: 2.11 seconds
--------------------
Image: orange_sofa.jpg
Captio

## InternVL2 2B Experiments

In [1]:
import numpy as np
import torch
import torchvision.transforms as T
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values

# If you want to load a model using multiple GPUs, please refer to the `Multiple GPUs` section.
path = 'OpenGVLab/InternVL2-2B'
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

conversation.py:   0%|          | 0.00/15.0k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/OpenGVLab/InternVL2-2B:
- conversation.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/OpenGVLab/InternVL2-2B:
- modeling_intern_vit.py
- conversation.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors:   0%|          | 0.00/4.41G [00:00<?, ?B/s]

InternLM2ForCausalLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.


generation_config.json:   0%|          | 0.00/115 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/4.00k [00:00<?, ?B/s]

tokenization_internlm2.py:   0%|          | 0.00/8.79k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/OpenGVLab/InternVL2-2B:
- tokenization_internlm2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


tokenizer.model:   0%|          | 0.00/1.48M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/179 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/844 [00:00<?, ?B/s]

User: <image>
Please describe the image in detail.
Assistant: The image features an armchair with a bold, striped upholstery. The fabric consists of horizontal stripes in multiple colors including blue, green, red, orange, and beige. The design is reminiscent of a chevron pattern, characteristic of classic American décor. The frame of the chair has dark wooden legs, providing a solid base that contrasts with the vibrant stripes. The chair is placed against a neutral, solid-colored background that highlights its design and colors. The overall aesthetic suggests a blend of traditional and modern styles, making it suitable for various interior design purposes.


In [19]:
import time
import os
image_folder = "/home/s464915/future-designer/experiments/images"

times = []

for filename in os.listdir(image_folder):
    if filename.endswith((".png", ".jpg", ".jpeg")):
        image_path = os.path.join(image_folder, filename)
        # set the max number of tiles in `max_num`
        pixel_values = load_image(image_path, max_num=12).to(torch.bfloat16).cuda()
        generation_config = dict(max_new_tokens=100, do_sample=True)

        # single-image multi-round conversation (单图多轮对话)
        question = '''<image>\nDescribe the furniture in the image using the following JSON structure. 
                                                        Use only one word be really specific. If any field is not applicable or cannot be determined, use "N/A".
                                    {
                                        "type": "Main furniture type (e.g., chair, table, sofa)",
                                        "style": "Overall style (e.g., modern, traditional, rustic)",
                                        "color": "Main color",
                                        "material": "Primary material",
                                        "shape": "General shape",
                                        "size": "Size category (small, medium, large)",
                                        "details": "Any decorative features"
                                        "condition": "Apparent condition if relevant"
                                    }'''

        start_time = time.time()
        response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)

        end_time = time.time()

        # Calculate and print the processing time
        processing_time = end_time - start_time
        times.append(processing_time)

        print(f"Processing time: {processing_time:.2f} seconds")

        print(f'User: {question}\nAssistant: {response}')

print(f'Avg time of interface: {sum(times)/len(times):.2f} s')

Processing time: 1.29 seconds
User: <image>
Describe the furniture in the image using the following JSON structure. 
                                                        Use only one word be really specific. If any field is not applicable or cannot be determined, use "N/A".
                                    {
                                        "type": "Main furniture type (e.g., chair, table, sofa)",
                                        "style": "Overall style (e.g., modern, traditional, rustic)",
                                        "color": "Main color",
                                        "material": "Primary material",
                                        "shape": "General shape",
                                        "size": "Size category (small, medium, large)",
                                        "details": "Any decorative features"
                                        "condition": "Apparent condition if relevant"
                                