<center>
<img src="https://supportvectors.ai/logo-poster-transparent.png" width="400px" style="opacity:0.7">
</center>

In [None]:
%run supportvectors-common.ipynb

# **DeepSeek-VL Demo**

In this notebooks, we will use a DeepSeek-VL model to describe images. To learn more about DeepSeek-VL, please refer to the [DeepSeek-VL GitHub repository](https://github.com/deepseek-ai/DeepSeek-VL) and the paper [DeepSeek-VL: Towards Real-World Vision-Language Understanding](https://arxiv.org/pdf/2403.05525).

## **Setup**

Follow these steps to setup the DeepSeek-VL environment. 

```bash
    # open a new terminal with no python environment active
    # create a new directory called DeepSeek-VL 
    # 1 Clone the repository
    git clone https://github.com/deepseek-ai/DeepSeek-VL

    # 2 Change directory to the cloned repo
    cd DeepSeek-VL

    # 3 Create a virtual environment with uv
    uv sync --upgrade

    #if step 3 fails because of the sentencepiece dependency, make the changes below

        # update the version of sentencepiece in the pyproject.toml like so 
        dependencies = [
            ...,
            "sentencepiece>=0.1.96",
            ...
        ]


        [project.optional-dependencies]
        gradio = [
            ...
            "SentencePiece>=0.1.96"
        ]



    # 4 Now return to the vision_language_understanding directory and activate the project enviornment
    cd /path/to/vision_language/understanding
    source .venv/bin/activate

    # 5 Add the cloned directory as a dependency
    uv pip install -e "deepseek-vl @ ../DeepSeek-VL"


```

In [None]:
import torch
from transformers import AutoModelForCausalLM

from deepseek_vl.models import VLChatProcessor, MultiModalityCausalLM
from deepseek_vl.utils.io import load_pil_images


# specify the path to the model
model_path = "deepseek-ai/deepseek-vl-1.3b-base"
vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer

vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()

conversation = [
    {
        "role": "User",
        "content": "<image_placeholder>Describe each stage of this image.",
        "images": ["../images/mlops-phasen.jpg"]
    },
    {
        "role": "Assistant",
        "content": ""
    }
]

# load images and prepare for inputs
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
    conversations=conversation,
    images=pil_images,
    force_batchify=True
).to(vl_gpt.device)

# run image encoder to get the image embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)

# run the model to get the response
outputs = vl_gpt.language_model.generate(
    inputs_embeds=inputs_embeds,
    attention_mask=prepare_inputs.attention_mask,
    pad_token_id=tokenizer.eos_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=512,
    do_sample=False,
    use_cache=True
)

answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(f"{prepare_inputs['sft_format'][0]}", answer)


In [3]:
from svlearn_vlu import config
from svlearn_vlu.utils.display_captions_table import CaptionDisplay

image_dir = config['datasets']["unsplash"]
captions_file = f"{image_dir}/captions.json"

caption_display = CaptionDisplay(image_dir, captions_file)
caption_display.display(num_samples=25, model_names=['deepseek'])