<a href="https://colab.research.google.com/github/VridhiJ/finetune-llma4/blob/main/Copy_of_JanusFlow_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# JanusFlow Demo
In this Google Colab we will be going through a simple example of a multimodal use case involving an image and a prompt. Original paper can be found here https://arxiv.org/abs/2411.07975.


## Step 1: Install Dependencies
We need to pull the code for JanusFlow from their github repository, as well as install torch for the model and requests to handle image URLs. Be sure to run this code with a GPU (use T4 for free) to enable CUDA.

In [2]:
!pip install git+https://github.com/deepseek-ai/Janus.git
!pip install diffusers[torch]
!pip install requests

Collecting git+https://github.com/deepseek-ai/Janus.git
  Cloning https://github.com/deepseek-ai/Janus.git to /tmp/pip-req-build-euhjm324
  Running command git clone --filter=blob:none --quiet https://github.com/deepseek-ai/Janus.git /tmp/pip-req-build-euhjm324
  Resolved https://github.com/deepseek-ai/Janus.git to commit 146668eafecabdc6dd9f36206281d01df6a96c05
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting attrdict (from janus==1.0.0)
  Downloading attrdict-2.0.1-py2.py3-none-any.whl.metadata (6.7 kB)
Downloading attrdict-2.0.1-py2.py3-none-any.whl (9.9 kB)
Building wheels for collected packages: janus
  Building wheel for janus (pyproject.toml) ... [?25l[?25hdone
  Created wheel for janus: filename=janus-1.0.0-py3-none-any.whl size=81083 sha256=050910ff7df17a102c755a0303d2127192c5301d40592ff1180089620ac4c8c3
  Stored in directory: /tmp/pip-ephem-wh

We are updating torchvision here because there are dependency conflicts that arose with torch

In [3]:
!pip install --upgrade torchvision

Collecting torch==2.5.1 (from torchvision)
  Downloading torch-2.5.1-cp310-cp310-manylinux1_x86_64.whl.metadata (28 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch==2.5.1->torchvision)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch==2.5.1->torchvision)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch==2.5.1->torchvision)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch==2.5.1->torchvision)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch==2.5.1->torchvision)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-c

## Step 2: Import Libraries

In [4]:
import torch
from janus.janusflow.models import MultiModalityCausalLM, VLChatProcessor
from janus.utils.io import load_pil_images
import requests
from PIL import Image
from io import BytesIO

Python version is above 3.10, patching the collections module.
Python version is above 3.10, patching the collections module.




## Step 3: Load Model and Processor

In [5]:
model_path = "deepseek-ai/JanusFlow-1.3B"
vl_chat_processor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer
vl_gpt = MultiModalityCausalLM.from_pretrained(model_path, trust_remote_code=True)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()

# Use requests to convert image to correct format
def load_image_from_url(url):
    response = requests.get(url)
    img = Image.open(BytesIO(response.content))
    return img

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


preprocessor_config.json:   0%|          | 0.00/439 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.94k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/4.61M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/525 [00:00<?, ?B/s]

processor_config.json:   0%|          | 0.00/365 [00:00<?, ?B/s]

Some kwargs in processor config are unused and will not have any effect: image_start_tag, image_gen_tag, mask_prompt, image_end_tag, image_tag, add_special_token, ignore_id, sft_format, num_image_tokens. 
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.


config.json:   0%|          | 0.00/1.60k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/4.09G [00:00<?, ?B/s]

## Step 4: Prepare Inputs

In [6]:
# Example picture of dog
image_url = "https://images.unsplash.com/photo-1632351459705-22a52c7a3d1d"

# Task will be a simple QA demo asking what the picture is
conversation = [
    {
        "role": "User",
        "content": "<image_placeholder>\nDescribe the image in detail.",
        "images": [image_url],
    },
    {"role": "Assistant", "content": ""},
]

# Loading image from URL
pil_images = [load_image_from_url(image_url)]

# Processing inputs
prepare_inputs = vl_chat_processor(
    conversations=conversation,
    images=pil_images,
    force_batchify=True
).to(vl_gpt.device)

## Step 5: Generating Output

In [7]:
# Using GPT to generate response
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)
outputs = vl_gpt.language_model.generate(
    inputs_embeds=inputs_embeds,
    attention_mask=prepare_inputs.attention_mask,
    pad_token_id=tokenizer.eos_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=512,
    do_sample=False,
    use_cache=True,
)

# Decoding and printing answer
answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(f"User: {conversation[0]['content']}")
print(f"Assistant: {answer}")

User: <image_placeholder>
Describe the image in detail.
Assistant: A brown and white dog with a brown and white coat is holding a stick in its mouth, standing on a dirt ground with a blurred background.
