# LLaVA Demo: Large Language and Vision Model

This notebook demonstrates the LLaVA (Large Language and Vision Assistant) model, presented at the NeurIPS 2023 conference. LLaVA combines image and text processing capabilities using an architecture consisting of a CLIP visual encoder, a projection layer, and the Vicuna language model.

## Key Components of LLaVA:
1. **CLIP-ViT-L/14 Visual Encoder** for converting images into vector representations
2. **Projection Layer** mapping visual embeddings into the language model's space
3. **Vicuna Language Model** (based on LLaMA) with 13B parameters, responsible for text generation

Let's install the necessary dependencies and load the model to demonstrate its capabilities.

## 1. Installing Required Libraries

First, let's install the necessary dependencies to work with LLaVA.

In [None]:
!pip install torch torchvision transformers accelerate sentencepiece protobuf==3.20.3 gradio
!pip install git+https://github.com/huggingface/transformers

## 2. Cloning the LLaVA Repository

Clone the official LLaVA repository to use its functionality.

In [None]:
!git clone https://github.com/haotian-liu/LLaVA.git
%cd LLaVA
!pip install -e .

## 3. Loading the LLaVA Model

Let's load the pre-trained LLaVA model from Hugging Face. We will use the LLaVA-1.5 version, which is an improved version of the original model.

In [None]:
import torch
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path
from llava.eval.run_llava import eval_model
from llava.conversation import conv_templates
from llava.utils import disable_torch_init
from PIL import Image
import requests
from io import BytesIO

disable_torch_init()

# Load the LLaVA-1.5 7B model (smaller version for quick use in Colab)
model_path = "liuhaotian/llava-v1.5-7b"
model_name = get_model_name_from_path(model_path)
tokenizer, model, processor, context_len = load_pretrained_model(
    model_path=model_path,
    model_base=None,
    model_name=model_name
)

print("LLaVA model successfully loaded!")
print(f"Model name: {model_name}")

## 4. Function for Image Processing and Response Generation

Let's create a function that will take an image and a question, and then generate a response using the LLaVA model.

In [None]:
def process_image_and_generate_response(image, prompt, temperature=0.2, max_new_tokens=512):
    """
    Processes an image and generates a text response based on the given question.
    
    Args:
        image: PIL image or image URL
        prompt: Text question or instruction
        temperature: Temperature parameter for text generation (0.0-1.0)
        max_new_tokens: Maximum number of new tokens for generation
        
    Returns:
        Text response from the model
    """
    if isinstance(image, str) and (image.startswith('http://') or image.startswith('https://')):
        response = requests.get(image)
        image = Image.open(BytesIO(response.content)).convert('RGB')
    elif not isinstance(image, Image.Image):
        image = Image.open(image).convert('RGB')
    
    conv = conv_templates['vicuna'].copy()
    conv.append_message(conv.roles[0], prompt)
    conv.append_message(conv.roles[1], None)
    prompt = conv.get_prompt()
    
    image_tensor = processor(image).unsqueeze(0).to(model.device)
    
    input_ids = tokenizer(prompt).input_ids
    input_ids = torch.tensor(input_ids).unsqueeze(0).to(model.device)
    
    with torch.inference_mode():
        output_ids = model.generate(
            input_ids,
            images=image_tensor,
            do_sample=True,
            temperature=temperature,
            max_new_tokens=max_new_tokens
        )
    
    outputs = tokenizer.decode(output_ids[0, input_ids.shape[1]:], skip_special_tokens=True)
    return outputs

## 5. Creating an Interactive Interface with Gradio

Let's create a simple web interface to interact with the LLaVA model using the Gradio library.

In [None]:
import gradio as gr

def llava_interface(image, prompt, temperature=0.2, max_tokens=512):
    if image is None:
        return "Please upload an image."
    if not prompt:
        return "Please enter a question or instruction."
    
    try:
        response = process_image_and_generate_response(
            image, prompt, temperature=temperature, max_new_tokens=max_tokens
        )
        return response
    except Exception as e:
        return f"An error occurred: {str(e)}"

demo = gr.Interface(
    fn=llava_interface,
    inputs=[
        gr.Image(type="pil", label="Upload an Image"),
        gr.Textbox(lines=2, placeholder="Enter a question or instruction...", label="Question/Instruction"),
        gr.Slider(minimum=0.0, maximum=1.0, value=0.2, step=0.1, label="Temperature"),
        gr.Slider(minimum=64, maximum=1024, value=512, step=64, label="Maximum Number of Tokens")
    ],
    outputs=gr.Textbox(label="Model Response"),
    title="LLaVA Demo: Large Language and Vision Model",
    description="Upload an image and ask a question or provide an instruction. The LLaVA model will analyze the image and generate a text response."
)

demo.launch(share=True, debug=True)

## 6. LLaVA Usage Examples

Let's look at a few examples of using the LLaVA model with different types of images and prompts.

In [None]:
image_url = "https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg"
response = requests.get(image_url)
example_image = Image.open(BytesIO(response.content)).convert('RGB')
example_image.save('example_image.jpg')
example_image

In [None]:
prompt = "Describe in detail what you see in this image."
response = process_image_and_generate_response(example_image, prompt)
print(response)

In [None]:
prompt = "What is to the left of the person in the image? Describe this object."
response = process_image_and_generate_response(example_image, prompt)
print(response)

In [None]:
prompt = "Guess the season depicted in the photo and explain why you think so."
response = process_image_and_generate_response(example_image, prompt)
print(response)

## 7. Uploading Your Own Image

You can upload your own image and ask a question or provide an instruction. Use the interactive interface created above or the following code:

In [None]:
from google.colab import files
import matplotlib.pyplot as plt

uploaded = files.upload()
image_path = list(uploaded.keys())[0]
user_image = Image.open(image_path).convert('RGB')

plt.figure(figsize=(10, 10))
plt.imshow(user_image)
plt.axis('off')
plt.show()

user_prompt = input("Enter a question or instruction: ")

response = process_image_and_generate_response(user_image, user_prompt)
print("\nModel Response:")
print(response)

## 8. Conclusion

In this notebook, we demonstrated the LLaVA model, which combines image and text processing capabilities. The model is capable of:

1. Describing image content in detail
2. Answering questions about the spatial relationships of objects
3. Performing complex reasoning based on visual information
4. Following user instructions when analyzing images

LLaVA represents a significant step in creating versatile multimodal assistants capable of understanding both textual and visual information.

### Links and Resources

- [Official LLaVA Repository on GitHub](https://github.com/haotian-liu/LLaVA)
- [LLaVA Paper at NeurIPS 2023](https://arxiv.org/abs/2304.08485)
- [LLaVA Models on Hugging Face](https://huggingface.co/liuhaotian/llava-v1.5-7b)