# notes 2/1

W is a very simple projection to take vision embeddings to language embedding space, but what other options are there? an entire network?

they cite "faster iteration of data centric experiments" as the reason for keeping it simple - so that they could refine the instruction-following dataset that they produced

questions:
- does LLaVA also patch images?
- what if the chosen vision embedding space and text embedding space naturally happened to be the same? then, would there be a need for any learning to happen? why?
- are we re defining everything from scratch (doubtful) or like do we need to import llava model and tweak it? are there places we can specify the original llava vision tower (to be siglip) and the language model?

# dims of language and vision

### explore siglip2

In [5]:
import torch
from transformers import AutoModel, AutoProcessor
from transformers.image_utils import load_image

In [None]:
# this is taken directly from HF SigLIP2 page.

# load the model and processor
ckpt = "google/siglip2-base-patch32-256"
siglip_encoder = AutoModel.from_pretrained(ckpt, device_map="auto").eval() # device_map is not relevant for cpu. Also, we use AutoModel (not AutoModelFor..) because it directly outputs the hidden states from the encoder?
siglip_processor = AutoProcessor.from_pretrained(ckpt)

# load the image
image = load_image("https://huggingface.co/datasets/merve/coco/resolve/main/val2017/000000000285.jpg")
inputs = siglip_processor(images=[image], return_tensors="pt")#.to(siglip_encoder.device) # 

# run infernece
with torch.no_grad():
    image_embeddings = siglip_encoder.get_image_features(**inputs)    

print(image_embeddings.shape)

torch.Size([1, 768])


notes on nn.Embedding (from docs):
- A simple lookup table that stores embeddings of a fixed dictionary and size.
    - dictionary size 256,000?
- This module is often used to store word embeddings and retrieve them using indices. The input to the module is a list of indices, and the output is the corresponding word embeddings.
- Input: (∗), IntTensor or LongTensor of arbitrary shape containing the indices to extract
- Output: (∗,H), where * is the input shape and H=embedding_dim
- https://rahullokurte.com/understanding-token-and-positional-embeddings-in-transformers 

In [None]:
"""
To summarize: Siglip/LIP models in general have an image encoder and a text encoder. 
Vision Encoder: CLIP paper states that it chooses either (1) ViT or (2) Resnet50 
Text Encoder: a modified Transformer, where "the text sequence is bracketed with [SOS] and [EOS] tokens and the activations of the
              highest layer of the transformer at the [EOS] token are treated as the feature representation of the text
              which is then normalized 
              and then linearly projected into the multimodal embedding space.


SiglipModel: notes
- text model
    - embeddings: specifies positional + token (semantic) embeddings
        - Token Embeddings are (Vocab size, dims)
        - positional embeddings are (Sequence Length, dims)
    - encoder: specifies the model/encoder structure
        - 12 layers
        - out dimension: 768 (out_proj)
- vision model
    - vision embeddings: patch embeddings + positional embeddings
        - (patch_embedding): Conv2d(3, 768, kernel_size=(32, 32), stride=(32, 32), padding=valid)
            - 3 in channels, 768 out channels
        -  (position_embedding): Embedding(64, 768)
    - 
"""
# print(siglip_encoder)
# ------- output of print(siglip_encoder)-------
"""
SiglipModel(
  (text_model): SiglipTextTransformer(
    (embeddings): SiglipTextEmbeddings(
      (token_embedding): Embedding(256000, 768) # vocab size, dims
      (position_embedding): Embedding(64, 768) # sequence length, dims
    )
    (encoder): SiglipEncoder(
      (layers): ModuleList(
        (0-11): 12 x SiglipEncoderLayer(
          (layer_norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
          (self_attn): SiglipAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (layer_norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
          (mlp): SiglipMLP(
            (activation_fn): GELUTanh()
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
          )
        )
      )
    )
    (final_layer_norm): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
    (head): Linear(in_features=768, out_features=768, bias=True)
  )
  (vision_model): SiglipVisionTransformer(
    (embeddings): SiglipVisionEmbeddings(
      (patch_embedding): Conv2d(3, 768, kernel_size=(32, 32), stride=(32, 32), padding=valid)
      (position_embedding): Embedding(64, 768)
    )
    (encoder): SiglipEncoder(
      (layers): ModuleList(
        (0-11): 12 x SiglipEncoderLayer(
          (layer_norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
          (self_attn): SiglipAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (layer_norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
          (mlp): SiglipMLP(
            (activation_fn): GELUTanh()
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
          )
        )
      )
    )
    (post_layernorm): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
    (head): SiglipMultiheadAttentionPoolingHead(
      (attention): MultiheadAttention(
        (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
      )
      (layernorm): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
      (mlp): SiglipMLP(
        (activation_fn): GELUTanh()
        (fc1): Linear(in_features=768, out_features=3072, bias=True)
        (fc2): Linear(in_features=3072, out_features=768, bias=True)
      )
    )
  )
)

"""

'\nTo summarize: Siglip/LIP models in general have an image encoder and a text encoder. \nVision Encoder: CLIP paper states that it chooses either (1) ViT or (2) Resnet50 \nText Encoder: a modified Transformer, where "the text sequence is bracketed with [SOS] and [EOS] tokens and the activations of the\n              highest layer of the transformer at the [EOS] token are treated as the feature representation of the text\n              which is then normalized \n              and then linearly projected into the multimodal embedding space.\n\n\nSiglipModel: notes\n- text model\n    - embeddings: specifies positional + token (semantic) embeddings\n        - Token Embeddings are (Vocab size, dims)\n        - positional embeddings are (Sequence Length, dims)\n    - encoder: specifies the model/encoder structure\n        - 12 layers\n        - out dimension: 768 (out_proj)\n- vision model\n    - vision embeddings: patch embeddings + positional embeddings\n        - (patch_embedding): Conv2

In [11]:
# print(siglip_processor)

### explore LFM2

In [12]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model_id = "LiquidAI/LFM2-350M"
lfm2 = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="bfloat16",
#    attn_implementation="flash_attention_2" <- uncomment on compatible GPU
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

`torch_dtype` is deprecated! Use `dtype` instead!


In [13]:
# Generate answer
prompt = "What is C. elegans?"
input_ids = tokenizer.apply_chat_template(
    [{"role": "user", "content": prompt}],
    add_generation_prompt=True,
    return_tensors="pt",
    tokenize=True,
).to(lfm2.device)

output = lfm2.generate(
    input_ids,
    do_sample=True,
    temperature=0.3,
    min_p=0.15,
    repetition_penalty=1.05,
    max_new_tokens=512,
)

print(tokenizer.decode(output[0], skip_special_tokens=False))


<|startoftext|><|im_start|>user
What is C. elegans?<|im_end|>
<|im_start|>assistant
C. elegans, commonly known as the razorback slug or zebra mussel, is a small, transparent, and fast-moving nematode worm (roundworm) species from the genus C. elegans. It was first discovered in 1856 by German entomologist Hans Christian Ørsted and is named after Danish mathematician Carl Friedrich Wilhelm Emanuel Elyseus E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E. E

In [17]:
print(lfm2)

""" 
Since the in dimension of the lfm2 model is 1024, perhaps then the dimension that we need to transfer to 
"""

Lfm2ForCausalLM(
  (model): Lfm2Model(
    (embed_tokens): Embedding(65536, 1024, padding_idx=0)
    (layers): ModuleList(
      (0-1): 2 x Lfm2DecoderLayer(
        (conv): Lfm2ShortConv(
          (conv): Conv1d(1024, 1024, kernel_size=(3,), stride=(1,), padding=(2,), groups=1024, bias=False)
          (in_proj): Linear(in_features=1024, out_features=3072, bias=False)
          (out_proj): Linear(in_features=1024, out_features=1024, bias=False)
        )
        (feed_forward): Lfm2MLP(
          (w1): Linear(in_features=1024, out_features=4608, bias=False)
          (w3): Linear(in_features=1024, out_features=4608, bias=False)
          (w2): Linear(in_features=4608, out_features=1024, bias=False)
        )
        (operator_norm): Lfm2RMSNorm((1024,), eps=1e-05)
        (ffn_norm): Lfm2RMSNorm((1024,), eps=1e-05)
      )
      (2): Lfm2DecoderLayer(
        (self_attn): Lfm2Attention(
          (q_proj): Linear(in_features=1024, out_features=1024, bias=False)
          (k_proj): Li

' \nSince the in dimension of the lfm2 model is 1024, perhaps then the dimension that we need to transfer to \n'

### explore llava

# structure

1. dataset loading - do we define a custom class?
    - also note that the images have to be married into the dataset, at least from the link to the IF dataset.
2. dataloader
3. define model architecture


input: image + text prompt (question)

# examine llava docs

Running the example code from https://huggingface.co/docs/transformers/en/model_doc/llava to see what it does


In [None]:
# https://huggingface.co/docs/transformers/en/model_doc/llava
# the below code snippet comes directly from the llava docs above

# LlavaConfig
from transformers import LlavaForConditionalGeneration, LlavaConfig, CLIPVisionConfig, LlamaConfig

# Initializing a CLIP-vision config
vision_config = CLIPVisionConfig()

# Initializing a Llama config
text_config = LlamaConfig()

# Initializing a Llava llava-1.5-7b style configuration
configuration = LlavaConfig(vision_config, text_config)

# Initializing a model from the llava-1.5-7b style configuration
model = LlavaForConditionalGeneration(configuration)

# Accessing the model configuration
configuration = model.config

^ this takes forever

In [27]:
model.vision_tower

CLIPVisionModel(
  (vision_model): CLIPVisionTransformer(
    (embeddings): CLIPVisionEmbeddings(
      (patch_embedding): Conv2d(3, 768, kernel_size=(32, 32), stride=(32, 32), bias=False)
      (position_embedding): Embedding(50, 768)
    )
    (pre_layrnorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (encoder): CLIPEncoder(
      (layers): ModuleList(
        (0-11): 12 x CLIPEncoderLayer(
          (self_attn): CLIPAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (layer_norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (mlp): CLIPMLP(
            (activation_fn): QuickGELUActivation()
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
        

In [30]:
lfm2_config_dict = lfm2.config.to_dict()

In [31]:
lfm2_config_dict

{'vocab_size': 65536,
 'hidden_size': 1024,
 'num_hidden_layers': 16,
 'rope_theta': 1000000.0,
 'max_position_embeddings': 128000,
 'use_cache': True,
 'norm_eps': 1e-05,
 'initializer_range': 0.02,
 'num_attention_heads': 16,
 'num_key_value_heads': 8,
 'conv_bias': False,
 'conv_L_cache': 3,
 'intermediate_size': 6656,
 'block_multiple_of': 256,
 'block_ffn_dim_multiplier': 1.0,
 'block_auto_adjust_ff_dim': True,
 'layer_types': ['conv',
  'conv',
  'full_attention',
  'conv',
  'conv',
  'full_attention',
  'conv',
  'conv',
  'full_attention',
  'conv',
  'full_attention',
  'conv',
  'full_attention',
  'conv',
  'full_attention',
  'conv'],
 'return_dict': True,
 'output_hidden_states': False,
 'torchscript': False,
 'dtype': 'bfloat16',
 'pruned_heads': {},
 'tie_word_embeddings': True,
 'chunk_size_feed_forward': 0,
 'is_encoder_decoder': False,
 'is_decoder': False,
 'cross_attention_hidden_size': None,
 'add_cross_attention': False,
 'tie_encoder_decoder': False,
 'architect

In [38]:
# Use lfm2 config and siglip

# from transformers import LlavaForConditionalGeneration, LlavaConfig, CLIPVisionConfig, LlamaConfig
from transformers import Lfm2Config, Siglip2VisionConfig, AutoConfig

# Initializing a siglip vision config
vision_config = Siglip2VisionConfig().from_pretrained("google/siglip2-base-patch32-256")

# Initializing a lfm2 350m config hopefully
text_config = AutoConfig.from_pretrained("LiquidAI/LFM2-350M")

# Initializing a Llava llava-1.5-7b style configuration
configuration = LlavaConfig(vision_config, text_config)

# Initializing a model from the llava-1.5-7b style configuration
llava_model = LlavaForConditionalGeneration(configuration)

# Accessing the model configuration
configuration = llava_model.config

You are using a model of type siglip_vision_model to instantiate a model of type siglip2_vision_model. This is not supported for all configurations of models and can yield errors.


In [39]:
llava_model

LlavaForConditionalGeneration(
  (model): LlavaModel(
    (vision_tower): Siglip2VisionModel(
      (vision_model): Siglip2VisionTransformer(
        (embeddings): Siglip2VisionEmbeddings(
          (patch_embedding): Linear(in_features=3072, out_features=768, bias=True)
          (position_embedding): Embedding(256, 768)
        )
        (encoder): Siglip2Encoder(
          (layers): ModuleList(
            (0-11): 12 x Siglip2EncoderLayer(
              (layer_norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
              (self_attn): Siglip2Attention(
                (k_proj): Linear(in_features=768, out_features=768, bias=True)
                (v_proj): Linear(in_features=768, out_features=768, bias=True)
                (q_proj): Linear(in_features=768, out_features=768, bias=True)
                (out_proj): Linear(in_features=768, out_features=768, bias=True)
              )
              (layer_norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        

In [40]:
llava_model.model

LlavaModel(
  (vision_tower): Siglip2VisionModel(
    (vision_model): Siglip2VisionTransformer(
      (embeddings): Siglip2VisionEmbeddings(
        (patch_embedding): Linear(in_features=3072, out_features=768, bias=True)
        (position_embedding): Embedding(256, 768)
      )
      (encoder): Siglip2Encoder(
        (layers): ModuleList(
          (0-11): 12 x Siglip2EncoderLayer(
            (layer_norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
            (self_attn): Siglip2Attention(
              (k_proj): Linear(in_features=768, out_features=768, bias=True)
              (v_proj): Linear(in_features=768, out_features=768, bias=True)
              (q_proj): Linear(in_features=768, out_features=768, bias=True)
              (out_proj): Linear(in_features=768, out_features=768, bias=True)
            )
            (layer_norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
            (mlp): Siglip2MLP(
              (activation_fn): GELUTanh()
           

# Transformers Documentation: Notes

- Every model is implemented from only three main classes (configuration, model, and preprocessor).
- The number of user-facing abstractions is limited to only three classes for instantiating a model, and two APIs for inference or training.
- Model gets initialized from a config

- https://huggingface.co/docs/transformers/en/how_to_hack_models customizing models
- 

# identifying the complete workflow

In [4]:
""" 
X_v image: .png/.jpg, local or remote
|
| jpg -> Image: load image using PIL to get an Image
| Image -> Tensor: 
| 
| do we use the Siglip Processor here? (autoprocessor??)
V
Vision Encoder: input needs to be a tensor
"""

' \nX_v image: .png/.jpg, local or remote\n|\n| jpg -> Image: load image using PIL to get an Image\n| Image -> Tensor: \n| \n| do we use the Siglip Processor here? (autoprocessor??)\nV\nVision Encoder: input needs to be a tensor\n'