Skip to content

The vision encoder doesn't return hidden states causing the model.generate() method to fail. How to fix it? #262

@beingdutta

Description

@beingdutta

If we check the stack trace, we would see that the forward pass of the vision mode,l which is basically a SigLip Encoder, is failing, causing the ultimate model.generate() fail. How to fix this error?

TypeError                                 Traceback (most recent call last)
Cell In[14], line 1
----> 1 g = model.generate(**inputs)
      2 print(g)

File ~/.cache/huggingface/modules/transformers_modules/mPLUG/mPLUG_hyphen_Owl3_hyphen_7B_hyphen_240728/eff25bcdc02ff1b513c25f376d761ec1ab6dfa1b/modeling_mplugowl3.py:152, in mPLUGOwl3Model.generate(self, input_ids, pixel_values, media_offset, attention_mask, tokenizer, stream, decode_text, **kwargs)
    149 assert input_ids is not None
    151 with torch.inference_mode():
--> 152     image_embeds = self.forward_image(pixel_values)
    154     if stream:
    155         result = self._decode_stream(input_ids=input_ids, image_embeds=image_embeds, media_offset=media_offset, tokenizer=tokenizer, **kwargs)

File ~/.cache/huggingface/modules/transformers_modules/mPLUG/mPLUG_hyphen_Owl3_hyphen_7B_hyphen_240728/eff25bcdc02ff1b513c25f376d761ec1ab6dfa1b/modeling_mplugowl3.py:70, in mPLUGOwl3Model.forward_image(self, pixel_values)
     68 dtype = self.language_model.model.embed_tokens.weight.dtype
     69 with torch.inference_mode():
---> 70     image_embeds = self.vision_model(pixel_values.to(dtype), output_hidden_states=True).hidden_states[-2]
     72 if self.vision2text_model is not None:
     73     image_embeds = self.vision2text_model(image_embeds)

TypeError: 'NoneType' object is not subscriptable`

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions