Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ViltTransformerSS usage clarification #40

Open
JoanFM opened this issue Dec 3, 2021 · 10 comments
Open

ViltTransformerSS usage clarification #40

JoanFM opened this issue Dec 3, 2021 · 10 comments

Comments

@JoanFM
Copy link

JoanFM commented Dec 3, 2021

Hello @dandelin,

I am trying to understand the interface expected by ViltTransformerSS

As I see the infer signature is as follows:

    def infer(
        self,
        batch,
        mask_text=False,
        mask_image=False,
        image_token_type_idx=1,
        image_embeds=None,
        image_masks=None,
    ):

If my understanding is correct, If I want to do retrieval and let VilT compute the embedding before entering the coatention layers, I should leave the default values untouched.

But as per the batch parameter, what is the exact signature, reading the code it seems that batch is a Dictionary of Lists with the following keys, which I would like to clarify:

        text_ids = batch[f"text_ids"] (I guess this is the ids after tokenization, what is the type of this?)
        text_labels = batch[f"text_labels"] (What if I do not have a label? Is it okey to be None?)
        text_masks = batch[f"text_masks"] (Is this okey if it is None?)
        img = batch["image"][0] (I guess this is the image (but in what format and with what preprocessing)?

Another thing I observed is that it seems that this inference method works with a single image a time, so I guess it works with a single text and a single image at a time. Is there an inference mode where this can be run with a real larger batch than 1?

Also the output of the function does not seem so clear to me:

       ret = {
            "text_feats": text_feats,
            "image_feats": image_feats,
            "cls_feats": cls_feats,
            "raw_cls_feats": x[:, 0],
            "image_labels": image_labels,
            "image_masks": image_masks,
            "text_labels": text_labels,
            "text_ids": text_ids,
            "text_masks": text_masks,
            "patch_index": patch_index,
        }

Which of these keys can be considered as the similarity metric? I guess the cls_feats or what should I look for?

Thank you very much in advance

@dandelin
Copy link
Owner

dandelin commented Dec 3, 2021

Hi, @JoanFM

vilt_module.py doesn't tell much about its internal working.

Yes, if you want to let ViLT infer take care of the embedding part, default values are for that.

The details on text-related data can be found here and here.
text_ids are the tokenized ids those converted from the raw text by self.tokenizer in get_text().
text_labels are labels for MLM tasks and are generated by hugging face mlm_collator, and they aren't used at all if the model doesn't perform the MLM task. (the default value is [-100] * seq_len).
text_masks are for masking out text paddings which are adopted to batch sentences with various lengths. (the default value is [1] * seq_len).

batch["image"] is a doubly nested list that also considers multiple views.
The details are here.
At the release of ViLT, we adopt a single view generation (augmentation) policy, so index the batch["image"] with 0 to get the first view of images.

You can do a batch inference by properly generating the batch dictionary.
Follow the collate function to learn how to generate the proper batch.

None of them are for the similarity metric.
It can only be acquired by passing cls_feats to the itm_score head.

@JoanFM
Copy link
Author

JoanFM commented Dec 14, 2021

Hello @dandelin .

I have a question.

I have written this code, and I do not understand 2 things:

  • Why do I get different score all the time? Is this the good way to load VisonTransformer from a checkpoint?
  • Also, why I get 2 scores in the output? Shouldn't there only be one as a similarity metric? How do I get the similarity metric from this?
import torch
from vilt import config
from vilt.modules import ViLTransformerSS

conf = config.config()
conf.load_path =  'vilt_irtr_f30k.ckpt'
conf.test_only = True
image_vilt = torch.ones(3, 224, 224)
batch = {}
batch['image'] = [image_vilt.unsqueeze(0)]
batch['text_ids'] = torch.IntTensor([[1, 0, 16, 32, 55]]) # random sentence tokens
batch['text_masks'] = torch.IntTensor([[1, 1, 1, 1, 1]]) # no masking
batch['text_labels'] = None
with torch.no_grad():
    vilt = ViLTransformerSS(conf)
    vilt.train(mode=False)
    out = vilt(batch)
    score = vilt.itm_score(out['cls_feats'])
    print(f' score {score}')

@JoanFM
Copy link
Author

JoanFM commented Dec 14, 2021

Another question I have about the batch processing you suggest using collate:

As I see in the code, the inference only consideres the first image (view) in the batch[image_key] so I do not understand how batching can work in this case? Do u mean the batching of the same image against different texts?

I would be more interested in having a match inference of the same text with different images

@dandelin
Copy link
Owner

dandelin commented Dec 14, 2021

Hi @JoanFM

I wrote a pedagogical code with comments.
Hope it answers your questions.

import torch
import copy
from vilt import config
from vilt.modules import ViLTransformerSS

# Scared config is immutable object, so you need to deepcopy it.
conf = copy.deepcopy(config.config())
conf["load_path"] = "vilt_irtr_coco.ckpt"
conf["test_only"] = True

# You need to properly configure loss_names to initialize heads (0.5 means it initializes head, but ignores the loss during training)
loss_names = {
    "itm": 0.5,
    "mlm": 0,
    "mpp": 0,
    "vqa": 0,
    "imgcls": 0,
    "nlvr2": 0,
    "irtr": 1,
    "arc": 0,
}
conf["loss_names"] = loss_names

# two different random images
image_vilt = torch.randn(2, 3, 224, 224)
batch = {}
batch["image"] = [image_vilt]
# repeated random sentence tokens
batch["text_ids"] = torch.IntTensor([[1, 0, 16, 32, 55], [1, 0, 16, 32, 55]])
# no masking
batch["text_masks"] = torch.IntTensor([[1, 1, 1, 1, 1], [1, 1, 1, 1, 1]])
batch["text_labels"] = None

with torch.no_grad():
    vilt = ViLTransformerSS(conf)
    vilt.train(mode=False)
    out = vilt(batch)

    itm_logit = vilt.itm_score(out["cls_feats"]).squeeze()
    print(
        f"itm logit, two logit (logit_neg, logit_pos) for a image-text pair.\n{itm_logit}"
    )
    itm_score = itm_logit.softmax(dim=-1)[:, 1]
    print(f"itm score, one score for a image-text pair.\n{itm_score}")

    # You should see "rank_output" head if loss_names["irtr"] > 0
    score = vilt.rank_output(out["cls_feats"]).squeeze()
    print(f"unnormalized irtr score, one score for a image-text pair.\n{score}")

    normalized_score = score.softmax(dim=0)
    print(
        f"normalized (relative) irtr score, one score for a image-text pair.\n{score}"
    )

@JoanFM
Copy link
Author

JoanFM commented Dec 14, 2021

Hey @dandelin, thank you for the quick and clear response.

It answers but still poses some doubts.

  • What is the logit about, what is logit_pos and logit_neg?
  • In the example, where the text is the same, shouldn't the logit be simmetric?
  • What is the difference between itm_score and score? if I do text 2 image retrieval and I pass the model with one text against 20 images, what score should I look at?
  • Also how can I batch this call of one query text against 20 images?

Thank you very much

@JoanFM
Copy link
Author

JoanFM commented Dec 15, 2021

Another ~~ thing, when trying to run your script, I get:

``` Traceback (most recent call last): File "src/model/vilt_demo.py", line 50, in vilt = ViLTransformerSS(conf) File "/home/joan/.local/lib/python3.7/site-packages/vilt/modules/vilt_module.py", line 106, in __init__ ckpt = torch.load(self.hparams.config["load_path"], map_location="cpu") File "/home/joan/.local/lib/python3.7/site-packages/torch/serialization.py", line 593, in load return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args) File "/home/joan/.local/lib/python3.7/site-packages/torch/serialization.py", line 764, in _legacy_load raise RuntimeError("Invalid magic number; corrupt file?") RuntimeError: Invalid magic number; corrupt file? ```

I guess the problem is that I am using a pytorch version that is newer to the one in the model? Can you give more details as per the torch version used during training?

Thank you very much,

Best regards

Joan

@JoanFM
Copy link
Author

JoanFM commented Dec 16, 2021

Forget about this last message. This was my bad! Sorry for the inconvenience!

@dandelin
Copy link
Owner

Hi @JoanFM

  • ITM objective is done in 2-way classification: the pair is either matched (pos) or mismatched (neg). Logits from vilt.itm_score are the logits for this 2-way classification.
  • I used the same text [1, 0, 16, 32, 55], however since the images are two different randomly initialized ones, image_vilt = torch.randn(2, 3, 224, 224), we have two different logits for each pair.
  • itm_score is from vilt.itm_score as a result of ITM objective (dimension = (#pair, 2)) and score is from vilt.rank_output as a result of IRTR objective (dimension = (#pair, 1)). So if you do T2I one text against 20 images, score will have dimension of (20, 1).
  • Put the images in the shape of (#images, #channel, H, W) as in image_vilt = torch.randn(2, 3, 224, 224) (2 images).

@JoanFM
Copy link
Author

JoanFM commented Dec 16, 2021

Hello @dandelin ,

So for me to clarify.

  • On top of the cross-attention pooled features u add an ITM that retrieves 2 logits, one for each class (they match together, and they do not match together?) Then with the softmax u get a probability distribution of (pos, negative). Is this correct? Intuitively would a single score not be enough to achieve the same?
    -The rank_output is just a layer where u do this:
            self.rank_output.weight.data = self.itm_score.fc.weight.data[1:, :]
            self.rank_output.bias.data = self.itm_score.fc.bias.data[1:]

So I guess you are using only one of the logits information to compute the ranking?

As per the image 2 text, I am trying to run this:

conf = copy.deepcopy(config.config())
conf["load_path"] = 'vilt_irtr_f30k.ckpt'
conf["test_only"] = True

# You need to properly configure loss_names to initialize heads (0.5 means it initializes head, but ignores the loss during training)
loss_names = {
    "itm": 0.5,
    "mlm": 0,
    "mpp": 0,
    "vqa": 0,
    "imgcls": 0,
    "nlvr2": 0,
    "irtr": 1,
    "arc": 0,
}
conf["loss_names"] = loss_names

# two different random images
image_vilt = torch.ones(2, 3, 224, 224) # 2 images in database
batch = {}
batch["image"] = [image_vilt]
# repeated random sentence tokens
batch["text_ids"] = torch.IntTensor([[1, 0, 16, 32, 55]]) # 1 single text query
# no masking
batch["text_masks"] = torch.IntTensor([[1, 1, 1, 1, 1]]) # 1 single text query
batch["text_labels"] = None

with torch.no_grad():
    vilt = ViLTransformerSS(conf)
    vilt.train(mode=False)
    out = vilt(batch)
    # You should see "rank_output" head if loss_names["irtr"] > 0
    score = vilt.rank_output(out["cls_feats"]).squeeze()
    print(f"unnormalized irtr score, one score for a image-text pair.\n{score}")

    normalized_score = score.softmax(dim=0)
    print(
        f"normalized (relative) irtr score, one score for a image-text pair.\n{score}"

This is failing with:

Traceback (most recent call last):
  File "src/model/vilt_demo.py", line 52, in <module>
    out = vilt(batch)
  File "/home/joan/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/joan/.local/lib/python3.7/site-packages/vilt/modules/vilt_module.py", line 191, in forward
    ret.update(self.infer(batch))
  File "/home/joan/.local/lib/python3.7/site-packages/vilt/modules/vilt_module.py", line 158, in infer
    co_embeds = torch.cat([text_embeds, image_embeds], dim=1)
RuntimeError: Sizes of tensors must match except in dimension 1. Got 1 and 2 in dimension 0 (The offending index is 1)

Again, highly appreciate your incredible support!

Regards,

Joan

@Elibeau
Copy link

Elibeau commented Nov 5, 2022

ViLT model Transformer Encoder total number of layers?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants