-
Notifications
You must be signed in to change notification settings - Fork 206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ViltTransformerSS usage clarification #40
Comments
Hi, @JoanFM vilt_module.py doesn't tell much about its internal working. Yes, if you want to let ViLT The details on text-related data can be found here and here.
You can do a batch inference by properly generating the batch dictionary. None of them are for the similarity metric. |
Hello @dandelin . I have a question. I have written this code, and I do not understand 2 things:
import torch
from vilt import config
from vilt.modules import ViLTransformerSS
conf = config.config()
conf.load_path = 'vilt_irtr_f30k.ckpt'
conf.test_only = True
image_vilt = torch.ones(3, 224, 224)
batch = {}
batch['image'] = [image_vilt.unsqueeze(0)]
batch['text_ids'] = torch.IntTensor([[1, 0, 16, 32, 55]]) # random sentence tokens
batch['text_masks'] = torch.IntTensor([[1, 1, 1, 1, 1]]) # no masking
batch['text_labels'] = None
with torch.no_grad():
vilt = ViLTransformerSS(conf)
vilt.train(mode=False)
out = vilt(batch)
score = vilt.itm_score(out['cls_feats'])
print(f' score {score}')
|
Another question I have about the As I see in the code, the I would be more interested in having a match inference of the same |
Hi @JoanFM I wrote a pedagogical code with comments. import torch
import copy
from vilt import config
from vilt.modules import ViLTransformerSS
# Scared config is immutable object, so you need to deepcopy it.
conf = copy.deepcopy(config.config())
conf["load_path"] = "vilt_irtr_coco.ckpt"
conf["test_only"] = True
# You need to properly configure loss_names to initialize heads (0.5 means it initializes head, but ignores the loss during training)
loss_names = {
"itm": 0.5,
"mlm": 0,
"mpp": 0,
"vqa": 0,
"imgcls": 0,
"nlvr2": 0,
"irtr": 1,
"arc": 0,
}
conf["loss_names"] = loss_names
# two different random images
image_vilt = torch.randn(2, 3, 224, 224)
batch = {}
batch["image"] = [image_vilt]
# repeated random sentence tokens
batch["text_ids"] = torch.IntTensor([[1, 0, 16, 32, 55], [1, 0, 16, 32, 55]])
# no masking
batch["text_masks"] = torch.IntTensor([[1, 1, 1, 1, 1], [1, 1, 1, 1, 1]])
batch["text_labels"] = None
with torch.no_grad():
vilt = ViLTransformerSS(conf)
vilt.train(mode=False)
out = vilt(batch)
itm_logit = vilt.itm_score(out["cls_feats"]).squeeze()
print(
f"itm logit, two logit (logit_neg, logit_pos) for a image-text pair.\n{itm_logit}"
)
itm_score = itm_logit.softmax(dim=-1)[:, 1]
print(f"itm score, one score for a image-text pair.\n{itm_score}")
# You should see "rank_output" head if loss_names["irtr"] > 0
score = vilt.rank_output(out["cls_feats"]).squeeze()
print(f"unnormalized irtr score, one score for a image-text pair.\n{score}")
normalized_score = score.softmax(dim=0)
print(
f"normalized (relative) irtr score, one score for a image-text pair.\n{score}"
) |
Hey @dandelin, thank you for the quick and clear response. It answers but still poses some doubts.
Thank you very much |
|
Forget about this last message. This was my bad! Sorry for the inconvenience! |
Hi @JoanFM
|
Hello @dandelin , So for me to clarify.
self.rank_output.weight.data = self.itm_score.fc.weight.data[1:, :]
self.rank_output.bias.data = self.itm_score.fc.bias.data[1:] So I guess you are using only one of the As per the image 2 text, I am trying to run this: conf = copy.deepcopy(config.config())
conf["load_path"] = 'vilt_irtr_f30k.ckpt'
conf["test_only"] = True
# You need to properly configure loss_names to initialize heads (0.5 means it initializes head, but ignores the loss during training)
loss_names = {
"itm": 0.5,
"mlm": 0,
"mpp": 0,
"vqa": 0,
"imgcls": 0,
"nlvr2": 0,
"irtr": 1,
"arc": 0,
}
conf["loss_names"] = loss_names
# two different random images
image_vilt = torch.ones(2, 3, 224, 224) # 2 images in database
batch = {}
batch["image"] = [image_vilt]
# repeated random sentence tokens
batch["text_ids"] = torch.IntTensor([[1, 0, 16, 32, 55]]) # 1 single text query
# no masking
batch["text_masks"] = torch.IntTensor([[1, 1, 1, 1, 1]]) # 1 single text query
batch["text_labels"] = None
with torch.no_grad():
vilt = ViLTransformerSS(conf)
vilt.train(mode=False)
out = vilt(batch)
# You should see "rank_output" head if loss_names["irtr"] > 0
score = vilt.rank_output(out["cls_feats"]).squeeze()
print(f"unnormalized irtr score, one score for a image-text pair.\n{score}")
normalized_score = score.softmax(dim=0)
print(
f"normalized (relative) irtr score, one score for a image-text pair.\n{score}" This is failing with:
Again, highly appreciate your incredible support! Regards, Joan |
ViLT model Transformer Encoder total number of layers? |
Hello @dandelin,
I am trying to understand the interface expected by
ViltTransformerSS
As I see the
infer
signature is as follows:If my understanding is correct, If I want to do
retrieval
and letVilT
compute the embedding before entering the coatention layers, I should leave the default values untouched.But as per the
batch
parameter, what is the exact signature, reading the code it seems thatbatch
is aDictionary of Lists
with the following keys, which I would like to clarify:Another thing I observed is that it seems that this inference method works with a single image a time, so I guess it works with a single text and a single image at a time. Is there an inference mode where this can be run with a real larger batch than 1?
Also the output of the function does not seem so clear to me:
Which of these keys can be considered as the similarity metric? I guess the
cls_feats
or what should I look for?Thank you very much in advance
The text was updated successfully, but these errors were encountered: