New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLAVA] Separate out text and image encoders #102
Conversation
[ghstack-poisoned]
ghstack-source-id: 5212d85775105b8176a13aedbcbf576fb3dc3291 Pull Request resolved: #102
[ghstack-poisoned]
ghstack-source-id: 278663179f3ff6a73ede11e714f7f8e5e9a6a8bb Pull Request resolved: #102
Codecov Report
@@ Coverage Diff @@
## gh/ankitade/3/base #102 +/- ##
======================================================
+ Coverage 88.85% 89.04% +0.19%
======================================================
Files 33 35 +2
Lines 1722 1744 +22
======================================================
+ Hits 1530 1553 +23
+ Misses 192 191 -1
Continue to review full report at Codecov.
|
Separate out the encoders into their own module without ay logic changes (except fixing 2 minor bugs, see annotations by me) and add tests [ghstack-poisoned]
ghstack-source-id: cbef8d57b722b36a66fa0b4155d2a9f82c0b2fe0 Pull Request resolved: #102
class_pos_embed = self.position_embeddings[:, 0] | ||
patch_pos_embed = self.position_embeddings[:, 1:] | ||
dim = embeddings.shape[-1] | ||
h0 = height // self.config.patch_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to self.patch_embedding.patch_size
pooler_output=output.pooler_output, | ||
hidden_states=output.hidden_states, | ||
attentions=output.attentions, | ||
image_labels=image_labels, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed this, image_labels is not a field
@ankitade has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
set_rng_seed(0) | ||
torch.manual_seed(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this line is redundant
atol=1e-4, | ||
rtol=0, | ||
) | ||
assert_expected(out.pooler_output, out.last_hidden_state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe a bit confusing to do the transitive thing here. Can you just set the expected result to a var and compare both last_hidden_state and pooler_output to that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isnt the transitive thing actually making it clear which values should line up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, just personal preference I guess
[ | ||
[ | ||
[0.2000, 0.2000, 0.2000, 0.2000, 0.2000], | ||
[0.1999, 0.2000, 0.2000, 0.2000, 0.2000], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.1999 due to rounding error? Maybe just make them all 0.2 for readability?
Image to Patch Embedding. | ||
""" | ||
|
||
def __init__(self, image_size=224, patch_size=16, num_channels=3, embed_dim=768): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typing will get handled separately in another PR (either Rafi or I will do it)
patch_pos_embed = nn.functional.interpolate( | ||
patch_pos_embed.reshape( | ||
1, int(math.sqrt(n)), int(math.sqrt(n)), dim | ||
).permute(0, 3, 1, 2), | ||
scale_factor=(h0 / math.sqrt(n), w0 / math.sqrt(n)), | ||
mode="bicubic", | ||
align_corners=False, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's with all the chaining here? Can we split it up a bit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally I want to avoid touching core logic as part of this refactor. I have a feeling some of the image encoder is going to get deleted in the end
# self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load | ||
# any TensorFlow checkpoint file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not super important, but I'm seeing this comment a lot in our embeddings classes. Is it actually adding any value? If not, maybe remove it
extended_attention_mask: torch.Tensor = self.get_extended_attention_mask( | ||
attention_mask, input_shape, device | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't need to be a class method. Katrina is adding it as a util in #99
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually this might need some more thought. do we want to handle attention mask HF style?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not touch the FLAVA's core logic in general. Feel free to refactor stuff around the logic but it would be good to avoid touching the logic itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we discussed this @apsdehal, it should be fine as long as tests are added, they pass and ckpt is kept in sync.
In general we need to refactor for somethings. examples: the projection being part of only pretraining model (but its needed for zero shot) or trying to use a common implementation of transformers. Will add you to all the PRs so we can address any concerns you have.
# Since attention_mask is 1.0 for positions we want to attend and 0.0 for | ||
# masked positions, this operation will create a tensor which is 0.0 for | ||
# positions we want to attend and -10000.0 for masked positions. | ||
# Since we are adding it to the raw scores before the softmax, this is | ||
# effectively the same as removing these entirely. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing here, can we remove this?
general comment, not going to introduce deeper refactoring changes in this PR since first time we are adding tests for the encoders (will get handled as part of unification / cleanup) |
Summary: Pull Request resolved: facebookresearch#102 Separate out the encoders into their own module without ay logic changes (except fixing 2 minor bugs, see annotations by me) and add tests Test Plan: pytest Differential Revision: D37407717 Pulled By: ankitade fbshipit-source-id: 7ebacb969b864438372ff9304a46ed2f4be4c906
Summary: Pull Request resolved: facebookresearch#115 Pull Request resolved: facebookresearch#102 Separate out the encoders into their own module without ay logic changes (except fixing 2 minor bugs, see annotations by me) and add tests Test Plan: pytest Reviewed By: ebsmothers Differential Revision: D37407717 Pulled By: ankitade fbshipit-source-id: cd9e120eea4890bb813cb8bbe77577f9e2c77c40
Summary: Pull Request resolved: facebookresearch#115 Pull Request resolved: facebookresearch#102 Separate out the encoders into their own module without ay logic changes (except fixing 2 minor bugs, see annotations by me) and add tests Test Plan: pytest Reviewed By: ebsmothers Differential Revision: D37407717 Pulled By: ankitade fbshipit-source-id: bb56e29c798081e8fb8f04ff9307d0f8903628a8
Separate out the encoders into their own module without ay logic changes (except fixing 2 minor bugs, see annotations by me) and add tests
Test plan:
pytest
Stack from ghstack (oldest at bottom):