Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize CLIPArchitecture #89

Closed
wants to merge 13 commits into from

Conversation

sophiazhi
Copy link
Contributor

@sophiazhi sophiazhi commented Jun 14, 2022

Summary:
Generalize CLIPArchitecture to allow two encoders of any modalities and added a test suite for CLIPArchitecture. Ultimately, the goal is to support multimodal models beyond image/text, like MUGEN which uses audio/text/video.

Test plan:
Run command pytest --cov=torchmultimodal/architectures/ test/architectures/test_clip.py::TestCLIPArchitecture -vv to run the unit test included in this PR.
Screen Shot 2022-06-16 at 8 27 42 PM

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 14, 2022
@sophiazhi sophiazhi marked this pull request as draft June 14, 2022 21:10
Copy link
Contributor

@ebsmothers ebsmothers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks pretty good! Left a few comments, but other than the stuff about the forward outputs, they're all relatively minor

warnings.warn(f"Missing encoder for extra input {key}")

# Return a dataclass object instead of a dictionary
clip_output = make_dataclass(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason we want to return a dataclass here? Imo one of the main advantages of dataclasses is that they follow a fixed schema, so returning one dynamically feels a bit unnatural.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it feels unnatural (it took me a while to figure out how to make a dataclass dynamically). I used a dataclass to match the pattern set by other modules, but now I realize a lot of modules don't have it, so unless anyone is a strong proponent of output classes then I can return a dictionary instead

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The creation is dynamics but once created the schema is fixed.

An advantage of dataclass is that we can use it for type hints.

The counterpart to dataclass is to use NamedTuple if we don't intend for inheritance. But no strong preference here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer NamedTuple for consistency with all our other model outputs, unless there's a clear advantage of using dataclass over NamedTuple

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

creating NamedTuple dynamically causes issues with mypy such that i have to include # type: ignore on the namedtuple creation line, but besides that i don't see other relative advantages of dataclass

torchmultimodal/architectures/clip.py Outdated Show resolved Hide resolved
torchmultimodal/architectures/clip.py Outdated Show resolved Hide resolved
Comment on lines 50 to 52
for key in modalities.keys():
if key not in self.encoders:
warnings.warn(f"Missing encoder for extra input {key}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your choice to raise a warning here makes sense. We might also want to do the same in late_fusion for the sake of consistency (doesn't have to be done in this PR though)

test/architectures/test_clip.py Outdated Show resolved Hide resolved
test/architectures/test_clip.py Outdated Show resolved Hide resolved
Copy link
Contributor

@ankitade ankitade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes
I dont think we should change CLIP which is a "standard" model to make it play nice with mugen. Other options are to either have a different model if we want to finally get to another "Standard" model like video clip or have a version in examples/mugen

test/architectures/test_clip.py Outdated Show resolved Hide resolved
torchmultimodal/architectures/clip.py Outdated Show resolved Hide resolved
torchmultimodal/architectures/clip.py Outdated Show resolved Hide resolved
warnings.warn(f"Missing encoder for extra input {key}")

# Return a dataclass object instead of a dictionary
clip_output = make_dataclass(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The creation is dynamics but once created the schema is fixed.

An advantage of dataclass is that we can use it for type hints.

The counterpart to dataclass is to use NamedTuple if we don't intend for inheritance. But no strong preference here.

@langong347 langong347 requested a review from RdoubleA June 15, 2022 15:35
@langong347
Copy link
Contributor

@ankitade This is not specific to MUGEN. Generalizing just in a sense that CLIP can compare more than 2 modalities. This is a common use case you might find in other research work.

@ebsmothers
Copy link
Contributor

@langong347 I do see @ankitade's point here. At the very least this is no longer really CLIPArchitecture since the LI in CLIP stand for language and image 🙂. A separate question is whether we want to keep CLIPArchitecture because it is a foundational model. Seems like the two options would be to either

(1) keep CLIPArchitecture as is and implement a separate generalized architecture to be used by MUGEN, or
(2) rename this to e.g. ContrastiveArchitecture and let both CLIP and MUGEN use it.

To me, the argument for (1) is that CLIP is a very important model and should be a first-class citizen with its own architecture. While the argument for (2) is better generality (I think we have said we should not have an architecture unless it is used by multiple models anyways). Personally, I would lean slightly towards (2) but would like to hear others' thoughts as well.

@langong347
Copy link
Contributor

langong347 commented Jun 15, 2022

Generalizing a SOTA model is not uncommon. This also relates to the discussion on "post paper model optimization". For example:

  • the original transformer model from "Attention is all you need" has spin off with a few variants, or being broken up into encoder and decoder only.
  • GPT model was originally proposed for text generation only but later generalized to video generation (video GPT), cross-modality generation (DALL-E)
  • Extend CLIP to text <> video retrieval is comparable in the "market share" to other tasks that use CLIP (see Paper with Code)

An architecture just represents a class of similar models. Initially, it could be based off a particular instance but it doesn't have to be restricted to where it came from. Compared to model builders, architectures are lower in the level. What we want to keep our fidelity to are the instances/builders while architecture is just the layer of abstraction beneath.

No strong opinion about naming here. "CLIPArchitecture" is probably better as a reminder of its origin than "ConstrastiveArchitecture" which is a term that hasn't been coined in the public yet.

Screen Shot 2022-06-15 at 2 17 07 PM

@RdoubleA
Copy link
Contributor

@ebsmothers I'm leaning towards option 1. CLIPArchitecture is just a convenient wrapper around encoder -> projection -> l2 norm -> output (on a separate note, where is the projection layer?). Since it has been used many times in different papers, I think that warrants it's own architecture even though it would be an image-text instance of a general ContrastiveArchitecture (much like Video VQVAE gets its own file even though it's an instance of VQVAE).

As for MUGEN, the linear projection layer after the encoder is slightly different than the CLIP paper (which only uses one linear layer I believe?): https://github.com/mugen-org/MUGEN_baseline/blob/02c7058cd221f4b651d4ace2276b085cac1c5efd/lib/models/videoclip/modules.py#L15. So that leads me to believe MUGEN should have its own ContrastiveArchitecture.

As for supporting more than two encoders, I'm not convinced of the benefit for that over multiple CLIPs, other than convenience for getting all three embeddings at once for training or inference. That seems MUGEN specific, warranting the separate contrastive architecture for MUGEN anyway.


def test_forward(self, start):
clip, input_query, input_retrieval = start
assert isinstance(clip, torch.nn.Module)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if it's necessary to ensure that clip is a Module, I would remove this

test/architectures/test_clip.py Outdated Show resolved Hide resolved
test/architectures/test_clip.py Outdated Show resolved Hide resolved
torchmultimodal/architectures/clip.py Outdated Show resolved Hide resolved
warnings.warn(f"Missing encoder for extra input {key}")

# Return a dataclass object instead of a dictionary
clip_output = make_dataclass(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer NamedTuple for consistency with all our other model outputs, unless there's a clear advantage of using dataclass over NamedTuple

@codecov-commenter
Copy link

codecov-commenter commented Jun 15, 2022

Codecov Report

Merging #89 (2df9dcd) into main (de4d037) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main      #89      +/-   ##
==========================================
+ Coverage   88.37%   88.39%   +0.01%     
==========================================
  Files          35       35              
  Lines        1850     1853       +3     
==========================================
+ Hits         1635     1638       +3     
  Misses        215      215              
Impacted Files Coverage Δ
torchmultimodal/architectures/clip.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update de4d037...2df9dcd. Read the comment docs.

@ankitade
Copy link
Contributor

  1. In the final state, we should just have models/clip.py with the CLIPModel (some version of current CLIPArch) + different standard instantiations clip_vit16 etc (we already have this i think)

  2. For video clip (standard one, not mugen specific, if they dont line up), there can be models/video_clip.py if we want to add it and it does something different like Rafi's point about projections being different

  3. If we start making clip take in a dict of modalities, its confusing for people for whom contrastive loss with batch negatives work for 2 dimensions (aka modalities for us). Actually I don't understand how the loss and 0 shot will eventually work for more than 2 entries in the dict

@sophiazhi
Copy link
Contributor Author

where is the projection layer?

Both the original CLIPArchitecture and this generalized CLIPArchitecture avoided explicitly including the projection layer(s) because users may want different types of projections and projection logic can be folded into the encoder that's passed in. We also can't guarantee that any projections passed in as arguments by the user have the same output size, so I don't see an advantage to including a projection argument. (Though this choice does assume that we want one general clip architecture and not two versions)

@langong347
Copy link
Contributor

langong347 commented Jun 15, 2022

For video clip (standard one, not mugen specific, if they dont line up), there can be models/video_clip.py if we want to add it and it does something different like Rafi's point about projections being different

The projection can be absorbed into the encoders (see Sophia's post), we can reuse the same CLIPArchitecture for arbitrary pair of modalities --- that's how CLIP has been extended for research. For that, hard-coding "text" and "image" in the keys of the output will not be suitable.

If we start making clip take in a dict of modalities, its confusing for people for whom contrastive loss with batch negatives work for 2 dimensions (aka modalities for us). Actually I don't understand how the loss and 0 shot will eventually work for more than 2 entries in the dict

In MUGEN, the loss is computed pair-wise for 3 modalities and summed together. We could instantiate 3 clip instances with each yielding just the loss for the pair and combine them later in the lightning module. My main concern about generalization is supporting different pairs of modalities using the same architecture.

CLIPArchitecture is just a convenient wrapper around encoder -> projection -> l2 norm -> output

The CLIPArchitecture only carries partial features from the CLIP model, for example, cosine similarity is a common feature used in many research works. I know we can grab the latter from contrastive_loss.py but alternatively we could also think of returning compound output from forward including embeddings + cosine similarity as the two items are closely related anyways. (MUGEN adds similarity computation as a method of their CLIP but that becomes just a utility inside a class)

@ebsmothers
Copy link
Contributor

Agree with both @langong347 and @sophiazhi's points about keeping the projection layer out of the architecture. Even in CLIP the projection layer is not guaranteed to be present (I think they have one in the ViT version but not the ResNet one). A corollary to this would be that MUGEN does not need its own architecture just because it has a different projection layer.

I wouldn't recommend returning the similarity as part of the architecture though. Then we are starting to integrate our loss into the architecture, which we don't want in general. This had to be done for ALBEF because of how the similarities get used in the multimodal encoder, but this is also part of the reason that class was implemented as a model and not an architecture (since it then becomes much more specific to that particular model). Also, even simple old cosine similarity can be implemented in different ways, with both FLAVA and CLIP handling propagation of gradients differently. So I would keep this out and leave it up to the user how to use the embeddings.

For @ankitade's 3rd point, agree that returning more than two modalities doesn't really make sense for zero shot. Though hopefully if the user is running zero shot (or contrastive with batch negatives), they wouldn't pass more than two modalities anyways. However, these assumptions plus excessive generality could potentially cause confusion for the users on the "flagship" instantiation of CLIPArchitecture (CLIP 😉 ).

So ultimately I agree with @RdoubleA: leaving CLIPArchitecture as is feels like the right path. But I do think we should generalize the MUGEN architecture to return a dict (as opposed to an arbitrary pair). Otherwise in a case like this, we would have to call each of the encoders multiple times.

(As an aside, this whole convo is yet another interesting test of our "don't generalize until you need to" principle...)

@sophiazhi sophiazhi marked this pull request as ready for review June 17, 2022 17:47
@facebook-github-bot
Copy link
Contributor

@sophiazhi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@sophiazhi sophiazhi deleted the szhi-generalize_clip branch June 21, 2022 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants