Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thoughts on Dynamic Graph Transformers? #6

Closed
jonathan-laurent opened this issue May 12, 2021 · 3 comments
Closed

Thoughts on Dynamic Graph Transformers? #6

jonathan-laurent opened this issue May 12, 2021 · 3 comments

Comments

@jonathan-laurent
Copy link

First of all, thanks for making your code available in such a way!

In this paper, the authors propose a generalization of the GREAT architecture where for each attention layer, a learned linear transform is used to compute a bias matrix of dimensions num_heads x num_edge_types for each node of the graph. Said differently, they make it possible for the added attention bias to depend on the current node and head number.

I was wondering if you had made experiments with similar generalizations on your dataset or if you had any thoughts about it.

(Please do not hesitate to close this issue if you think this is not an appropriate forum for such a discussion.)

@VHellendoorn
Copy link
Owner

That is an interesting paper. A disclaimer: I found it a bit difficult to get the full picture of how their model works from section 3.2; maybe when they release their code things will be more clear. That aside, I definitely considered having the edge bias be both layer and head (in)dependent; I think almost any permutation of these makes sense, but much like our last discussion, I found no noticeable gain in splitting it up by heads (although similarly, it costs nearly nothing to do so). It doesn't look like this paper ablates that decision either.

It occurs to me that a lot of these small decisions that seem reasonable are hard to evaluate because inherent randomness from the initialization and the sheer capacity of the models makes the actual performance impact hard to trace. One solution could be to compile a richer benchmark of tasks to evaluate the impact of these smaller factors across a statistically robust set of examples. I have something in the works that might(?) be suitable, but nothing expected soon. Happy to hear ideas about this too; there really should be a more reliable way of figuring this out.

The second innovation here pertains to conditioning the edge bias on the node state. Now this sort of makes sense, and I believe I've seen it suggested in other papers too. I say "sort of" because the bias terms currently already interacts with the query and key through the (q+b)k (or vice versa) attention formulation, but it's less direct than it could be. It's also not quite clear how they instantiated it here: it seems to be dependent on the "target" node's state (they mention a bias per "receiving edge type"), which is curious -- I wonder if they tailored the choice of (q+b)k vs. (k+b)q to this decision. In any case, from what I've seen, this is definitely a good way to improve performance further. In fact, my relational attention is really the simplest possible form, being just one scalar per edge type. Upgrading it to an N-dim state (as in Shaw et al.'s work, though it makes the computation quite a bit more expensive), adding various types of "payload" to the edges, or maybe enriching the edge vocabulary (supernodes anyone?) are all bound to help, though presumably with diminishing returns for the added model complexity and training cost. So it definitely pays to monitor citations of the original paper for innovations like in this one :)

Hope this helps!

Vincent

@jonathan-laurent
Copy link
Author

Thanks a lot for this thoughtful answer.

One solution could be to compile a richer benchmark of tasks to evaluate the impact of these smaller factors across a statistically robust set of examples.

This is a great point and hopefully I should be able to contribute to this effort.

I say "sort of" because the bias terms currently already interacts with the query and key through the (q+b)k (or vice versa) attention formulation, but it's less direct than it could be.

The interaction is really minimal here though. If b is a scalar, (q+b)k is equal to q k + b sum(k) and so the only interaction is through the sum of all key components (with no dependency on the query). This makes me wonder if just using qk + b instead would result in worse performances.

Upgrading it to an N-dim state (as in Shaw et al.'s work, though it makes the computation quite a bit more expensive), adding various types of "payload" to the edges, or maybe enriching the edge vocabulary (supernodes anyone?) are all bound to help, though presumably with diminishing returns for the added model complexity and training cost.

I agree that benchmarking an N-dim version would be interesting too. I suspect it would be especially interesting in the kind of applications where data-efficiency becomes more important relatively to inference cost (such as in more RLish applications).

Finally, I was also wondering: did you run some benchmarks without positional encodings? My guess is that it would not perform well on GREAT as the edge biases are probably not enough to compensate for the absence of positional info. However, the situation may very well change as one considers richer forms of relational attention.

@VHellendoorn
Copy link
Owner

The q/k/b interaction is definitely quite limited in GREAT; the intuition was that there we are unlikely to want to attend more or less to e.g. incoming data-flow for some specific query, but we do want to be able to prioritize some keys over others when there are such edges. Still, it's just one of many instantiations; I definitely like the idea of doing q k + w(q) sum(k) (as the paper you linked likely does). While I didn't experiment much with qk + b directly, I recall it not being quite as effective; I guess it's easy enough to try on this repo. That would be the equivalent of the T5 relative positional encoding, in the same way that GREAT is like Shaw et al. but specifically for (multiple) edge types.

I agree that benchmarking an N-dim version would be interesting too. I suspect it would be especially interesting in the kind of applications where data-efficiency becomes more important relatively to inference cost (such as in more RLish applications).

That's definitely possible. It's worth noting, though, just how massively an N-dim bias can impact your memory footprint; just needing to instantiate a bqkd bias matrix alone can really eat into your memory when your inputs are long -- I saw my max batch size drop 5x at times. Maybe there's a simpler reformulation that sidesteps explicitly creating that matrix? That would certainly makes sense when edges are sparse (of roughly O(q)). Of course one could rewrite this to effectively imitate the GNN logic of message passing, but that eats up a lot of compute. Which wouldn't be as much of a concern in RL settings, as you point out.

I did not benchmark this without positional encodings; I very much believe that you're right, it would not do terribly well using just next/previous syntax (and to some extent, other) edges to confer positioning. In fact, even the GGNN benefits from adding positional encoding to the base embeddings, presumably for the same reason. I imagine the "richer forms" of relational attention needed to overcome this would have to involve some notion of longer, but still sequential distance. For code, statement-level boundaries seem like an obvious fit; they'd allow you to jump 5-15 tokens at the time, which is plenty for most inputs across 8-12 layers. More generally, I would imagine transitive edges would help (which I experimented with at the time, and to some success, but eventually left out for simplicity). And maybe also bringing in T5 style relative positional encoding, as just another edge type.

-Vincent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants