New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thoughts on Dynamic Graph Transformers? #6
Comments
That is an interesting paper. A disclaimer: I found it a bit difficult to get the full picture of how their model works from section 3.2; maybe when they release their code things will be more clear. That aside, I definitely considered having the edge bias be both layer and head (in)dependent; I think almost any permutation of these makes sense, but much like our last discussion, I found no noticeable gain in splitting it up by heads (although similarly, it costs nearly nothing to do so). It doesn't look like this paper ablates that decision either. It occurs to me that a lot of these small decisions that seem reasonable are hard to evaluate because inherent randomness from the initialization and the sheer capacity of the models makes the actual performance impact hard to trace. One solution could be to compile a richer benchmark of tasks to evaluate the impact of these smaller factors across a statistically robust set of examples. I have something in the works that might(?) be suitable, but nothing expected soon. Happy to hear ideas about this too; there really should be a more reliable way of figuring this out. The second innovation here pertains to conditioning the edge bias on the node state. Now this sort of makes sense, and I believe I've seen it suggested in other papers too. I say "sort of" because the bias terms currently already interacts with the query and key through the Hope this helps! Vincent |
Thanks a lot for this thoughtful answer.
This is a great point and hopefully I should be able to contribute to this effort.
The interaction is really minimal here though. If
I agree that benchmarking an N-dim version would be interesting too. I suspect it would be especially interesting in the kind of applications where data-efficiency becomes more important relatively to inference cost (such as in more RLish applications). Finally, I was also wondering: did you run some benchmarks without positional encodings? My guess is that it would not perform well on GREAT as the edge biases are probably not enough to compensate for the absence of positional info. However, the situation may very well change as one considers richer forms of relational attention. |
The
That's definitely possible. It's worth noting, though, just how massively an N-dim bias can impact your memory footprint; just needing to instantiate a I did not benchmark this without positional encodings; I very much believe that you're right, it would not do terribly well using just next/previous syntax (and to some extent, other) edges to confer positioning. In fact, even the GGNN benefits from adding positional encoding to the base embeddings, presumably for the same reason. I imagine the "richer forms" of relational attention needed to overcome this would have to involve some notion of longer, but still sequential distance. For code, statement-level boundaries seem like an obvious fit; they'd allow you to jump 5-15 tokens at the time, which is plenty for most inputs across 8-12 layers. More generally, I would imagine transitive edges would help (which I experimented with at the time, and to some success, but eventually left out for simplicity). And maybe also bringing in T5 style relative positional encoding, as just another edge type. -Vincent |
First of all, thanks for making your code available in such a way!
In this paper, the authors propose a generalization of the GREAT architecture where for each attention layer, a learned linear transform is used to compute a bias matrix of dimensions
num_heads x num_edge_types
for each node of the graph. Said differently, they make it possible for the added attention bias to depend on the current node and head number.I was wondering if you had made experiments with similar generalizations on your dataset or if you had any thoughts about it.
(Please do not hesitate to close this issue if you think this is not an appropriate forum for such a discussion.)
The text was updated successfully, but these errors were encountered: