New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement relative positional encoding #139
Conversation
bfa5269
to
05f1d1d
Compare
Some basic ablations here: https://wandb.ai/entity-neural-network/enn-ppo/reports/Relative-positional-encoding-ablations--VmlldzoxNDM0MzIx A big caveat is that, at least on this task, we seem to need per-entity relative positional values to get good performance.
While per-entity relative positional values are good solution in the case of this environment, they are also quite limited. If, instead of separate food and snake entities, there was a single entity type with a feature that identifies whether it is "food" or "snake segment", per-entity relpos values wouldn't apply. This seems wrong, we ought to have a solution that works just as well in that case. More generally, the relevant property might not be the entity type, but some arbitrary feature of the entity learned by the network. I believe we can come up with a new type of relative positional encoding which is fully general by allowing for a non-linear combination of a (projection of) the entity embeddings and the positional features. Since there are N^2 relative positional values, we probably can't afford a full matmul, but there some cheap-elementwise operations that I think could work well. In particular, a good approach could be to perform an element-wise multiplication of the relative positional values and a projection of the corresponding entity embedding using one of the GLU variants described in Shazeer (2020). This would effectively allow entities to apply an arbitrary gating function to any of the relative positional values, and should be strictly more powerful than per-entity positional encodings. An important related question is whether all of this is even necessary. In principle, a multi-layer or multi-head attention network ought to be able to perform the same operation. E.g., a two-head attention layer could retrieve all
|
Implements a version of relative positional encoding for n-dimensional grids. Relative positional encoding with e.g. a 11 x 13 extent for an environment with a 2d grid can be enabled by passing
--relpos-encoding='{"extent": [5, 6], "position_features": ["x", "y"], "per_entity_values": true}'
toenn_ppo/train.py
.There are many variations and refinements of relative positional encoding. This implementation mostly follows the original formulation described in Shaw et al (2018). In particular, here is a non-exhaustive list of somewhat arbitrary design choices that we may want to revisit once we have some good benchmarks to test against:
The current implementation requires ds^2 memory, where d is the dimension of heads and s is the sequence length. Since our sequences are relatively short so far, this does present a major issue. The usual trick used to reduce memory usage by a factor of s only works for sequences and not our more general version where entities can be at arbitrary grid points. We could still achieve the same savings with a custom GPU kernel though.