This annoymous repo contains introductions and codes of paper "Cached Transformers: Improving Transformers with Differentiable Memory Cache ".
In this work, we propose a novel family of Transformer model, called Cached Transformer, which has a gated recurrent caches (GRC), a lightweight and flexible widget enabling Transformers to access the historical knowledge.
We look into this behavior in image classification and find that GRC can separate features into two parts, attending over caches yielding instance-invariant features, as well as attending over self yielding instance-specific features (See visualizations Below}).
We conduct extensive experiments on more than ten representative Transformer networks from both vision and language tasks, including long range arena, image classification, object detection, instance segmentation, and machine translation. The results demonstrate that our approach significantly improves performance of recent Transformers.
The illustration of proposed GRC-Attention in Cached Transformers.
(a) Details of the updating process of Gated Recurrent Cache. The updated cache
(b) Overall pipeline of GRC-Attention. Inputs will attend over cache and themselves respectively, and the outputs are formulated as interpolation of the two attention results.
To verify that the above performance gains mainly come from attending over caches, we analyze the contribution of
We investigate the function of GRC-Attention by visualizing their interior feature maps.
We choose the middle layers of cached ViT-S, averaging the outputs from self-attention
In GRC-Attention,
The pytorch implementation of GRC-Attention module is provided in "core" directory. Full training and testing codes will be released later.