Skip to content

asFeng/Diffuser

Repository files navigation

Diffuser: Efficient Transformers with Multi-hop Attention Diffusion for Long Sequences

Diffuser combines advantages of both the efficiency of sparse transformer and the expressiveness of full-attention Transformer. The key idea is to expand the receptive field of sparse attention using Attention Diffusion, which computes multi-hop token correlations based on all paths between corresponding disconnected tokens, besides attention among neighboring tokens.

Code

Important files in the code are:

Installation

Install PyTorch following the instuctions on the [official website] (https://pytorch.org/). The code has been tested over PyTorch 1.8.0 version.

The other important dependencies requirements are listed in requirements.txt.

Running Classification

To run IMDB review classification task with one GPU

CUDA_VISIBLE_DEVICES=0 python train_classification_imdb.py 

Multi-GPU training has to be lauched with DistributedDataParallel (DDP) for PyTorch and DGL

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 train_classification_imdb.py

Model configurations are listed in config.json and training arguments can be changed in train_classification_imdb.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages