Skip to content

Latest commit

 

History

History
270 lines (225 loc) · 15.9 KB

CHANGELOG.md

File metadata and controls

270 lines (225 loc) · 15.9 KB

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[0.0.27] - TBD

Added

Improved

Removed

[0.0.26] - 2024-04-29

Added

  • [2:4 sparsity] Added support for Straight-Through Estimator for sparsify24 gradient (GRADIENT_STE)
  • [2:4 sparsity] sparsify24_like now supports the cuSparseLt backend, and the STE gradient
  • Basic support for torch.compile for the memory_efficient_attention operator. Currently only supports Flash-Attention, and without any bias provided. We want to expand this coverage progressively.

Improved

  • merge_attentions no longer needs inputs to be stacked.
  • fMHA: triton_splitk now supports additive bias
  • fMHA: benchmark cleanup

Removed

[0.0.25.post1] - 2024-03-29

Pre-built binary wheels require PyTorch 2.2.2

[0.0.25] - 2024-03-14

Pre-built binary wheels require PyTorch 2.2.1

Added

  • New merge_attentions function
  • fMHA: New gappy attention biases.

Improved

  • fMHA: Updated Flash-Attention to v2.5.6: this has a performance improvement for multiquery.
  • fMHA: triton_splitk changed and expanded. Now amalgamates using LSE. Can autotune, supports causal with a small number of queries - not just 1. Experimental support for paged attention.
  • rope_padded: Fixed CUDA error with many queries (more than 65k)
  • rmsnorm: Fixed CUDA error with large inputs (enables 512k+ sequence length on Llama2 70B)

Removed

  • fMHA: Removed triton operator (fmha.triton.*, xformers.ops.MemoryEfficientAttentionTritonFwdFlashBwOp, xformers.ops.TritonFlashAttentionOp), as it has correctness issues under some conditions, and is slower than other implementations.

[0.0.24] - 2024-01-31

Pre-built binary wheels require PyTorch 2.2.0

Added

  • Added components for model/sequence parallelism, as near-drop-in replacements for FairScale/Megatron Column&RowParallelLinear modules. They support fusing communication and computation for sequence parallelism, thus making the communication effectively free. Read more
  • Added kernels for training models with 2:4-sparsity. We introduced a very fast kernel for converting a matrix A into 24-sparse format, which can be used during training to sparsify weights dynamically, activations etc... xFormers also provides an API that is compatible with torch-compile, see xformers.ops.sparsify24.

Improved

  • Make selective activation checkpointing be compatible with torch.compile.

Removed

  • Triton kernels now require a GPU with compute capability 8.0 at least (A100 or newer). This is due to newer versions of triton not supporting older GPUs correctly
  • Removed support for PyTorch version older than 2.1.0

[0.0.23] - 2023-12-05

Pre-built binary wheels require PyTorch 2.1.1 (xFormers 0.0.23) or PyTorch 2.1.2 (xFormers 0.0.23.post1).

Fixed

  • fMHA: Fixed a bug in cutlass backend forward pass where the logsumexp was not correctly calculated, resulting in wrong results in the BW pass. This would happen with MQA when one sequence has a query with length%64 == 1
  • fMHA: Updated Flash-Attention to v2.3.6 - this fixes a performance regression in causal backward passes, and now supports BlockDiagonalCausalWithOffsetPaddedKeysMask

Added

  • fMHA: Added LocalAttentionFromBottomRightMask (local)
  • fMHA: Added LowerTriangularFromBottomRightMask (causal)
  • fMHA: Added LowerTriangularFromBottomRightLocalAttentionMask (local + causal)

Removed

  • Removed xformers.triton.sum_strided

[0.0.22] - 2023-09-27

Fixed

  • fMHA: Backward pass now works in PyTorch deterministic mode (although slower)

Added

  • fMHA: Added experimental support for Multi-Query Attention and Grouped-Query Attention. This is handled by passing 5-dimensional inputs to memory_efficient_attention, see the documentation for more details
  • fMHA: Added experimental support for Local Attention biases to memory_efficient_attention
  • Added an example of efficient LLaMa decoding using xformers operators
  • Added Flash-Decoding for faster attention during Large Language Model (LLM) decoding - up to 50x faster for long sequences (token decoding up to 8x faster end-to-end)
  • Added an efficient rope implementation in triton, to be used in LLM decoding
  • Added selective activation checkpointing, which gives fine-grained control of which activations to keep and which activations to recompute
  • xformers.info now indicates the Flash-Attention version used

Removed

  • fMHA: Removed smallK backend support for CPU. memory_efficient_attention only works for CUDA/GPU tensors now
  • DEPRECATION: Many classes in xformers.factory, xformers.triton and xformers.components have been or will be deprecated soon (see tracking issue #848)

[0.0.21] - 2023-08-18

Improved

  • fMHA: Updated flash-attention to v2, with massive performance improvements for both the forward pass and backward pass. This implementation is now used by default when it's available

Bug fixes

  • fMHA/cutlass: Fix potential race condition in the FW/BW passes
  • fMHA/cutlass: Fix attn_bias stride overflow for very long sequences (>32k)
  • LowerTriangularMask is now backward compatible with older xformers versions

Breaking changes

  • memory_efficient_attention now expects the attn_bias argument to have a head dimension
  • memory_efficient_attention no longer broadcasts the batch/head dimensions of attn_bias. Please use .expand if you need to broadcast the bias
  • Remove causal_diagonal argument from BlockDiagonalCausalWithOffsetPaddedKeysMask

Added

  • Binary wheels on pypi/conda now contain H100 kernels
  • fMHA: Added backend specialized for decoding that does not use TensorCores - useful when not using multiquery

NOTE: Binary wheels are now provided only for PyTorch 2 with cuda 11.8. It is still possible to use xFormers with older versions of PyTorch by building from source or using conda.

[0.0.20] - 2023-05-23

Improved

  • fMHA/cutlass (backward): Massive performance improvements when batch_size * num_heads is low (10x+)
  • fMHA/cutlass: Further performance improvements for both the forward & backward kernels
  • fMHA (backward): Now dispatching to cutlass when embed_dim>64
  • fMHA: Updated Flash-Attention to v1.0.5

Added

  • fMHA now runs on H100 (support is experimental)

[0.0.19] - 2023-04-28

Added

  • Display nvcc version used to compile xformers in python -m xformers.info

Fixed

  • Fixed performance regression with nvcc>11.6 (#712)
  • fMHA/cutlass: Fixed nan in the output when using a torch.Tensor with -inf prefixes as attn_bias (#722)
  • fMHA/cutlass: Fixed nan in the output when the sequence length is larger than 2 ** 15 (#719)
  • fMHA/cutlass: Significative performance improvements (up to 2x) for both the forward pass and backward pass
  • fMHA/cutlass: The kernel are now deterministic
  • fMHA/cutlass: Fixed backward pass correctness when using dropout (#724)

[0.0.18] - 2023-03-31

Added

  • Added xformers.ops.index_select_cat and xformers.ops.scaled_index_add - those are experimental functions that only work with a few shapes, and can be used to write efficient stochastic depth in transformer architectures for instance

Fixed

  • fMHA: memory_efficient_attention now accepts torch.Tensor as attention bias for any seqlen, although there are still requirements on the alignment of the bias tensor (see #683)

[0.0.17] - 2023-03-28

Fixed

  • fMHA: Fixed BW pass on Sm86/Sm89 GPUs when K > 64 (RTX 3090, RTX 4090, A6000, ..) [#631]

Added

  • fMHA/CUTLASS: Added tensor attn bias support [#587] - contribution from @jfc4050
  • fMHA/CUTLASS: Added tensor attn bias grad support [#587] - contribution from @jfc4050
  • fMHA/CUTLASS: Added dropout support [#587] - contribution from @jfc4050
  • fMHA: Added support for varying sequence lengths [#500]

[0.0.16] - 2023-01-31

Fixed

  • Updated triton dependency [#418]
  • Stripe lineinfo from binaries, reducing the binary size [#549]
  • Added support for pip wheels [#588, #573, #534, #523, ...] big thanks to @AbdBarho!
  • Fixed compatibility with Python 3.7 [#541] - thanks to @susumuota
  • fMHA: Fixed strides for QKV gradients for cutlass attention [#535]
  • fMHA: Stricter inputs validation to avoid CUDA errors for unsupported inputs [#592]
  • fMHA/Flash-Attention: Updated to https://github.com/HazyResearch/flash-attention/commit/a1f49a2b92b6fa022379bbebafed9d7f5e96a675 with multiple changes from @TriDao that make the operator up to 20% faster
  • fMHA/Flash-Attention: Fixed backward pass wrapper, where non-contiguous gradients could give the wrong result [#548]
  • fMHA: Separate each operator into forward and backward operators. It's now possible to use any combination of forward+backward (for instance Triton forward and Flash-Attention backward) [#560]

Added

[0.0.15] - Skipped

[0.0.14] - 2022-11-10

Fixed

  • fMHA/CUTLASS: The current CUDA stream is now used by the kernel [#491]
  • fMHA/CUTLASS: Improve overall performance

Added

  • SwiGLU: Added xformers.ops.SwiGLU and its functional counterpart (xformers.ops.swiglu) [#490]
  • fMHA: Possible to combine CUTLASS's forward with flash-attention's backward pass [#469] - improves performance on A100 for K = 128
  • fMHA: Add custom xformers.ops.unbind operator to avoid a cat in the attention block [#458]

[0.0.13] - 2022-09-26

Added

  • fMHA: Added CUTLASS-based kernel for xformers.ops.memory_efficient_attention. This kernel is automatically depending on the inputs, and works on any GPU after P100 [#362]

[0.0.12] - 2022-08-08

Fixed

  • Removed duplicated biases in the FusedMLP layers [#317]
  • Rotary embeddings respecting input types [#326]
  • Poolformer style instantiating useless projection layers [#349]
  • Fix layer position not being properly tracked, causing extra layernorms for programmatic xformers [#348]
  • Pass use_triton flag to LayerNorm module [#336]

Added

  • Four blocksparsity layouts from DeepSpeed [#320]
  • Support several initialization options [#312]
  • Conv2DFeedforward feedforward part [#321]
  • VisualAttention [#329]
  • Automatic blocksparse for causal attention [#334]
  • Better hierarchical transformer generation [#345]
  • Fused operations with AOTAutograd/NVFuser, integration into MLP [#357]
  • Refactor LRA code to use Pytorch Lightning [#343]

[0.0.11] - 2022-05-30

Fixed

  • Fix some torchscriptability [#246]
  • Fix FourierMix being compatible with AMP [#258]
  • Better asserts on QKV dimensions [#264]
  • Better perfs for FusedMLP and FusedLinearLayer [#283]
  • Deepnorm init missing self-attention [#284]

Added

  • Simplicial Embeddings [#259]
  • Mem efficient attention, FW pass [#267]
  • MHA benchmark
  • MLP benchmark
  • Move all triton kernels to triton v2 [#272]
  • Mem efficient attention, BW pass [#281]
  • Metaformer support [#294]

[0.0.10] - 2022-03-14

Fixed

  • Expose bias flag for feedforwards, same default as Timm [#220]
  • Update eps value for layernorm, same default as torch [#221]
  • PreNorm bugfix, only one input was normalized [#233]
  • Fix bug where embedding dimensions that did not match model dim would lead to a crash [#244]

Added

  • Add DeepNet (DeepNorm) residual path and init [#227]

[0.0.9] - 2022-02-09

Added

  • Compositional Attention [#41]
  • Experimental Ragged attention [#189]
  • Mixture of Experts [#181]
  • BlockSparseTensor [#202]
  • Nd-tensor support for triton softmax [#210]

Fixed

  • Bugfix Favor, single feature map [#183]
  • Sanity check blocksparse settings [#207]
  • Fixed some picklability [#204]

[0.0.8] - 2022-01-07

Fixed

  • Much faster fused dropout [#164]
  • Fused dropout repeatability [#173]

Added

  • Embedding weight tying option [#172]

[0.0.7] - 2021-11-30

Fixed

  • Dropout setting not properly passed in many attentions [#123]

[0.0.6] - 2021-11-24

Fixed

  • Fix self attention optimization not being triggered, broken residual path [#119]
  • Improve speed by not using contiguous Tensors when not needed [#119]

Added

  • Attention mask wrapper [#113]
  • ViT comparison benchmark [#117]

[0.0.4] - 2021-11-16

Fixed

  • Homogenizing the masks, additive or bool [#79][#85][#86]
  • Fix causality flag not being respected [#103]
  • Enabling FusedLayerNorm by default in the factory if Triton is available
  • Fixing Favor with fp16
  • Fixing Favor trainability

Added

  • Fused dropout/bias/activation layer [#58]
  • Fused layernorm used by default in the factory [#92]

[0.0.3] - 2021-11-01

Fixed

  • Nystrom causal attention [#75]

[0.0.2] - 2021-11-01

Fixed

  • More robust blocksparse [#24]

Added

  • Rotary embeddings [#32]
  • More flexible layernorm [#50]