Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SwiGLU optimized fw/bw #490

Merged
merged 36 commits into from
Nov 10, 2022
Merged

SwiGLU optimized fw/bw #490

merged 36 commits into from
Nov 10, 2022

Commits on Oct 24, 2022

  1. SwiGLU optimized fw/bw

    [ghstack-poisoned]
    danthe3rd committed Oct 24, 2022
    Configuration menu
    Copy the full SHA
    069405e View commit details
    Browse the repository at this point in the history
  2. Update on "SwiGLU optimized fw/bw"

    [ghstack-poisoned]
    danthe3rd committed Oct 24, 2022
    Configuration menu
    Copy the full SHA
    4b317c6 View commit details
    Browse the repository at this point in the history

Commits on Oct 25, 2022

  1. Update on "SwiGLU optimized fw/bw"

    [ghstack-poisoned]
    danthe3rd committed Oct 25, 2022
    Configuration menu
    Copy the full SHA
    11bad90 View commit details
    Browse the repository at this point in the history
  2. Update on "SwiGLU optimized fw/bw"

    [ghstack-poisoned]
    danthe3rd committed Oct 25, 2022
    Configuration menu
    Copy the full SHA
    8b2f688 View commit details
    Browse the repository at this point in the history
  3. Update on "SwiGLU optimized fw/bw"

    [ghstack-poisoned]
    danthe3rd committed Oct 25, 2022
    Configuration menu
    Copy the full SHA
    e1609de View commit details
    Browse the repository at this point in the history
  4. Update on "SwiGLU optimized fw/bw"

    [ghstack-poisoned]
    danthe3rd committed Oct 25, 2022
    Configuration menu
    Copy the full SHA
    30ca17c View commit details
    Browse the repository at this point in the history
  5. Update on "SwiGLU optimized fw/bw"

    [ghstack-poisoned]
    danthe3rd committed Oct 25, 2022
    Configuration menu
    Copy the full SHA
    eb9c553 View commit details
    Browse the repository at this point in the history
  6. Update on "SwiGLU optimized fw/bw"

    [ghstack-poisoned]
    danthe3rd committed Oct 25, 2022
    Configuration menu
    Copy the full SHA
    ed2b7c2 View commit details
    Browse the repository at this point in the history
  7. Update on "SwiGLU optimized fw/bw"

    
    **USAGE**
    
    ```python
    import xformers.ops as xops
    
    # NOTE: Important to use `unbind` from xformers for the bw pass!
    w1, w2 = xops.unbind(
        w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
        dim=0,
    )
    b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
    y = xops.functional_swiglu(x,
        w1, b1, w2, b2, w3, b3
        op=xops.SwiGLUFusedOp)
    ```
    
    
    [ghstack-poisoned]
    danthe3rd committed Oct 25, 2022
    Configuration menu
    Copy the full SHA
    e758435 View commit details
    Browse the repository at this point in the history
  8. Update on "SwiGLU optimized fw/bw"

    
    **USAGE**
    
    ```python
    import xformers.ops as xops
    
    # NOTE: Important to use `unbind` from xformers for the bw pass!
    w1, w2 = xops.unbind(
        w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
        dim=0,
    )
    b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
    y = xops.functional_swiglu(x,
        w1, b1, w2, b2, w3, b3
        op=xops.SwiGLUFusedOp)
    ```
    
    
    [ghstack-poisoned]
    danthe3rd committed Oct 25, 2022
    Configuration menu
    Copy the full SHA
    3207254 View commit details
    Browse the repository at this point in the history
  9. Update on "SwiGLU optimized fw/bw"

    
    **USAGE**
    
    ```python
    import xformers.ops as xops
    
    # NOTE: Important to use `unbind` from xformers for the bw pass!
    w1, w2 = xops.unbind(
        w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
        dim=0,
    )
    b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
    y = xops.functional_swiglu(x,
        w1, b1, w2, b2, w3, b3
        op=xops.SwiGLUFusedOp)
    ```
    
    
    [ghstack-poisoned]
    danthe3rd committed Oct 25, 2022
    Configuration menu
    Copy the full SHA
    dbf6092 View commit details
    Browse the repository at this point in the history

Commits on Oct 26, 2022

  1. Update on "SwiGLU optimized fw/bw"

    
    **USAGE**
    
    ```python
    import xformers.ops as xops
    
    # NOTE: Important to use `unbind` from xformers for the bw pass!
    w1, w2 = xops.unbind(
        w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
        dim=0,
    )
    b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
    y = xops.functional_swiglu(x,
        w1, b1, w2, b2, w3, b3
        op=xops.SwiGLUFusedOp)
    ```
    
    
    [ghstack-poisoned]
    danthe3rd committed Oct 26, 2022
    Configuration menu
    Copy the full SHA
    acdf239 View commit details
    Browse the repository at this point in the history
  2. Update on "SwiGLU optimized fw/bw"

    
    **USAGE**
    
    ```python
    import xformers.ops as xops
    
    # NOTE: Important to use `unbind` from xformers for the bw pass!
    w1, w2 = xops.unbind(
        w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
        dim=0,
    )
    b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
    y = xops.functional_swiglu(x,
        w1, b1, w2, b2, w3, b3
        op=xops.SwiGLUPackedFusedOp)
    ```
    
    
    [ghstack-poisoned]
    danthe3rd committed Oct 26, 2022
    Configuration menu
    Copy the full SHA
    bbdc00e View commit details
    Browse the repository at this point in the history
  3. Update on "SwiGLU optimized fw/bw"

    
    **USAGE**
    
    ```python
    import xformers.ops as xops
    
    # NOTE: Important to use `unbind` from xformers for the bw pass!
    w1, w2 = xops.unbind(
        w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
        dim=0,
    )
    b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
    y = xops.functional_swiglu(x,
        w1, b1, w2, b2, w3, b3
        op=xops.SwiGLUPackedFusedOp)
    ```
    
    
    [ghstack-poisoned]
    danthe3rd committed Oct 26, 2022
    Configuration menu
    Copy the full SHA
    5fe54aa View commit details
    Browse the repository at this point in the history
  4. Update on "SwiGLU optimized fw/bw"

    
    **USAGE**
    
    ```python
    import xformers.ops as xops
    
    # NOTE: Important to use `unbind` from xformers for the bw pass!
    w1, w2 = xops.unbind(
        w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
        dim=0,
    )
    b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
    y = xops.functional_swiglu(x,
        w1, b1, w2, b2, w3, b3
        op=xops.SwiGLUPackedFusedOp)
    ```
    
    
    [ghstack-poisoned]
    danthe3rd committed Oct 26, 2022
    Configuration menu
    Copy the full SHA
    44a6fbf View commit details
    Browse the repository at this point in the history
  5. Update on "SwiGLU optimized fw/bw"

    
    **USAGE**
    
    ```python
    import xformers.ops as xops
    
    # NOTE: Important to use `unbind` from xformers for the bw pass!
    w1, w2 = xops.unbind(
        w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
        dim=0,
    )
    b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
    y = xops.functional_swiglu(x,
        w1, b1, w2, b2, w3, b3
        op=xops.SwiGLUPackedFusedOp)
    ```
    
    
    [ghstack-poisoned]
    danthe3rd committed Oct 26, 2022
    Configuration menu
    Copy the full SHA
    d3e3089 View commit details
    Browse the repository at this point in the history

Commits on Oct 27, 2022

  1. Update on "SwiGLU optimized fw/bw"

    
    **USAGE**
    
    ```python
    import xformers.ops as xops
    
    # NOTE: Important to use `unbind` from xformers for the bw pass!
    w1, w2 = xops.unbind(
        w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
        dim=0,
    )
    b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
    y = xops.functional_swiglu(x,
        w1, b1, w2, b2, w3, b3
        op=xops.SwiGLUPackedFusedOp)
    ```
    
    
    [ghstack-poisoned]
    danthe3rd committed Oct 27, 2022
    Configuration menu
    Copy the full SHA
    db5770d View commit details
    Browse the repository at this point in the history

Commits on Oct 28, 2022

  1. Update on "SwiGLU optimized fw/bw"

    
    **USAGE**
    
    ```python
    import xformers.ops as xops
    
    # NOTE: Important to use `unbind` from xformers for the bw pass!
    w1, w2 = xops.unbind(
        w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
        dim=0,
    )
    b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
    y = xops.functional_swiglu(x,
        w1, b1, w2, b2, w3, b3
        op=xops.SwiGLUPackedFusedOp)
    ```
    
    
    [ghstack-poisoned]
    danthe3rd committed Oct 28, 2022
    Configuration menu
    Copy the full SHA
    4c2bfdc View commit details
    Browse the repository at this point in the history
  2. Update on "SwiGLU optimized fw/bw"

    
    **USAGE**
    
    ```python
    import xformers.ops as xops
    
    # NOTE: Important to use `unbind` from xformers for the bw pass!
    w1, w2 = xops.unbind(
        w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
        dim=0,
    )
    b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
    y = xops.functional_swiglu(x,
        w1, b1, w2, b2, w3, b3
        op=xops.SwiGLUPackedFusedOp)
    ```
    
    
    [ghstack-poisoned]
    danthe3rd committed Oct 28, 2022
    Configuration menu
    Copy the full SHA
    d2d0187 View commit details
    Browse the repository at this point in the history
  3. Update on "SwiGLU optimized fw/bw"

    
    **USAGE**
    
    ```python
    import xformers.ops as xops
    
    # NOTE: Important to use `unbind` from xformers for the bw pass!
    w1, w2 = xops.unbind(
        w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
        dim=0,
    )
    b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
    y = xops.functional_swiglu(x,
        w1, b1, w2, b2, w3, b3
        op=xops.SwiGLUPackedFusedOp)
    ```
    
    
    [ghstack-poisoned]
    danthe3rd committed Oct 28, 2022
    Configuration menu
    Copy the full SHA
    e2d97d2 View commit details
    Browse the repository at this point in the history
  4. Update on "SwiGLU optimized fw/bw"

    
    **USAGE**
    
    ```python
    import xformers.ops as xops
    
    # NOTE: Important to use `unbind` from xformers for the bw pass!
    w1, w2 = xops.unbind(
        w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
        dim=0,
    )
    b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
    y = xops.functional_swiglu(x,
        w1, b1, w2, b2, w3, b3
        op=xops.SwiGLUPackedFusedOp)
    ```
    
    
    [ghstack-poisoned]
    danthe3rd committed Oct 28, 2022
    Configuration menu
    Copy the full SHA
    7224112 View commit details
    Browse the repository at this point in the history
  5. Update on "SwiGLU optimized fw/bw"

    
    **USAGE**
    
    ```python
    import xformers.ops as xops
    
    # NOTE: Important to use `unbind` from xformers for the bw pass!
    w1, w2 = xops.unbind(
        w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
        dim=0,
    )
    b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
    y = xops.functional_swiglu(x,
        w1, b1, w2, b2, w3, b3
        op=xops.SwiGLUPackedFusedOp)
    ```
    
    
    [ghstack-poisoned]
    danthe3rd committed Oct 28, 2022
    Configuration menu
    Copy the full SHA
    06c1487 View commit details
    Browse the repository at this point in the history
  6. Update on "SwiGLU optimized fw/bw"

    
    **USAGE**
    
    ```python
    import xformers.ops as xops
    
    # NOTE: Important to use `unbind` from xformers for the bw pass!
    w1, w2 = xops.unbind(
        w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
        dim=0,
    )
    b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
    y = xops.functional_swiglu(x,
        w1, b1, w2, b2, w3, b3
        op=xops.SwiGLUPackedFusedOp)
    ```
    
    
    [ghstack-poisoned]
    danthe3rd committed Oct 28, 2022
    Configuration menu
    Copy the full SHA
    783a2ff View commit details
    Browse the repository at this point in the history
  7. Update on "SwiGLU optimized fw/bw"

    **NOTE**
    We can improve a bit more once this is fixed - NVIDIA/cutlass#674
    
    **USAGE**
    
    ```python
    import xformers.ops as xops
    
    # NOTE: Important to use `unbind` from xformers for the bw pass!
    w1, w2 = xops.unbind(
        w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
        dim=0,
    )
    b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
    y = xops.functional_swiglu(x,
        w1, b1, w2, b2, w3, b3
        op=xops.SwiGLUPackedFusedOp)
    ```
    
    **PERFORMANCE (A100 only)**
    
    *FW*
    ```
    [-------------------------------------------------------- swiglu_fw ---------------------------------------------------------]
                                         |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
    1 threads: -------------------------------------------------------------------------------------------------------------------
          f16    B=9456, I=1536, H=4096  |               1377.7               |  1581.4  |         1339.1
          f16.ac B=9456, I=1536, H=4096  |               1449.3               |  1735.3  |         1462.9
          f16    B=4440, I=1536, H=4096  |                600.4               |   735.6  |          593.9
          f16.ac B=4440, I=1536, H=4096  |                709.0               |   843.7  |          717.6
          f16    B=4728, I=1536, H=4096  |                638.9               |   776.2  |          635.3
          f16.ac B=4728, I=1536, H=4096  |                748.9               |   892.2  |          756.7
          f16    B=4728, I=1536, H=1024  |                162.3               |   201.5  |          163.1
          f16.ac B=4728, I=1536, H=1024  |                235.2               |   277.4  |          245.5
    
    Times are in microseconds (us).
    ```
    
    *BW*
    ```
    [-------------------------------------------------------- swiglu_bw ---------------------------------------------------------]
                                         |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
    1 threads: -------------------------------------------------------------------------------------------------------------------
          f16    B=9456, I=1536, H=4096  |               2333.1               |  2696.7  |         2336.1
          f16.ac B=9456, I=1536, H=4096  |               2620.8               |  2990.9  |         2840.0
          f16    B=4440, I=1536, H=4096  |               1243.2               |  1413.8  |         1240.3
          f16.ac B=4440, I=1536, H=4096  |               1448.6               |  1629.0  |         1637.3
          f16    B=4728, I=1536, H=4096  |               1298.4               |  1481.5  |         1301.1
          f16.ac B=4728, I=1536, H=4096  |               1511.8               |  1705.3  |         1705.4
          f16    B=4728, I=1536, H=1024  |                463.3               |   493.9  |          463.0
          f16.ac B=4728, I=1536, H=1024  |                582.4               |   614.9  |          672.7
    
    Times are in microseconds (us).
    ```
    
    [ghstack-poisoned]
    danthe3rd committed Oct 28, 2022
    Configuration menu
    Copy the full SHA
    69e299f View commit details
    Browse the repository at this point in the history
  8. Update on "SwiGLU optimized fw/bw"

    **NOTE**
    We can improve a bit more once this is fixed - NVIDIA/cutlass#674
    
    **USAGE**
    
    ```python
    import xformers.ops as xops
    
    # NOTE: Important to use `unbind` from xformers for the bw pass!
    w1, w2 = xops.unbind(
        w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
        dim=0,
    )
    b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
    y = xops.functional_swiglu(x,
        w1, b1, w2, b2, w3, b3
        op=xops.SwiGLUPackedFusedOp)
    ```
    
    **PERFORMANCE (A100 only)**
    
    *FW*
    ```
    [-------------------------------------------------------- swiglu_fw ---------------------------------------------------------]
                                         |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
    1 threads: -------------------------------------------------------------------------------------------------------------------
          f16    B=9456, I=1536, H=4096  |               1377.7               |  1581.4  |         1339.1
          f16.ac B=9456, I=1536, H=4096  |               1449.3               |  1735.3  |         1462.9
          f16    B=4440, I=1536, H=4096  |                600.4               |   735.6  |          593.9
          f16.ac B=4440, I=1536, H=4096  |                709.0               |   843.7  |          717.6
          f16    B=4728, I=1536, H=4096  |                638.9               |   776.2  |          635.3
          f16.ac B=4728, I=1536, H=4096  |                748.9               |   892.2  |          756.7
          f16    B=4728, I=1536, H=1024  |                162.3               |   201.5  |          163.1
          f16.ac B=4728, I=1536, H=1024  |                235.2               |   277.4  |          245.5
    
    Times are in microseconds (us).
    ```
    
    *BW*
    ```
    [-------------------------------------------------------- swiglu_bw ---------------------------------------------------------]
                                         |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
    1 threads: -------------------------------------------------------------------------------------------------------------------
          f16    B=9456, I=1536, H=4096  |               2333.1               |  2696.7  |         2336.1
          f16.ac B=9456, I=1536, H=4096  |               2620.8               |  2990.9  |         2840.0
          f16    B=4440, I=1536, H=4096  |               1243.2               |  1413.8  |         1240.3
          f16.ac B=4440, I=1536, H=4096  |               1448.6               |  1629.0  |         1637.3
          f16    B=4728, I=1536, H=4096  |               1298.4               |  1481.5  |         1301.1
          f16.ac B=4728, I=1536, H=4096  |               1511.8               |  1705.3  |         1705.4
          f16    B=4728, I=1536, H=1024  |                463.3               |   493.9  |          463.0
          f16.ac B=4728, I=1536, H=1024  |                582.4               |   614.9  |          672.7
    
    Times are in microseconds (us).
    ```
    
    [ghstack-poisoned]
    danthe3rd committed Oct 28, 2022
    Configuration menu
    Copy the full SHA
    f6e2ceb View commit details
    Browse the repository at this point in the history
  9. Update on "SwiGLU optimized fw/bw"

    **NOTE**
    We can improve a bit more once this is fixed - NVIDIA/cutlass#674
    
    **USAGE**
    
    ```python
    import xformers.ops as xops
    
    # NOTE: Important to use `unbind` from xformers for the bw pass!
    w1, w2 = xops.unbind(
        w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
        dim=0,
    )
    b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
    y = xops.functional_swiglu(x,
        w1, b1, w2, b2, w3, b3
        op=xops.SwiGLUPackedFusedOp)
    ```
    
    **PERFORMANCE (A100 only)**
    
    *FW*
    ```
    [-------------------------------------------------------- swiglu_fw ---------------------------------------------------------]
                                         |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
    1 threads: -------------------------------------------------------------------------------------------------------------------
          f16    B=9456, I=1536, H=4096  |               1377.7               |  1581.4  |         1339.1
          f16.ac B=9456, I=1536, H=4096  |               1449.3               |  1735.3  |         1462.9
          f16    B=4440, I=1536, H=4096  |                600.4               |   735.6  |          593.9
          f16.ac B=4440, I=1536, H=4096  |                709.0               |   843.7  |          717.6
          f16    B=4728, I=1536, H=4096  |                638.9               |   776.2  |          635.3
          f16.ac B=4728, I=1536, H=4096  |                748.9               |   892.2  |          756.7
          f16    B=4728, I=1536, H=1024  |                162.3               |   201.5  |          163.1
          f16.ac B=4728, I=1536, H=1024  |                235.2               |   277.4  |          245.5
    
    Times are in microseconds (us).
    ```
    
    *BW*
    ```
    [-------------------------------------------------------- swiglu_bw ---------------------------------------------------------]
                                         |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
    1 threads: -------------------------------------------------------------------------------------------------------------------
          f16    B=9456, I=1536, H=4096  |               2333.1               |  2696.7  |         2336.1
          f16.ac B=9456, I=1536, H=4096  |               2620.8               |  2990.9  |         2840.0
          f16    B=4440, I=1536, H=4096  |               1243.2               |  1413.8  |         1240.3
          f16.ac B=4440, I=1536, H=4096  |               1448.6               |  1629.0  |         1637.3
          f16    B=4728, I=1536, H=4096  |               1298.4               |  1481.5  |         1301.1
          f16.ac B=4728, I=1536, H=4096  |               1511.8               |  1705.3  |         1705.4
          f16    B=4728, I=1536, H=1024  |                463.3               |   493.9  |          463.0
          f16.ac B=4728, I=1536, H=1024  |                582.4               |   614.9  |          672.7
    
    Times are in microseconds (us).
    ```
    
    [ghstack-poisoned]
    danthe3rd committed Oct 28, 2022
    Configuration menu
    Copy the full SHA
    538d05c View commit details
    Browse the repository at this point in the history
  10. Update on "SwiGLU optimized fw/bw"

    **NOTE**
    We can improve a bit more once this is fixed - NVIDIA/cutlass#674
    
    **USAGE**
    
    ```python
    import xformers.ops as xops
    
    # NOTE: Important to use `unbind` from xformers for the bw pass!
    w1, w2 = xops.unbind(
        w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
        dim=0,
    )
    b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
    y = xops.functional_swiglu(x,
        w1, b1, w2, b2, w3, b3
        op=xops.SwiGLUPackedFusedOp)
    ```
    
    **PERFORMANCE (A100 only)**
    
    *FW*
    ```
    [-------------------------------------------------------- swiglu_fw ---------------------------------------------------------]
                                         |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
    1 threads: -------------------------------------------------------------------------------------------------------------------
          f16    B=9456, I=1536, H=4096  |               1377.7               |  1581.4  |         1339.1
          f16.ac B=9456, I=1536, H=4096  |               1449.3               |  1735.3  |         1462.9
          f16    B=4440, I=1536, H=4096  |                600.4               |   735.6  |          593.9
          f16.ac B=4440, I=1536, H=4096  |                709.0               |   843.7  |          717.6
          f16    B=4728, I=1536, H=4096  |                638.9               |   776.2  |          635.3
          f16.ac B=4728, I=1536, H=4096  |                748.9               |   892.2  |          756.7
          f16    B=4728, I=1536, H=1024  |                162.3               |   201.5  |          163.1
          f16.ac B=4728, I=1536, H=1024  |                235.2               |   277.4  |          245.5
    
    Times are in microseconds (us).
    ```
    
    *BW*
    ```
    [-------------------------------------------------------- swiglu_bw ---------------------------------------------------------]
                                         |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
    1 threads: -------------------------------------------------------------------------------------------------------------------
          f16    B=9456, I=1536, H=4096  |               2333.1               |  2696.7  |         2336.1
          f16.ac B=9456, I=1536, H=4096  |               2620.8               |  2990.9  |         2840.0
          f16    B=4440, I=1536, H=4096  |               1243.2               |  1413.8  |         1240.3
          f16.ac B=4440, I=1536, H=4096  |               1448.6               |  1629.0  |         1637.3
          f16    B=4728, I=1536, H=4096  |               1298.4               |  1481.5  |         1301.1
          f16.ac B=4728, I=1536, H=4096  |               1511.8               |  1705.3  |         1705.4
          f16    B=4728, I=1536, H=1024  |                463.3               |   493.9  |          463.0
          f16.ac B=4728, I=1536, H=1024  |                582.4               |   614.9  |          672.7
    
    Times are in microseconds (us).
    ```
    
    [ghstack-poisoned]
    danthe3rd committed Oct 28, 2022
    Configuration menu
    Copy the full SHA
    0ab305f View commit details
    Browse the repository at this point in the history
  11. Update on "SwiGLU optimized fw/bw"

    **NOTE**
    We can improve a bit more once this is fixed - NVIDIA/cutlass#674
    
    **USAGE**
    
    ```python
    import xformers.ops as xops
    
    # NOTE: Important to use `unbind` from xformers for the bw pass!
    w1, w2 = xops.unbind(
        w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
        dim=0,
    )
    b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
    y = xops.functional_swiglu(x,
        w1, b1, w2, b2, w3, b3
        op=xops.SwiGLUPackedFusedOp)
    ```
    
    **PERFORMANCE (A100 only)**
    
    *FW*
    ```
    [-------------------------------------------------------- swiglu_fw ---------------------------------------------------------]
                                         |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
    1 threads: -------------------------------------------------------------------------------------------------------------------
          f16    B=9456, I=1536, H=4096  |               1377.7               |  1581.4  |         1339.1
          f16.ac B=9456, I=1536, H=4096  |               1449.3               |  1735.3  |         1462.9
          f16    B=4440, I=1536, H=4096  |                600.4               |   735.6  |          593.9
          f16.ac B=4440, I=1536, H=4096  |                709.0               |   843.7  |          717.6
          f16    B=4728, I=1536, H=4096  |                638.9               |   776.2  |          635.3
          f16.ac B=4728, I=1536, H=4096  |                748.9               |   892.2  |          756.7
          f16    B=4728, I=1536, H=1024  |                162.3               |   201.5  |          163.1
          f16.ac B=4728, I=1536, H=1024  |                235.2               |   277.4  |          245.5
    
    Times are in microseconds (us).
    ```
    
    *BW*
    ```
    [-------------------------------------------------------- swiglu_bw ---------------------------------------------------------]
                                         |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
    1 threads: -------------------------------------------------------------------------------------------------------------------
          f16    B=9456, I=1536, H=4096  |               2333.1               |  2696.7  |         2336.1
          f16.ac B=9456, I=1536, H=4096  |               2620.8               |  2990.9  |         2840.0
          f16    B=4440, I=1536, H=4096  |               1243.2               |  1413.8  |         1240.3
          f16.ac B=4440, I=1536, H=4096  |               1448.6               |  1629.0  |         1637.3
          f16    B=4728, I=1536, H=4096  |               1298.4               |  1481.5  |         1301.1
          f16.ac B=4728, I=1536, H=4096  |               1511.8               |  1705.3  |         1705.4
          f16    B=4728, I=1536, H=1024  |                463.3               |   493.9  |          463.0
          f16.ac B=4728, I=1536, H=1024  |                582.4               |   614.9  |          672.7
    
    Times are in microseconds (us).
    ```
    
    [ghstack-poisoned]
    danthe3rd committed Oct 28, 2022
    Configuration menu
    Copy the full SHA
    c67a0ad View commit details
    Browse the repository at this point in the history

Commits on Oct 31, 2022

  1. Update on "SwiGLU optimized fw/bw"

    **NOTE**
    We can improve a bit more once this is fixed - NVIDIA/cutlass#674
    
    **USAGE**
    
    ```python
    import xformers.ops as xops
    
    # NOTE: Important to use `unbind` from xformers for the bw pass!
    w1, w2 = xops.unbind(
        w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
        dim=0,
    )
    b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
    y = xops.functional_swiglu(x,
        w1, b1, w2, b2, w3, b3
        op=xops.SwiGLUPackedFusedOp)
    ```
    
    **PERFORMANCE (A100 only)**
    
    *FW*
    ```
    [-------------------------------------------------------- swiglu_fw ---------------------------------------------------------]
                                         |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
    1 threads: -------------------------------------------------------------------------------------------------------------------
          f16    B=9456, I=1536, H=4096  |               1377.7               |  1581.4  |         1339.1
          f16.ac B=9456, I=1536, H=4096  |               1449.3               |  1735.3  |         1462.9
          f16    B=4440, I=1536, H=4096  |                600.4               |   735.6  |          593.9
          f16.ac B=4440, I=1536, H=4096  |                709.0               |   843.7  |          717.6
          f16    B=4728, I=1536, H=4096  |                638.9               |   776.2  |          635.3
          f16.ac B=4728, I=1536, H=4096  |                748.9               |   892.2  |          756.7
          f16    B=4728, I=1536, H=1024  |                162.3               |   201.5  |          163.1
          f16.ac B=4728, I=1536, H=1024  |                235.2               |   277.4  |          245.5
    
    Times are in microseconds (us).
    ```
    
    *BW*
    ```
    [-------------------------------------------------------- swiglu_bw ---------------------------------------------------------]
                                         |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
    1 threads: -------------------------------------------------------------------------------------------------------------------
          f16    B=9456, I=1536, H=4096  |               2333.1               |  2696.7  |         2336.1
          f16.ac B=9456, I=1536, H=4096  |               2620.8               |  2990.9  |         2840.0
          f16    B=4440, I=1536, H=4096  |               1243.2               |  1413.8  |         1240.3
          f16.ac B=4440, I=1536, H=4096  |               1448.6               |  1629.0  |         1637.3
          f16    B=4728, I=1536, H=4096  |               1298.4               |  1481.5  |         1301.1
          f16.ac B=4728, I=1536, H=4096  |               1511.8               |  1705.3  |         1705.4
          f16    B=4728, I=1536, H=1024  |                463.3               |   493.9  |          463.0
          f16.ac B=4728, I=1536, H=1024  |                582.4               |   614.9  |          672.7
    
    Times are in microseconds (us).
    ```
    
    [ghstack-poisoned]
    danthe3rd committed Oct 31, 2022
    Configuration menu
    Copy the full SHA
    a77aeec View commit details
    Browse the repository at this point in the history
  2. Update on "SwiGLU optimized fw/bw"

    **NOTE**
    We can improve a bit more once this is fixed - NVIDIA/cutlass#674
    
    **USAGE**
    
    ```python
    import xformers.ops as xops
    
    # NOTE: Important to use `unbind` from xformers for the bw pass!
    w1, w2 = xops.unbind(
        w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
        dim=0,
    )
    b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
    y = xops.functional_swiglu(x,
        w1, b1, w2, b2, w3, b3)
    ```
    
    **PERFORMANCE (A100 only)**
    
    *FW*
    ```
    [-------------------------------------------------------- swiglu_fw ---------------------------------------------------------]
                                         |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
    1 threads: -------------------------------------------------------------------------------------------------------------------
          f16    B=9456, I=1536, H=4096  |               1377.7               |  1581.4  |         1339.1
          f16.ac B=9456, I=1536, H=4096  |               1449.3               |  1735.3  |         1462.9
          f16    B=4440, I=1536, H=4096  |                600.4               |   735.6  |          593.9
          f16.ac B=4440, I=1536, H=4096  |                709.0               |   843.7  |          717.6
          f16    B=4728, I=1536, H=4096  |                638.9               |   776.2  |          635.3
          f16.ac B=4728, I=1536, H=4096  |                748.9               |   892.2  |          756.7
          f16    B=4728, I=1536, H=1024  |                162.3               |   201.5  |          163.1
          f16.ac B=4728, I=1536, H=1024  |                235.2               |   277.4  |          245.5
    
    Times are in microseconds (us).
    ```
    
    *BW*
    ```
    [-------------------------------------------------------- swiglu_bw ---------------------------------------------------------]
                                         |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
    1 threads: -------------------------------------------------------------------------------------------------------------------
          f16    B=9456, I=1536, H=4096  |               2333.1               |  2696.7  |         2336.1
          f16.ac B=9456, I=1536, H=4096  |               2620.8               |  2990.9  |         2840.0
          f16    B=4440, I=1536, H=4096  |               1243.2               |  1413.8  |         1240.3
          f16.ac B=4440, I=1536, H=4096  |               1448.6               |  1629.0  |         1637.3
          f16    B=4728, I=1536, H=4096  |               1298.4               |  1481.5  |         1301.1
          f16.ac B=4728, I=1536, H=4096  |               1511.8               |  1705.3  |         1705.4
          f16    B=4728, I=1536, H=1024  |                463.3               |   493.9  |          463.0
          f16.ac B=4728, I=1536, H=1024  |                582.4               |   614.9  |          672.7
    
    Times are in microseconds (us).
    ```
    
    [ghstack-poisoned]
    danthe3rd committed Oct 31, 2022
    Configuration menu
    Copy the full SHA
    4b600bf View commit details
    Browse the repository at this point in the history

Commits on Nov 3, 2022

  1. Update on "SwiGLU optimized fw/bw"

    **NOTE**
    We can improve a bit more once this is fixed - NVIDIA/cutlass#674
    
    **USAGE**
    
    ```python
    import xformers.ops as xops
    
    # NOTE: Important to use `unbind` from xformers for the bw pass!
    w1, w2 = xops.unbind(
        w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
        dim=0,
    )
    b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
    y = xops.functional_swiglu(x,
        w1, b1, w2, b2, w3, b3)
    ```
    
    **PERFORMANCE (A100 only)**
    
    *FW*
    ```
    [-------------------------------------------------------- swiglu_fw ---------------------------------------------------------]
                                         |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
    1 threads: -------------------------------------------------------------------------------------------------------------------
          f16    B=9456, I=1536, H=4096  |               1377.7               |  1581.4  |         1339.1
          f16.ac B=9456, I=1536, H=4096  |               1449.3               |  1735.3  |         1462.9
          f16    B=4440, I=1536, H=4096  |                600.4               |   735.6  |          593.9
          f16.ac B=4440, I=1536, H=4096  |                709.0               |   843.7  |          717.6
          f16    B=4728, I=1536, H=4096  |                638.9               |   776.2  |          635.3
          f16.ac B=4728, I=1536, H=4096  |                748.9               |   892.2  |          756.7
          f16    B=4728, I=1536, H=1024  |                162.3               |   201.5  |          163.1
          f16.ac B=4728, I=1536, H=1024  |                235.2               |   277.4  |          245.5
    
    Times are in microseconds (us).
    ```
    
    *BW*
    ```
    [-------------------------------------------------------- swiglu_bw ---------------------------------------------------------]
                                         |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
    1 threads: -------------------------------------------------------------------------------------------------------------------
          f16    B=9456, I=1536, H=4096  |               2333.1               |  2696.7  |         2336.1
          f16.ac B=9456, I=1536, H=4096  |               2620.8               |  2990.9  |         2840.0
          f16    B=4440, I=1536, H=4096  |               1243.2               |  1413.8  |         1240.3
          f16.ac B=4440, I=1536, H=4096  |               1448.6               |  1629.0  |         1637.3
          f16    B=4728, I=1536, H=4096  |               1298.4               |  1481.5  |         1301.1
          f16.ac B=4728, I=1536, H=4096  |               1511.8               |  1705.3  |         1705.4
          f16    B=4728, I=1536, H=1024  |                463.3               |   493.9  |          463.0
          f16.ac B=4728, I=1536, H=1024  |                582.4               |   614.9  |          672.7
    
    Times are in microseconds (us).
    ```
    
    [ghstack-poisoned]
    danthe3rd committed Nov 3, 2022
    Configuration menu
    Copy the full SHA
    dd6a285 View commit details
    Browse the repository at this point in the history

Commits on Nov 4, 2022

  1. Update on "SwiGLU optimized fw/bw"

    **NOTE**
    We can improve a bit more once this is fixed - NVIDIA/cutlass#674
    
    **USAGE**
    
    ```python
    import xformers.ops as xops
    
    # NOTE: Important to use `unbind` from xformers for the bw pass!
    w1, w2 = xops.unbind(
        w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
        dim=0,
    )
    b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
    y = xops.functional_swiglu(x,
        w1, b1, w2, b2, w3, b3)
    ```
    
    **PERFORMANCE (A100 only)**
    
    *FW*
    ```
    [-------------------------------------------------------- swiglu_fw ---------------------------------------------------------]
                                         |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
    1 threads: -------------------------------------------------------------------------------------------------------------------
          f16    B=9456, I=1536, H=4096  |               1377.7               |  1581.4  |         1339.1
          f16.ac B=9456, I=1536, H=4096  |               1449.3               |  1735.3  |         1462.9
          f16    B=4440, I=1536, H=4096  |                600.4               |   735.6  |          593.9
          f16.ac B=4440, I=1536, H=4096  |                709.0               |   843.7  |          717.6
          f16    B=4728, I=1536, H=4096  |                638.9               |   776.2  |          635.3
          f16.ac B=4728, I=1536, H=4096  |                748.9               |   892.2  |          756.7
          f16    B=4728, I=1536, H=1024  |                162.3               |   201.5  |          163.1
          f16.ac B=4728, I=1536, H=1024  |                235.2               |   277.4  |          245.5
    
    Times are in microseconds (us).
    ```
    
    *BW*
    ```
    [-------------------------------------------------------- swiglu_bw ---------------------------------------------------------]
                                         |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
    1 threads: -------------------------------------------------------------------------------------------------------------------
          f16    B=9456, I=1536, H=4096  |               2333.1               |  2696.7  |         2336.1
          f16.ac B=9456, I=1536, H=4096  |               2620.8               |  2990.9  |         2840.0
          f16    B=4440, I=1536, H=4096  |               1243.2               |  1413.8  |         1240.3
          f16.ac B=4440, I=1536, H=4096  |               1448.6               |  1629.0  |         1637.3
          f16    B=4728, I=1536, H=4096  |               1298.4               |  1481.5  |         1301.1
          f16.ac B=4728, I=1536, H=4096  |               1511.8               |  1705.3  |         1705.4
          f16    B=4728, I=1536, H=1024  |                463.3               |   493.9  |          463.0
          f16.ac B=4728, I=1536, H=1024  |                582.4               |   614.9  |          672.7
    
    Times are in microseconds (us).
    ```
    
    [ghstack-poisoned]
    danthe3rd committed Nov 4, 2022
    Configuration menu
    Copy the full SHA
    d825314 View commit details
    Browse the repository at this point in the history
  2. Update on "SwiGLU optimized fw/bw"

    **NOTE**
    We can improve a bit more once this is fixed - NVIDIA/cutlass#674
    
    **USAGE**
    
    ```python
    import xformers.ops as xops
    
    # NOTE: Important to use `unbind` from xformers for the bw pass!
    w1, w2 = xops.unbind(
        w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
        dim=0,
    )
    b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
    y = xops.functional_swiglu(x,
        w1, b1, w2, b2, w3, b3)
    ```
    
    **PERFORMANCE (A100 only)**
    
    *FW*
    ```
    [-------------------------------------------------------- swiglu_fw ---------------------------------------------------------]
                                         |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
    1 threads: -------------------------------------------------------------------------------------------------------------------
          f16    B=9456, I=1536, H=4096  |               1377.7               |  1581.4  |         1339.1
          f16.ac B=9456, I=1536, H=4096  |               1449.3               |  1735.3  |         1462.9
          f16    B=4440, I=1536, H=4096  |                600.4               |   735.6  |          593.9
          f16.ac B=4440, I=1536, H=4096  |                709.0               |   843.7  |          717.6
          f16    B=4728, I=1536, H=4096  |                638.9               |   776.2  |          635.3
          f16.ac B=4728, I=1536, H=4096  |                748.9               |   892.2  |          756.7
          f16    B=4728, I=1536, H=1024  |                162.3               |   201.5  |          163.1
          f16.ac B=4728, I=1536, H=1024  |                235.2               |   277.4  |          245.5
    
    Times are in microseconds (us).
    ```
    
    *BW*
    ```
    [-------------------------------------------------------- swiglu_bw ---------------------------------------------------------]
                                         |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
    1 threads: -------------------------------------------------------------------------------------------------------------------
          f16    B=9456, I=1536, H=4096  |               2333.1               |  2696.7  |         2336.1
          f16.ac B=9456, I=1536, H=4096  |               2620.8               |  2990.9  |         2840.0
          f16    B=4440, I=1536, H=4096  |               1243.2               |  1413.8  |         1240.3
          f16.ac B=4440, I=1536, H=4096  |               1448.6               |  1629.0  |         1637.3
          f16    B=4728, I=1536, H=4096  |               1298.4               |  1481.5  |         1301.1
          f16.ac B=4728, I=1536, H=4096  |               1511.8               |  1705.3  |         1705.4
          f16    B=4728, I=1536, H=1024  |                463.3               |   493.9  |          463.0
          f16.ac B=4728, I=1536, H=1024  |                582.4               |   614.9  |          672.7
    
    Times are in microseconds (us).
    ```
    
    [ghstack-poisoned]
    danthe3rd committed Nov 4, 2022
    Configuration menu
    Copy the full SHA
    e2bfbb2 View commit details
    Browse the repository at this point in the history
  3. Update on "SwiGLU optimized fw/bw"

    **NOTE**
    We can improve a bit more once this is fixed - NVIDIA/cutlass#674
    
    **USAGE**
    
    ```python
    import xformers.ops as xops
    
    # NOTE: Important to use `unbind` from xformers for the bw pass!
    w1, w2 = xops.unbind(
        w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
        dim=0,
    )
    b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
    y = xops.functional_swiglu(x,
        w1, b1, w2, b2, w3, b3)
    ```
    
    **PERFORMANCE (A100 only)**
    
    *FW*
    ```
    [-------------------------------------------------------- swiglu_fw ---------------------------------------------------------]
                                         |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
    1 threads: -------------------------------------------------------------------------------------------------------------------
          f16    B=9456, I=1536, H=4096  |               1377.7               |  1581.4  |         1339.1
          f16.ac B=9456, I=1536, H=4096  |               1449.3               |  1735.3  |         1462.9
          f16    B=4440, I=1536, H=4096  |                600.4               |   735.6  |          593.9
          f16.ac B=4440, I=1536, H=4096  |                709.0               |   843.7  |          717.6
          f16    B=4728, I=1536, H=4096  |                638.9               |   776.2  |          635.3
          f16.ac B=4728, I=1536, H=4096  |                748.9               |   892.2  |          756.7
          f16    B=4728, I=1536, H=1024  |                162.3               |   201.5  |          163.1
          f16.ac B=4728, I=1536, H=1024  |                235.2               |   277.4  |          245.5
    
    Times are in microseconds (us).
    ```
    
    *BW*
    ```
    [-------------------------------------------------------- swiglu_bw ---------------------------------------------------------]
                                         |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
    1 threads: -------------------------------------------------------------------------------------------------------------------
          f16    B=9456, I=1536, H=4096  |               2333.1               |  2696.7  |         2336.1
          f16.ac B=9456, I=1536, H=4096  |               2620.8               |  2990.9  |         2840.0
          f16    B=4440, I=1536, H=4096  |               1243.2               |  1413.8  |         1240.3
          f16.ac B=4440, I=1536, H=4096  |               1448.6               |  1629.0  |         1637.3
          f16    B=4728, I=1536, H=4096  |               1298.4               |  1481.5  |         1301.1
          f16.ac B=4728, I=1536, H=4096  |               1511.8               |  1705.3  |         1705.4
          f16    B=4728, I=1536, H=1024  |                463.3               |   493.9  |          463.0
          f16.ac B=4728, I=1536, H=1024  |                582.4               |   614.9  |          672.7
    
    Times are in microseconds (us).
    ```
    
    [ghstack-poisoned]
    danthe3rd committed Nov 4, 2022
    Configuration menu
    Copy the full SHA
    07135b8 View commit details
    Browse the repository at this point in the history

Commits on Nov 7, 2022

  1. Update on "SwiGLU optimized fw/bw"

    **NOTE**
    We can improve a bit more once this is fixed - NVIDIA/cutlass#674
    
    **USAGE**
    
    ```python
    import xformers.ops as xops
    
    # NOTE: Important to use `unbind` from xformers for the bw pass!
    w1, w2 = xops.unbind(
        w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
        dim=0,
    )
    b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
    y = xops.functional_swiglu(x,
        w1, b1, w2, b2, w3, b3)
    ```
    
    **PERFORMANCE (A100 only)**
    
    *FW*
    ```
    [-------------------------------------------------------- swiglu_fw ---------------------------------------------------------]
                                         |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
    1 threads: -------------------------------------------------------------------------------------------------------------------
          f16    B=9456, I=1536, H=4096  |               1377.7               |  1581.4  |         1339.1
          f16.ac B=9456, I=1536, H=4096  |               1449.3               |  1735.3  |         1462.9
          f16    B=4440, I=1536, H=4096  |                600.4               |   735.6  |          593.9
          f16.ac B=4440, I=1536, H=4096  |                709.0               |   843.7  |          717.6
          f16    B=4728, I=1536, H=4096  |                638.9               |   776.2  |          635.3
          f16.ac B=4728, I=1536, H=4096  |                748.9               |   892.2  |          756.7
          f16    B=4728, I=1536, H=1024  |                162.3               |   201.5  |          163.1
          f16.ac B=4728, I=1536, H=1024  |                235.2               |   277.4  |          245.5
    
    Times are in microseconds (us).
    ```
    
    *BW*
    ```
    [-------------------------------------------------------- swiglu_bw ---------------------------------------------------------]
                                         |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
    1 threads: -------------------------------------------------------------------------------------------------------------------
          f16    B=9456, I=1536, H=4096  |               2333.1               |  2696.7  |         2336.1
          f16.ac B=9456, I=1536, H=4096  |               2620.8               |  2990.9  |         2840.0
          f16    B=4440, I=1536, H=4096  |               1243.2               |  1413.8  |         1240.3
          f16.ac B=4440, I=1536, H=4096  |               1448.6               |  1629.0  |         1637.3
          f16    B=4728, I=1536, H=4096  |               1298.4               |  1481.5  |         1301.1
          f16.ac B=4728, I=1536, H=4096  |               1511.8               |  1705.3  |         1705.4
          f16    B=4728, I=1536, H=1024  |                463.3               |   493.9  |          463.0
          f16.ac B=4728, I=1536, H=1024  |                582.4               |   614.9  |          672.7
    
    Times are in microseconds (us).
    ```
    
    [ghstack-poisoned]
    danthe3rd committed Nov 7, 2022
    Configuration menu
    Copy the full SHA
    3490242 View commit details
    Browse the repository at this point in the history

Commits on Nov 10, 2022

  1. Update on "SwiGLU optimized fw/bw"

    **NOTE**
    We can improve a bit more once this is fixed - NVIDIA/cutlass#674
    
    **USAGE**
    
    ```python
    import xformers.ops as xops
    
    # NOTE: Important to use `unbind` from xformers for the bw pass!
    w1, w2 = xops.unbind(
        w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
        dim=0,
    )
    b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
    y = xops.functional_swiglu(x,
        w1, b1, w2, b2, w3, b3)
    ```
    
    **PERFORMANCE (A100 only)**
    
    *FW*
    ```
    [-------------------------------------------------------- swiglu_fw ---------------------------------------------------------]
                                         |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
    1 threads: -------------------------------------------------------------------------------------------------------------------
          f16    B=9456, I=1536, H=4096  |               1377.7               |  1581.4  |         1339.1
          f16.ac B=9456, I=1536, H=4096  |               1449.3               |  1735.3  |         1462.9
          f16    B=4440, I=1536, H=4096  |                600.4               |   735.6  |          593.9
          f16.ac B=4440, I=1536, H=4096  |                709.0               |   843.7  |          717.6
          f16    B=4728, I=1536, H=4096  |                638.9               |   776.2  |          635.3
          f16.ac B=4728, I=1536, H=4096  |                748.9               |   892.2  |          756.7
          f16    B=4728, I=1536, H=1024  |                162.3               |   201.5  |          163.1
          f16.ac B=4728, I=1536, H=1024  |                235.2               |   277.4  |          245.5
    
    Times are in microseconds (us).
    ```
    
    *BW*
    ```
    [-------------------------------------------------------- swiglu_bw ---------------------------------------------------------]
                                         |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
    1 threads: -------------------------------------------------------------------------------------------------------------------
          f16    B=9456, I=1536, H=4096  |               2333.1               |  2696.7  |         2336.1
          f16.ac B=9456, I=1536, H=4096  |               2620.8               |  2990.9  |         2840.0
          f16    B=4440, I=1536, H=4096  |               1243.2               |  1413.8  |         1240.3
          f16.ac B=4440, I=1536, H=4096  |               1448.6               |  1629.0  |         1637.3
          f16    B=4728, I=1536, H=4096  |               1298.4               |  1481.5  |         1301.1
          f16.ac B=4728, I=1536, H=4096  |               1511.8               |  1705.3  |         1705.4
          f16    B=4728, I=1536, H=1024  |                463.3               |   493.9  |          463.0
          f16.ac B=4728, I=1536, H=1024  |                582.4               |   614.9  |          672.7
    
    Times are in microseconds (us).
    ```
    
    [ghstack-poisoned]
    danthe3rd committed Nov 10, 2022
    Configuration menu
    Copy the full SHA
    a90fe49 View commit details
    Browse the repository at this point in the history