Add flash attn for af2 #8

Xreki · 2023-04-24T05:17:58Z

PR types

Performance optimization

PR changes

OPs

Description

RT

* test,test=develop * test,test=develop * test,test=develop * test,test=develop * test,test=develop * test,test=develop * test,test=develop * test,test=develop

…ddle#53384)

…dle#50915)" (PaddlePaddle#53527) This reverts commit 9c40653.

* move UniformRawKernel to legacy * Update uniform_kernel.cc * Update uniform_kernel.cu * Update uniform_kernel.cc * Update uniform_kernel.cu * Update uniform_kernel.h * Update uniform_kernel.cc * Empty Commit to setup deployments

* rem npu in test * restore some code

* Add trt pow converter. * update to use AddConstantLayer * add dims=0 ut

* Rename randint_raw and move it to legacy * Update fetch_v2_op.cc * Update randint_kernel.cc * Update randint_kernel.cu * Empty Commit to setup deployments

* polish * polish * polish * polish * polish * polish * polish * polish * polish * polish * polish

* use int64 to calc dim for c softmax * fix complie bug

…_grad kernel (PaddlePaddle#53528)

* Add fused_gate_attention API. * Implement FusedDropout API. * Fix doc and add unittest. * Skip for non-gpu device. * Add unittest.

* add OpTrait OpInterface ValueIterator TypeList * refine code * refine code * refine code * add opinfo * add typeid copy constructor * add trait interface construct method for opinfo_impl * add trait interface construct method for opinfo_impl * add trait interface construct method for opinfo_impl * add trait interface construct method for opinfo_impl * add trait interface construct method for opinfo_impl * add create * add member func for opinfo * fix compile bug * add op interface in ircontext * fix compile bug * fix compile bug * refine code * fix compile bug * add ut * refine ut * refine code of opinfo_impl * delete unused code * add dyncast for operation * refine comment * refine opinfo_impl * delete unused code * refine code by comment * refine code * refine code * refine code for registerOp * refine opfin create * refine code of search method of ircontext * refine op attribute * change opinfo_map key from type_id to string

…ddlePaddle#53531) kernels.

* add mul doubel grad * add sub_double_grad * add add sub high test * add mutiply test * modify other unsqueeze * delete api.yaml * only for make ci run * midify unsqueeze * modify unsqueeze * tmp * modify operants gen * review modify * modify review * debug * debug * modify ci cross boundary * delete log

* fix strided_slice ut * remove check_dygraph

…yer (PaddlePaddle#53554) * add lookup_table op trt converter * update

…addle#53744) * optimize logsumexp in small data scale * fix * fix * add #pragma once * compile protobuf offline * add submodlu gflags * check_submodules * check_submodules * add_submodule protobuf * add_submodule_protobuf * add_submodule * add .gitmodules * add_submodules * fix_compiler error * support offline compile * support offline compile * support offline_compile * remove cub * remove brpc * support offline compile * support offline compile * canning patching on cryptopp * modify .gitigonre of cryptopp * test * offline compile * add_submodule zlib * modify .gitmodules * modify .gitmodules * fix setup.py bug * delete submodule cryptopp * fix windows compile bug * fix xxhash compile problem --------- Co-authored-by: Asthestarsfalll <1186454801@qq.com> Co-authored-by: Asthestarsfalll <72954905+Asthestarsfalll@users.noreply.github.com>

) * suport device_guard for npu * fix comment * fix typo

…_fix (PaddlePaddle#53908)

* add master gradients on static graph * add unit test for bf16 master grad static graph * use float16 as v100 test dtype * only skip GPU which do not support bf16 * use linear layer to test master grad * 1.push master grad creation before all optimizer ops; 2.remove useless unittest; 3.use a function to create master grad states

* rm cmake npu * Update generic.cmake * Update generic.cmake

* rm tools npu * Update get_pr_ut.py * Update get_pr_ut.py

…53862) * [XPU] do not call check_nccl_version_for_p2p under xpu * refine code.

* simplify layer_norm_op.cc * support auto generate for op layer_norm * update unittest for composite_layer_norm * remove layer_norm_op.cc from scripts * replace layer_norm_op with generated_op * add get_expected_kernel for layer_norm * update cmake kernel register function for layer_norm_mkldnn_op

…Paddle#53899)

…ddle#52006) * [Dy2static-Fallback] add set_eval_frame function in pybind. 1. add set_eval_frame function in pybind. * add unittest for eval frame hooker. * [support py38] * fix-GeneratorExit error in eval frame hooker * support python == 3.9 * support 3.10 * fix some comments

* fix * fix

* move sequence_mask op InferShape func * add dtype infer

* Fused elementwises kernels and ops * change fuse pass name * adjust .pbtxt files * adjust quantization attributes * add missing arguments and fix others, review fixed * simplify fused kernel registration * fix elementwise unit tests * reuse one fused elementwise op * adjust proto * Add supported datatypes * Change 'Scale' to 'scale' in tests, change some tests to onednn * Revert breaking changes * Fix unit tests * Delete obsolete test cases * Delete commented out code * Fix codestyle * delete temporary condition * fix conflicts and delete duplicate fusing * Fix code after merge * Move tests to new directory * fix tests volatility * Rename test_elementwise_add_onednn_op.py to test_elementwise_add_mkldnn_op.py * Update CMakeLists.txt add mkldnn op test --------- Co-authored-by: Silv3S <slawomir.siwek@intel.com>

Xreki force-pushed the add_flash_attn_for_af2 branch 2 times, most recently from de37e2f to cf4a1c8 Compare April 24, 2023 15:16

tianshuo78520a and others added 28 commits May 5, 2023 14:17

Mv cpp_extension test dir (PaddlePaddle#53330)

e85fbac

【Hackathon No.61】uniform_random 算子FP16/BF16单测完善 (PaddlePaddle#52949)

b02de1b

[XPU] Fix the out_max of the branch in xpu_conv2d op(PaddlePaddle#53343)

d27f15e

[XPU] Fusion of gather and assign operators to fused_mt op for reduci…

2039115

…ng memory usage (PaddlePaddle#53262)

remove some [-Wunused-parameter]warning (PaddlePaddle#53397)

58435ae

* test,test=develop * test,test=develop * test,test=develop * test,test=develop * test,test=develop * test,test=develop * test,test=develop * test,test=develop

[Dygraph] Fix bugs in dp_pp_comm_overlap for HybridParallel (PaddlePa…

0d9a23b

…ddle#53384)

Revert "【Hackathon No.52】为 Paddle dist 算子实现 float16 数据类型支持 (PaddlePad…

d463f8e

…dle#50915)" (PaddlePaddle#53527) This reverts commit 9c40653.

move UniformRawKernel to legacy (PaddlePaddle#53158)

13e2e10

* move UniformRawKernel to legacy * Update uniform_kernel.cc * Update uniform_kernel.cu * Update uniform_kernel.cc * Update uniform_kernel.cu * Update uniform_kernel.h * Update uniform_kernel.cc * Empty Commit to setup deployments

rem npu in test (PaddlePaddle#53469)

a499731

* rem npu in test * restore some code

Add trt pow converter. (PaddlePaddle#53462)

5a44bf7

* Add trt pow converter. * update to use AddConstantLayer * add dims=0 ut

Rename randint_raw and move it to legacy (PaddlePaddle#53157)

3e7be9c

* Rename randint_raw and move it to legacy * Update fetch_v2_op.cc * Update randint_kernel.cc * Update randint_kernel.cu * Empty Commit to setup deployments

[inference][trt] add reduce_all and reduce_any (PaddlePaddle#53088)

12406ca

fix brpc double link (PaddlePaddle#53512)

03fe3ce

* polish * polish * polish * polish * polish * polish * polish * polish * polish * polish * polish

use int64 to calc dim for c softmax (PaddlePaddle#53541)

da963ea

* use int64 to calc dim for c softmax * fix complie bug

Rewirte the reshape of temp_mask and temp_bias.

08a8b75

Merge branch 'develop' into add_flash_attn_for_af2

6a65ee0

[XPU] substitute new api kernel for combinatorial adaptive avg_pool2d…

eda8df7

…_grad kernel (PaddlePaddle#53528)

Update commit and fix reduce_dim.

4682c0d

XPU Support external stream (PaddlePaddle#53334)

99399f3

Add fused_gate_attention API. (PaddlePaddle#53432)

b729512

* Add fused_gate_attention API. * Implement FusedDropout API. * Fix doc and add unittest. * Skip for non-gpu device. * Add unittest.

Merge branch 'develop' into add_flash_attn_for_af2

165afab

API support use_flash_attn.

dd2860e

Use copy_if_different to avoid recompilation of generated cutlass (Pa…

f5476da

…ddlePaddle#53531) kernels.

fix strided_slice ut (PaddlePaddle#53553)

1d8c82b

* fix strided_slice ut * remove check_dygraph

fix conv1d_transpose insert quant node bug (PaddlePaddle#53320)

ca174ea

[inference][trt] add lookup_table op trt converter, use trt gather la…

08b44e6

…yer (PaddlePaddle#53554) * add lookup_table op trt converter * update

Xreki and others added 30 commits May 17, 2023 18:06

Polish codes.

f6be954

[CustomDevice] suport device_guard for custom device (PaddlePaddle#53808

9e045ee

) * suport device_guard for npu * fix comment * fix typo

[CINN] extend cinn single test timeout from 150 to 200, test=document…

4f1bf19

…_fix (PaddlePaddle#53908)

Fix typos in send_v2_op.cu.cc (PaddlePaddle#53904)

65ce688

Fix typos, test=document_fix (PaddlePaddle#53916)

92121d1

fix -Werror=format-security (PaddlePaddle#53886)

6d7076c

rm cmake npu (PaddlePaddle#53869)

79ce3fa

* rm cmake npu * Update generic.cmake * Update generic.cmake

rm tools npu (PaddlePaddle#53870)

d294eef

* rm tools npu * Update get_pr_ut.py * Update get_pr_ut.py

[XPU] do not call check_nccl_version_for_p2p under xpu (PaddlePaddle#…

5d638fe

…53862) * [XPU] do not call check_nccl_version_for_p2p under xpu * refine code.

[Fix Typo] Fix gpu_info.h, Wheter->Whether (PaddlePaddle#53564)

236e742

Fix typos in executor_statistics.cc (PaddlePaddle#53917)

1ac28b6

[CustomOp Unittest] Fix XPU unittest, discard static backward (Paddle…

2d0c694

…Paddle#53899)

Del test_async_read_write in CPU (PaddlePaddle#53882)

acb5039

* fix * fix

Fix typos in elementwise dir (PaddlePaddle#53907)

2782b29

move sequence_mask op InferShape func (PaddlePaddle#53782)

a862deb

* move sequence_mask op InferShape func * add dtype infer

add fp16 and bf16 for trunc (PaddlePaddle#53876)

d8407c5

Fix typos, test=document_fix (PaddlePaddle#53927)

e916e80

Fix typos (PaddlePaddle#53912)

117e951

Add segment_pool tests (PaddlePaddle#53785)

0bed220

move fusion_group kernel to phi (PaddlePaddle#53781)

26da689

remove CopyWithContext limitation (PaddlePaddle#53771)

d53d8fd

Fix qkv_transpose_out's shape and scaling of Q * K.

ba84941

Add einsum tests (PaddlePaddle#53722)

c3c8579

Merge branch 'develop' into add_flash_attn_for_af2

bee8537

Update commit of flash-attention.

3747978

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add flash attn for af2 #8

Add flash attn for af2 #8

Xreki commented Apr 24, 2023

Add flash attn for af2 #8

Are you sure you want to change the base?

Add flash attn for af2 #8

Conversation

Xreki commented Apr 24, 2023

PR types

PR changes

Description