ARM Backend using ruy for fp32 and int8 #79

jerinphilip · 2022-03-09T20:22:10Z

Provides an arm backend for matrix multiplies using google/ruy and math
functions through simde (https://simd-everywhere.github.io/blog/about/)
effectively getting marian-decoder to run on ARM.

The following cmake flags are added:

USE_INTGEMM (to switch intgemm on/off)
USE_RUY (to switch ruy on/off)
USE_ONNX_SGEMM (use onnx sgemm added by wasm to provide attention
matrix multiply which is currently reliant on a BLAS library). This previously
used to be WASM_COMPATIBLE_BLAS.
USE_SIMDE (swaps out existing intel based functions by using SIMDE
instead).

The built marian-decoder is tested on an Oracle Cloud ARM Machine with
the following specs:

Architecture   : aarch64
CPU op-mode(s) : 32-bit, 64-bit
Byte Order     : Little Endian
Vendor ID      : ARM
Model name     : Neoverse-N1
Flags          : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp
                 asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs

A CI check on GitHub actions is added to use android-ndk cross-compile
targetting arm64-v8a. The built binary is tested to work on an Android
Phone using termux (Samsung M30s).

Successful android build additionally requires a patch (sentencepiece ->
protobuf). See opencv/opencv#17282 and
opencv/opencv#19049.

-Werror etc causes issues with ruy (-Wmulti-line-comment) and are
disabled.

The following minor changes are also applied:

Remove M32_BINARIES use COMPILE_WASM for -m32
Hide msse4.1 if unknown platform
faiss was previously hardcoded for platforms with SSE available. This
has been mitigated by adding a refernce standard cpp implementation of
the missing function.
Exclude packed_gemm_....cpp from sources if USE_FBGEMM=off
MSVC workaround following
Import matrix-multiply from a separate wasm module #56 (comment)

Status

FP32 Ruy SGEMM integration
Replace SIMDE with simd_utils Using simd_utils instead of SIMDE jerinphilip/marian#8
Lift ruy from intgemm_interface.h.

Provides an arm backend for matrix multiplies using google/ruy and math functions through simde (https://simd-everywhere.github.io/blog/about/) effectively getting marian-decoder to run on ARM. The following cmake flags are added: - USE_INTGEMM (to switch intgemm on/off) - USE_RUY (to switch ruy on/off) - USE_ONNX_SGEMM (use onnx sgemm added by wasm to provide attention matrix multiply which is currently reliant on a BLAS library). - USE_SIMDE (swaps out existing intel based functions by using SIMDE instead). The built marian-decoder is tested on an Oracle Cloud ARM Machine with the following specs: Architecture : aarch64 CPU op-mode(s) : 32-bit, 64-bit Byte Order : Little Endian Vendor ID : ARM Model name : Neoverse-N1 Flags : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs A CI check on GitHub actions is added to use android-ndk cross-compile targetting arm64-v8a. The built binary is tested to work on an Android Phone using termux (Samsung M30s). Successful android build additionally requires a patch (sentencepiece -> protobuf). See opencv/opencv#17282 and opencv/opencv#19049. -Werror etc causes issues with ruy (-Wmulti-line-comment) and are disabled. The following minor changes are also applied: - Remove M32_BINARIES use COMPILE_WASM for -m32 - Hide msse4.1 if unknown platform - faiss was previously hardcoded for platforms with SSE available. This has been mitigated by adding a refernce standard cpp implementation of the missing function. - Exclude packed_gemm_....cpp from sources if USE_FBGEMM=off - MSVC workaround following browsermt#56 (comment)

jerinphilip · 2022-03-09T20:38:04Z

The cmake-flags, BLAS ifdefs etc are hard to navigate territory. WASM has not done much to help this situation either - but the markings kind of help search when android has similar platform complaints of usual stuff missing. I am afraid the additional branches I create (in CMake and ifdef) only makes the situation worse.

marian-nmt#762 looks like a different approach using the units here will be undertaken for marian-nmt/marian-dev.

Requesting @XapaJIaMnu, @kpu, @graemenail for feedback on how to simplify the situation. I'm hoping devs who have awareness of the bigger picture can point me towards the appropriate things to do. I have gotten this to work on Oracle Machines and Android - I expect to have access to an M1 Macbook Air in May.

kpu · 2022-03-14T14:33:59Z

Don't use the ONNX GEMM. It should be using ruy.

Since ruy has sgemm (or can quickly implement sgemm in src/tensors/cpu/prod.cpp), use that for float32 matrix multiplies.

We would then have a library that does sgemm but not the other BLAS routines. As you note, FAISS depends on more BLAS routines than sgemm: https://github.com/jerinphilip/marian/runs/5468340444?check_suite_focus=true#step:10:360 . So my suggestion here is to split BLAS_FOUND into HAVE_SGEMM and USE_FAISS. Then do it at the cmake level, including USE_FAISS off at the cmake level for ARM compiles.

jerinphilip · 2022-03-16T21:20:30Z

Don't use the ONNX GEMM. It should be using ruy.

ONNX SGEMM is used only by WebAssembly and Android (for now). On Mac M1 Apple Accelerate or a BLAS library is available. OpenBLAS is used for time being in Oracle cloud ARM machine. I have isolated enough to get sgemm through ruy in https://github.com/jerinphilip/sgemm/blob/28dc786d821d3abb2acf086e1fb145e58cc55372/src/main.cpp#L54-L75 for now. This looks like it will take longer - the optimized primitive I find in ruy is A*B, not sgemm(...). ONNX JS SGEMM appears to be internally calling Eigen, I'm not sure how fast this is but Eigen reports SIMD etc capability. The other arguments are used (https://cs.github.com/browsermt/marian-dev/blob/53c4f7e4537dbf7782c583e98f50e513f0a27541/src/graph/node_operators_binary.h?q=ProdBatched#L419-L506) and will need optimized implementations (alpha, beta, transpose by layout adjustments).

If available, I'd like to focus on integration here (without adding more source complicating review) and bring in a PR behind finding for ruy based sgemm for Android.

BLAS_FOUND is the more suitable flag than USE_FAISS in existing places is what I find.

kpu · 2022-03-16T21:51:50Z

Ruy provides a bias term here:
https://github.com/google/ruy/blob/2d950b3bfa7ebfbe7a97ecb44b1cc4da5ac1d6f0/ruy/mul_params.h#L225
Example with bias term:
https://github.com/google/ruy/blob/2d950b3bfa7ebfbe7a97ecb44b1cc4da5ac1d6f0/example/example.cc#L51-L62

Regarding layout (transpose and stride arguments), these are supported: https://github.com/google/ruy/blob/2d950b3bfa7ebfbe7a97ecb44b1cc4da5ac1d6f0/example/parametrized_example.cc#L84-L91

There does not appear to be support for alpha and beta scaling parameters. However, those use cases are very rare in Marian (I think one could get a fully functional implementation just by asserting they are 1) and could be done with element-wise postprocessing.

And with that you would have the features of sgemm.

Eigen's performance on x86 has been bad, not sure about ARM.

jerinphilip · 2022-03-16T22:46:38Z

Maybe I'm missing something, what do we need bias term for? We multiply int8*int8 -> int32 then convert to float to add a float-bias which causes us to take a different path in MozIntGemm (which is imported here). sgemm does not have a bias term, does it?

The layout-based transpose is available - looks like https://github.com/google/ruy/blob/8c3fd3f266b4a22d542d4aa41329b5018d6b87e1/ruy/test.h#L1157-L1164. I am currently looking into using this.

ProdBatched is used with different combinations (which is what the code-search permalink points to). I have traced this to bdot which uses non-identity scale/alpha value at places. If these are asserted to be 1 it will be fragmented and won't sit well with the rest of the implementations. I will at least have to introduce ABORTS on the calls because scale!=1.0f is effectively broken. I will work on the elementwise pre-processing (alpha *A) in the isolated (sgemm) repository for inlining here when ready.

Awaiting feedback on the remaining parts meanwhile.

kpu

Generally, use ruy for fp32 and https://github.com/JishinMaster/simd_utils/ for the functions

.github/workflows/macos.yml

.gitmodules

CMakeLists.txt

src/tensors/cpu/integer_common.h

src/3rd_party/sse_mathfun.h

CMakeLists.txt

src/tensors/cpu/ruy_adapter.h

jerinphilip

Notes to self.

patches/01-spm-protobuf-android.patch

src/tensors/cpu/ruy_adapter.h

.github/workflows/arm.yml

CMakeLists.txt

src/tensors/cpu/prod_blas.h

jerinphilip · 2022-03-14T23:25:52Z

src/tensors/cpu/expression_graph_packable.h

@@ -218,11 +218,13 @@ class ExpressionGraphPackable : public ExpressionGraph {
                                cols(val));
          //Put the quantMult at the back of the tensor
          *(reinterpret_cast<float *>(paramMat->data<int8_t>() + val->shape().elements())) = quantMult;
+#else
+    ABORT("Int8::PrepareA not implemented yet for ruy");


Compile error moved to runtime, but haven't managed to trigger it in runs. I wonder why.

In Ruy PrepareA is the same as PrepareB

CMakeLists.txt

XapaJIaMnu

Some partial comments, from my (im)partial review.

src/tensors/cpu/ruy_adapter.h

XapaJIaMnu · 2022-04-11T22:15:03Z

src/tensors/cpu/ruy_adapter.h

+    return result;
+  }
+
+  static void PrepareBias(const float *input, float *output, Index rows, Index cols) {


We should be using marian tensors here, as they allow for zero copy.

To be more precise, when being inside the marian tensor ecosystem, you can choose to return the same unmodified tensor from prepareBias (identity operation that just returns the input tensor) and avoid the copy.

ruy_adapter.h sits at the same level as intgemm. Tensor in arguments is therefore unsuitable. Marian provides Tensor. Marian uses ruy_adapter / intgemm. My preferred solution to this is to capture the variations as callback/first-class functions and do f(args) to not duplicate code. These primitives should be supplied by this file/layer, without pulling in marian::Tensor.

Edit: callback could be unquantize when we don't add bias, and unquantizeAddBias when we have bias, switch can happen at call-site.

The RUY backend should NEVER use prepareBias. Calling this function from insude ARM should result in an Abort and is definitely a programmer mistake. ARM supports proper int8_t * int8_t and therefore doesn't need the prepare bias mumbo jumbo which we do in intgemm.

The RUY backend should NEVER use prepareBias.

I agree. I understand the objective here is to avoid the std::memcpy which can be done by changing the calling code. I will try to find a solution that achieves the intent at integration.

What I mean is that this function should never be called from ruy as it will be slower than the other code path. It should abort. The "shifted" code path shouldn't be taken by this backend.

As discussed on today's call, since ARM has 8-bit signed * signed, the relevant code path to follow is int8 (not int8shifted).

49beb50 (#79).

@kpu suggested lifting ruy path from intgemm_interface.h and adding at a level of gemm-provider. Much of the existing state is due to coming over from https://github.com/jerinphilip/MozIntGemm. This will involve writing some equivalent of intgemm_interface to complete the calls (i.e duplication). Please let me know if it's a go on this one.

I agree. I understand the objective here is to avoid the std::memcpy which can be done by changing the calling code. I will try to find a solution that achieves the intent at integration.

The std::memcpy was probably useless, got rid of it and a bunch of nullptr in the args in the process.

@kpu suggested lifting ruy path from intgemm_interface.h and adding at a level of gemm-provider. Much of the existing state is due to coming over from jerinphilip/MozIntGemm. This will involve writing some equivalent of intgemm_interface to complete the calls (i.e duplication). Please let me know if it's a go on this one.

Without documentation (for intgemm_interface.h, the integration to marian's graph system) or active help from someone who understands the area, I'm afraid I will not be able to do this in the immediate future.

I have read https://github.com/browsermt/marian-dev/blob/master/src/tensors/cpu/intgemm_interface.h and determined that, while it could use some comments, it is in state where any software engineer working for me is expected to read though and understand or come up with questions.

CMakeLists.txt

XapaJIaMnu

Small review

XapaJIaMnu · 2022-06-20T13:38:29Z

CMakeLists.txt

+
+target_architecture(CMAKE_TARGET_ARCHITECTURES)
+list(LENGTH CMAKE_TARGET_ARCHITECTURES cmake_target_arch_len)
+if(NOT "${cmake_target_arch_len}" STREQUAL "1")


What does this mean, sorry? Is this 32bit vs 64bit? A small clarifying comment?

Does this catch the unknown arch condition, and is that desirable?

XapaJIaMnu · 2022-06-20T13:44:06Z

src/tensors/cpu/expression_graph_packable.h

@@ -218,11 +218,13 @@ class ExpressionGraphPackable : public ExpressionGraph {
                                cols(val));
          //Put the quantMult at the back of the tensor
          *(reinterpret_cast<float *>(paramMat->data<int8_t>() + val->shape().elements())) = quantMult;
+#else
+    ABORT("Int8::PrepareA not implemented yet for ruy");


In Ruy PrepareA is the same as PrepareB

XapaJIaMnu · 2022-06-20T13:47:41Z

src/tensors/cpu/prod_blas.h

+
+  ruy::Matrix<float> lhs;
+  ruy::MakeSimpleLayout(M, K, orderA, lhs.mutable_layout());
+  lhs.set_data(A);


This is zero copy, right, just pointer set?

If you mean this statement, yes. It's a pointer set. However, I think ruy might be allocating/deallocating under the hood based on its needs of a layout change? https://github.com/google/ruy/blob/38a9266b832767a3f535a74a9e0cf39f7892e594/ruy/prepare_packed_matrices.cc#L69-L92

XapaJIaMnu · 2022-06-20T13:55:06Z

src/tensors/cpu/ruy_adapter.h

+  }
+}
+
+struct UnquantizeAndAddBiasAndWrite {


Obligatory comment that it is unnecessary to code it in this way, but we have already covered that.

XapaJIaMnu · 2022-06-20T13:55:33Z

src/tensors/cpu/ruy_adapter.h

+      Preprocess<kHighestPath>::quantize(input, output, quant_mult, rows, cols);
+    }
+
+    static void SelectColumnsB(const Type *input,


@jerinphilip did you try vanilla index_select?

graemenail

A few comments while I catch up on the full discussion

graemenail · 2022-06-20T14:04:17Z

CMakeLists.txt

@@ -266,7 +308,8 @@ else(MSVC)
  set(DISABLE_GLOBALLY "-Wno-unused-result ${CLANG_IGNORE_UNKNOWN_CUDA} ${CLANG_IGNORE_UNUSED_VALUES}") # This needs to appear here as well to appease clang11+ on linux

  # These are used in src/CMakeLists.txt on a per-target basis
-  list(APPEND ALL_WARNINGS -Wall; -Werror; -Wextra; -Wno-unused-result; -Wno-deprecated;
+  list(APPEND ALL_WARNINGS -Wall; # -Werror; 


What warning was introduced that made removal of -Werror necessary?

Ruy has a lot of things that fire on -Werror, this is what made me remove it. jerinphilip/marian@e1c3f7a (#4)

simd_utils has strict aliases/type-punning warnings which become errors.

Do I do https://stackoverflow.com/a/3308675/4565794 to get around this? I can break it and get around it by something with the headers, I hope. Upstream appears to want to keep -Werror (marian-nmt#598 (comment)).

We should keep -Werror and the solution you linked is acceptable as it documents exactly which warnings have been disabled and ties them to a particular header. To me, this satisfies the "OK to disable warnings in 3rd-party code after checking them once" from upstream.

For the record: https://godbolt.org/z/6Mzhc1Tqq

graemenail · 2022-06-20T14:59:24Z

CMakeLists.txt

+
+target_architecture(CMAKE_TARGET_ARCHITECTURES)
+list(LENGTH CMAKE_TARGET_ARCHITECTURES cmake_target_arch_len)
+if(NOT "${cmake_target_arch_len}" STREQUAL "1")


Does this catch the unknown arch condition, and is that desirable?

CMakeLists.txt

src/functional/operators.h

This reverts commit 1b38e01.

jerinphilip · 2022-06-21T20:35:15Z

While Android builds appear happy with -Werror gone, builds on Oracle Cloud machine are failing. I'm temporarily reverting -Werror for resolution later once I address other concerns.

What remains is -Wstrict-aliasing on simd_utils and -Wcomment on ruy, which I am currently unable to get rid of via diagnostic pragmas.

This reverts commit 4b80399.

graemenail · 2022-06-28T12:11:09Z

There's a #if BLAS_FOUND block in Prod of src/cpu/prod.cpp that needs removing. By removing this block, and disabling -Werror (needs to be addressed later, see: #79 (comment)), I am able to compile and run on a Raspberry Pi 4.

The implications of the changes in CMake in putting RUY_SGEMM above openblas/cblas need to be understood. BLAS_FOUND is used in LSH, node inits, unit tests; we need to determine what BLAS features are required there.

CMakeLists.txt

jerinphilip · 2022-06-28T12:55:44Z

src/tensors/cpu/prod.cpp

@@ -10,11 +10,9 @@
 #if MKL_FOUND
 #include <mkl.h>
 #elif BLAS_FOUND


We only need SGEMM here?

Almost everything here is handled in prod_blas.h, then there's the MKL batched gemm.

I meant we don't need a full BLAS flag, we need only an SGEMM implementation. faiss (LSH) blas found is okay, because it uses a lot of BLAS (in some sense). The switch here can be narrowed to something GEMM.

We can get rid of the includes here, leaving only

#if MKL_FOUND #include <mkl.h> #endif

since we directly call MKL functions here.

Agreed?

jerinphilip · 2022-06-28T13:00:35Z

While Android builds appear happy with -Werror gone, builds on Oracle Cloud machine are failing. I'm temporarily reverting -Werror for resolution later once I address other concerns.

To update the others, this is isolated into a GCC bug(?). Clang (which is cross compiling for android) does not complain. A minimum reproducible example is https://godbolt.org/z/6Mzhc1Tqq, already linked in review-comments. Clang works fine on Oracle ARM machine.

jerinphilip and others added 2 commits March 9, 2022 20:02

Fix sentencepiece submodule mixup

2ac7cbc

jerinphilip requested review from XapaJIaMnu, graemenail and kpu March 9, 2022 20:38

Merge branch 'browsermt-master' into arm-backend

93b841b

jerinphilip marked this pull request as ready for review March 14, 2022 14:05

kpu requested changes Mar 28, 2022

View reviewed changes

jerinphilip mentioned this pull request Mar 28, 2022

SGEMM is tangled and cumbersome to edit #86

Open

jerinphilip commented Mar 28, 2022

View reviewed changes

CMakeLists.txt Outdated Show resolved Hide resolved

jerinphilip mentioned this pull request Mar 28, 2022

Trigger workflows on PR to any branch #87

Merged

Jerin Philip added 5 commits April 10, 2022 21:47

[sentencepiece] android cmake additional libs

9674973

Remove separately added patch in favour of submodule update

f3e7818

Remove trailing newline in integer_common.h to prettify diff

5250b9e

Remove trailing newline in ruy_adapter.h

26d3ba2

Merge branch 'browsermt-master' into arm-backend

b7969b0

XapaJIaMnu reviewed Apr 11, 2022

View reviewed changes

XapaJIaMnu reviewed Apr 13, 2022

View reviewed changes

CMakeLists.txt Show resolved Hide resolved

In-place multiply without malloc by reinterpret_cast

b271b70

jerinphilip mentioned this pull request Apr 14, 2022

Sense int8 or int8Shift #92

Open

Documentation for the stdcpp/NEON paths created

efa5a85

kpu changed the title ~~ARM Backend for marian~~ ARM Backend using ruy for fp32 and int8 Apr 16, 2022

jerinphilip added 2 commits April 18, 2022 08:07

Remove templated abort transpose()

179f239

Reinterpret at unquantize add bias as well as int32_t from float32_t

0d189c8

jerinphilip added 6 commits June 14, 2022 10:08

Remove leftover gcc diagnostic pop for SIMDE

4a8c0da

Remove simde-no-tests reference in CMakeLists.txt file

e17a5dd

Remove obsolete comments

3c8a149

Explain copying x86-SSE structure for NEON

800402c

Remove executable upload for android

9d648d0

Remove comment

d19a312

XapaJIaMnu reviewed Jun 20, 2022

View reviewed changes

graemenail reviewed Jun 20, 2022

View reviewed changes

jerinphilip added 3 commits June 21, 2022 09:51

Restore -Werror

1b38e01

Switch to a {{0}} sigaction on WASM, {0} for rest

9027ea4

Revert "Restore -Werror"

a0ee527

This reverts commit 1b38e01.

jerinphilip added 11 commits June 21, 2022 20:41

Use -DFMA for NEON from simd_utils example

6285f28

Remove redundant neon_mathfun include after simd_utils.h

8895fda

Wrap CmakeLists.txt ARM definitions with an if

c6c3ac6

Use __clang__ instead of WASM_COMPATIBLE_SOURCE; emcc uses LLVM

3baf620

Suppress warnings by #pragma GCC diagnostic ...

aa1842c

Re-enable -Werror

8eae08b

{0} -> {} to work around empty-braces Werror

9a541c4

Replace -Wall with -Wcomment

4b80399

Revert "Replace -Wall with -Wcomment"

ac8de91

This reverts commit 4b80399.

Disable formatting then local edit -Wall -> -Wcomment

38b608a

Do not check for BLAS on usual ARM, except Mac: Apple Accelerate

86c8d44

graemenail reviewed Jun 28, 2022

View reviewed changes

CMakeLists.txt Outdated Show resolved Hide resolved

jerinphilip commented Jun 28, 2022

View reviewed changes

Fix endif: CMakeScript quirks

861e31d

XapaJIaMnu mentioned this pull request Sep 27, 2022

Arm backend fix #96

Merged

XapaJIaMnu merged commit 861e31d into browsermt:master Jan 18, 2023

ARM Backend using ruy for fp32 and int8 #79

ARM Backend using ruy for fp32 and int8 #79

Conversation

jerinphilip commented Mar 9, 2022 • edited Loading

Status

jerinphilip commented Mar 9, 2022

kpu commented Mar 14, 2022 • edited Loading

jerinphilip commented Mar 16, 2022

kpu commented Mar 16, 2022 • edited Loading

jerinphilip commented Mar 16, 2022 • edited Loading

kpu left a comment

Choose a reason for hiding this comment

jerinphilip left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

XapaJIaMnu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerinphilip Apr 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerinphilip Apr 18, 2022 • edited Loading

Choose a reason for hiding this comment

jerinphilip Apr 19, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

XapaJIaMnu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerinphilip Jun 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

graemenail left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerinphilip commented Jun 21, 2022 • edited Loading

graemenail commented Jun 28, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerinphilip commented Jun 28, 2022

jerinphilip commented Mar 9, 2022 •

edited

Loading

kpu commented Mar 14, 2022 •

edited

Loading

kpu commented Mar 16, 2022 •

edited

Loading

jerinphilip commented Mar 16, 2022 •

edited

Loading

jerinphilip Apr 11, 2022 •

edited

Loading

jerinphilip Apr 18, 2022 •

edited

Loading

jerinphilip Apr 19, 2022 •

edited

Loading

jerinphilip Jun 23, 2022 •

edited

Loading

jerinphilip commented Jun 21, 2022 •

edited

Loading