Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM Backend using ruy for fp32 and int8 #79

Merged
merged 70 commits into from
Jan 18, 2023

Conversation

jerinphilip
Copy link

@jerinphilip jerinphilip commented Mar 9, 2022

Provides an arm backend for matrix multiplies using google/ruy and math
functions through simde (https://simd-everywhere.github.io/blog/about/)
effectively getting marian-decoder to run on ARM.

The following cmake flags are added:

  • USE_INTGEMM (to switch intgemm on/off)
  • USE_RUY (to switch ruy on/off)
  • USE_ONNX_SGEMM (use onnx sgemm added by wasm to provide attention
    matrix multiply which is currently reliant on a BLAS library). This previously
    used to be WASM_COMPATIBLE_BLAS.
  • USE_SIMDE (swaps out existing intel based functions by using SIMDE
    instead).

The built marian-decoder is tested on an Oracle Cloud ARM Machine with
the following specs:

Architecture   : aarch64
CPU op-mode(s) : 32-bit, 64-bit
Byte Order     : Little Endian
Vendor ID      : ARM
Model name     : Neoverse-N1
Flags          : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp
                 asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs

A CI check on GitHub actions is added to use android-ndk cross-compile
targetting arm64-v8a. The built binary is tested to work on an Android
Phone using termux (Samsung M30s).

Successful android build additionally requires a patch (sentencepiece ->
protobuf). See opencv/opencv#17282 and
opencv/opencv#19049.

-Werror etc causes issues with ruy (-Wmulti-line-comment) and are
disabled.

The following minor changes are also applied:

  • Remove M32_BINARIES use COMPILE_WASM for -m32
  • Hide msse4.1 if unknown platform
  • faiss was previously hardcoded for platforms with SSE available. This
    has been mitigated by adding a refernce standard cpp implementation of
    the missing function.
  • Exclude packed_gemm_....cpp from sources if USE_FBGEMM=off
  • MSVC workaround following
    Import matrix-multiply from a separate wasm module #56 (comment)

Status

jerinphilip and others added 2 commits March 9, 2022 20:02
Provides an arm backend for matrix multiplies using google/ruy and math
functions through simde (https://simd-everywhere.github.io/blog/about/)
effectively getting marian-decoder to run on ARM.

The following cmake flags are added:

- USE_INTGEMM (to switch intgemm on/off)
- USE_RUY (to switch ruy on/off)
- USE_ONNX_SGEMM (use onnx sgemm added by wasm to provide attention
  matrix multiply which is currently reliant on a BLAS library).
- USE_SIMDE (swaps out existing intel based functions by using SIMDE
  instead).

The built marian-decoder is tested on an Oracle Cloud ARM Machine with
the following specs:

  Architecture   : aarch64
  CPU op-mode(s) : 32-bit, 64-bit
  Byte Order     : Little Endian
  Vendor ID      : ARM
  Model name     : Neoverse-N1
  Flags          : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp
                   asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs

A CI check on GitHub actions is added to use android-ndk cross-compile
targetting arm64-v8a. The built binary is tested to work on an Android
Phone using termux (Samsung M30s).

Successful android build additionally requires a patch (sentencepiece ->
protobuf). See opencv/opencv#17282 and
opencv/opencv#19049.

-Werror etc causes issues with ruy (-Wmulti-line-comment) and are
disabled.

The following minor changes are also applied:

 - Remove M32_BINARIES use COMPILE_WASM for -m32
 - Hide msse4.1 if unknown platform
 - faiss was previously hardcoded for platforms with SSE available. This
   has been mitigated by adding a refernce standard cpp implementation of
   the missing function.
 - Exclude packed_gemm_....cpp from sources if USE_FBGEMM=off
 - MSVC workaround following
   browsermt#56 (comment)
@jerinphilip
Copy link
Author

The cmake-flags, BLAS ifdefs etc are hard to navigate territory. WASM has not done much to help this situation either - but the markings kind of help search when android has similar platform complaints of usual stuff missing. I am afraid the additional branches I create (in CMake and ifdef) only makes the situation worse.

marian-nmt#762 looks like a different approach using the units here will be undertaken for marian-nmt/marian-dev.

Requesting @XapaJIaMnu, @kpu, @graemenail for feedback on how to simplify the situation. I'm hoping devs who have awareness of the bigger picture can point me towards the appropriate things to do. I have gotten this to work on Oracle Machines and Android - I expect to have access to an M1 Macbook Air in May.

@jerinphilip jerinphilip marked this pull request as ready for review March 14, 2022 14:05
@kpu
Copy link
Member

kpu commented Mar 14, 2022

Don't use the ONNX GEMM. It should be using ruy.

Since ruy has sgemm (or can quickly implement sgemm in src/tensors/cpu/prod.cpp), use that for float32 matrix multiplies.

We would then have a library that does sgemm but not the other BLAS routines. As you note, FAISS depends on more BLAS routines than sgemm: https://github.com/jerinphilip/marian/runs/5468340444?check_suite_focus=true#step:10:360 . So my suggestion here is to split BLAS_FOUND into HAVE_SGEMM and USE_FAISS. Then do it at the cmake level, including USE_FAISS off at the cmake level for ARM compiles.

@jerinphilip
Copy link
Author

Don't use the ONNX GEMM. It should be using ruy.

ONNX SGEMM is used only by WebAssembly and Android (for now). On Mac M1 Apple Accelerate or a BLAS library is available. OpenBLAS is used for time being in Oracle cloud ARM machine. I have isolated enough to get sgemm through ruy in https://github.com/jerinphilip/sgemm/blob/28dc786d821d3abb2acf086e1fb145e58cc55372/src/main.cpp#L54-L75 for now. This looks like it will take longer - the optimized primitive I find in ruy is A*B, not sgemm(...). ONNX JS SGEMM appears to be internally calling Eigen, I'm not sure how fast this is but Eigen reports SIMD etc capability. The other arguments are used (https://cs.github.com/browsermt/marian-dev/blob/53c4f7e4537dbf7782c583e98f50e513f0a27541/src/graph/node_operators_binary.h?q=ProdBatched#L419-L506) and will need optimized implementations (alpha, beta, transpose by layout adjustments).

If available, I'd like to focus on integration here (without adding more source complicating review) and bring in a PR behind finding for ruy based sgemm for Android.

BLAS_FOUND is the more suitable flag than USE_FAISS in existing places is what I find.

@kpu
Copy link
Member

kpu commented Mar 16, 2022

Ruy provides a bias term here:
https://github.com/google/ruy/blob/2d950b3bfa7ebfbe7a97ecb44b1cc4da5ac1d6f0/ruy/mul_params.h#L225
Example with bias term:
https://github.com/google/ruy/blob/2d950b3bfa7ebfbe7a97ecb44b1cc4da5ac1d6f0/example/example.cc#L51-L62

Regarding layout (transpose and stride arguments), these are supported: https://github.com/google/ruy/blob/2d950b3bfa7ebfbe7a97ecb44b1cc4da5ac1d6f0/example/parametrized_example.cc#L84-L91

There does not appear to be support for alpha and beta scaling parameters. However, those use cases are very rare in Marian (I think one could get a fully functional implementation just by asserting they are 1) and could be done with element-wise postprocessing.

And with that you would have the features of sgemm.

Eigen's performance on x86 has been bad, not sure about ARM.

@jerinphilip
Copy link
Author

jerinphilip commented Mar 16, 2022

Maybe I'm missing something, what do we need bias term for? We multiply int8*int8 -> int32 then convert to float to add a float-bias which causes us to take a different path in MozIntGemm (which is imported here). sgemm does not have a bias term, does it?

The layout-based transpose is available - looks like https://github.com/google/ruy/blob/8c3fd3f266b4a22d542d4aa41329b5018d6b87e1/ruy/test.h#L1157-L1164. I am currently looking into using this.

ProdBatched is used with different combinations (which is what the code-search permalink points to). I have traced this to bdot which uses non-identity scale/alpha value at places. If these are asserted to be 1 it will be fragmented and won't sit well with the rest of the implementations. I will at least have to introduce ABORTS on the calls because scale!=1.0f is effectively broken. I will work on the elementwise pre-processing (alpha *A) in the isolated (sgemm) repository for inlining here when ready.

Awaiting feedback on the remaining parts meanwhile.

Copy link
Member

@kpu kpu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, use ruy for fp32 and https://github.com/JishinMaster/simd_utils/ for the functions

.github/workflows/macos.yml Outdated Show resolved Hide resolved
.gitmodules Outdated Show resolved Hide resolved
CMakeLists.txt Show resolved Hide resolved
src/tensors/cpu/integer_common.h Show resolved Hide resolved
src/3rd_party/sse_mathfun.h Outdated Show resolved Hide resolved
CMakeLists.txt Show resolved Hide resolved
CMakeLists.txt Show resolved Hide resolved
CMakeLists.txt Show resolved Hide resolved
src/tensors/cpu/ruy_adapter.h Outdated Show resolved Hide resolved
src/tensors/cpu/ruy_adapter.h Outdated Show resolved Hide resolved
Copy link
Author

@jerinphilip jerinphilip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notes to self.

patches/01-spm-protobuf-android.patch Outdated Show resolved Hide resolved
src/tensors/cpu/ruy_adapter.h Show resolved Hide resolved
src/tensors/cpu/ruy_adapter.h Show resolved Hide resolved
src/tensors/cpu/ruy_adapter.h Show resolved Hide resolved
.github/workflows/arm.yml Outdated Show resolved Hide resolved
CMakeLists.txt Outdated Show resolved Hide resolved
src/tensors/cpu/prod_blas.h Show resolved Hide resolved
@@ -218,11 +218,13 @@ class ExpressionGraphPackable : public ExpressionGraph {
cols(val));
//Put the quantMult at the back of the tensor
*(reinterpret_cast<float *>(paramMat->data<int8_t>() + val->shape().elements())) = quantMult;
#else
ABORT("Int8::PrepareA not implemented yet for ruy");
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compile error moved to runtime, but haven't managed to trigger it in runs. I wonder why.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Ruy PrepareA is the same as PrepareB

CMakeLists.txt Show resolved Hide resolved
CMakeLists.txt Show resolved Hide resolved
CMakeLists.txt Outdated Show resolved Hide resolved
Copy link
Collaborator

@XapaJIaMnu XapaJIaMnu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some partial comments, from my (im)partial review.

src/tensors/cpu/ruy_adapter.h Outdated Show resolved Hide resolved
src/tensors/cpu/ruy_adapter.h Outdated Show resolved Hide resolved
src/tensors/cpu/ruy_adapter.h Outdated Show resolved Hide resolved
src/tensors/cpu/ruy_adapter.h Show resolved Hide resolved
src/tensors/cpu/ruy_adapter.h Show resolved Hide resolved
return result;
}

static void PrepareBias(const float *input, float *output, Index rows, Index cols) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be using marian tensors here, as they allow for zero copy.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be more precise, when being inside the marian tensor ecosystem, you can choose to return the same unmodified tensor from prepareBias (identity operation that just returns the input tensor) and avoid the copy.

Copy link
Author

@jerinphilip jerinphilip Apr 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ruy_adapter.h sits at the same level as intgemm. Tensor in arguments is therefore unsuitable. Marian provides Tensor. Marian uses ruy_adapter / intgemm. My preferred solution to this is to capture the variations as callback/first-class functions and do f(args) to not duplicate code. These primitives should be supplied by this file/layer, without pulling in marian::Tensor.

Edit: callback could be unquantize when we don't add bias, and unquantizeAddBias when we have bias, switch can happen at call-site.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RUY backend should NEVER use prepareBias. Calling this function from insude ARM should result in an Abort and is definitely a programmer mistake. ARM supports proper int8_t * int8_t and therefore doesn't need the prepare bias mumbo jumbo which we do in intgemm.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RUY backend should NEVER use prepareBias.

I agree. I understand the objective here is to avoid the std::memcpy which can be done by changing the calling code. I will try to find a solution that achieves the intent at integration.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is that this function should never be called from ruy as it will be slower than the other code path. It should abort. The "shifted" code path shouldn't be taken by this backend.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed on today's call, since ARM has 8-bit signed * signed, the relevant code path to follow is int8 (not int8shifted).

Copy link
Author

@jerinphilip jerinphilip Apr 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

49beb50 (#79).

@kpu suggested lifting ruy path from intgemm_interface.h and adding at a level of gemm-provider. Much of the existing state is due to coming over from https://github.com/jerinphilip/MozIntGemm. This will involve writing some equivalent of intgemm_interface to complete the calls (i.e duplication). Please let me know if it's a go on this one.

I agree. I understand the objective here is to avoid the std::memcpy which can be done by changing the calling code. I will try to find a solution that achieves the intent at integration.

The std::memcpy was probably useless, got rid of it and a bunch of nullptr in the args in the process.

Copy link
Author

@jerinphilip jerinphilip Apr 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kpu suggested lifting ruy path from intgemm_interface.h and adding at a level of gemm-provider. Much of the existing state is due to coming over from jerinphilip/MozIntGemm. This will involve writing some equivalent of intgemm_interface to complete the calls (i.e duplication). Please let me know if it's a go on this one.

Without documentation (for intgemm_interface.h, the integration to marian's graph system) or active help from someone who understands the area, I'm afraid I will not be able to do this in the immediate future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have read https://github.com/browsermt/marian-dev/blob/master/src/tensors/cpu/intgemm_interface.h and determined that, while it could use some comments, it is in state where any software engineer working for me is expected to read though and understand or come up with questions.

@kpu kpu changed the title ARM Backend for marian ARM Backend using ruy for fp32 and int8 Apr 16, 2022
Copy link
Collaborator

@XapaJIaMnu XapaJIaMnu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small review


target_architecture(CMAKE_TARGET_ARCHITECTURES)
list(LENGTH CMAKE_TARGET_ARCHITECTURES cmake_target_arch_len)
if(NOT "${cmake_target_arch_len}" STREQUAL "1")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean, sorry? Is this 32bit vs 64bit? A small clarifying comment?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this catch the unknown arch condition, and is that desirable?

@@ -218,11 +218,13 @@ class ExpressionGraphPackable : public ExpressionGraph {
cols(val));
//Put the quantMult at the back of the tensor
*(reinterpret_cast<float *>(paramMat->data<int8_t>() + val->shape().elements())) = quantMult;
#else
ABORT("Int8::PrepareA not implemented yet for ruy");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Ruy PrepareA is the same as PrepareB


ruy::Matrix<float> lhs;
ruy::MakeSimpleLayout(M, K, orderA, lhs.mutable_layout());
lhs.set_data(A);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is zero copy, right, just pointer set?

Copy link
Author

@jerinphilip jerinphilip Jun 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you mean this statement, yes. It's a pointer set. However, I think ruy might be allocating/deallocating under the hood based on its needs of a layout change? https://github.com/google/ruy/blob/38a9266b832767a3f535a74a9e0cf39f7892e594/ruy/prepare_packed_matrices.cc#L69-L92

}
}

struct UnquantizeAndAddBiasAndWrite {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obligatory comment that it is unnecessary to code it in this way, but we have already covered that.

Preprocess<kHighestPath>::quantize(input, output, quant_mult, rows, cols);
}

static void SelectColumnsB(const Type *input,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerinphilip did you try vanilla index_select?

Copy link
Member

@graemenail graemenail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments while I catch up on the full discussion

CMakeLists.txt Outdated
@@ -266,7 +308,8 @@ else(MSVC)
set(DISABLE_GLOBALLY "-Wno-unused-result ${CLANG_IGNORE_UNKNOWN_CUDA} ${CLANG_IGNORE_UNUSED_VALUES}") # This needs to appear here as well to appease clang11+ on linux

# These are used in src/CMakeLists.txt on a per-target basis
list(APPEND ALL_WARNINGS -Wall; -Werror; -Wextra; -Wno-unused-result; -Wno-deprecated;
list(APPEND ALL_WARNINGS -Wall; # -Werror;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What warning was introduced that made removal of -Werror necessary?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Ruy has a lot of things that fire on -Werror, this is what made me remove it. jerinphilip/marian@e1c3f7a (#4)
  2. simd_utils has strict aliases/type-punning warnings which become errors.

Do I do https://stackoverflow.com/a/3308675/4565794 to get around this? I can break it and get around it by something with the headers, I hope. Upstream appears to want to keep -Werror (marian-nmt#598 (comment)).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep -Werror and the solution you linked is acceptable as it documents exactly which warnings have been disabled and ties them to a particular header. To me, this satisfies the "OK to disable warnings in 3rd-party code after checking them once" from upstream.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


target_architecture(CMAKE_TARGET_ARCHITECTURES)
list(LENGTH CMAKE_TARGET_ARCHITECTURES cmake_target_arch_len)
if(NOT "${cmake_target_arch_len}" STREQUAL "1")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this catch the unknown arch condition, and is that desirable?

CMakeLists.txt Outdated Show resolved Hide resolved
CMakeLists.txt Show resolved Hide resolved
src/functional/operators.h Outdated Show resolved Hide resolved
@jerinphilip
Copy link
Author

jerinphilip commented Jun 21, 2022

While Android builds appear happy with -Werror gone, builds on Oracle Cloud machine are failing. I'm temporarily reverting -Werror for resolution later once I address other concerns.

What remains is -Wstrict-aliasing on simd_utils and -Wcomment on ruy, which I am currently unable to get rid of via diagnostic pragmas.

@graemenail
Copy link
Member

There's a #if BLAS_FOUND block in Prod of src/cpu/prod.cpp that needs removing. By removing this block, and disabling -Werror (needs to be addressed later, see: #79 (comment)), I am able to compile and run on a Raspberry Pi 4.

The implications of the changes in CMake in putting RUY_SGEMM above openblas/cblas need to be understood. BLAS_FOUND is used in LSH, node inits, unit tests; we need to determine what BLAS features are required there.

CMakeLists.txt Outdated Show resolved Hide resolved
@@ -10,11 +10,9 @@
#if MKL_FOUND
#include <mkl.h>
#elif BLAS_FOUND
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only need SGEMM here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost everything here is handled in prod_blas.h, then there's the MKL batched gemm.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant we don't need a full BLAS flag, we need only an SGEMM implementation. faiss (LSH) blas found is okay, because it uses a lot of BLAS (in some sense). The switch here can be narrowed to something GEMM.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can get rid of the includes here, leaving only

#if MKL_FOUND
#include <mkl.h>
#endif

since we directly call MKL functions here.

Agreed?

@jerinphilip
Copy link
Author

While Android builds appear happy with -Werror gone, builds on Oracle Cloud machine are failing. I'm temporarily reverting -Werror for resolution later once I address other concerns.

To update the others, this is isolated into a GCC bug(?). Clang (which is cross compiling for android) does not complain. A minimum reproducible example is https://godbolt.org/z/6Mzhc1Tqq, already linked in review-comments. Clang works fine on Oracle ARM machine.

@XapaJIaMnu XapaJIaMnu mentioned this pull request Sep 27, 2022
@XapaJIaMnu XapaJIaMnu merged commit 861e31d into browsermt:master Jan 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants