# Metal Backend for Hardware Acceleration in Needle Framework



## Introduction
We have added a Metal backend to Needle framework to enable GPU acceleration for deep learning computations on M1 Mac. Metal is Apple's graphics framework that provides a low-level programming language called Metal Shading Language (MSL) for writing high-performance shaders and speeding up computations. By using the Metal backend, Needle framework can take advantage of the high-performance GPU on Apple devices and provide faster training speeds and higher performance for deep learning tasks.

At the same time, Needle framework also includes a high-level NN module library, which provides a set of neural network layers and components that can be easily combined to create complex model architectures. This library is designed to be flexible and modular, allowing users to easily experiment with different model configurations and architectures.

Overall, Needle framework is designed to provide users with a balance of low-level acceleration and high-level convenience, making it easy to build and train deep neural networks for a variety of tasks.

## Prepare the codebase

- Install dependencies:

In [3]:
!python3 -m pip install pybind11
!python3 -m pip install numpy



- Append Needle library:

In [2]:
import sys
sys.path.append('./python')

- Build array backend

In [4]:
!cmake . && make

-- The C compiler identification is AppleClang 14.0.0.14000029
-- The CXX compiler identification is AppleClang 14.0.0.14000029
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Python: /Users/amo/opt/anaconda3/bin/python3.9 (found version "3.9.12") found components: Development Interpreter Development.Module Development.Embed 
-- Performing Test HAS_FLTO
-- Performing Test HAS_FLTO - Success
-- Performing Test HAS_FLTO_THIN
-- Performing Test HAS_FLTO_THIN - Success
-- F

## Metal Shading Language
Metal Shading Language (MSL) is a low-level programming language designed for use with Apple's Metal graphics framework. It is intended to provide developers with fine-grained control over the rendering pipeline, allowing for the creation of highly optimized and performant shaders on M1 chips and other Apple devices.

Based on the C++ programming language, MSL offers a set of built-in functions and data types for graphics and compute tasks, as well as language extensions specific to Metal, such as support for vertex, fragment, and compute shaders, and resource and memory management.

### Metal command queue
In the Metal architecture, the `MTLDevice` protocol supports methods for encoding and queueing render and compute commands to be submitted to the GPU for execution.

A command queue consists of a queue of command buffers, and a command queue organizes the order of execution of those command buffers. A command buffer contains encoded commands that are intended for execution on a particular device. A command encoder appends rendering, computing, and blitting commands onto a command buffer, and those command buffers are eventually committed for execution on the device.

The `MTLCommandQueue` protocol defines an interface for command queues, primarily supporting methods for creating command buffer objects. The `MTLCommandBuffer` protocol defines an interface for command buffers and provides methods for creating command encoders, enqueueing command buffers for execution, checking status, and other operations.

![](https://developer.apple.com/library/archive/documentation/Miscellaneous/Conceptual/MetalProgrammingGuide/Art/Cmd-Model-1_2x.png)

We have abstracted a class called `MyMetal` to organize the command queue and submit command buffers to the GPU. This class provides useful methods including `LoadKernelsFromFile` to load kernel command from `.metal` files, `RegisterKernel` on `MTLDevice`, `GetComputePipelineState` for create command buffer, etc. 
Then we create macros for command encoder to reduce the number of lines of code, which makes the implementation of the array operators more readable.

``` cpp
#define BEGIN_COMPUTE_COMMAND(command_kernel_name)                             \
  MyMetal* metal = MyMetal::GetInstance();                                     \
  MTL::CommandBuffer* command_buffer =                                         \
      metal->command_queue()->commandBuffer();                                 \
  MTL::ComputeCommandEncoder* command_encoder =                                \
      command_buffer->computeCommandEncoder();                                 \
  command_encoder->setComputePipelineState(                                    \
      metal->GetComputePipelineState(command_kernel_name));

#define END_COMPUTE_COMMAND                                                    \
  command_encoder->endEncoding();                                              \
  command_buffer->commit();                                                    \
  command_buffer->waitUntilCompleted();                                        \
  command_encoder->release();                                                  \
  command_buffer->release();

```

### Build Metal library 
One major difference between the compile process for Cuda and Metal is the support for the g++ compiler. Cuda code can be compiled using the g++ compiler as well as the NVCC compiler provided as part of the Cuda Toolkit. This allows Cuda code to be integrated into a wider range of build systems and development environments.

However, Metal code is not supported by the g++ compiler and must be compiled using the Xcode build tools. This processes can also be done Without using Xcode by integrating the command line utilities into `CMakeLists.txt`. 

``` bash
xcrun -sdk macosx metal src/metal/kernels.metal -c -o kernels.air
xcrun -sdk macosx metallib kernels.air -o kernels.metallib
```

As the image shown below, first compile `.metal` files into a single `.air` file, which stores an intermediate representation (IR) of shader language code. Then we use the `metallib` tool to build a Metal `.metallib` library file from IR `.air` files


![](https://developer.apple.com/library/archive/documentation/Miscellaneous/Conceptual/MetalProgrammingGuide/Art/library_2x.png)

## Multi-level Operators 
Take tanh activation function for example, in this pipeline we have actually implement 4 levels of abstraction.

### nn.py
From NN Module, we have Tanh activation layer. This layer calls tanh from ops.

``` python
class Tanh(Module):
    def forward(self, x: Tensor) -> Tensor:
        return ops.tanh(x)
```
### ops.py
In ops, it calls low-level array api to compute the tanh.
``` python
class Tanh(TensorOp):
    def compute(self, a):
        return array_api.tanh(a)

    def gradient(self, out_grad, node):
        tmp = exp(node.inputs[0] * 2)
        return ((tmp + 2 + tmp ** -1) ** -1 * 4 * out_grad,)


def tanh(a):
    return Tanh()(a)
```

### ndarray_backend_metal.cc
``` cpp
void EwiseTanh(const MetalArray<scalar_t>& a, MetalArray<scalar_t>* out) {
  BEGIN_COMPUTE_COMMAND("EwiseTanhKernel")

  command_encoder->setBuffer(a.buffer, 0, 0);
  command_encoder->setBuffer(out->buffer, 0, 1);
  MetalDims dim = MetalOneDim(a.size);
  command_encoder->dispatchThreads(dim.num_threads_per_grid, dim.num_threads_per_group);

  END_COMPUTE_COMMAND
}
```

### kernels.metal
``` glsl
kernel void EwiseTanhKernel(device const float* a [[buffer(0)]],
                            device float* out [[buffer(1)]],
                            uint index [[thread_position_in_grid]]) {
  float tmp = metal::tanh(a[index]);
  out[index] = metal::isnan(tmp) ? 1.0: tmp;
}
```
