graph: backend: elyzor: add a sketch of elyzor graph backend (oneapi-…

…src#4) Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
dchigarev · May 2, 2024 · cc1ed10 · cc1ed10
1 parent 8d17abd
commit cc1ed10
Show file tree

Hide file tree

Showing 22 changed files with 6,789 additions and 0 deletions.
diff --git a/examples/graph/cpu_elyzor_test.cpp b/examples/graph/cpu_elyzor_test.cpp
@@ -0,0 +1,283 @@
+/*******************************************************************************
+* Copyright 2023-2024 Intel Corporation
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*     http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*******************************************************************************/
+
+/// @example cpu_getting_started.cpp
+/// @copybrief graph_cpu_getting_started_cpp
+/// > Annotated version: @ref graph_cpu_getting_started_cpp
+
+/// @page graph_cpu_getting_started_cpp Getting started on CPU with Graph API
+/// This is an example to demonstrate how to build a simple graph and run it on
+/// CPU.
+///
+/// > Example code: @ref cpu_getting_started.cpp
+///
+/// Some key take-aways included in this example:
+///
+/// * how to build a graph and get partitions from it
+/// * how to create an engine, allocator and stream
+/// * how to compile a partition
+/// * how to execute a compiled partition
+///
+/// Some assumptions in this example:
+///
+/// * Only workflow is demonstrated without checking correctness
+/// * Unsupported partitions should be handled by users themselves
+///
+
+/// @page graph_cpu_getting_started_cpp
+/// @section graph_cpu_getting_started_cpp_headers Public headers
+///
+/// To start using oneDNN Graph, we must include the @ref dnnl_graph.hpp header
+/// file in the application. All the C++ APIs reside in namespace `dnnl::graph`.
+///
+/// @page graph_cpu_getting_started_cpp
+/// @snippet cpu_getting_started.cpp Headers and namespace
+//[Headers and namespace]
+#include <iostream>
+#include <memory>
+#include <vector>
+#include <unordered_map>
+#include <unordered_set>
+
+#include <assert.h>
+
+#include "oneapi/dnnl/dnnl_graph.hpp"
+
+#include "example_utils.hpp"
+#include "graph_example_utils.hpp"
+
+using namespace dnnl::graph;
+using data_type = logical_tensor::data_type;
+using layout_type = logical_tensor::layout_type;
+using dim = logical_tensor::dim;
+using dims = logical_tensor::dims;
+//[Headers and namespace]
+
+void cpu_getting_started_tutorial() {
+    //[Create second relu]
+
+
+    /// Finally, those created ops will be added into the graph. The graph
+    /// inside will maintain a list to store all these ops. To create a graph,
+    /// #dnnl::engine::kind is needed because the returned partitions
+    /// maybe vary on different devices. For this example, we use CPU engine.
+    ///
+    /// @note The order of adding op doesn't matter. The connection will
+    /// be obtained through logical tensors.
+    ///
+    /// Create graph and add ops to the graph
+    /// @snippet cpu_getting_started.cpp Create graph and add ops
+    //[Create graph and add ops]
+    graph g(dnnl::engine::kind::cpu);
+
+    std::vector<int64_t> src_shape {4, 1, 4096};
+
+    dims smooth_quant_scales_shape;
+    auto dtype = data_type::f32;
+    logical_tensor mul_in {0, dtype};
+    logical_tensor smooth_quant_scale {1, dtype};
+    logical_tensor mul_out {2, dtype};
+    logical_tensor quant_out {3, data_type::u8};
+
+
+    op mul {4, op::kind::Multiply, "mul"};
+    mul.add_input(mul_in);
+    mul.add_input(smooth_quant_scale);
+    mul.add_output(mul_out);
+
+    op quantize {5, op::kind::Quantize, "quantize"};
+    quantize.set_attr(op::attr::scales, std::vector<float>({0.12f}));
+    quantize.set_attr(op::attr::zps, std::vector<int64_t>({2}));
+    quantize.set_attr(op::attr::qtype, std::string("per_tensor"));
+    quantize.set_attr(op::attr::axis, (int64_t)0);
+
+    quantize.add_input(mul_out);
+    quantize.add_output(quant_out);
+
+    g.add_op(mul);
+    g.add_op(quantize);
+
+    //[Create graph and add ops]
+
+    /// After adding all ops into the graph, call
+    /// #dnnl::graph::graph::get_partitions() to indicate that the
+    /// graph building is over and is ready for partitioning. Adding new
+    /// ops into a finalized graph or partitioning a unfinalized graph
+    /// will both lead to a failure.
+    ///
+    /// @snippet cpu_getting_started.cpp Finalize graph
+    //[Finalize graph]
+    g.finalize();
+    //[Finalize graph]
+
+    /// After finished above operations, we can get partitions by calling
+    /// #dnnl::graph::graph::get_partitions().
+    ///
+    /// In this example, the graph will be partitioned into two partitions:
+    /// 1. conv0 + conv0_bias_add + relu0
+    /// 2. conv1 + conv1_bias_add + relu1
+    ///
+    /// @snippet cpu_getting_started.cpp Get partition
+    //[Get partition]
+    auto partitions = g.get_partitions();
+    //[Get partition]
+
+    // Check partitioning results to ensure the examples works. Users do
+    // not need to follow this step.
+    std::cout << "part size: " << partitions.size() << std::endl;
+
+    /// @page graph_cpu_getting_started_cpp
+    /// @subsection graph_cpu_getting_started_cpp_compile Compile and Execute Partition
+    ///
+    /// In the real case, users like framework should provide device information
+    /// at this stage. But in this example, we just use a self-defined device to
+    /// simulate the real behavior.
+    ///
+    /// Create a #dnnl::engine. Also, set a user-defined
+    /// #dnnl::graph::allocator to this engine.
+    ///
+    /// @snippet cpu_getting_started.cpp Create engine
+    //[Create engine]
+    allocator alloc {};
+    dnnl::engine eng
+            = make_engine_with_allocator(dnnl::engine::kind::cpu, 0, alloc);
+    //[Create engine]
+
+    /// Create a #dnnl::stream on a given engine
+    ///
+    /// @snippet cpu_getting_started.cpp Create stream
+    //[Create stream]
+    dnnl::stream strm {eng};
+    // return;
+    //[Create stream]
+
+    // Mapping from logical tensor id to output tensors
+    // used to the connection relationship between partitions (e.g partition 0's
+    // output tensor is fed into partition 1)
+    std::unordered_map<size_t, tensor> global_outputs_ts_map;
+
+    // Memory buffers bound to the partition input/output tensors
+    // that helps manage the lifetime of these tensors
+    std::vector<std::shared_ptr<void>> data_buffer;
+
+    // Mapping from id to queried logical tensor from compiled partition
+    // used to record the logical tensors that are previously enabled with
+    // ANY layout
+    std::unordered_map<size_t, logical_tensor> id_to_queried_logical_tensors;
+
+    // This is a helper function which helps decide which logical tensor is
+    // needed to be set with `dnnl::graph::logical_tensor::layout_type::any`
+    // layout.
+    // This function is not a part to Graph API, but similar logic is
+    // essential for Graph API integration to achieve best performance.
+    // Typically, users need implement the similar logic in their code.
+    std::unordered_set<size_t> ids_with_any_layout;
+    set_any_layout(partitions, ids_with_any_layout);
+
+    // Mapping from logical tensor id to the concrete shapes.
+    // In practical usage, concrete shapes and layouts are not given
+    // until compilation stage, hence need this mapping to mock the step.
+
+    dims ml1_dims {10};
+    dims ml2_dims {10};
+
+    std::unordered_map<size_t, dims> concrete_shapes {{0, ml1_dims}, {1, ml2_dims}};
+
+    // Compile and execute the partitions, including the following steps:
+    //
+    // 1. Update the input/output logical tensors with concrete shape and layout
+    // 2. Compile the partition
+    // 3. Update the output logical tensors with queried ones after compilation
+    // 4. Allocate memory and bind the data buffer for the partition
+    // 5. Execute the partition
+    //
+    // Although they are not part of the APIs, these steps are essential for
+    // the integration of Graph API., hence users need to implement similar
+    // logic.
+    for (const auto &partition : partitions) {
+        if (!partition.is_supported()) {
+            std::cout
+                    << "cpu_get_started: Got unsupported partition, users need "
+                       "handle the operators by themselves."
+                    << std::endl;
+            continue;
+        }
+
+        std::vector<logical_tensor> inputs = partition.get_input_ports();
+        std::vector<logical_tensor> outputs = partition.get_output_ports();
+
+        // Update input logical tensors with concrete shape and layout
+        for (auto &input : inputs) {
+            const auto id = input.get_id();
+            // If the tensor is an output of another partition,
+            // use the cached logical tensor
+            if (id_to_queried_logical_tensors.find(id)
+                    != id_to_queried_logical_tensors.end())
+                input = id_to_queried_logical_tensors[id];
+            else
+                // Create logical tensor with strided layout
+                input = logical_tensor {id, input.get_data_type(),
+                        concrete_shapes[id], layout_type::strided};
+        }
+
+        // Update output logical tensors with concrete shape and layout
+        for (auto &output : outputs) {
+            const auto id = output.get_id();
+            output = logical_tensor {id, output.get_data_type(),
+                    DNNL_GRAPH_UNKNOWN_NDIMS, // set output dims to unknown
+                    ids_with_any_layout.count(id) ? layout_type::any
+                                                  : layout_type::strided};
+        }
+
+        /// Compile the partition to generate compiled partition with the
+        /// input and output logical tensors.
+        ///
+        /// @snippet cpu_getting_started.cpp Compile partition
+        //[Compile partition]
+        compiled_partition cp = partition.compile(inputs, outputs, eng);
+        //[Compile partition]
+
+        // Update output logical tensors with queried one
+        for (auto &output : outputs) {
+            const auto id = output.get_id();
+            output = cp.query_logical_tensor(id);
+            id_to_queried_logical_tensors[id] = output;
+        }
+
+        // Allocate memory for the partition, and bind the data buffers with
+        // input and output logical tensors
+        std::vector<tensor> inputs_ts, outputs_ts;
+        allocate_graph_mem(inputs_ts, inputs, data_buffer,
+                global_outputs_ts_map, eng, /*is partition input=*/true);
+        allocate_graph_mem(outputs_ts, outputs, data_buffer,
+                global_outputs_ts_map, eng, /*is partition input=*/false);
+
+        /// Execute the compiled partition on the specified stream.
+        ///
+        /// @snippet cpu_getting_started.cpp Execute compiled partition
+        //[Execute compiled partition]
+        cp.execute(strm, inputs_ts, outputs_ts);
+        //[Execute compiled partition]
+    }
+
+    // Wait for all compiled partition's execution finished
+    strm.wait();
+}
+
+int main(int argc, char **argv) {
+    return handle_example_errors(
+            {engine::kind::cpu}, cpu_getting_started_tutorial);
+}
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
@@ -160,6 +160,9 @@ if(ONEDNN_BUILD_GRAPH)
     if(ONEDNN_EXPERIMENTAL_GRAPH_COMPILER_BACKEND)
         add_definitions_with_host_compiler(-DDNNL_ENABLE_COMPILER_BACKEND)
     endif()
+    if(ONEDNN_EXPERIMENTAL_ELYZOR_BACKEND)
+        add_definitions_with_host_compiler(-DDNNL_ENABLE_ELYZOR_BACKEND)
+    endif()
     if(ONEDNN_ENABLE_GRAPH_DUMP)
         message(STATUS "Graph artifacts dump is enabled")
         add_definitions_with_host_compiler(-DDNNL_ENABLE_GRAPH_DUMP)

diff --git a/src/graph/backend/CMakeLists.txt b/src/graph/backend/CMakeLists.txt
@@ -17,3 +17,4 @@
 add_subdirectory(fake)
 add_subdirectory(dnnl)
 add_subdirectory(graph_compiler)
+add_subdirectory(elyzor)
diff --git a/src/graph/backend/elyzor/CMakeLists.txt b/src/graph/backend/elyzor/CMakeLists.txt
@@ -0,0 +1,65 @@
+#===============================================================================
+# Copyright 2021-2024 Intel Corporation
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#===============================================================================
+
+if(NOT ONEDNN_EXPERIMENTAL_ELYZOR_BACKEND)
+    message(STATUS "Elyzor backend is disabled.")
+    return()
+endif()
+
+message(STATUS "Elyzor backend is enabled.")
+
+if(${CMAKE_CXX_COMPILER_ID} STREQUAL MSVC)
+    set(CCXX_NOWARN_FLAGS "")
+    append(CCXX_NOWARN_FLAGS "/wd4200")
+    # allow usage of "deprecated" functions
+    append(CCXX_NOWARN_FLAGS "/wd4996")
+    # inherits via dominance
+    append(CCXX_NOWARN_FLAGS "/wd4250")
+    # conversion from 'size_t' to 'uint16_t'
+    append(CCXX_NOWARN_FLAGS "/wd4267")
+    # function assumed not to throw an exception but does
+    append(CCXX_NOWARN_FLAGS "/wd4297")
+    #  format string '%lu' requires an argument of type 'unsigned long'
+    append(CCXX_NOWARN_FLAGS "/wd4477")
+    # not enough arguments for function-like macro
+    append(CCXX_NOWARN_FLAGS "/wd4003")
+    # 
+    append(CCXX_NOWARN_FLAGS "/wd4624")
+    # 'elem_type': unreferenced local variable
+    append(CCXX_NOWARN_FLAGS "/wd4101")
+    # unary minus operator applied to unsigned type
+    append(CCXX_NOWARN_FLAGS "/wd4146")
+    # destructor never returns, potential memory leak
+    append(CCXX_NOWARN_FLAGS "/wd4722")
+    # needs to have dll-interface to be used by clients of struct
+    append(CCXX_NOWARN_FLAGS "/wd4251")
+
+    append(CMAKE_CCXX_NOWARN_FLAGS ${CCXX_NOWARN_FLAGS})
+    set_property(GLOBAL PROPERTY ELYZOR_CCXX_NOWARN_FLAGS "${CCXX_NOWARN_FLAGS}")
+endif()
+
+append(CMAKE_CXX_FLAGS "${CMAKE_CCXX_NOWARN_FLAGS}")
+append_host_compiler_options(CMAKE_CXX_FLAGS "${DPCPP_CXX_NOWARN_FLAGS}")
+
+file(GLOB SOURCES
+    ${CMAKE_CURRENT_SOURCE_DIR}/*.[ch]pp
+    )
+
+set(OBJ_LIB dnnl_graph_backend_elyzor)
+add_library(${OBJ_LIB} OBJECT ${SOURCES})
+
+set_property(GLOBAL APPEND PROPERTY DNNL_LIB_DEPS
+    $<TARGET_OBJECTS:${OBJ_LIB}>)
diff --git a/src/graph/backend/elyzor/README.md b/src/graph/backend/elyzor/README.md
@@ -0,0 +1,21 @@
+A copy of the ['graph compiler' backend](https://github.com/dchigarev/oneDNN/tree/init_elyzor/src/graph/backend/graph_compiler) without an actual compiler.
+
+#### How to enable:
+Pass `-DONEDNN_EXPERIMENTAL_ELYZOR_BACKEND=ON` to your cmake:
+```
+cd oneDNN
+mkdir build && cd build
+cmake ../ -DONEDNN_EXPERIMENTAL_ELYZOR_BACKEND=ON
+```
+
+#### How to test:
+There's an example file that uses elyzor backend for compilation/execution ([examples/graph/cpu_elyzor_test.cpp](https://github.com/dchigarev/oneDNN/blob/init_elyzor/examples/graph/cpu_elyzor_test.cpp)).
+
+Currently, it's only able to print "hello world" strings from [compile](https://github.com/dchigarev/oneDNN/blob/c0a48558295dfcabf84c6ab68e6311ac95c98d6b/src/graph/backend/elyzor/compiler_partition_impl.cpp#L121) and [execute](https://github.com/dchigarev/oneDNN/blob/c0a48558295dfcabf84c6ab68e6311ac95c98d6b/src/graph/backend/elyzor/compiler_partition_impl.cpp#L185) methods.
+
+#### Hacks:
+1. The graph compiler's front-end [uses certain functionality](https://github.com/dchigarev/oneDNN/blob/c0a48558295dfcabf84c6ab68e6311ac95c98d6b/src/graph/backend/graph_compiler/target_machine.hpp#L19-L24)
+   from its core to detect which CPU instructions are available and [define patterns accordingly](https://github.com/dchigarev/oneDNN/blob/c0a48558295dfcabf84c6ab68e6311ac95c98d6b/src/graph/backend/graph_compiler/compiler_backend.cpp#L54).
+   In elyzor we don't have this functionality, so we are [assuming that all instructions are available](https://github.com/dchigarev/oneDNN/blob/c0a48558295dfcabf84c6ab68e6311ac95c98d6b/src/graph/backend/elyzor/target_machine.hpp#L19-L27).
+2. The [compile](https://github.com/dchigarev/oneDNN/blob/c0a48558295dfcabf84c6ab68e6311ac95c98d6b/src/graph/backend/elyzor/compiler_partition_impl.cpp#L142-L146)
+   and [execute](https://github.com/dchigarev/oneDNN/blob/c0a48558295dfcabf84c6ab68e6311ac95c98d6b/src/graph/backend/elyzor/compiler_partition_impl.cpp#L185-L188) methods are dummies for now