Skip to content

Execution of Inference

Mingyu Kim edited this page Jul 5, 2022 · 11 revisions

Network execution happens when user calls inferRequest->infer() or inferRequest->start_async(). (src)

In high level, all we need to do is enqueuing OCL kernels with buffers. For that purpose, we need to find the cldnn::network instance as it contains the required buffers for execution. (link) CPUStreamExecutor is holding streams and the stream corresponds to the cldnn::network structure. (src)

The main body of network execution is cldnn::network::execute_impl. (src) In this function, set_arguments() is called to set OpenCL arguments and execute_primitive is called to enqueue kernels to OCL queue. In case of synchronous API call(i.e. inferRequest->infer()), waiting for completion of kernels is also required. It is called from cldnn::network_output::get_memory() function. (src)

Optimized-out node

During graph compilation(link), some nodes may be optimized out.

For example, concat operation may be executed implicitly, or in other words, concat may be optimized out. Implicit concat is possible when the input of concat can put the output tensor directly into the result tensor of concat.

In such case, we don't remove the node in the graph for integrity of node connection. Concat layer is just marked as optimized-out and not executed during runtime. (src)

Dumping layer in/out buffer during execution

cldnn::network::execute_impl also contains some logic to dump layer in/out buffers for debugging purpose. As it is related to memory usage, it deserves some description, too.

In order to dump buffers, we need to wait for the moment that the kernel is about to be called(for source buffer) or just called(for destination buffer). In other moments, we don't have the layer's buffer as the buffers are reused from memory pool. (link)

get_stream().finish() is called firstly as we need to be synchronous with kernel execution. (src) Then we can access the buffer. (src) This access varies depending on the kind of buffer. If it is usm_host or usm_shared, it is just accessed directly. If it is usm_device, it is accessed after copying the data into host memory because host cannot access usm_device directly. (src) If it is ocl memory, we map this into host memory. (src)

Typical network execution happens with usm_host for network input and output and usm_device for the buffers inside the network.

For usage of this dumping feature, please see link

Clone this wiki locally