Is the output inefficient? #14

nomaddo · 2017-12-07T07:40:39Z

Recently, I read the output of VC4C. To me, it includes inefficient instructions.
I would like to know your design concept of the output.

kernel void add (global float a[], global float b[]) {
  int id = get_global_id(0);
  float8 v1 = vload8(id, a);
  float8 v2 = vload8(id, b);
  v1 += v2;
  vstore8 (v1, id, a);
}

https://gist.github.com/nomaddo/728ffc2fa605ab5b87f316a6280246be

My question is:

In line 3, 6, 7 or so on, this read from uniform and drop it. Why do it do?
I want to know the layout of uniform which VC4CL runtime passes to the kernel.
It reads values from VPM, not TMU.
In general, load from TMU is better than VPM , because we have two TMUs and these are free from mutex locks. Why do you choose the load way from VPM?

And some optimizations seem to lack:

Line 30 and line 31, can be fused.

or r0, r2, r2
shl r0, r0, 2

is equal to

shl r0, r2, r0

In line 22, or r2, r2, r2 has no meaning effects.
Have you implemented instruction scheduling? It affects a lot in simple architecture like vc4.

The text was updated successfully, but these errors were encountered:

doe300 · 2017-12-07T09:59:34Z

In line 3, 6, 7 or so on, this read from uniform and drop it. Why do it do?
I want to know the layout of uniform which VC4CL runtime passes to the kernel.

The reason for why I read and drop some (up to 13, I think) uniforms lies in the layout of the uniforms:
Through reading uniform values, the parameters are passed to the kernel execution. Beside the "explicit" parameters specified in the kernel code, some "hidden" parameters are passed too, like work-item index, work-group size, ....

The complete list of uniforms reads (in order, can also be seen in src/asm/CodeGenerator.cpp#generateStartSegment):

work-dimensions (get_work_dim())
local-sizes (contains all 3 dimensions of get_local_size() packed into one 32-bit word) (*)
local ids (contains all 3 dimensions of get_local_id() packed into one 32-bit word) (*)
number of work-groups x (get_num_groups(0))
number of work-groups y (get_num_groups(1))
number of work-groups z (get_num_groups(2))
group ID x (get_group_id(0))
group ID y (get_group_id(1))
group ID z (get_group_id(2))
global offsets x (get_global_offset(0))
global offsets y (get_global_offset(1))
global offsets z (get_global_offset(2))
global data address (the base pointer to the beginning of the global data segment)
the actual "explicit" parameters. NOTE: for vector-parameters (but not pointer to vectors!), every element is passed in its own uniform and the vector is then assembled device-side
(last uniform) a flag whether to re-run the whole kernel execution or exit. This is used in an optimization, which executes multiple work-groups without need to re-start the execution host-side by simply re-running the whole code with modified unform values

(*) Packing the values in 2. and 3. is possible, since the local size/id cannot exceed 12 and therefore can fit into a single byte.

The uniforms 1. to 13. and 15. are always passed to the kernel. If the kernel does not use their values, they are read and discarded (they need to be read to be able to access the next uniform).

To not discard any uniforms, the kernel meta-data would need to contain fields to which of the work-item values are really used and the host-library would need to only set the used fields in a fixed order.

It reads values from VPM, not TMU.
In general, load from TMU is better than VPM , because we have two TMUs and these are free from mutex locks. Why do you choose the load way from VPM?

Speed and configurability. You are right, reading from TMU is much easier (and AFAIK doesn't need to be secured via a global mutex), but it is also much slower. Likewise, the TMU can only read single 32-bit integers (up to 16 at a time, but still only single values and only 32-bit values). The VPM on the other hand can load several values at once, with sizes of 8-, 16- or 32-bit as well as vectors of arbitrary size (between 1 and 16 elements, from consecutive memory), which models the possible ways of accessing memory OpenCL allows much better.

Line 30 and line 31, can be fused.

The problem here lies with the register-allocator (or more precise, its limitations). To enable the register-allocator to find a correct variable-to-register mapping for many cases, variables are not eliminated, if they are used. In some special cases, this could be optimized away, but definitively not in the general case.

In line 22, or r2, r2, r2 has no meaning effects.

This happens, if the instruction is the last read of the operands and the first write of the output-value. Originally, this would have looked something like mov out, in. If the register-allocator is able to assign both variables to the same register (here: accumulator 2), this unnecessary code is generated. The fix for this is related to the unnecessary code above.
The problem in short: The moment we know both variables to reside in the same register we cannot remove the instruction anymore, since the labels have already been mapped to their position in code.

Have you implemented instruction scheduling? It affects a lot in simple architecture like vc4.

There is one optimization that re-orders instructions, but only to replace nops with "useful" instructions. General instruction re-ordering is not implemented.

nomaddo · 2017-12-08T06:25:30Z

Thanks, much helpful for me.
Anyway, this code is slower than I expected (consume 0.1s to sum 10^6 elements. 10 MFLops !).
I will try to add some optimization to make it faster.

doe300 · 2017-12-08T09:18:07Z

Anyway, this code is slower than I expected (consume 0.1s to sum 10^6 elements. 10 MFLops !).

The problem here is the fact, that of 5 lines of code, 3 access the memory, which is slow (see here).

If you can use float16 instead of float8, the performance should go up a bit (I estimate to about 12 - 18 MFLOPS).

nomaddo · 2017-12-11T09:20:29Z

I write the same program as assembly.
In my experiment, the performance of TMU program (reading from TMUs and write via VPM) is about 80 ~ 100 MFlops.
Program is here (I uses py-videocore as assembler).
https://gist.github.com/nomaddo/6a89bd57e1d30a14518a12f9ffdc9812

I and our colleague generally, load from TMUs and write using VPMs no to stop threads.
If this method is not effective, after considering memory layout, we consider load from VPM and uniforms.
So first of all, all memory load can be implemented using TMUs.

That's our experience. Glad if you have another knowledge or performance benchmark result about it.

nomaddo · 2017-12-12T08:20:35Z

One more thing: it is possible, only 8 elements are loaded by VPM in one mutex block?
ex) line 40 loads one vector via VPM, and release mutex, which means VPM doesn't perform work in full power?

doe300 · 2017-12-12T08:31:11Z

it is possible, only 8 elements are loaded by VPM in one mutex block?

Yes they are. There exists on optimization to load multiple successive values (used by the same QPU) at once (from RAM into VPM), which cannot be used here, since both loads use different base addresses.

A more general optimization to use the VPM as cache shared across all QPUs is in planning, but currently there are a few problems to solve, before I am able to implement it.

nomaddo · 2017-12-12T13:32:19Z

Thanks. I understand how it works.

A more general optimization to use the VPM as cache shared across all QPUs is in planning, but currently there are a few problems to solve, before I am able to implement it.

Mmm, i don't think this method perform well. You mean, a woker deal with small pierce of data, then store it to VPM as cache. I imagine that it is not make good performance because of writing to mutex locks, as you know.
I and my colleague usually try to implement one physical thread as follows:

Load from TMUs
Calculation using all registers
Mutate lock, and store to memmory using VPM

Then, Each thread takes turns at occupying VPM.
If the calculation is heavy, this program is run in parallel well.

Anyway, in my understanding, we need to fuse worker-threads to load or store 16x64 elements.

nomaddo · 2018-01-18T08:51:22Z

This issue is already separated into individual issue.

doe300 self-assigned this Dec 7, 2017

doe300 mentioned this issue Dec 11, 2017

Speed-up memory access #15

Closed

doe300 added the question label Dec 11, 2017

doe300 added a commit that referenced this issue Dec 13, 2017

Several small optimization improvements, see #14 and #17

6f1c91a

doe300 mentioned this issue Dec 21, 2017

Remove redundant instructions for checking for-loop-exit-condition #22

Open

nomaddo closed this as completed Jan 18, 2018

nomaddo mentioned this issue Mar 2, 2018

Optimize uniform layout #44

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is the output inefficient? #14

Is the output inefficient? #14

nomaddo commented Dec 7, 2017

doe300 commented Dec 7, 2017

nomaddo commented Dec 8, 2017

doe300 commented Dec 8, 2017

nomaddo commented Dec 11, 2017

nomaddo commented Dec 12, 2017

doe300 commented Dec 12, 2017

nomaddo commented Dec 12, 2017

nomaddo commented Jan 18, 2018

Is the output inefficient? #14

Is the output inefficient? #14

Comments

nomaddo commented Dec 7, 2017

doe300 commented Dec 7, 2017

nomaddo commented Dec 8, 2017

doe300 commented Dec 8, 2017

nomaddo commented Dec 11, 2017

nomaddo commented Dec 12, 2017

doe300 commented Dec 12, 2017

nomaddo commented Dec 12, 2017

nomaddo commented Jan 18, 2018