Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is the output inefficient? #14

Closed
nomaddo opened this issue Dec 7, 2017 · 8 comments
Closed

Is the output inefficient? #14

nomaddo opened this issue Dec 7, 2017 · 8 comments
Assignees
Labels

Comments

@nomaddo
Copy link
Collaborator

nomaddo commented Dec 7, 2017

Recently, I read the output of VC4C. To me, it includes inefficient instructions.
I would like to know your design concept of the output.

kernel void add (global float a[], global float b[]) {
  int id = get_global_id(0);
  float8 v1 = vload8(id, a);
  float8 v2 = vload8(id, b);
  v1 += v2;
  vstore8 (v1, id, a);
}

https://gist.github.com/nomaddo/728ffc2fa605ab5b87f316a6280246be

My question is:

  • In line 3, 6, 7 or so on, this read from uniform and drop it. Why do it do?
    I want to know the layout of uniform which VC4CL runtime passes to the kernel.
  • It reads values from VPM, not TMU.
    In general, load from TMU is better than VPM , because we have two TMUs and these are free from mutex locks. Why do you choose the load way from VPM?

And some optimizations seem to lack:

  • Line 30 and line 31, can be fused.
or r0, r2, r2
shl r0, r0, 2

is equal to

shl r0, r2, r0
  • In line 22, or r2, r2, r2 has no meaning effects.
  • Have you implemented instruction scheduling? It affects a lot in simple architecture like vc4.
@doe300 doe300 self-assigned this Dec 7, 2017
@doe300
Copy link
Owner

doe300 commented Dec 7, 2017

In line 3, 6, 7 or so on, this read from uniform and drop it. Why do it do?
I want to know the layout of uniform which VC4CL runtime passes to the kernel.

The reason for why I read and drop some (up to 13, I think) uniforms lies in the layout of the uniforms:
Through reading uniform values, the parameters are passed to the kernel execution. Beside the "explicit" parameters specified in the kernel code, some "hidden" parameters are passed too, like work-item index, work-group size, ....

The complete list of uniforms reads (in order, can also be seen in src/asm/CodeGenerator.cpp#generateStartSegment):

  1. work-dimensions (get_work_dim())
  2. local-sizes (contains all 3 dimensions of get_local_size() packed into one 32-bit word) (*)
  3. local ids (contains all 3 dimensions of get_local_id() packed into one 32-bit word) (*)
  4. number of work-groups x (get_num_groups(0))
  5. number of work-groups y (get_num_groups(1))
  6. number of work-groups z (get_num_groups(2))
  7. group ID x (get_group_id(0))
  8. group ID y (get_group_id(1))
  9. group ID z (get_group_id(2))
  10. global offsets x (get_global_offset(0))
  11. global offsets y (get_global_offset(1))
  12. global offsets z (get_global_offset(2))
  13. global data address (the base pointer to the beginning of the global data segment)
  14. the actual "explicit" parameters. NOTE: for vector-parameters (but not pointer to vectors!), every element is passed in its own uniform and the vector is then assembled device-side
  15. (last uniform) a flag whether to re-run the whole kernel execution or exit. This is used in an optimization, which executes multiple work-groups without need to re-start the execution host-side by simply re-running the whole code with modified unform values

(*) Packing the values in 2. and 3. is possible, since the local size/id cannot exceed 12 and therefore can fit into a single byte.

The uniforms 1. to 13. and 15. are always passed to the kernel. If the kernel does not use their values, they are read and discarded (they need to be read to be able to access the next uniform).

To not discard any uniforms, the kernel meta-data would need to contain fields to which of the work-item values are really used and the host-library would need to only set the used fields in a fixed order.

It reads values from VPM, not TMU.
In general, load from TMU is better than VPM , because we have two TMUs and these are free from mutex locks. Why do you choose the load way from VPM?

Speed and configurability. You are right, reading from TMU is much easier (and AFAIK doesn't need to be secured via a global mutex), but it is also much slower. Likewise, the TMU can only read single 32-bit integers (up to 16 at a time, but still only single values and only 32-bit values). The VPM on the other hand can load several values at once, with sizes of 8-, 16- or 32-bit as well as vectors of arbitrary size (between 1 and 16 elements, from consecutive memory), which models the possible ways of accessing memory OpenCL allows much better.

Line 30 and line 31, can be fused.

The problem here lies with the register-allocator (or more precise, its limitations). To enable the register-allocator to find a correct variable-to-register mapping for many cases, variables are not eliminated, if they are used. In some special cases, this could be optimized away, but definitively not in the general case.

In line 22, or r2, r2, r2 has no meaning effects.

This happens, if the instruction is the last read of the operands and the first write of the output-value. Originally, this would have looked something like mov out, in. If the register-allocator is able to assign both variables to the same register (here: accumulator 2), this unnecessary code is generated. The fix for this is related to the unnecessary code above.
The problem in short: The moment we know both variables to reside in the same register we cannot remove the instruction anymore, since the labels have already been mapped to their position in code.

Have you implemented instruction scheduling? It affects a lot in simple architecture like vc4.

There is one optimization that re-orders instructions, but only to replace nops with "useful" instructions. General instruction re-ordering is not implemented.

@nomaddo
Copy link
Collaborator Author

nomaddo commented Dec 8, 2017

Thanks, much helpful for me.
Anyway, this code is slower than I expected (consume 0.1s to sum 10^6 elements. 10 MFLops !).
I will try to add some optimization to make it faster.

@doe300
Copy link
Owner

doe300 commented Dec 8, 2017

Anyway, this code is slower than I expected (consume 0.1s to sum 10^6 elements. 10 MFLops !).

The problem here is the fact, that of 5 lines of code, 3 access the memory, which is slow (see here).

If you can use float16 instead of float8, the performance should go up a bit (I estimate to about 12 - 18 MFLOPS).

@nomaddo
Copy link
Collaborator Author

nomaddo commented Dec 11, 2017

I write the same program as assembly.
In my experiment, the performance of TMU program (reading from TMUs and write via VPM) is about 80 ~ 100 MFlops.
Program is here (I uses py-videocore as assembler).
https://gist.github.com/nomaddo/6a89bd57e1d30a14518a12f9ffdc9812

I and our colleague generally, load from TMUs and write using VPMs no to stop threads.
If this method is not effective, after considering memory layout, we consider load from VPM and uniforms.
So first of all, all memory load can be implemented using TMUs.

That's our experience. Glad if you have another knowledge or performance benchmark result about it.

@nomaddo
Copy link
Collaborator Author

nomaddo commented Dec 12, 2017

One more thing: it is possible, only 8 elements are loaded by VPM in one mutex block?
ex) line 40 loads one vector via VPM, and release mutex, which means VPM doesn't perform work in full power?

@doe300
Copy link
Owner

doe300 commented Dec 12, 2017

it is possible, only 8 elements are loaded by VPM in one mutex block?

Yes they are. There exists on optimization to load multiple successive values (used by the same QPU) at once (from RAM into VPM), which cannot be used here, since both loads use different base addresses.

A more general optimization to use the VPM as cache shared across all QPUs is in planning, but currently there are a few problems to solve, before I am able to implement it.

@nomaddo
Copy link
Collaborator Author

nomaddo commented Dec 12, 2017

Thanks. I understand how it works.

A more general optimization to use the VPM as cache shared across all QPUs is in planning, but currently there are a few problems to solve, before I am able to implement it.

Mmm, i don't think this method perform well. You mean, a woker deal with small pierce of data, then store it to VPM as cache. I imagine that it is not make good performance because of writing to mutex locks, as you know.
I and my colleague usually try to implement one physical thread as follows:

  • Load from TMUs
  • Calculation using all registers
  • Mutate lock, and store to memmory using VPM

Then, Each thread takes turns at occupying VPM.
If the calculation is heavy, this program is run in parallel well.

Anyway, in my understanding, we need to fuse worker-threads to load or store 16x64 elements.

@nomaddo
Copy link
Collaborator Author

nomaddo commented Jan 18, 2018

This issue is already separated into individual issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants