-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is the output inefficient? #14
Comments
The reason for why I read and drop some (up to 13, I think) uniforms lies in the layout of the uniforms: The complete list of uniforms reads (in order, can also be seen in
(*) Packing the values in 2. and 3. is possible, since the local size/id cannot exceed 12 and therefore can fit into a single byte. The uniforms 1. to 13. and 15. are always passed to the kernel. If the kernel does not use their values, they are read and discarded (they need to be read to be able to access the next uniform). To not discard any uniforms, the kernel meta-data would need to contain fields to which of the work-item values are really used and the host-library would need to only set the used fields in a fixed order.
Speed and configurability. You are right, reading from TMU is much easier (and AFAIK doesn't need to be secured via a global mutex), but it is also much slower. Likewise, the TMU can only read single 32-bit integers (up to 16 at a time, but still only single values and only 32-bit values). The VPM on the other hand can load several values at once, with sizes of 8-, 16- or 32-bit as well as vectors of arbitrary size (between 1 and 16 elements, from consecutive memory), which models the possible ways of accessing memory OpenCL allows much better.
The problem here lies with the register-allocator (or more precise, its limitations). To enable the register-allocator to find a correct variable-to-register mapping for many cases, variables are not eliminated, if they are used. In some special cases, this could be optimized away, but definitively not in the general case.
This happens, if the instruction is the last read of the operands and the first write of the output-value. Originally, this would have looked something like
There is one optimization that re-orders instructions, but only to replace |
Thanks, much helpful for me. |
The problem here is the fact, that of 5 lines of code, 3 access the memory, which is slow (see here). If you can use |
I write the same program as assembly. I and our colleague generally, load from TMUs and write using VPMs no to stop threads. That's our experience. Glad if you have another knowledge or performance benchmark result about it. |
One more thing: it is possible, only 8 elements are loaded by VPM in one mutex block? |
Yes they are. There exists on optimization to load multiple successive values (used by the same QPU) at once (from RAM into VPM), which cannot be used here, since both loads use different base addresses. A more general optimization to use the VPM as cache shared across all QPUs is in planning, but currently there are a few problems to solve, before I am able to implement it. |
Thanks. I understand how it works.
Mmm, i don't think this method perform well. You mean, a woker deal with small pierce of data, then store it to VPM as cache. I imagine that it is not make good performance because of writing to mutex locks, as you know.
Then, Each thread takes turns at occupying VPM. Anyway, in my understanding, we need to fuse worker-threads to load or store 16x64 elements. |
This issue is already separated into individual issue. |
Recently, I read the output of
VC4C
. To me, it includes inefficient instructions.I would like to know your design concept of the output.
https://gist.github.com/nomaddo/728ffc2fa605ab5b87f316a6280246be
My question is:
I want to know the layout of uniform which
VC4CL
runtime passes to the kernel.In general, load from TMU is better than VPM , because we have two TMUs and these are free from mutex locks. Why do you choose the load way from VPM?
And some optimizations seem to lack:
is equal to
or r2, r2, r2
has no meaning effects.The text was updated successfully, but these errors were encountered: