A2XX Shader Instruction Set Architecture

Gabriele Svelto edited this page Feb 19, 2015 · 2 revisions

Note: some of the stuff on this page is a bit outdated.. check instr-a2xx.h for the latest until I have time to update the wiki page.

Instruction Set Architecture (ISA) Overview

The adreno GPU has a unified shader architecture, so the same instruction set and shader resources are used by both vertex (VS) and fragment/pixel (PS) shaders.

It bears some resemblances to the r600 ISA, in that it is a VLIW architecture, with separation of control flow (CF) program, which controls the execution flow, and arithmetic and logic (ALU) / FETCH instructions. But while the r600 ALU instructions consist of up to 5 scalar operations, the adreno ALU instruction consists of one vec4 operation and/or one scalar operation.

The adreno shader consists of 96bit (3 dwords) CF, ALU, and FETCH instructions.

Assembler syntax

The assembler syntax is loosely based on the r600 assembler syntax, and also a single screenshot in optimize-adreno.pdf (pg 6). As with the r600 assembler syntax, the CF and ALU/FETCH instructions are interleaved for easier reading:

EXEC ADDR(0x3) CNT(0x1)
   (S)FETCH:    VERTEX  R1.xyz1 = R0.x FMT_32_32_32_FLOAT UNSIGNED STRIDE(12) CONST(10)
ALLOC COORD SIZE(0x0)
EXEC ADDR(0x4) CNT(0x1)
      ALU:  MAXv    export62 = R1, R1   ; gl_Position
ALLOC PARAM/PIXEL SIZE(0x0)
EXEC_END ADDR(0x5) CNT(0x0)

While in memory, the actual layout is:

EXEC ADDR(0x3) CNT(0x1)
ALLOC COORD SIZE(0x0)
EXEC ADDR(0x4) CNT(0x1)
ALLOC PARAM/PIXEL SIZE(0x0)
EXEC_END ADDR(0x5) CNT(0x0)
   (S)FETCH:    VERTEX  R1.xyz1 = R0.x FMT_32_32_32_FLOAT UNSIGNED STRIDE(12) CONST(10)
      ALU:  MAXv    export62 = R1, R1   ; gl_Position

The ADDR and CNT fields for EXEC and EXEC_END CF clauses refer to the offset (in multiples of 96bits) and instruction counts of the corresponding ALU/FETCH instructions. The shader assembler will let you omit these fields, as it can figure them out for itself.

CF instructions

Each 96bit (3 dwords) CF instruction consists of two CF clauses. The instruction format is:

dword bit position description
dword0 0..11 addr/size 1
12..15 count 1
16..31 sequence 1 - 2 bits per instruction in the EXEC clause, the low bit seems to control FETCH vs ALU instruction type, the high bit seems to be (S) modifier on instruction (which might make the name SERIALIZE() in optimize-for-adreno.pdf screenshot make sense.. although I don't quite understand the meaning yet)
dword1 0..7 UNKNOWN
8..15? CF opcode 1
16..27 addr/size 2
28..31 count 2
dword2 0..15 sequence 2 - same as sequence 1 but for the 2nd CF clause
16..23 UNKNOWN
24..31 CF opcode 2

EXEC / EXEC_END

These CF clauses specify a corresponding set of ALU/FETCH instructions to execute. The last group of ALU/FETCH instructions should be an EXEC_END clause.

ALLOC COORD

Allocate space for coordinate export, the SIZE parameter specifies the number of exports minus one.

ALLOC PARAM/PIXEL

Allocate space for a parameter (ie. a varying in VS) or pixel (gl_FragColor in PS) export. The SIZE parameter specifies the number of exports minus one.

NOP

A no-op.. seems mainly to be used to pad the 2nd CF clause when not needed.

others..

Additional CF opcodes are seen in disassembly of shaders which contain loops, for managing the flow control.

ALU instructions

Each 96 bit ALU instruction can execute one vec4 operation, and/or one scalar operation. Some instructions are only available as scalar or vector instructions.

An example of a combined vec4+scalar instruction in assembler syntax:

ALU:    MULv    R2.xyz_ = R3.zzzw, C10
    RCP R4.x___ = R0

To perform only a scalar operation, the vector operation should mask each channel in the vector dest (ie. R0.____)

Each ALU instruction can have up to 3 src registers. The 3rd src register is either used for a 3 op vector instruction like MULADDv or for the paired scalar instruction.

dword bit position description
dword0 0..5? vector dest register
6?..7 UNKNOWN
8..13? scalar dest register
14 UNKNOWN
15 export flag
16..19 vector dest write mask (wxyz, 1 bit per channel)
20..23 scalar dest write mask (same as above)
24..26 UNKNOWN
27..31 scalar operation
dword1 0..7 src3 swizzle
8..15 src2 swizzle
16..23 src1 swizzle
24 src3 negate
25 src2 negate
26 src1 negate
27 predicate case (1 - execute if true, 0 - execute if false)
28 predicate (conditional execution)
29..31 UNKNOWN
dword2 0..5? src3 register
6 UNKNOWN
7 src3 abs (assumed)
8..13? src2 register
14 UNKNOWN
15 src2 abs
16..21? src1 register
22 UNKNOWN
23 src1 abs
24..28 vector operation
29 src3 type/bank (1 - Register bank (R), varyings and locals; 0 - Constant bank (C), uniforms and consts
30 vector src2 type/bank (same as above)
31 vector src1 type/bank (same as above)

Interpretation of ALU src swizzle fields

1..0 chan[0] (x) swizzle 3..2 chan[1] (y) swizzle
00 x 11 x
01 y 00 y
10 z 01 z
11 w 10 w
5..4 chan[2] (z) swizzle 7..6 chan[3] (w) swizzle
10 x 01 x
11 y 10 y
00 z 11 z
01 w 00 w

The known vec4 instruction opcodes:

opcode name opcode # description
ADDv 0 Rdstv = Rsrc1 + Rsrc2
MULv 1 Rdstv = Rsrc1 * Rsrc2
MAXv 2 Rdstv = max(Rsrc1, Rsrc2) (also used as a MOV instruction)
MINv 3 Rdstv = min(Rsrc1, Rsrc2)
FLOORv 10 Rdstv = floor(Rsrc1)
MULADDv 11 Rdstv = Rsrc3 + (Rsrc1 * Rsrc2)
DOT4v 15 dot product of all 4 channels of Rsrc1, Rsrc2
DOT3v 16 dot product of first 3 channels of Rsrc1, Rsrc2

The known scalar instruction opcodes:

opcode name opcode # description
MOV 2 Rdsts = Rsrc3
EXP2 7 Rdsts = exp(Rsrc3)
LOG2 8 Rdsts = log2(Rsrc3)
RCP 9 Rdsts = 1 / Rsrc3
RSQ 11 Rdsts = 1 / sqrt(Rsrc3)
PSETE 13 predicate = Rsrc3 == 0 (called PRED_SETE in r600isa.pdf)
SQRT 20 Rdsts = sqrt(Rsrc3)
MUL 21 Rdsts = Rsrc3 * ??
ADD 22 Rdsts = Rsrc3 + ??

FETCH instructions

The FETCH instruction is also 96 bit, but can fetch one vec4 vertex value or one vec4 texture sample value.

...

Example

Here is a more complete example.

GLSL format:

uniform mat4 modelviewMatrix;
uniform mat4 modelviewprojectionMatrix;
uniform mat3 normalMatrix;

attribute vec4 in_position;
attribute vec3 in_normal;
attribute vec4 in_color;

vec4 lightSource = vec4(2.0, 2.0, 20.0, 0.0);

varying vec4 vVaryingColor;

void main()
{
    gl_Position = modelviewprojectionMatrix * in_position;
    vec3 vEyeNormal = normalMatrix * in_normal;
    vec4 vPosition4 = modelviewMatrix * in_position;
    vec3 vPosition3 = vPosition4.xyz / vPosition4.w;
    vec3 vLightDir = normalize(lightSource.xyz - vPosition3);
    float diff = max(0.0, dot(vEyeNormal, vLightDir));
    vVaryingColor = vec4(diff * in_color.rgb, 1.0);
}

and the corresponding commented shader assembly:

 ;;;; const/register assignment:
 ; R0: vVaryingColor
 ; R1, CONST(1): in_color
 ; R3, CONST(2): in_normal
 ; R2, CONST(3): in_position
 ; C0+: modelviewMatrix
 ; C4+: modelviewprojectionMatrix
 ; C8+: normalMatrix
 ; C11: 2.000000, 2.000000, 20.000000, 0.000000
 ; C12: 1.000000, 0.000000, 0.000000, 0.000000
EXEC
      FETCH:  VERTEX  R1.xyz_ = R0.z FMT_32_32_32_FLOAT SIGNED STRIDE(12) CONST(4)
      FETCH:  VERTEX  R2.xyz1 = R0.x FMT_32_32_32_FLOAT SIGNED STRIDE(12) CONST(4)
      FETCH:  VERTEX  R3.xyz_ = R0.y FMT_32_32_32_FLOAT SIGNED STRIDE(12) CONST(4)
   (S)ALU:    MULv    R0 = R2.wwww, C7           ; -> modelviewprojectionMatrix * in_position
      ALU:    MULADDv R0 = R0, R2.zzzz, C6       ; -> modelviewprojectionMatrix * in_position
      ALU:    MULADDv R0 = R0, R2.yyyy, C5       ; -> modelviewprojectionMatrix * in_position
ALLOC COORD SIZE(0x0)
EXEC
      ALU:    MULADDv export62 = R0, R2.xxxx, C4 ; gl_Position = modelviewprojectionMatrix * in_position
      ALU:    MULv    R0 = R2.wwww, C3           ; -> modelviewMatrix * in_position
      ALU:    MULADDv R0 = R0, R2.zzzz, C2       ; -> modelviewMatrix * in_position
      ALU:    MULADDv R0 = R0, R2.yyyy, C1       ; -> modelviewMatrix * in_position
      ALU:    MULADDv R0 = R0, R2.xxxx, C0       ; vec4 vPosition4 = modelviewMatrix * in_position
      ALU:    MULv    R2.xyz_ = R3.zzzw, C10     ; -> normalMatrix * in_normal
              RCP     R4.x___ = R0               ; -> 1 / vPosition4.w
EXEC
      ALU:    MULADDv R0.xyz_ = C11.xxzw, -R0, R4.xxxw ; -> lightSource - (vPosition4.xyz / vPosition4.w
      ALU:    DOT3v   R4.x___ = R0, R0                 ; -> normalize(...)
      ALU:    MULADDv R2.xyz_ = R2, R3.yyyw, C9        ; -> normalMatrix * in_normal
      ALU:    MULADDv R2.xyz_ = R2, R3.xxxw, C8        ; vec3 vEyeNormal = normalMatrix * in_normal
ALLOC PARAM/PIXEL SIZE(0x0)
EXEC_END
      ALU:    MAXv    R0.____ = R0, R0
              RSQ     R0.___w = R4.xyzx          ; -> normalize(...)  (1 / sqrt(dot(..))
      ALU:    MULv    R0.xyz_ = R0, R0.wwww      ; -> vec3 vLightDir = normalize(lightSource.xyz - vPosition3)
      ALU:    DOT3v   R0.x___ = R2, R0           ; -> dot(vEyeNormal, vLightDir)
      ALU:    MAXv    R0.x___ = R0, C11.wyzw     ; float diff = max(0.0, dot(vEyeNormal, vLightDir))
      ALU:    MULv    export0.xyz_ = R1, R0.xxxw ; vVaryingColor.xyz_= diff * in_color.rgb
              MOV     export0.___w = C12.xyzx    ; vVaryingColor.___w  = 1.0