Skip to content
Dimitar Dimitrov edited this page Feb 25, 2018 · 1 revision

The PRU Multiplier

The AM335x TRM has a detailed section explaining the PRU multiplier. Turns out, though, that it was not detailed enough for me to write an assembly program. Below I've documented how I managed to get the PRU multiplier working.

Scroll to the bottom if you're in a hurry for a working example.

The Test Setup

First I needed a testbench to evaluate my assembly code. I took the HC-SR04 example, and instead of ultrasonic range I returned to the host the result of PRU multiplication.

Full listing of the new assembly function that I used to replace the existing one from the C example:

    .text
    .section .text
    .global measure_distance_mm
measure_distance_mm:
    # Start with non-zero values, simply to add randomness to our test.
    fill r14, 4 * (29-14)

    # Load the MUL operands
    ldi r28, 1001
    ldi r29, 2002
    # Do the multiplication, per the TRM.
    xin 0, r26, 4

    # Move the MUL result to the function return value register.
    mov r14, r26

    ret

The Naive Implementation

Thinking that I fully understand the PRU multiplier I wrote the following snippet. But when testing, the result was quite different from what I expected.

    ldi r28, 1001
    ldi r29, 2002
    xin 0, r26, 4
    # Wrong result: -1001

The Missing Cycle

Re-reading the TRM, a few facts stood out:

  • MUL is single-cycle.
  • R28/R29 operands are sampled each cycle.
  • XIN instruction is required to transfer the result back to R26.

Drawing the pipeline made obvious my mistake:

          .-- one cycle to execute multiplication
          |
          V
      |<----->|
| LDI |  MAC  | XIN |
      ^       ^
      |       |
      |        `-- result ready for transfer to CPU R26/R27 registers
       `-- sample R28/R29 operands

Adding a nop to account for the MUL/MAC cycle led to a correct result:

    ldi r28, 1001
    ldi r29, 2002
    nop
    xin 0, r26, 4
    # Result: 2004002

Let's Accumulate

PRU's Multiplier also has MAC mode where results are accumulated. This time there were no hiccups when following the TRM-suggested sequence:

    ldi r25, 1
    xout 0, r25, 1
    ldi r25, 3
    xout 0, r25, 1

    ldi r25, 1
    ldi32 r28, 99787
    ldi r29, 3319
    xout 0, r25, 1
    ldi32 r28, 64663
    ldi r29, 9521
    xout 0, r25, 1

    xin 0, r26, 4
    # Success! Got the expected 946849476

The two consecutive writes to MAC's mode register seemed odd. Indeed, removing the following two lines still yielded a correct result:

    ldi r25, 1
    xout 0, r25, 1

Let's Restart And Accumulate Again

Finally, let's check that the MAC accumulator "reset" works. This is the action when the MAC accumulator is set to zero, in order to initiate a new sequence of multiply-accumulate commands.

For a change, let's also test the full 64-bit result. This requires a trivial change to pass 64-bit value to the host, that I'll not show here.

    # First MAC cycle that we'll ignore
    ldi r25, 3
    xout 0, r25, 1

    ldi r25, 1
    ldi32 r28, 99787
    ldi r29, 3319
    xout 0, r25, 1
    ldi32 r28, 64663
    ldi r29, 9521
    xout 0, r25, 1

    xin 0, r26, 8

    # Second MAC cycle "for real"
    ldi r25, 3
    xout 0, r25, 1

    ldi r25, 1
    ldi32 r28, 100931
    ldi32 r29, 1000033
    xout 0, r25, 1
    ldi32 r28, 104701
    ldi32 r29, 1000003
    xout 0, r25, 1

    xin 0, r26, 8
    # Success! Got the expected 0x2FE0D6ED9A (205635644826)

A Reset Is Not!

While playing with the above examples, I noticed something peculiar. My perfect MUL example was giving wrong results when I ran it right after testing the MAC example. Of course, I was rebooting the PRU remoteproc firmware between test sessions. For kernel 4.4.52-ti-r91 running on my BBG, the command is:

echo "4a338000.pru1" > /sys/bus/platform/drivers/pru-rproc/unbind
echo "4a338000.pru1" > /sys/bus/platform/drivers/pru-rproc/bind

Evidently, the MAC mode register is not cleared on remoteproc reset. Thus, I would recommend to explicitly initialize the MUL/MAC mode register at the beginning of your assembly program:

    ldi r25, 0
    xout 0, r25, 1

TLDR

In case you're in a hurry, here is how to multiply two 32-bit integers and get a 32-bit result:

    ldi r25, 0
    xout 0, r25, 1  # Reset the MAC mode register (one-time-initialization).

    ldi r28, 1001   # mov or ldi to load operand into R28.
    ldi r29, 4567   # mov or ldi to load second operand into R29.
    nop             # Delay one cycle before acquiring the result.
    xin 0, r26, 4   # Load the MUL result into R26.