Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs/DSP: Trivial adjustment to BLOOP{,I} sub-operation order #11106

Merged
merged 3 commits into from Oct 3, 2022

Conversation

vpelletier
Copy link
Contributor

Noticed while tracing in a BLOOP using DSPSpy with $st{0..3} unmasked. BLOOPI assumed to follow the pattern.

@AdmiralCurtiss
Copy link
Contributor

@Pokechu22 You probably know what to do with this?

@BhaaLseN
Copy link
Member

I checked the old docs, and it seems like this code snippet has always been like that (which doesn't necessarily have to mean it was correct before).

The change makes sense, but obviously a DSP Test to confirm this would be nice.

@Pokechu22
Copy link
Contributor

I assume by "using DSPSpy with $st{0..3} unmasked" you mean changing this code to eliminate the if (j != 1 || i < 4) case (which hides $st0 through $st3). I'm not sure why those are disabled by default - I guess they're a little noisy (since it would have the calling function's return address and that'll generally change) but it's still useful information. If you can visually see the correct behavior that way, that's probably good enough of a test (it'd be nicer to have more precise tests for the stacks in general, but I'm not going to require that).

The actual change makes sense. There are two differences in behavior: $st3 won't be changed from 0 to -1 (but since it's popped at the end, this is not observable) and more importantly $st3 is observed not observed as decremented until the loop jumps back (so BLOOPI #3, end would observe values of 3, 2, 1 instead of 2, 1, 0). Of course, reading the stack register pops from it, so you'd need to push the value back onto the stack after reading it (which is what DSPSpy's send_back already happens to do).

This is consistent with how we implement it in the interpreter and (I think) in the JIT.

I do have one change I'd like you to make: please update the version number and version history table in the manual.

@vpelletier
Copy link
Contributor Author

a DSP Test to confirm this would be nice.

While I am familiar-enough with DSPSpy to run code, I am not quite sure how such test should behave: should the DSP code detect a failure or should it just dump registers and let something else (visibly not DSPSpy itself, maybe something else checking a file with all dumps) check if the content is as expected ? Checking tests/loop_counter_test.ds it detects unexpected values and then conditionally dumps the registers.

Could you describe a bit how tests should be written (maybe DSPSpy is not even the correct place to look at) ?

In any case, I can implement such test. I'm thinking about a 2-iteration loop dumping registers right after BLOOP{,I}, and some code checking the value of $st3 on each dump.

I assume by "using DSPSpy with $st{0..3} unmasked" you mean changing this code to eliminate the if (j != 1 || i < 4) case (which hides $st0 through $st3).

Correct. I have a patch dropping this condition that I hesitated to submit. For reference, the condition comes from b0bb4e6 which indeed hides them because of the noise they cause. Personally (as a DSP neophyte, running DSPSpy interactively) I found them useful for telling quickly which dump call produced a given result (with the return stack tip) and in which loop I am (with the other registers).

more importantly $st3 is observed not observed as decremented until the loop jumps back (so BLOOPI #3, end would observe values of 3, 2, 1 instead of 2, 1, 0)

This is exactly what I observed.

I do have one change I'd like you to make: please update the version number and version history table in the manual.

Done.

Tangential topics:

  • thank you so much, @Pokechu22, for your ghidra sleigh, this helped me a lot getting into the DSP ASM syntax
  • having worked on the memory card unlocking function (with great results) I have found some new (?) behaviour of the DSP accelerator: FORMAT 0 writes to ACDATA1 with the current address MSb set seem to write the whole 16 bits, but reading from it with the MSb clear only reads a nibble at a time. I have done a lazy attempt at exericising this, but I only ended up triggering exception handler 3, which is undocumented AFAICS. Given that this work has likely no immediate practical value for emulation (...I guess this is only used to unlock memory cards, otherwise it would have been already investigated), would there be any interest in patches implementing this behaviour ? If so, I'll have to investigate more how the hardware behaves exactly, and what exception 3 means.

@Pokechu22
Copy link
Contributor

One other thing to note is that Dolphin doesn't handle the stacks quite right; it allows more to be pushed onto them than actually should be possible. This is compounded by the fact that DSPSpy loads initial values into the stack registers, i.e. they already have one thing pushed into them at the start (you can edit the initial register values, though I don't know if you can do that with the stack registers when they're hidden). But on real hardware with DSPSpy, you can only nest loops 3 deep (presumably 4 deep if DSPSpy didn't push initially), while on Dolphin it lets you nest further. I haven't investigated this further because nothing seems to rely on the stack overflow interrupts, though.

While I am familiar-enough with DSPSpy to run code, I am not quite sure how such test should behave: should the DSP code detect a failure or should it just dump registers and let something else (visibly not DSPSpy itself, maybe something else checking a file with all dumps) check if the content is as expected ? Checking tests/loop_counter_test.ds it detects unexpected values and then conditionally dumps the registers.

Could you describe a bit how tests should be written (maybe DSPSpy is not even the correct place to look at) ?

There isn't any consistency here :)

When I wrote tests, I tried to sometimes detect failures if it was easy, but if that seemed like it would be more difficult, I instead opted to always send back the results so that the correct results could be saved on real hardware using DSPSpy's save functionality, and then that could be (manually) compared with results saved from dolphin. DSPTool has a command that converts these to a human-readable file. There isn't any automation to the comparison process at this time though. (One other aspect is that DSPSpy's save functionality is kinda poorly implemented, so it's not well-suited to large gaps in runtime; see #10715 for the workaround.)

Correct. I have a patch dropping this condition that I hesitated to submit. For reference, the condition comes from b0bb4e6 which indeed hides them because of the noise they cause. Personally (as a DSP neophyte, running DSPSpy interactively) I found them useful for telling quickly which dump call produced a given result (with the return stack tip) and in which loop I am (with the other registers).

Probably they would be more useful if we included symbols of some sort with the assembled DSP binaries. There's some vague code in DSPTool to support this, but I don't think it ever was fully implemented (and I don't think DSPSpy uses it in any way). If you find them useful, it's probably fine to revert it. (I mainly use DSPSpy interactively too, but from what I've seen the result dumps still include the stack registers, so it's not like hiding them from the UI is helpful for automation.)

having worked on the memory card unlocking function (with great results) I have found some new (?) behaviour of the DSP accelerator: FORMAT 0 writes to ACDATA1 with the current address MSb set seem to write the whole 16 bits, but reading from it with the MSb clear only reads a nibble at a time. I have done a lazy attempt at exericising this, but I only ended up triggering exception handler 3, which is undocumented AFAICS. Given that this work has likely no immediate practical value for emulation (...I guess this is only used to unlock memory cards, otherwise it would have been already investigated), would there be any interest in patches implementing this behaviour ? If so, I'll have to investigate more how the hardware behaves exactly, and what exception 3 means.

That link seems to have gotten broken (I think it's supposed to be jamchamb/gc-memcard-adapter#5?). In any case, @xperia64 previously did some investigation into those formats in #10766 while reverse-engineering devolution. That PR never got fully finished, though.

Note that Dolphin currently cheats at implementing memory card unlocking; it pretends that cards are always unlocked, so the card uCode never actually runs. I have an incomplete branch where I partially implemented it, but I never finished it either.

@vpelletier
Copy link
Contributor Author

DSPSpy loads initial values into the stack registers, i.e. they already have one thing pushed into them at the start

Are they actually pushed ? My (unverified) mental model for the stack registers is that the instructions control the push/pop (so the address of the stack top) and the registers only show & modify that cell. I'll try to check this later (got to go to work):

  • write 4 times to the return address register, to see if it overflows
  • write to the loop counter to see if it affects the loop

That link seems to have gotten broken (I think it's supposed to be jamchamb/gc-memcard-adapter#5?).

Woops, forgot to check the link before posting. You are absolutely correct.

With that patch I can unlock official memory cards on a raspberry pi. My intent being to avoid having to purchase a memory card with an exploit and swiss on it, but I ended up purchasing a Wii to step through the DSP code and ended up creating the memory card this way... If maybe not the sanest path from a financial point of view, it was at least a lot of fun. And now I have a Wii.

Dolphin currently [...] pretends that cards are always unlocked, so the card uCode never actually runs.

I spotted this trick, yes. Which makes perfect sense, it seems very unlikely anything would depend on the actual unlocking process.

@Pokechu22
Copy link
Contributor

Are they actually pushed ? My (unverified) mental model for the stack registers is that the instructions control the push/pop (so the address of the stack top) and the registers only show & modify that cell.

The way Dolphin implements it is that reads pop and writes push, which of course doesn't mean that's how it actually works, but it matches what I remember seeing on real hardware a while ago. The manual doesn't mention which way it works, though.

@vpelletier
Copy link
Contributor Author

vpelletier commented Sep 29, 2022

but it matches what I remember seeing on real hardware a while ago

You are correct and my mental model is wrong:

  • writing enough times to $st0 triggers exception 1
  • reading enough times from $st0 triggers exception 1

So loads and stores have side effects on at least $st0.

Which brings me back to:

This is exactly what I observed.

...and what I observed may have been a mirage.

If I run:

        LRI $ax0.l, #5
        LRI $ac1.l, #0
        BLOOP $ax0.l, last_of_bloop
        INC $acc1
last_of_bloop:
        NOP
        CALL send_back

$ac1.l is, without surprise, 5.

But if I run:

        LRI $ax0.l, #5
        LRI $ac1.l, #0
        BLOOP $ax0.l, last_of_bloopc
        CALL loop_func
last_of_bloopc:
        NOP

loop_func:
        INC $acc1
        CALL send_back
        RET

then $ac1.l is 3. Which seems to be floor((loop_count + 1) / 2). And same with BLOOPI.

Modifying dsp_base_noirq.inc to never touch any $st? register does not change this.

So, somehow, calling within a BLOOP throws the counter off, which means what led me to this merge request was actually not a reliable evidence.

My plan for tomorrow:

  • check $st3 read/write side effects
  • inline send_back (and the few calls in it) and re-enable $st3 accesses in a way which undoes any side-effect

@Pokechu22
Copy link
Contributor

Pokechu22 commented Sep 29, 2022

A somewhat similar DSP (μPD77210, used by the Wii Speak, see this) requires that branch instructions not happen within 3 instructions of a loop's end (see this, page 75 and this, page 111 (these refer to the text page numbers; they're PDF pages 77 and 113 respectively)). I think that if you add an extra NOP before last_of_bloopc then calls will work correctly (though I'm not 100% sure of this). It's definitely possible to use calls with some restrictions, at least.

EDIT: The Zelda uCode (24B22038 specifically, used by the NTSC GameCube IPL) has a function at 0470 (CMD02?) which uses a BLOOP that has a lot of CALLs, including one at the almost-last instruction (followed by 2 NOPs). (There are also cases where it pads with several NOPs even though it's not using any calls (00ae and 00d0, which interact with the accelerator - this might be a timing thing of some sort, or just jank from an early revision).)

The screen real-estate is already reserved, the values are dumped and
restored by the on-DSP code, why not make something out of these values ?
Allows following:
- where exactly send_back was called from ($st1)
- the boundaries and progress of the innermost BLOOP{,I} ($st0, 2 and 3)
  up to send_back's call
Noticed while tracing in a BLOOP using DSPSpy with $st{0..3} unmasked.
BLOOPI assumed to follow the pattern.
@vpelletier
Copy link
Contributor Author

I think that if you add an extra NOP before last_of_bloopc then calls will work correctly (though I'm not 100% sure of this).

I added 2 NOPs to the existing one (taking the limit of 3 as inclusive) and I confirm that this fixed the loop count.

Then I checked $st3 for side effects, as I was beginning to suspect that maybe that register would not pop but maybe decrement on read, but I found nothing of the sort.

Then I re-enabled all $st? accesses in dsp_base_noirq.inc and ran:

test_main:
        BLOOPI #2, last_of_bloopi
        NOP
        CALL send_back
        NOP
        NOP
last_of_bloopi:
        NOP

        CALL send_back
        JMP end_of_test

which confirmed my assumed sub-operation order: $st3 successively has the values 2, 1, and 0x0b88 once the loop exited (so after the pop). So looks like it was no mirage.

I added a test for this, which in turn means that I included my change un-hiding $st? registers in DSPSpy UI.

@vpelletier
Copy link
Contributor Author

@xperia64 previously did some investigation into those formats in #10766

Ooh, so that is the meaning of setting the accelerator address's MSb. I wondered, but never got around to testing and instead just focused on replicating the code (in python) and then comparing the output (and intermediate steps) with the firmware's.

I am also guessing that I triggered exception 3 because I did not pay enough attention to the order in which I was configuring the accelerator before writing to/reading from D3. I ended up re-assembling the disassembled relevant IROM functions into IRAM, so that I could insert call send_back here and there (...which is how I ended up tracing inside a BLOOP and noticing the unexpected $st3).

I have an incomplete branch where I partially implemented it, but I never finished it either.

Oh, I also started assuming the accelerator reads were fetching bytes, but the output would only take so few possible values that I thought some bruteforce would be enough to unlock memory cards (as I found on the PS1/PS2 USB memory card reader for the PS3, almost).

About the length, I assumed that the length of the input value would always be 8 bytes (which I got reading libogc2), so I did not try with anything else. This simplifies the implementation, but this is only possible because I am in control of the caller, unlike in dolphin where games could do whatever.

About the memory cards being detected as corrupted, could this be because of the first 12 bytes read from the card during the unlocking procedure ? These are XOR'ed with a keystream computed on the PPC (...at least in the case of libogc2), and they are later used to produce the card header written on the card. Maybe on read the consistency is checked, as a way to prevent a card image from being transferred verbatim to another card ? That keystream depends on the first read operation done on the card during the unlocking process (address and length), which means the card must produce the same keystream on its side and produce the correct ciphertext.

@Pokechu22
Copy link
Contributor

I added 2 NOPs to the existing one (taking the limit of 3 as inclusive) and I confirm that this fixed the loop count.

Just to clarify, the limit of 3 was for an unrelated DSP (albeit one that has a lot of similar features, including a loop stack and extended opcodes of a sort), not the one on the GameCube/Wii.

I think the limit is 2 in this case, as in this would also be legal:


	LRI $ac0.l, #2
	BLOOP $ac0.l, last_of_bloop
	CALL send_back
	NOP
last_of_bloop:
	NOP
	CALL send_back

but I haven't actually tested that (this is just roughly what I saw on the Zelda uCode).

$st3 successively has the values 2, 1, and 0x0b88 once the loop exited (so after the pop).

Just as a note, the 0x0b88 value is something DSPSpy explicitly sets (or pushes):

u16 dspreg_in[32] = {
0x0410, 0x0510, 0x0610, 0x0710, 0x0810, 0x0910, 0x0a10, 0x0b10, 0xFFFF, 0xFFFF, 0xFFFF,
0xFFFF, 0x0855, 0x0966, 0x0a77, 0x0b88, 0x0014, 0xfff5, 0x00ff, 0x2200, 0x0000, 0x0000,
0x0000, 0x0000, 0x0003, 0x0004, 0x8000, 0x000C, 0x0007, 0x0008, 0x0009, 0x000a,
}; /// ax_h_1 ax_h_1

But good to see that this is consistent with my understanding of it.

I assumed that the length of the input value would always be 8 bytes

Yeah, I think this is true in practice for games unlocking memory cards (at least I haven't seen a counterexample), and I think that's the only place where that function is used in practice.

About the memory cards being detected as corrupted, could this be because of the first 12 bytes read from the card during the unlocking procedure ? These are XOR'ed with a keystream computed on the PPC (...at least in the case of libogc2), and they are later used to produce the card header written on the card. Maybe on read the consistency is checked, as a way to prevent a card image from being transferred verbatim to another card ? That keystream depends on the first read operation done on the card during the unlocking process (address and length), which means the card must produce the same keystream on its side and produce the correct ciphertext.

My WIP branch only implemented the DSP part of it, and doesn't do anything with it on the memory card itself; I haven't looked into what actually happens on the memory card or how it produces the same key, or where the PPC side of it gets its values. That part does sound plausible though.

@vpelletier
Copy link
Contributor Author

vpelletier commented Oct 1, 2022

I haven't looked into what actually happens on the memory card or how it produces the same key, or where the PPC side of it gets its values.

While it is still fresh in my mind, and in case you would be interested here are some pointers in libogc2:

There are two read operations from the card which influence the keystream, with 3 inputs:

The third parameter only influences the keystream after the card sends the 12 bytes which I suspect influence the console into declaring the card as corrupt: it is the data returned by the second read, before it knows how many bytes will be read beyond the first fixed 20 bytes (containing the 12 bytes I'm talking about, followed by the 8 bytes which get fed to the DSP).

Those 12 bytes, once decrypted, become sramex->flash_id for the current card, and this is then involved in producing the card header when formatting the card. The only other parameter involved in computing the value is the time, which is stored verbatim in the header.

From there, it is easy to decode the stored serial using the stored time and check that it matches the decrypted first 12 bytes... If it actually does this.

Copy link
Contributor

@Pokechu22 Pokechu22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. (I haven't tried the test out on my own console, but I expect it works correctly.)

@Pokechu22 Pokechu22 merged commit cb6d476 into dolphin-emu:master Oct 3, 2022
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
4 participants