Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

16-bit optimization error uses uninitialized carry? #895

Closed
bbbradsmith opened this issue May 14, 2019 · 10 comments
Closed

16-bit optimization error uses uninitialized carry? #895

bbbradsmith opened this issue May 14, 2019 · 10 comments
Labels

Comments

@bbbradsmith
Copy link
Contributor

@bbbradsmith bbbradsmith commented May 14, 2019

The following produces invalid code when -O is used:

unsigned int a;
unsigned int b;

void foo(void)
{
	a = 259;
	if (a < 7)
	{
		b = 473;
	}
}

Resulting assembly:

.proc	_foo: near

.segment	"CODE"

	ldx     #$01
	lda     #$03
	sta     _a
	stx     _a+1
	bcs     L0004
	lda     #$D9
	sta     _b
	stx     _b+1
L0004:	rts

Note that the comparison has entirely disappeared. With an unsigned char for a, the optimizer seems to be able to determine if the condition is always true or false, and can even remove the whole assignment to b. But, with unsigned int, it bungles the optimization, and simultaneously leaves the assignment in there, and branches over it (or fails to branch) with an uninitialized carry? Very odd.

Source: https://forums.nesdev.com/viewtopic.php?f=2&t=18843&p=238785

@greg-king5 greg-king5 added the bug label May 14, 2019
@bbbradsmith
Copy link
Contributor Author

@bbbradsmith bbbradsmith commented May 14, 2019

Now that I discovered --debug-opt-output I can see where it goes wrong:

=========================================================================
Initial code for function 'foo':
.segment	"CODE"

	ldx     #$01
	lda     #$03
	sta     _a
	stx     _a+1
	lda     _a
	ldx     _a+1
	cpx     #$00
	bne     L0006
	cmp     #$07
L0006:	jsr     boolult
	jeq     L0004
	ldx     #$01
	lda     #$D9
	sta     _b
	stx     _b+1
L0004:	rts
=========================================================================
Code after applying 'OptStore4':
.segment	"CODE"

	ldx     #$01
	lda     #$03
	sta     _a
	stx     _a+1
	cpx     #$00
	bne     L0006
	cmp     #$07
L0006:	jsr     boolult
	jeq     L0004
	ldx     #$01
	lda     #$D9
	sta     _b
	stx     _b+1
L0004:	rts
=========================================================================
Code after applying 'OptBoolTrans':
.segment	"CODE"

	ldx     #$01
	lda     #$03
	sta     _a
	stx     _a+1
	cpx     #$00
	bne     L0006
	cmp     #$07
L0006:	jcs     L0004
	ldx     #$01
	lda     #$D9
	sta     _b
	stx     _b+1
L0004:	rts
=========================================================================
Code after applying 'OptCmp8':
.segment	"CODE"

	ldx     #$01
	lda     #$03
	sta     _a
	stx     _a+1
	jmp     L0006
	cmp     #$07
L0006:	jcs     L0004
	ldx     #$01
	lda     #$D9
	sta     _b
	stx     _b+1
L0004:	rts
=========================================================================
...

So, the initial generated line at L0006 is processing the carry of two different comparison results.

OptCmp8 appears to assume carry is not needed after a branch, so it makes no attempt to preserve it and eliminates the carry incorrectly.

So... either the assumption made by OptCmp8 is invalid and it needs to be revised or removed, or the continued/combined use of the carry result produced by the generator is invalid?

Loading

@bbbradsmith
Copy link
Contributor Author

@bbbradsmith bbbradsmith commented May 15, 2019

Just as a test, commenting out the line that removes the comparison: https://github.com/cc65/cc65/blob/master/src/cc65/coptcmp.c#L890

	ldx     #$01
	lda     #$03
	sta     _a
	stx     _a+1
	cpx     #$00
	rts

The cpx ends up hanging around to the end, but with all the other optimizations in place it otherwise manages to clean up everything else very well.

So, even with OptCmp8 only replacing the branch, it seems to be useful. It feels like replacing the branch with a jmp is initially a de-optimization without removing the compare, but OptDeadCode and OptDeadJumps seem to be able to eliminate the intervening code anyway?

Not sure if this would still work as well with more complicated examples, but I think at least OptCmp8 by itself isn't able to produce invalid code if it doesn't eliminate the compare.

Loading

bbbradsmith added a commit to bbbradsmith/cc65 that referenced this issue May 15, 2019
Generates incorrect code for some 16-bit cases. See: cc65#895
@greg-king5
Copy link
Contributor

@greg-king5 greg-king5 commented May 16, 2019

The bug is that OptCmp8 optimizes only 8-bit comparisons of "known" values. It doesn't handle 32-bit and 16-bit comparisons. It optimizes the "cpx #" without optimizing its partner "cmp #". Naturally, that limited act ruins the target code.

Loading

bbbradsmith added a commit to bbbradsmith/cc65 that referenced this issue May 16, 2019
@bbbradsmith
Copy link
Contributor Author

@bbbradsmith bbbradsmith commented May 16, 2019

After writing a test for this against various types (signed/unsigned * char/int/long) it seems to only fail against unsigned int, and even then only in the < and <= cases, and would only visibly manifest at runtime if the carry happened to be clear as a pre-condition. All other types, and cases seem to be immune to the bug, as far as I've noticed.

So it seems that only the code generated for unsigned int < or <= does the kind of shared branch thing that OptCmp8 would fail on?

Disabling the removal of the compare instruction (as in #899) seems to be a simple and safe way to correct OptCmp8, but it does seem like its effectiveness is reduced a bit by that, since it was valid in other cases. It might be possible to re-instate the compare removal for the other cases if there was a way for OptCmp8 to know a little more information about the context?

On the other hand, maybe this was the wrong place to try to optimize branching on a constant. Could it be done instead at the higher level when the code is generated in the first place? That might be safer and more effective.

At any rate #899 appears to fix the invalid code, at the expense of weakening this optimization slightly. Not sure if it's important to preserve/restore that lost optimization but the test will probably help make that easier if one of us attempts it.

Loading

@oliverschmidt
Copy link
Contributor

@oliverschmidt oliverschmidt commented May 17, 2019

I don't know much it helps but I tried to quantify the impact of #899. I built the Contiki 80-column webbrowser (with -Ors) which is the largest program I'm aware of - and large means here lots of C code. See the numbers (with links to the full map files):

C64

Name                   Start     End    Size  Align
----------------------------------------------------
CODE                  00086C  009F22  0096B7  00001

C64 w. 899

Name                   Start     End    Size  Align
----------------------------------------------------
CODE                  00086C  009FAC  009741  00001

Apple II

Name                   Start     End    Size  Align
----------------------------------------------------
CODE                  00087F  008652  007DD4  00001
LC                    00D400  00DFDA  000BDB  00001

Apple II w. 899

Name                   Start     End    Size  Align
----------------------------------------------------
CODE                  00087F  0086D3  007E55  00001
LC                    00D400  00DFE0  000BE1  00001

ATARI

Name                   Start     End    Size  Align
----------------------------------------------------
CODE                  002BAF  0098EB  006D3D  00001
SHADOW_RAM2           00D800  00DFF4  0007F5  00001
SHADOW_RAM            00E400  00FF32  001B33  00001

ATARI w. 899

Name                   Start     End    Size  Align
----------------------------------------------------
CODE                  002BAF  009923  006D75  00001
SHADOW_RAM2           00D800  00DFF4  0007F5  00001
SHADOW_RAM            00E400  00FF76  001B77  00001

So the size increase is:

  • $8A bytes on the C64
  • $87 bytes on the Apple II
  • $7C bytes on the ATARI

Loading

@bbbradsmith
Copy link
Contributor Author

@bbbradsmith bbbradsmith commented May 17, 2019

Hmm, well if we want to think more specifically about the cases where it fails, each time it seemed to have jumped directly to either:

  1. A branch instruction. (e.g. BCS, JCS)
  2. A library function that operates on flags. (e.g. boolult)

The latter seems like it can be detected in the same way that OptBoolTrans detects it, I think, so it does appear possible to check for that.

I'm not confident that I've encountered all ways that this can fail, but I think suppressing the removal of the compare in only these two cases should solve the known problems at least, without hamstringing the other cases.

Loading

@oliverschmidt
Copy link
Contributor

@oliverschmidt oliverschmidt commented May 24, 2019

According to my understanding this seems like a viable approach. In case later on you/someone else discovers a 3.) we can still be optimistic that this 3.) will be detectable too. Or am I missing the point?

Loading

bbbradsmith added a commit to bbbradsmith/cc65 that referenced this issue May 25, 2019
@bbbradsmith
Copy link
Contributor Author

@bbbradsmith bbbradsmith commented May 25, 2019

Well, I tried this in bc3cc99 and it seems to pass the test I had created. On the example case, even though it retains the cpx on the first OptCmp8 pass, it actually gets removed in a later OptCmp8 pass once the intervening comparison's dead code has been removed.

I don't have extensive knowledge of the code generator or optimizer, but it seems to be working OK in the cases I know about at least.

Loading

@oliverschmidt
Copy link
Contributor

@oliverschmidt oliverschmidt commented May 25, 2019

Thanks for your quick reaction on my last comment :-)

I don't have extensive knowledge of the code generator or optimizer [...]

I'd say that's true for all participating at cc65@GitHub. Therefore at least I personally appreciate that you nevertheless try to fix bugs in that area!

I want to give other a little time to comment. If no one vetoes I'll merge...

Loading

oliverschmidt added a commit that referenced this issue May 27, 2019
Generates incorrect code for some 16-bit cases. See: #895
@oliverschmidt
Copy link
Contributor

@oliverschmidt oliverschmidt commented May 27, 2019

Fixed with #899 - thanks :-)

Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants