Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question from notes on Apollo core forum #20

Closed
gregthecanuck opened this issue Feb 7, 2018 · 14 comments
Closed

Question from notes on Apollo core forum #20

gregthecanuck opened this issue Feb 7, 2018 · 14 comments

Comments

@gregthecanuck
Copy link

gregthecanuck commented Feb 7, 2018

In a discussion thread here: http://www.apollo-core.com/knowledge.php?b=2&note=12425

Gunnar makes a comment which I have pasted below. Does this still apply to the current gcc?

Thanks

======Comment============================================

GCC and other C compiler sometimes create relative bad results.

Lets give an example:

Below is part of the main render function in NEO-GEO emulator.

L73:
         bfextu d1{#4:#4},d0
         tst.b d0
         jeq L74
         and.l #255,d0
         move.w 2(a2,d0.l*4),18(a0)
L74:
         bfextu d1{#8:#4},d0
         tst.b d0
         jeq L75
         and.l #255,d0
         move.w 2(a2,d0.l*4),20(a0)
L75:
         bfextu d1{#12:#4},d0
         tst.b d0
         jeq L76
         and.l #255,d0
         move.w 2(a2,d0.l*4),22(a0)

This code uses 5 instructions per pixel.
The BFEXTU does set the flags correctly.
GCC is nor aware if this - and put there an extra TST instruction to create the flags. This instruction is NOT needed and useless.
Then GCC puts there a AND.L #$FF,D0 to limit the result to BYTE range. The extracted value of BFEXTU is by the parameter already limited to Byte range. This means this AND is totally unneeded.

In short: of the 5 instruction - 2 are useless.

That GCC throws on 68K huge number of unneeded TST instructions is a known bug in GCC - which is caused by the laziness of the GCC developers - they simply did not care to implement a tracking of the flags created by instructions.

@bebbo
Copy link
Owner

bebbo commented Feb 7, 2018

There are 2 issues.

  1. cmp/tst
    In gcc6 (and I mean in all other versions too) cmp/tst insns are eliminated during the last pass. If the cmp/tst does not modify the flags for the next flag consuming insn, it is omitted.

Maybe the bfextu insn is not modelled correctly to set the flags.
"The BFEXTU does set the flags correctly." but the compiler is not aware of this feature.

  1. and
    The compiler does not check what area of the bitfield is read. But IMHO this code results from using code like this
unsigned char c = (p>>4) & 0xf; // bfextu
if (c) // tst, jeq
  s[9] = i[c]; // and, move 

changing the variable type to int, solves the and issue (in gcc6) - and surprisingly the tst ins too.

unsigned c = (p>>4) & 0xf; // bfextu
if (c) // jeq
  s[9] = i[c]; // move 

Now I also understand why there is the tst insn:

  • the flags are set from a 32 bit test
  • but the if requires a 8 bit test
  • the compiler is not aware what bit area is looked up, then d0 could contian 0x10000000 and tst.l / tst.b yields different results.

BUT, if programmers understand how a compiler works, it could help.

void foo(int p, short * s, int * i) {
  unsigned  c = (p>>17) & 0xf;
  if (c) 
    s[9] = i[c];
}

m68k-amigaos-gcc -S -Os -m68040 -fomit-frame-pointer -mregparm=2 bfext.c

_foo:
        bfextu d0{#11:#4},d0
        jeq .L1
        move.w 2(a1,d0.l*4),18(a0)
.L1:
        rts

@bebbo
Copy link
Owner

bebbo commented Feb 7, 2018

The gcc-6 compiler can yield very good results!
Some people create very bad C code!

@matthey7
Copy link

matthey7 commented Feb 7, 2018

The 68k is not RISC where the register width is the only datatype width. Even RISC has realized this is inefficient as can be seen with the evolution from Alpha to AArch64 for example. It is not always a bad coding practice to use datatype sizes less than the registers size unless the compiler has poor support for them. Could GCC support smaller datatypes better?

bfextu

For a 32 bit datatype:
   CCR[N] and CCR[Z] are always valid for the result
For a 16 bit datatype and immediate bf width:
   if width > 16 error "data too big for datatype" and exit
   mark upper 16 bits of destination register as being zero
   CCR[Z] is valid for the result
   if width == 16, CCR[N] is valid for the result
For an 8 bit datatype and immediate bf width:
   if width > 8 error "data too big for datatype" and exit
   mark upper 24 bits of destination register as being zero
   CCR[Z] is valid for the result
   if width == 8, CCR[N] is valid for the result

I understand if other 68k compiler improvements are more important but if GCC can't do a better job then perhaps the problem is as much a bad compiler as the code is bad. This is not a criticism of you, your work or your opinion. Certainly any improvements are better than none and much appreciated.

@bebbo
Copy link
Owner

bebbo commented Feb 8, 2018

The current implementation tracks the CCR as it is. And since the CCR has no support for b/w/l flags there is nothing to improve in the gcc model. gcc tracks the values plus the CCR. It does not track the unused area of registers.

For sure one could write an optimizer pass which tracks zero flags for the unused register parts and elimintes superfluous instructions based on that information.

And in this case - plus many other cases - it helps a lot if the developer would review the asm code.
The developer makes the difference, not the compiler!

Here an example about strncmp vs strncmp

int strncmp(const char *s1,const char *s2,size_t n)
{ unsigned char *p1=(unsigned char *)s1,*p2=(unsigned char *)s2;
  unsigned long r,c;

  if ((r=n))
    do;while(r=*p1++,c=*p2++,!(r-=c) && (char)c && --n);
  return r;
}

While this is looking quite short, it's not efficient:

_strncmp:
        movem.l a2/d3/d2,-(sp)
        move.l 16(sp),a1
        move.l 24(sp),a0
        move.l a0,d0
        jeq .L4
        moveq #0,d1
.L3:
        moveq #0,d0
        move.b (a1,d1.l),d0
        move.l 20(sp),a2
        move.b (a2,d1.l),d2
        moveq #0,d3
        move.b d2,d3
        sub.l d3,d0
        jne .L2
        tst.b d2
        jeq .L2
        addq.l #1,d1
        cmp.l a0,d1
        jne .L3
.L2:
        movem.l (sp)+,d2/d3/a2
        rts
.L4:
        moveq #0,d0
        jra .L2

The code is not optimal, but the issue is: again there is some char to int extension. The programmer did not care enough and since the result is an int, int's are used to early.

int strncmp(const char *p1,const char *p2, size_t n)
{
  for(;;) {
	  if (n--) {
		  char c = *p1++;
		  if (c) {
			if (c == *p2++)
			  continue;
		  } else
			++p2;
		  unsigned char a = *--p1;
		  unsigned char b = *--p2;
		  return (short)a - b;
	  }
	  return 0;
  }
}

here the int conversion is performed before return.
Also note the nested if statements which help the compiler to create an optimal loop:

_strncmp:
        move.l 4(sp),a1
        move.l 8(sp),a0
        move.l a1,d1
        add.l 12(sp),d1
.L4:
        cmp.l a1,d1
        jeq .L6
        move.b (a1)+,d0
        jeq .L3
        cmp.b (a0)+,d0
        jeq .L4
.L5:
        and.l #255,d0
        moveq #0,d1
        move.b -1(a0),d1
        sub.l d1,d0
        rts
.L3:
        addq.l #1,a0
        jra .L5
.L6:
        moveq #0,d0
        rts

It would certainly be great if a compiler could also generate optimal assembler from the above code.

And it's easy to blame the bad compilers.

There are just a lot of bad whiny developers.

PS: expect different results per platform - there is no best C-code for all platforms.

@MBeijer
Copy link

MBeijer commented Feb 8, 2018 via email

@matthey7
Copy link

matthey7 commented Feb 9, 2018

bebbo wrote:

The current implementation tracks the CCR as it is. And since the CCR has no support for b/w/l flags there is nothing to improve in the gcc model. gcc tracks the values plus the CCR. It does not track the unused area of registers.

Surely GCC tracks the condition codes for the current integer datatype size (as opposed to the whole register). The bfextu instruction sets the correct CCR[Z] for the datatype if the bf it grabs fits in the datatype. The fact that only the CCR[Z] flag is valid is not much different than the BTST/BCHG/BCLR/BSET instructions which only affect the CCR[Z] flag.

For sure one could write an optimizer pass which tracks zero flags for the unused register parts and eliminates superfluous instructions based on that information.

And in this case - plus many other cases - it helps a lot if the developer would review the asm code.
The developer makes the difference, not the compiler!

I can usually make a compiler generate good code by reviewing the code (a good idea) but I shouldn't have to. Wasn't C supposed to optimize for the target CPU and good programming practice is to avoid hardware specific optimizations? Should I start reviewing the Netsurf 68k disassembly for places where the code could be better written to produce better code on the 68k?

Here an example about strncmp vs strncmp

int strncmp(const char *s1,const char *s2,size_t n)
{ unsigned char *p1=(unsigned char *)s1,*p2=(unsigned char *)s2;
unsigned long r,c;

if ((r=n))
do;while(r=*p1++,c=*p2++,!(r-=c) && (char)c && --n);
return r;
}

While this is looking quite short, it's not efficient:

The mismatched datatype sizes practically call for zero extensions which is poor coding but GCC also does a poor job of handling the integer promotion and partial register accesses. Is it because the 68k backend is shared with the ColdFire which wants to use mvz/mvs instructions?

_strncmp:
        movem.l a2/d3/d2,-(sp)
        move.l 16(sp),a1
        move.l 24(sp),a0
        move.l a0,d0
        jeq .L4
        moveq #0,d1
.L3:
        mvz.b (a1,d1.l),d0
        move.l 20(sp),a2
        mvz.b (a2,d1.l),d2
        move.l d2,d3
        sub.l d3,d0
        jne .L2
        tst.b d2
        jeq .L2
        addq.l #1,d1
        cmp.l a0,d1
        jne .L3
.L2:
        movem.l (sp)+,d2/d3/a2
        rts
.L4:
        moveq #0,d0
        jra .L2

If the GCC 68k backend generated reasonably good code for the "poor source", it would give something like this...

_strncmp:
        movem.l a2/d2,-(sp)
        move.l 20(sp),d0
        jeq .L2
        move.l d0,a0
        moveq #0,d1
        move.l 12(sp),a1
        moveq #0,d2
        move.l 16(sp),a2
.L3:
        moveq #0,d0
        move.b (a1,d1.l),d0
        move.b (a2,d1.l),d2
        sub.l d2,d0
        jne .L2
        tst.b d2
        jeq .L2
        addq.l #1,d1
        cmp.l a0,d1
        jne .L3
.L2:
        movem.l (sp)+,d2/a2
        rts

The good source and resulting code is also less efficient because of integer promotion issues. Better would be the following...

_strncmp:
        move.l 4(sp),a1
        moveq #0,d0
        move.l 8(sp),a0
        move.l a1,d1
        add.l 12(sp),d1
.L4:
        cmp.l a1,d1
        jeq .L6
        move.b (a1)+,d0
        jeq .L3
        cmp.b (a0)+,d0
        jeq .L4
        subq.l #1,a0
.L3:
        sub.b (a0),d0
.L6:
        rts

I did not extend the return value in d0 at the end of the function. This allows for more optimal code but the question is whether the AT&T System V ABI for the 68k allows it. Most ABIs I have looked at do not specify what happens to the unused portion of the return value register when a datatype smaller than a register is used (one exception is the PPC ABI which defines the unused bits as trash). I could not find a copy of the 68k System V ABI anywhere so I do not know for certain. Some other compilers are also less efficient because they sign/zero extend the return value before returning from the function and sign/zero extend the return value after returning from the function. I believe it would be more efficient to sign/zero extend after the function (on a read from the stack), especially on the 68k without a MVS/MVZ instruction like the ColdFire.

And it's easy to blame the bad compilers.

There are just a lot of bad whiny developers.

PPC was supposed to replace CISC with it's great compiler support and moving all the CPU complexity into the compiler. At the same time Intel was optimizing their CPUs for bad code. Maybe compilers can't handle bad source code. I'm sure there is no end to whiny developers writing bad code on other platforms. The few Amiga programmers left spend most of their time porting the bad code from other platforms. Oh well, there was little hope of a 68k revival without new mass produced 68k CPUs anyway. The only people trying to design 68k CPUs do not care about making compiler design easier.

@bebbo
Copy link
Owner

bebbo commented Feb 9, 2018

right, the compiler could do a bit better.

And you can't omit the extension from byte to long, since the result type of this function is int plus
You have to extend before you do the subtraction, otherwise it will yield incorrect results.

@matthey7
Copy link

matthey7 commented Feb 9, 2018

right, the compiler could do a bit better.

I understand the difficulties of improving the 68k backend. It could take years for a good programmer and many testers to bring it up to the quality of the x86/x86_64 backend.

And you can't omit the extension from byte to long, since the result type of this function is int plus
You have to extend before you do the subtraction, otherwise it will yield incorrect results.

Doh! You are correct and what strncmp() returns is a standard that can't be changed. Your better source code was strange with the (short) typecast which called for an early integer promotion. Wouldn't it be better to have written...

int strncmp(const char *p1,const char *p2, size_t n)
{
  for(;;) {
	  if (n--) {
		  char c = *p1++;
		  if (c) {
			if (c == *p2++)
			  continue;
		  } else
			++p2;
		  unsigned char a = *--p1;
		  unsigned char b = *--p2;
		  return (int)(a - b);
	  }
	  return 0;
  }
}

An 8 bit subtraction would be better for the 68000-68040 (the 68060 is a wash without mvs/mvz instructions even though it prefers 32 bit int operations and CF requires integer promotion lacking .b and .w operations). The resulting 68k code would still need a big fat and.l #$ff,d0 before the rts lacking a CF mvz.b d0,d0. How much easier do you think it would be to write a good 68k backend if a modern 68k CPU had the mvs/mvz instructions?

@bebbo
Copy link
Owner

bebbo commented Feb 9, 2018

well, the cast can be omitted - doesn't matter if int oder short is used, it's superfluous.
So the compiler's result is not far from the optimal result. And since the loop is optimal, it's ok for me for now.

Zero extend move insns should be feasible since coldfire has those. Guess you simply have to define the cpu and set this config bit for the new cpu...

@bebbo
Copy link
Owner

bebbo commented Feb 10, 2018

Guess it will work:

stefan@Thinky ~/a/tmp
$ m68k-amigaos-gcc -fomit-frame-pointer -Os -S movz.c

stefan@Thinky ~/a/tmp
$ cat movz.s
#NO_APP
        .text
        .align  2
        .globl  _foo
_foo:
        move.l 4(sp),a0
        move.b (a0),d0
        ext.w d0
        ext.l d0
        rts

$ m68k-amigaos-gcc -fomit-frame-pointer -Os -S movz.c -m68060

stefan@Thinky ~/a/tmp
$ cat movz.s
#NO_APP
        .text
        .align  2
        .globl  _foo
_foo:
        move.l 4(sp),a0
        move.b (a0),d0
        extb.l d0
        rts

stefan@Thinky ~/a/tmp
$ m68k-amigaos-gcc -fomit-frame-pointer -Os -S movz.c -m68080

stefan@Thinky ~/a/tmp
$ cat movz.s
#NO_APP
        .text
        .align  2
        .globl  _foo
_foo:
        move.l 4(sp),a0
        mvs.b (a0),d0
        rts

@matthey7
Copy link

matthey7 commented Feb 10, 2018

I have mentioned before how easy it should be to turn on existing CF instructions in backends with shared 68k/CF support (common). The retro 68k fpga CPU designers don't want anything new and the Apollo Core designer would rather add shiny hypothetical go faster bling. The AC68080 supported CF mvs/mvz instructions (as encoded in CF) at one time but they were removed. If you want to support the AC68080, you supposedly get integer unit Bn (global base relative) registers and En (extended shared SIMD unit) registers which are non-orthogonal with An and Dn registers and may require more instructions and give worse code density if used. A 64 bit 68k CPU really should have a movs.l and movz.l instructions as well and they do take considerable encoding space even though most modern CPUs have them (the x86_64 versions are fat because of lack of encoding space but the 68k has more free encoding space). Anyway, I do not represent, support or endorse the Apollo Core although I was at one time part of the "team" but our philosophies differed.

The 68060 would have likely benefited significantly from the CF mvs/mvz instructions as it only has a 32 bit instruction fetch and prefers 32 bit operations for result forwarding. My analysis of existing 68k code only showed a 0%-4% code size reduction from using these instructions but that was only finding code sequences which could be replaced by the CF instructions and often the code analyzed was compiled for older 68k processors. I would be interested in hearing how much of a code reduction (compiled benchmarks and/or popular programs) comes from compiling for the 68060 but with mvs/mvz support added. I have been researching code density of which the 68k is one of the best.

https://docs.google.com/spreadsheets/d/e/2PACX-1vTyfDPXIN6i4thorNXak5hlP0FQpqpZFk2sgauXgYZPdtJX7FvVgabfbpCtHTkp5Yo9ai6MhiQqhgyG/pubhtml?gid=909588979&single=true

I have also been playing with more sane, dense and conservative 68k ISAs called 68k_32 and 68k_64. One of my focuses is providing easier and better compiler support which I view as necessary to leverage the 68k code density advantage. If it was that easy to turn on mvs/mvz support then some statistics of the results would be interesting.

@bebbo
Copy link
Owner

bebbo commented Feb 11, 2018

Well, I'll wait if there are new insn in the 68080.

@bebbo bebbo closed this as completed Feb 11, 2018
@gregthecanuck
Copy link
Author

Hello Stefan -

Had a quick discussion on the Apollo IRC channel. The Apollo 68080 core supports mvs/mvz, but in a backwards-compatible way.

It recognizes combinations of instructions and "fuses" them to the internal equivalents of mvs/mvz. These execute in 0.5 clocks.

MOVE.B (d16,An),Dn EXTB.L Dn
this is MVS

MOVE.W (d16,An),Dn EXT.L Dn
this too

Similar idea for MVZ. MOVEQ + MOVE.

Examples of some (but not all) of the instruction fusing logic are here: http://apollo-accelerators.com/wiki/doku.php/cpu_fuse?s[]=fusing

@matthey7
Copy link

gregthecanuck wrote:

Had a quick discussion on the Apollo IRC channel. The Apollo 68080 core supports mvs/mvz, but in a backwards-compatible way.

The Apollo Core does not support mvs/mvz compatible instructions at either the binary or instruction level. It appears bebbo can easily have these CF instructions generated in Amiga executables but the Apollo Core would no longer recognize them (my understanding is that it would have not long ago). The Apollo Core claims to recognize and fold/fuse the following combinations.

 MOVEQ #0,Dn + MOVE.W EA,Dn * similar to MVZ.W EA,Dn but 2x the instructions and 2-3x the size
 MOVEQ #0,Dn + MOVE.B EA,Dn * similar to MVZ.B EA,Dn but 2x the instructions and 2-3x the size
 MOVE.W EA,Dn + EXT.L Dn * similar to MVS.W EA,Dn but 2x the instructions and 2-3x the size
 MOVE.B EA,Dn + EXTB.L Dn * similar to MVS.B EA,Dn but 2x the instructions and 2x the size

The MVZ register to register is also useful.

 AND.L #$ff,Dn * similar to MVZ.B Dn,Dn but 3x the size
 AND.L #$ffff,Dn * similar to MVZ.W Dn,Dn but 3x the size

It would have been interesting to see how much of a size difference the MVS/MVZ instructions would have made. The performance gains from code density improvements are usually underestimated and the gains from adding registers (especially non-orthogonal registers like the Apollo Core) are usually overestimated. Not only are the non-orthogonal registers giving no gains because they are unused as they would not be easy to add support in compilers but the percentage performance gains from more than 16 gp integer registers are likely in the low single digits, possibly less if code density deteriorates from using the new registers. Hardware engineers know best how to improve hypothetical performance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants