# Test : Pipe-line vs branchement

Pour en savoir un peu plus sur le pipe-line : https://fr.wikipedia.org/wiki/Pipeline_(architecture_des_processeurs)


Pour voir les effets des conflits entre le pipe-line et un/des branchements, suivons le calcul de nombreuses suites de Syracuse (on cherche les voyages les plus longs).

## Plus long voyage (syracuse), première version

In [1]:
%%writefile syracuse.c
#include <stdio.h>
#include <stdlib.h>

int syracuse(int a) {
int i;
for(i=0;a>1;i++) {
  if (a&1) {a=3*a+1;}
  else {a=a/2;}}
return i;}

int main (int argc, char * argv[]) {
int i,m,n,im;
if (argc==1)    {i=20;}
else {i=atoi(argv[1]);}
for(m=0;i>0;i--) {
  n=syracuse(i);
  if (n>m) {m=n;im=i;}}
printf("i=%d, m=%d\n",im,m);
return 0;}

Overwriting syracuse.c


In [2]:
%%sh
arm-linux-gnueabi-gcc -S -static syracuse.c
cat syracuse.s

	.arch armv5t
	.fpu softvfp
	.eabi_attribute 20, 1
	.eabi_attribute 21, 1
	.eabi_attribute 23, 3
	.eabi_attribute 24, 1
	.eabi_attribute 25, 1
	.eabi_attribute 26, 2
	.eabi_attribute 30, 6
	.eabi_attribute 34, 0
	.eabi_attribute 18, 4
	.file	"syracuse.c"
	.text
	.align	2
	.global	syracuse
	.type	syracuse, %function
syracuse:
	@ args = 0, pretend = 0, frame = 16
	@ frame_needed = 1, uses_anonymous_args = 0
	@ link register save eliminated.
	str	fp, [sp, #-4]!
	add	fp, sp, #0
	sub	sp, sp, #20
	str	r0, [fp, #-16]
	mov	r3, #0
	str	r3, [fp, #-8]
	b	.L2
.L5:
	ldr	r3, [fp, #-16]
	and	r3, r3, #1
	cmp	r3, #0
	beq	.L3
	ldr	r2, [fp, #-16]
	mov	r3, r2
	mov	r3, r3, asl #1
	add	r3, r3, r2
	add	r3, r3, #1
	str	r3, [fp, #-16]
	b	.L4
.L3:
	ldr	r3, [fp, #-16]
	mov	r2, r3, lsr #31
	add	r3, r2, r3
	mov	r3, r3, asr #1
	str	r3, [fp, #-16]
.L4:
	ldr	r3, [fp, #-8]
	add	r3, r3, #1
	str	r3, [fp, #-8]
.L2:
	ldr	r3, [fp, #-16]
	cmp	r3, #1
	bgt	.L5
	ldr	r3, [fp, #-8]
	mov	r0, r3
	add	sp, fp, #0
	ldmfd	sp!, {fp}
	b

In [3]:
%%sh
arm-linux-gnueabi-gcc -static syracuse.c
qemu-arm a.out

i=19, m=20


In [29]:
%%sh
arm-linux-gnueabi-gcc -static -O syracuse.c
time qemu-arm a.out 1000000

i=910107, m=475


0.87user 0.00system 0:00.89elapsed 98%CPU (0avgtext+0avgdata 4124maxresident)k
0inputs+0outputs (0major+1113minor)pagefaults 0swaps


## Version optimisée par le programmeur

Le programmeur a juste observé qu'après un nombre impaire (*3+1) donne un nombre pair, donc autant faire le calcul de suite.

In [12]:
%%writefile syracuseOptProg.c
#include <stdio.h>
#include <stdlib.h>

int syracuse(int a) {
int i;
for(i=0;a>1;i++) {
  if (a&1) {a=(3*a+1)/2;i++;}
  else {a=a/2;}}
return i;}

int main () {
int i=1000000,m,n,im;
for(m=0;i>0;i--) {
  n=syracuse(i);
  if (n>m) {m=n;im=i;}}
printf("i=%d, m=%d\n",im,m);
return 0;}

Overwriting syracuseOptProg.c


In [36]:
%%sh
arm-linux-gnueabi-gcc -static -O syracuseOptProg.c
time qemu-arm a.out 1000000

i=910107, m=475


0.87user 0.00system 0:00.86elapsed 100%CPU (0avgtext+0avgdata 4112maxresident)k
0inputs+0outputs (0major+1111minor)pagefaults 0swaps


## Versions optimisées par le compilateur

La première version :  
remarquez que le compilateur a aussi évité les branchements ! (en utilisant des instructions conditionnelles)

In [16]:
%%sh
arm-linux-gnueabi-gcc -O -S -static syracuse.c
cat syracuse.s

	.arch armv5t
	.fpu softvfp
	.eabi_attribute 20, 1
	.eabi_attribute 21, 1
	.eabi_attribute 23, 3
	.eabi_attribute 24, 1
	.eabi_attribute 25, 1
	.eabi_attribute 26, 2
	.eabi_attribute 30, 1
	.eabi_attribute 34, 0
	.eabi_attribute 18, 4
	.file	"syracuse.c"
	.text
	.align	2
	.global	syracuse
	.type	syracuse, %function
syracuse:
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	@ link register save eliminated.
	mov	r3, r0
	cmp	r0, #1
	ble	.L6
	mov	r0, #0
.L5:
	tst	r3, #1
	addne	r3, r3, r3, asl #1
	addne	r3, r3, #1
	addeq	r3, r3, r3, lsr #31
	moveq	r3, r3, asr #1
	add	r0, r0, #1
	cmp	r3, #1
	bgt	.L5
	bx	lr
.L6:
	mov	r0, #0
	bx	lr
	.size	syracuse, .-syracuse
	.align	2
	.global	main
	.type	main, %function
main:
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	stmfd	sp!, {r4, r5, r6, lr}
	cmp	r0, #1
	bne	.L9
	mov	r4, #20
.L11:
	mov	r5, #0
	b	.L10
.L9:
	ldr	r0, [r1, #4]
	mov	r1, #0
	mov	r2, #10
	bl	strtol
	subs	r4, r0, #0
	movle	r

In [6]:
%%sh
arm-linux-gnueabi-gcc -O -static syracuse.c
time qemu-arm a.out 1000000

i=910107, m=475


1.07user 0.01system 0:01.09elapsed 100%CPU (0avgtext+0avgdata 4120maxresident)k
0inputs+0outputs (0major+1114minor)pagefaults 0swaps


La version optimisée par le programmeur :

In [8]:
%%sh
arm-linux-gnueabi-gcc -static syracuseOptProg.c
time qemu-arm a.out 1000000

i=910107, m=475


0.93user 0.00system 0:00.96elapsed 97%CPU (0avgtext+0avgdata 4116maxresident)k
0inputs+0outputs (0major+1112minor)pagefaults 0swaps


## Arm vs X86 ?

Mais est-ce que qemu simule le pipe-line (ou seulement bénéficie d'un pipe-line de la machine ? ou produit un code qui peut en bénéficiér ?) ? Bref, avec qemu, ce n'est pas tout à fait sûr de savoir ce qui se passe. Pour avoir confirmation de ce que l'on a l'impression d'observer regardons en X86 (cette machine est en X86, pas en ARM) pour un langage compilé (il ne faudrait peut-être pas prendre Python ou java pour voir cela)

In [37]:
%%sh
gcc -S -masm=intel syracuse.c
cat syracuse.s

	.file	"syracuse.c"
	.intel_syntax noprefix
	.text
	.globl	syracuse
	.type	syracuse, @function
syracuse:
.LFB2:
	.cfi_startproc
	push	rbp
	.cfi_def_cfa_offset 16
	.cfi_offset 6, -16
	mov	rbp, rsp
	.cfi_def_cfa_register 6
	mov	DWORD PTR [rbp-20], edi
	mov	DWORD PTR [rbp-4], 0
	jmp	.L2
.L5:
	mov	eax, DWORD PTR [rbp-20]
	and	eax, 1
	test	eax, eax
	je	.L3
	mov	edx, DWORD PTR [rbp-20]
	mov	eax, edx
	add	eax, eax
	add	eax, edx
	add	eax, 1
	mov	DWORD PTR [rbp-20], eax
	jmp	.L4
.L3:
	mov	eax, DWORD PTR [rbp-20]
	mov	edx, eax
	shr	edx, 31
	add	eax, edx
	sar	eax
	mov	DWORD PTR [rbp-20], eax
.L4:
	add	DWORD PTR [rbp-4], 1
.L2:
	cmp	DWORD PTR [rbp-20], 1
	jg	.L5
	mov	eax, DWORD PTR [rbp-4]
	pop	rbp
	.cfi_def_cfa 7, 8
	ret
	.cfi_endproc
.LFE2:
	.size	syracuse, .-syracuse
	.section	.rodata
.LC0:
	.string	"i=%d, m=%d\n"
	.text
	.globl	main
	.type	main, @function
main:
.LFB3:
	.cfi_startproc
	push	rbp
	.cfi_def_cfa_offset 16
	.cfi_offset 6, -16
	mov	rbp, rsp
	.cfi_def_cfa_register 6
	sub	rsp, 32
	mov	

In [49]:
%%sh
gcc -O -masm=intel syracuse.c
time a.out 1000000

i=910107, m=475


0.35user 0.01system 0:00.36elapsed 103%CPU (0avgtext+0avgdata 500maxresident)k
0inputs+0outputs (0major+155minor)pagefaults 0swaps


In [21]:
%%sh
gcc -O -S syracuseOptProg.c
cat syracuseOptProg.s

	.file	"syracuseOptProg.c"
	.text
	.globl	syracuse
	.type	syracuse, @function
syracuse:
.LFB39:
	.cfi_startproc
	cmpl	$1, %edi
	jle	.L6
	movl	$0, %eax
.L5:
	testb	$1, %dil
	je	.L3
	leal	1(%rdi,%rdi,2), %edx
	movl	%edx, %edi
	shrl	$31, %edi
	addl	%edx, %edi
	sarl	%edi
	addl	$1, %eax
	jmp	.L4
.L3:
	movl	%edi, %edx
	shrl	$31, %edx
	addl	%edx, %edi
	sarl	%edi
.L4:
	addl	$1, %eax
	cmpl	$1, %edi
	jg	.L5
	rep ret
.L6:
	movl	$0, %eax
	ret
	.cfi_endproc
.LFE39:
	.size	syracuse, .-syracuse
	.section	.rodata.str1.1,"aMS",@progbits,1
.LC0:
	.string	"i=%d, m=%d\n"
	.text
	.globl	main
	.type	main, @function
main:
.LFB40:
	.cfi_startproc
	pushq	%r12
	.cfi_def_cfa_offset 16
	.cfi_offset 12, -16
	pushq	%rbp
	.cfi_def_cfa_offset 24
	.cfi_offset 6, -24
	pushq	%rbx
	.cfi_def_cfa_offset 32
	.cfi_offset 3, -32
	movl	$0, %ebp
	movl	$1000000, %ebx
.L10:
	movl	%ebx, %edi
	call	syracuse
	cmpl	%ebp, %eax
	jle	.L8
	movl	%ebx, %r12d
	movl	%eax, %ebp
.L8:
	subl	$1, %ebx
	jne	.L10
	movl	%ebp, %ecx
	movl	%r12d, %edx


In [58]:
%%sh
gcc -O syracuseOptProg.c
time a.out 1000000

i=910107, m=475


0.31user 0.01system 0:00.30elapsed 105%CPU (0avgtext+0avgdata 492maxresident)k
0inputs+0outputs (0major+153minor)pagefaults 0swaps


In [17]:
%%sh
gcc -O syracuse.c
time a.out 1000000

i=910107, m=475


0.45user 0.00system 0:00.46elapsed 98%CPU (0avgtext+0avgdata 500maxresident)k
0inputs+0outputs (0major+155minor)pagefaults 0swaps


In [16]:
%%sh
gcc -O syracuseOptProg.c
time a.out 1000000

i=910107, m=475


0.37user 0.00system 0:00.39elapsed 95%CPU (0avgtext+0avgdata 492maxresident)k
0inputs+0outputs (0major+153minor)pagefaults 0swaps


## Conclusion

C'est (parfois) mieux quand le programmeur et le compilateur optimisent en même temps !