Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggested expansion for (/ a b) #357

Closed
gambiteer opened this issue May 15, 2018 · 16 comments
Closed

Suggested expansion for (/ a b) #357

gambiteer opened this issue May 15, 2018 · 16 comments

Comments

@gambiteer
Copy link
Collaborator

I see a lot of scheme code written in the style

(if (even? x)
  (... (/ x 2) ...)
  (... (/ (- x 1) 2) ...))

which bothers me, but, hey, I can't change other people's programming style.

So I tried to work out a better inline expansion for (/ x y) and came up with the following suggestions and tests:

(declare (standard-bindings)
         (extended-bindings)
         (block)
         (not safe))

(define (my/ a b)

  (define (safe/ a b)
    (declare (not inline-primitives))
    (/ a b))
  
  (cond ((and (##fixnum? a)
              (##fixnum? b))
         (cond ((##fx= b -1)
                (or (##fx-? a)
                    (safe/ a b)))
               ((##fxzero? b)
                (safe/ a b))
               ((##fxzero? (##fxremainder a b))
                (##fxquotient a b))
               (else
                (##/ a b))))
        ((and (##flonum? a)
              (##flonum? b))
         (##fl/ a b))
        (else
         (safe/ a b))))

(define reps #e1e7)

(define b 1234670)
(define a (* 2 b))

(time (do ((i 0 (fx+ i 1)))
          ((fx= i reps))
        (let ()
          (declare (not inline-primitives)
                   (not standard-bindings))
          (/ a b))))

(time (do ((i 0 (fx+ i 1)))
          ((fx= i reps))
        (my/ a b)))

(time (do ((i 0 (fx+ i 1)))
          ((fx= i reps))
        (let ()
          (declare (not inline-primitives)
                   (not standard-bindings))
          (/ a 2))))

(time (do ((i 0 (fx+ i 1)))
          ((fx= i reps))
        (my/ a 2)))

(time (do ((i 0 (fx+ i 1)))
          ((fx= i reps))
        (let ()
          (declare (not inline-primitives)
                   (not standard-bindings))
          (/ a -1))))

(time (do ((i 0 (fx+ i 1)))
          ((fx= i reps))
        (my/ a -1)))

with output

firefly:~/programs/gambit/gambit> gsc/gsc -:=.  test-sqrt
firefly:~/programs/gambit/gambit> gsi/gsi -:=.  test-sqrt
(time (do ((i 0 (fx+ i 1))) ((fx= i reps)) (let () (declare (not inline-primitives) (not standard-bindings)) (/ a b))))
    481 ms real time
    481 ms cpu time (481 user, 0 system)
    no collections
    no bytes allocated
    2 minor faults
    no major faults
(time (do ((i 0 (fx+ i 1))) ((fx= i reps)) (my/ a b)))
    174 ms real time
    174 ms cpu time (174 user, 0 system)
    no collections
    no bytes allocated
    no minor faults
    no major faults
(time (do ((i 0 (fx+ i 1))) ((fx= i reps)) (let () (declare (not inline-primitives) (not standard-bindings)) (/ a 2))))
    480 ms real time
    480 ms cpu time (480 user, 0 system)
    no collections
    no bytes allocated
    no minor faults
    no major faults
(time (do ((i 0 (fx+ i 1))) ((fx= i reps)) (my/ a 2)))
    171 ms real time
    171 ms cpu time (171 user, 0 system)
    no collections
    no bytes allocated
    no minor faults
    no major faults
(time (do ((i 0 (fx+ i 1))) ((fx= i reps)) (let () (declare (not inline-primitives) (not standard-bindings)) (/ a -1))))
    158 ms real time
    158 ms cpu time (158 user, 0 system)
    no collections
    no bytes allocated
    no minor faults
    no major faults
(time (do ((i 0 (fx+ i 1))) ((fx= i reps)) (my/ a -1)))
    25 ms real time
    25 ms cpu time (25 user, 0 system)
    no collections
    no bytes allocated
    no minor faults
    no major faults

So this code makes things run 2.8 to 6 times faster.

Things will go faster when the constant 2 replaces y in the code source.

@gambiteer
Copy link
Collaborator Author

Using macros we get a better picture of the expected speedup, and it's much better, about a factor of 40 when the divisor is 2.

(declare (standard-bindings)
         (extended-bindings)
         (block)
         (not safe))

(define (safe/ a b)
  (declare (not inline-primitives))
  (/ a b))
  

(define-macro (my/ a b)
  `(let ((a ,a)
         (b ,b))
     (cond ((and (##fixnum? a)
                 (##fixnum? b))
            (cond ((##fx= b -1)
                   (or (##fx-? a)
                       (safe/ a b)))
                  ((##fxzero? b)
                   (safe/ a b))
                  ((##fxzero? (##fxremainder a b))
                   (##fxquotient a b))
                  (else
                   (##/ a b))))
           ((and (##flonum? a)
                 (##flonum? b))
            (##fl/ a b))
           (else
            (safe/ a b)))))

(define reps #e1e7)

(define b 1234670)
(define a (* 2 b))

(set! a a)
(set! b b)

(time (do ((i 0 (fx+ i 1)))
          ((fx= i reps))
        (let ()
          (declare (not inline-primitives)
                   (not standard-bindings))
          (/ a b))))

(time (do ((i 0 (fx+ i 1)))
          ((fx= i reps))
        (my/ a b)))

(time (do ((i 0 (fx+ i 1)))
          ((fx= i reps))
        (let ()
          (declare (not inline-primitives)
                   (not standard-bindings))
          (/ a 2))))

(time (do ((i 0 (fx+ i 1)))
          ((fx= i reps))
        (my/ a 2)))

(time (do ((i 0 (fx+ i 1)))
          ((fx= i reps))
        (let ()
          (declare (not inline-primitives)
                   (not standard-bindings))
          (/ a -1))))

(time (do ((i 0 (fx+ i 1)))
          ((fx= i reps))
        (my/ a -1)))

which expands to

firefly:~/programs/gambit/gambit> gsc/gsc -:=. -expansion test-sqrt
Expansion:

(define safe/ (lambda (a b) ('#<procedure #2 /> a b)))

(define reps 10000000)

(define b 1234670)

(define a
  (let ((temp.13 b))
    (if ('#<procedure #3 ##fixnum?> temp.13)
        (if ('#<procedure #4 ##eqv?> temp.13 0)
            0
            (let ((temp.14 (if ('#<procedure #4 ##eqv?> temp.13 -1)
                               -2
                               ('#<procedure #5 ##fx*?> 2 temp.13))))
              (if temp.14 temp.14 ('#<procedure #6 *> 2 temp.13))))
        ('#<procedure #6 *> 2 temp.13))))

(set! a a)

(set! b b)

(##time (lambda ()
          (letrec ((do-temp.0
                    (lambda (i)
                      (if ('#<procedure #7 ##fx=> i 10000000)
                          #!void
                          (begin
                            (/ a b)
                            (let ((i ('#<procedure #8 ##fx+> i 1)))
                              (if ('#<procedure #7 ##fx=> i 10000000)
                                  #!void
                                  (begin
                                    (/ a b)
                                    (do-temp.0
                                     ('#<procedure #8 ##fx+> i 1))))))))))
            (do-temp.0 0)))
        '(do ((i 0 (fx+ i 1)))
             ((fx= i reps))
           (let ()
             (declare (not inline-primitives) (not standard-bindings))
             (/ a b))))

(##time (lambda ()
          (letrec ((do-temp.2
                    (lambda (i)
                      (if ('#<procedure #7 ##fx=> i 10000000)
                          #!void
                          (begin
                            (let ((b b) (a a))
                              (if (and ('#<procedure #3 ##fixnum?> a)
                                       ('#<procedure #3 ##fixnum?> b))
                                  (if ('#<procedure #7 ##fx=> b -1)
                                      (or ('#<procedure #9 ##fx-?> a)
                                          ('#<procedure #2 /> a b))
                                      (if ('#<procedure #10 ##fxzero?> b)
                                          ('#<procedure #2 /> a b)
                                          (if ('#<procedure #10 ##fxzero?>
                                               ('#<procedure #11 ##fxremainder>
                                                a
                                                b))
                                              ('#<procedure #12 ##fxquotient>
                                               a
                                               b)
                                              (##/ a b))))
                                  (if (and ('#<procedure #13 ##flonum?> a)
                                           ('#<procedure #13 ##flonum?> b))
                                      ('#<procedure #14 ##fl/> a b)
                                      ('#<procedure #2 /> a b))))
                            (do-temp.2 ('#<procedure #8 ##fx+> i 1)))))))
            (do-temp.2 0)))
        '(do ((i 0 (fx+ i 1))) ((fx= i reps)) (my/ a b)))

(##time (lambda ()
          (letrec ((do-temp.4
                    (lambda (i)
                      (if ('#<procedure #7 ##fx=> i 10000000)
                          #!void
                          (begin
                            (/ a 2)
                            (let ((i ('#<procedure #8 ##fx+> i 1)))
                              (if ('#<procedure #7 ##fx=> i 10000000)
                                  #!void
                                  (begin
                                    (/ a 2)
                                    (do-temp.4
                                     ('#<procedure #8 ##fx+> i 1))))))))))
            (do-temp.4 0)))
        '(do ((i 0 (fx+ i 1)))
             ((fx= i reps))
           (let ()
             (declare (not inline-primitives) (not standard-bindings))
             (/ a 2))))

(##time (lambda ()
          (letrec ((do-temp.6
                    (lambda (i)
                      (if ('#<procedure #7 ##fx=> i 10000000)
                          #!void
                          (begin
                            (let ((a a))
                              (if ('#<procedure #3 ##fixnum?> a)
                                  (if ('#<procedure #10 ##fxzero?>
                                       ('#<procedure #11 ##fxremainder> a 2))
                                      ('#<procedure #12 ##fxquotient> a 2)
                                      (##/ a 2))
                                  ('#<procedure #2 /> a 2)))
                            (do-temp.6 ('#<procedure #8 ##fx+> i 1)))))))
            (do-temp.6 0)))
        '(do ((i 0 (fx+ i 1))) ((fx= i reps)) (my/ a 2)))

(##time (lambda ()
          (letrec ((do-temp.8
                    (lambda (i)
                      (if ('#<procedure #7 ##fx=> i 10000000)
                          #!void
                          (begin
                            (/ a -1)
                            (let ((i ('#<procedure #8 ##fx+> i 1)))
                              (if ('#<procedure #7 ##fx=> i 10000000)
                                  #!void
                                  (begin
                                    (/ a -1)
                                    (do-temp.8
                                     ('#<procedure #8 ##fx+> i 1))))))))))
            (do-temp.8 0)))
        '(do ((i 0 (fx+ i 1)))
             ((fx= i reps))
           (let ()
             (declare (not inline-primitives) (not standard-bindings))
             (/ a -1))))

(##time (lambda ()
          (letrec ((do-temp.10
                    (lambda (i)
                      (if ('#<procedure #7 ##fx=> i 10000000)
                          #!void
                          (begin
                            (let ((a a))
                              (if ('#<procedure #3 ##fixnum?> a)
                                  (or ('#<procedure #9 ##fx-?> a)
                                      ('#<procedure #2 /> a -1))
                                  ('#<procedure #2 /> a -1)))
                            (do-temp.10 ('#<procedure #8 ##fx+> i 1)))))))
            (do-temp.10 0)))
        '(do ((i 0 (fx+ i 1))) ((fx= i reps)) (my/ a -1)))

with timings

firefly:~/programs/gambit/gambit> gsi/gsi -:=. test-sqrt
(time (do ((i 0 (fx+ i 1))) ((fx= i reps)) (let () (declare (not inline-primitives) (not standard-bindings)) (/ a b))))
    521 ms real time
    521 ms cpu time (520 user, 1 system)
    no collections
    no bytes allocated
    1 minor fault
    no major faults
(time (do ((i 0 (fx+ i 1))) ((fx= i reps)) (my/ a b)))
    101 ms real time
    101 ms cpu time (97 user, 3 system)
    no collections
    no bytes allocated
    no minor faults
    no major faults
(time (do ((i 0 (fx+ i 1))) ((fx= i reps)) (let () (declare (not inline-primitives) (not standard-bindings)) (/ a 2))))
    496 ms real time
    496 ms cpu time (496 user, 0 system)
    no collections
    no bytes allocated
    no minor faults
    no major faults
(time (do ((i 0 (fx+ i 1))) ((fx= i reps)) (my/ a 2)))
    12 ms real time
    12 ms cpu time (12 user, 0 system)
    no collections
    no bytes allocated
    no minor faults
    no major faults
(time (do ((i 0 (fx+ i 1))) ((fx= i reps)) (let () (declare (not inline-primitives) (not standard-bindings)) (/ a -1))))
    161 ms real time
    161 ms cpu time (161 user, 0 system)
    no collections
    no bytes allocated
    no minor faults
    no major faults
(time (do ((i 0 (fx+ i 1))) ((fx= i reps)) (my/ a -1)))
    17 ms real time
    17 ms cpu time (13 user, 4 system)
    no collections
    no bytes allocated
    no minor faults
    no major faults

@gambiteer
Copy link
Collaborator Author

Hey, Matthew implemented something very similar in Racket's fork of Chez:

racket/ChezScheme@e91430b

but it appears that he implemented it in the runtime library, not at expansion time.

This expansion will help Gambit a lot more.

@feeley
Copy link
Member

feeley commented Jan 8, 2020

Let's do it!

@alvatar
Copy link
Contributor

alvatar commented Jan 8, 2020

Awesome 👏

@gambiteer
Copy link
Collaborator Author

Microbenchmarking is hard. I added a loop to measure the overhead and got for #e1e8 reps

(time (do ((i 0 (fx+ i 1)) (result 0 (fxmax result (let () (declare (not inline-primitives) (not standard-bindings)) (/ a b))))) ((fx= i reps) result)))
    6.214013 secs real time
    6.211302 secs cpu time (6.210605 user, 0.000697 system)
    no collections
    no bytes allocated
    1 minor fault
    no major faults
(time (do ((i 0 (fx+ i 1)) (result 0 (fxmax result (my/ a b)))) ((fx= i reps) result)))
    1.861679 secs real time
    1.860780 secs cpu time (1.860780 user, 0.000000 system)
    no collections
    no bytes allocated
    no minor faults
    no major faults
(time (do ((i 0 (fx+ i 1)) (result 0 (fxmax result (let () (declare (not inline-primitives) (not standard-bindings)) (/ a 2))))) ((fx= i reps) result)))
    6.205070 secs real time
    6.202406 secs cpu time (6.202406 user, 0.000000 system)
    no collections
    no bytes allocated
    no minor faults
    no major faults
(time (do ((i 0 (fx+ i 1)) (result 0 (fxmax result (my/ a 2)))) ((fx= i reps) result)))
    0.209048 secs real time
    0.208159 secs cpu time (0.208159 user, 0.000000 system)
    no collections
    no bytes allocated
    no minor faults
    no major faults
(time (do ((i 0 (fx+ i 1)) (result 0 (fxmax result i))) ((fx= i reps) result)))
    0.078965 secs real time
    0.078955 secs cpu time (0.078955 user, 0.000000 system)
    no collections
    no bytes allocated
    no minor faults
    no major faults
(time (do ((i 0 (fx+ i 1)) (result 0 (fxmin result (let () (declare (not inline-primitives) (not standard-bindings)) (/ a -1))))) ((fx= i reps) result)))
    1.504262 secs real time
    1.504237 secs cpu time (1.504237 user, 0.000000 system)
    no collections
    no bytes allocated
    no minor faults
    no major faults
(time (do ((i 0 (fx+ i 1)) (result 0 (fxmin result (my/ a -1)))) ((fx= i reps) result)))
    0.157164 secs real time
    0.157138 secs cpu time (0.157138 user, 0.000000 system)
    no collections
    no bytes allocated
    no minor faults
    no major faults

Then I tried this expansion, which turned out to be a bit faster:

(define-macro (my/ a b)
  `(let ((a ,a)
         (b ,b))
     (cond ((and (##fixnum? a)
                 (##fixnum? b))
            (cond ((##fx= b -1)
                   (or (##fx-? a)
                       (safe/ a b)))
                  ((##fxzero? b)
                   (safe/ a b))
                  (else
                   (let ((result? (##fxquotient a b)))
                     (if (fx= a (##fx* result? b))
                         result?
                         (##/ a b))))))
           ((and (##flonum? a)
                 (##flonum? b))
            (##fl/ a b))
           (else
            (safe/ a b)))))

with times

(time (do ((i 0 (fx+ i 1)) (result 0 (fxmax result (let () (declare (not inline-primitives) (not standard-bindings)) (/ a b))))) ((fx= i reps) result)))
    6.177677 secs real time
    6.177654 secs cpu time (6.177175 user, 0.000479 system)
    no collections
    no bytes allocated
    1 minor fault
    no major faults
(time (do ((i 0 (fx+ i 1)) (result 0 (fxmax result (my/ a b)))) ((fx= i reps) result)))
    1.035759 secs real time
    1.035750 secs cpu time (1.035750 user, 0.000000 system)
    no collections
    no bytes allocated
    no minor faults
    no major faults
(time (do ((i 0 (fx+ i 1)) (result 0 (fxmax result (let () (declare (not inline-primitives) (not standard-bindings)) (/ a 2))))) ((fx= i reps) result)))
    6.223743 secs real time
    6.223737 secs cpu time (6.220394 user, 0.003343 system)
    no collections
    no bytes allocated
    no minor faults
    no major faults
(time (do ((i 0 (fx+ i 1)) (result 0 (fxmax result (my/ a 2)))) ((fx= i reps) result)))
    0.156306 secs real time
    0.156308 secs cpu time (0.156308 user, 0.000000 system)
    no collections
    no bytes allocated
    no minor faults
    no major faults
(time (do ((i 0 (fx+ i 1)) (result 0 (fxmax result i))) ((fx= i reps) result)))
    0.078408 secs real time
    0.078409 secs cpu time (0.078409 user, 0.000000 system)
    no collections
    no bytes allocated
    no minor faults
    no major faults
(time (do ((i 0 (fx+ i 1)) (result 0 (fxmin result (let () (declare (not inline-primitives) (not standard-bindings)) (/ a -1))))) ((fx= i reps) result)))
    1.490614 secs real time
    1.490595 secs cpu time (1.490595 user, 0.000000 system)
    no collections
    no bytes allocated
    no minor faults
    no major faults
(time (do ((i 0 (fx+ i 1)) (result 0 (fxmin result (my/ a -1)))) ((fx= i reps) result)))
    0.156762 secs real time
    0.156763 secs cpu time (0.156763 user, 0.000000 system)
    no collections
    no bytes allocated
    no minor faults
    no major faults

with speedups of: for dividing by constant -1

> (/ (- 1.490614 0.078408) (- 0.156762 0.078408))
18.023406590601628

for dividing by constant 2

> (/ (- 6.223743 0.078408) (- 0.156306 0.078408))
78.88950935839175

for dividing by variable that equals 2

> (/ (- 6.177677 0.078408) (- 1.035759 0.078408))
6.370985145469112

So a speedup of nearly 80.

@feeley
Copy link
Member

feeley commented Jan 9, 2020

I think the case for dividing by -1 should not be in the fast path... I expect it to be almost never the case, and it bloats the code. Implementation would be simple if it didn't have to support multi-argument division...

@gambiteer
Copy link
Collaborator Author

gambiteer commented Jan 9, 2020

But I don't think either test

                  ((##fxzero? (##fxremainder a b))
                   (##fxquotient a b))```
or
               (let ((result? (##fxquotient a b)))
                 (if (fx= a (##fx* result? b))
                     result?
                     (##/ a b))))))```

work if b is -1, i.e.

> (##fxquotient ##min-fixnum -1)
-2305843009213693952
> (##fx* -2305843009213693952 -1)
-2305843009213693952
> (##fxremainder ##min-fixnum -1)
0
> ##min-fixnum
-2305843009213693952

I suspect that most of the time b will be a constant and the tests for 0 and -1 will be elided.

@feeley
Copy link
Member

feeley commented Jan 9, 2020

I mean that when b is 0 or -1 a call should be made to (/ a b). This could also be optimized to test for 0 and -1 with a single conditional jump.

@gambiteer
Copy link
Collaborator Author

OK, so you're saving maybe a branch and an inlined fx-?.
But perhaps / is used infrequently enough that it doesn't really matter, here's some operation counts from the r7rs-benchmarks sources:

firefly:~/programs/r7rs-benchmarks/src> grep '(+' * | grep -v '^;' | wc
    819    6387   51092
firefly:~/programs/r7rs-benchmarks/src> grep '(-' * | grep -v '^;' | wc
    596    4619   36658
firefly:~/programs/r7rs-benchmarks/src> grep '(\*' * | grep -v '^;' | wc
    358    2752   21122
firefly:~/programs/r7rs-benchmarks/src> grep '(\/' * | grep -v '^;' | wc
     43     320    2922
firefly:~/programs/r7rs-benchmarks/src> grep '(quotient' * | grep -v '^;' | wc
     66     424    3662

+, -, and * are used an order of magnitude more often than / and quotient.

@feeley
Copy link
Member

feeley commented Jan 9, 2020

The Gambit RTS is not a good benchmark for what user programs do. Many programs don't use / to divide integers exactly. This optimization is for those infrequent programs that do that very often.

@gambiteer
Copy link
Collaborator Author

The problem with b being -1 occurs only when a is ##min-fixnum, so perhaps it would be better if one could test whether b is zero or a is ##min-fixnum.

Multiplying by -1 is quite common, but I can't think of much natural code that divides by -1.

@feeley
Copy link
Member

feeley commented Jan 22, 2020

A better expansion of / has been added with commit 1438a38 . It calls fxquotient and fxremainder but at -O2 the C compiler combines those two operations in a single division instruction. Perhaps a better divisibility test could be devised, perhaps based on https://arxiv.org/abs/1902.01961 .

@feeley feeley closed this as completed Jan 22, 2020
@gambiteer
Copy link
Collaborator Author

I'm sorry, but I don't see it. With this source file

(declare (standard-bindings)
         (extended-bindings)
         (block)
         (not safe)
         (inlining-limit 0))

(define b 1234670)
(define a (* 2 b))

(set! a a)
(set! b b)

(time (do ((i 0 (fx+ i 1))
           (result 0 (fxmax result
                            (/ a b))))
          ((fx= i reps) result)))

I get

firefly:~/programs/gambit> gsc -expansion test-div2.scm
Expansion:

(define b 1234670)

(define a ('#<procedure #2 *> 2 b))

(set! a a)

(set! b b)

(##time (lambda ()
          (letrec ((do-temp.0
                    (lambda (i result)
                      (if ('#<procedure #3 ##fx=> i reps)
                          result
                          (do-temp.0 ('#<procedure #4 ##fx+> i 1) ('#<procedure #5 ##fxmax> result ('#<procedure #6 /> a b)))))))
            (do-temp.0 0 0)))
        '(do ((i 0 (fx+ i 1)) (result 0 (fxmax result (/ a b)))) ((fx= i reps) result)))

Where's the expansion?

@feeley
Copy link
Member

feeley commented Jan 22, 2020

Remove the (not safe)... This is a misfeature I'm working on...

@gambiteer
Copy link
Collaborator Author

gambiteer commented Jan 22, 2020 via email

@feeley
Copy link
Member

feeley commented Jan 24, 2020

It works for your example now. On my machine, with two fixnums where the first is a multiple of the second, the improvement is about 5x when the divisor is not known at compile time, and about 50x when the divisor is known at compile time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants