don't (Lisp-)specialize `Vector` types #931

stylewarning · 2023-05-23T17:46:59Z

Currently, the Vector type is constructed as Common Lisp vectors. The CL vectors are specialized on the element type, which in principle gives us benefits in speed and memory, at the very tiny cost of carrying around a RuntimeRepr constraint in polymorphic code.

Since our vectors are by default adjustable with fill pointers, we don't actually fully realize a speed benefit. The assembly code for the following two functions is identical:

(declaim (optimize speed (safety 0) (debug 0)))

(defun sum1 (a)
  (declare (type (vector t) a))
  (loop :with s :of-type double-float := 0.0d0
        :for x :of-type double-float :across a
        :do (incf s x)
        :finally (return s)))

(defun sum2 (a)
  (declare (type (vector double-float) a))
  (loop :with s :of-type double-float := 0.0d0
        :for x :of-type double-float :across a
        :do (incf s x)
        :finally (return s)))

; disassembly for SUM1
; Size: 163 bytes. Origin: #x536D0C5A                         ; SUM1
; 5A:       660F57D2         XORPD XMM2, XMM2
; 5E:       660F57C9         XORPD XMM1, XMM1
; 62:       31F6             XOR ESI, ESI
; 64:       4D8B48F9         MOV R9, [R8-7]
; 68:       EB4F             JMP L1
; 6A:       660F1F440000     NOP
; 70: L0:   488975F8         MOV [RBP-8], RSI
; 74:       F20F1155F0       MOVSD [RBP-16], XMM2
; 79:       4C8945E8         MOV [RBP-24], R8
; 7D:       4C894DE0         MOV [RBP-32], R9
; 81:       488BFE           MOV RDI, RSI
; 84:       4883EC10         SUB RSP, 16
; 88:       498BD0           MOV RDX, R8
; 8B:       48892C24         MOV [RSP], RBP
; 8F:       488BEC           MOV RBP, RSP
; 92:       E869C633FF       CALL #x52A0D300                  ; SB-KERNEL:HAIRY-DATA-VECTOR-REF
; 97:       4C8B4DE0         MOV R9, [RBP-32]
; 9B:       4C8B45E8         MOV R8, [RBP-24]
; 9F:       F20F1055F0       MOVSD XMM2, [RBP-16]
; A4:       488B75F8         MOV RSI, [RBP-8]
; A8:       F20F104AF9       MOVSD XMM1, [RDX-7]
; AD:       4883C602         ADD RSI, 2
; B1:       F20F58CA         ADDSD XMM1, XMM2
; B5:       660F28D1         MOVAPD XMM2, XMM1
; B9: L1:   4C39CE           CMP RSI, R9
; BC:       7CB2             JL L0
; BE:       4D896D28         MOV [R13+40], R13                ; thread.pseudo-atomic-bits
; C2:       498B5570         MOV RDX, [R13+112]               ; thread.mixed-tlab
; C6:       4883C210         ADD RDX, 16
; CA:       493B5578         CMP RDX, [R13+120]
; CE:       7720             JNBE L4
; D0:       49895570         MOV [R13+112], RDX               ; thread.mixed-tlab
; D4:       48FFCA           DEC RDX
; D7: L2:   66C742F11D01     MOV WORD PTR [RDX-15], 285
; DD:       4D316D28         XOR [R13+40], R13                ; thread.pseudo-atomic-bits
; E1:       7402             JEQ L3
; E3:       CC09             INT3 9                           ; pending interrupt trap
; E5: L3:   F20F1152F9       MOVSD [RDX-7], XMM2
; EA:       488BE5           MOV RSP, RBP
; ED:       F8               CLC
; EE:       5D               POP RBP
; EF:       C3               RET
; F0: L4:   6A10             PUSH 16
; F2:       E8F9F732FF       CALL #x52A004F0                  ; SB-VM::ALLOC-TRAMP
; F7:       5A               POP RDX
; F8:       80CA0F           OR DL, 15
; FB:       EBDA             JMP L2

Both of them have to do a "hairy vector ref", which is an unoptimized general reference into the vector.

Compare to this third, which is very optimized.

(defun sum3 (a)
  (declare (type (simple-array double-float (*)) a))
  (loop :with s :of-type double-float := 0.0d0
        :for x :of-type double-float :across a
        :do (incf s x)))

CL-USER> (disassemble #'sum3)
; disassembly for SUM3
; Size: 108 bytes. Origin: #x536D0E96                         ; SUM3
; E96:       660F57C9         XORPD XMM1, XMM1
; E9A:       660F57D2         XORPD XMM2, XMM2
; E9E:       31C9             XOR ECX, ECX
; EA0:       488B5AF9         MOV RBX, [RDX-7]
; EA4:       EB18             JMP L1
; EA6:       660F1F840000000000 NOP
; EAF:       90               NOP
; EB0: L0:   F20F10548A01     MOVSD XMM2, [RDX+RCX*4+1]
; EB6:       4883C102         ADD RCX, 2
; EBA:       F20F58CA         ADDSD XMM1, XMM2
; EBE: L1:   4839D9           CMP RCX, RBX
; EC1:       7CED             JL L0
; EC3:       4D896D28         MOV [R13+40], R13               ; thread.pseudo-atomic-bits
; EC7:       498B5570         MOV RDX, [R13+112]              ; thread.mixed-tlab
; ECB:       4883C210         ADD RDX, 16
; ECF:       493B5578         CMP RDX, [R13+120]
; ED3:       7720             JNBE L4
; ED5:       49895570         MOV [R13+112], RDX              ; thread.mixed-tlab
; ED9:       48FFCA           DEC RDX
; EDC: L2:   66C742F11D01     MOV WORD PTR [RDX-15], 285
; EE2:       4D316D28         XOR [R13+40], R13               ; thread.pseudo-atomic-bits
; EE6:       7402             JEQ L3
; EE8:       CC09             INT3 9                          ; pending interrupt trap
; EEA: L3:   F20F114AF9       MOVSD [RDX-7], XMM1
; EEF:       488BE5           MOV RSP, RBP
; EF2:       F8               CLC
; EF3:       5D               POP RBP
; EF4:       C3               RET
; EF5: L4:   6A10             PUSH 16
; EF7:       E8F4F532FF       CALL #x52A004F0                 ; SB-VM::ALLOC-TRAMP
; EFC:       5A               POP RDX
; EFD:       80CA0F           OR DL, 15
; F00:       EBDA             JMP L2

While we may not realize a speed benefit, we do get a storage benefit. Consider these three definitions:

(defun general-adjustable (n)
  (let ((x (make-array n :element-type t :initial-element 1.0d0 :adjustable t :fill-pointer n)))
    (declare (type (vector t) x))
    (map-into x #'random x)
    (length x)))

(defun special-adjustable (n)
  (let ((x (make-array n :element-type 'double-float :initial-element 1.0d0 :adjustable t :fill-pointer n)))
    (declare (type (vector (double-float (0.0d0))) x))
    (map-into x #'random x)
    (length x)))

(defun special-fixed (n)
  (let ((x (make-array n :element-type 'double-float :initial-element 1.0d0)))
    (declare (type (simple-array (double-float (0.0d0)) (*)) x))
    (map-into x #'random x)
    (length x)))

Metering them we get:

CL-USER> (time (general-adjustable 10000000))
Evaluation took:
  0.219 seconds of real time
  0.219447 seconds of total run time (0.201021 user, 0.018426 system)
  100.00% CPU
  526,860,812 processor cycles
  239,985,216 bytes consed
  
10000000
CL-USER> (time (special-adjustable 10000000))
Evaluation took:
  0.130 seconds of real time
  0.130024 seconds of total run time (0.129971 user, 0.000053 system)
  100.00% CPU
  312,132,498 processor cycles
  80,000,016 bytes consed
  
10000000
CL-USER> (time (special-fixed 10000000))
Evaluation took:
  0.146 seconds of real time
  0.146113 seconds of total run time (0.128996 user, 0.017117 system)
  100.00% CPU
  350,772,302 processor cycles
  80,000,016 bytes consed
  
10000000

We see the latter two definitions consed less than 1/2 the memory.

However, while the objects themselves occupy less memory, we actually still cons a lot with our sumN functions:

CL-USER> (let ((x (general-adjustable 10000000)))
           (time (sum1 x)))
Evaluation took:
  0.046 seconds of real time
  0.046030 seconds of total run time (0.046018 user, 0.000012 system)
  100.00% CPU
  110,467,390 processor cycles
  0 bytes consed
  
5000138.483784734d0
CL-USER> (let ((x (special-adjustable 10000000)))
           (time (sum2 x)))
Evaluation took:
  0.176 seconds of real time
  0.176989 seconds of total run time (0.164089 user, 0.012900 system)
  [ Run times consist of 0.116 seconds GC time, and 0.061 seconds non-GC time. ]
  100.57% CPU
  423,857,834 processor cycles
  160,006,144 bytes consed
  
5001112.489638757d0
CL-USER> (let ((x (special-fixed 10000000)))
           (time (sum3 x)))
Evaluation took:
  0.009 seconds of real time
  0.009601 seconds of total run time (0.009598 user, 0.000003 system)
  111.11% CPU
  23,036,176 processor cycles
  0 bytes consed
  
4999124.358925528d0

I don't understand why the general (vector t) case didn't cons. So actually, it's both 2x faster and conses less memory at runtime, despite storage costing more.

As such, I think for a standard library object, we should remove the specialization.

(In a follow-on, we should add APIs for specialized, fixed-sized arrays.)

The text was updated successfully, but these errors were encountered:

stylewarning · 2023-05-23T17:51:54Z

In Coalton, we don't actually emit type-specialized declarations, as can be seen here:

  (repr :native (cl:and (cl:vector cl:*) (cl:not cl:simple-array)))
  (define-type (Vector :a))

If we wanted our arrays to be more efficient, we'd need a mechanism to emit (cl:and (cl:vector <TYPE>) ...) when known. Using * essentially forces Lisp to select the broadest possible accessor function.

eliaslfox · 2023-05-23T18:04:51Z

It would be pretty easy to hack in specialized array types https://github.com/coalton-lang/coalton/blob/c18e287af94b7e2dde93f1146108545db7d9d126/src/typechecker/lisp-type.lisp.

stylewarning added performance standard library labels May 23, 2023

stylewarning closed this as completed Oct 6, 2023

stylewarning mentioned this issue Oct 6, 2023

Add flat, efficient, specialized (one-dimensional) array types #1006

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

don't (Lisp-)specialize `Vector` types #931

don't (Lisp-)specialize `Vector` types #931

stylewarning commented May 23, 2023

stylewarning commented May 23, 2023

eliaslfox commented May 23, 2023

don't (Lisp-)specialize Vector types #931

don't (Lisp-)specialize Vector types #931

Comments

stylewarning commented May 23, 2023

stylewarning commented May 23, 2023

eliaslfox commented May 23, 2023

don't (Lisp-)specialize `Vector` types #931

don't (Lisp-)specialize `Vector` types #931