/
efficiency.texinfo
339 lines (272 loc) · 11.3 KB
/
efficiency.texinfo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
@node Efficiency
@comment node-name, next, previous, up
@chapter Efficiency
@cindex Efficiency
@menu
* Slot access::
* Dynamic-extent allocation::
* Modular arithmetic::
* Miscellaneous Efficiency Issues::
@end menu
@node Slot access
@comment node-name, next, previous, up
@section Slot access
@cindex Slot access
@subsection Structure object slot access
Structure slot accessors are efficient only if the compiler is able to
open code them: compiling a call to a structure slot accessor before
the structure is defined, declaring one @code{notinline}, or passing
it as a functional argument to another function causes severe
perfomance degradation.
@subsection Standard object slot access
The most efficient way to access a slot of a @code{standard-object} is
by using @code{slot-value} with a constant slot name argument inside a
@code{defmethod} body, where the variable holding the instance is a
specializer parameter of the method and is never assigned to. The cost
is roughly 1.6 times that of an open coded structure slot accessor.
Second most efficient way is to use a CLOS slot accessor, or
@code{slot-value} with a constant slot name argument, but in
circumstances other than specified above. This may be up to 3 times as
slow as the method described above.
Example:
@lisp
(defclass foo () ((bar)))
;; Fast: specializer and never assigned to
(defmethod quux ((foo foo) new)
(let ((old (slot-value foo 'bar)))
(setf (slot-value foo 'bar) new)
old))
;; Slow: not a specializer
(defmethod quux ((foo foo) new)
(let* ((temp foo)
(old (slot-value temp 'bar)))
(setf (slot-value temp 'bar) new)
old))
;; Slow: assignment to FOO
(defmethod quux ((foo foo) new)
(let ((old (slot-value foo 'bar)))
(setf (slot-value foo 'bar) new)
(setf foo new)
old))
@end lisp
Note that when profiling code such as this, the first few calls to the
generic function are not representative, as the dispatch mechanism is
lazily set up during those calls.
@node Dynamic-extent allocation
@comment node-name, next, previous, up
@section Dynamic-extent allocation
@cindex Dynamic-extent declaration
SBCL has limited support for performing allocation on the stack when a
variable is declared @code{dynamic-extent}. The @code{dynamic-extent}
declarations are not verified, but are simply trusted as long as
@code{sb-ext:*stack-allocate-dynamic-extent*} is true.
If dynamic extent constraints specified in the Common Lisp standard
are violated, the best that can happen is for the program to have
garbage in variables and return values; more commonly, the system will
crash.
@include var-sb-ext-star-stack-allocate-dynamic-extent-star.texinfo
There are many cases when @code{dynamic-extent} declarations could be
useful. At present, SBCL implements stack allocation for
@itemize
@item
@code{&rest} lists, when these are declared @code{dynamic-extent}.
@item
@code{cons}, @code{list} and @code{list*}, when the result is bound to
a variable declared @code{dynamic-extent}.
@item
simple forms of @code{make-array}, whose result is bound to a variable
declared @code{dynamic-extent}: stack allocation is possible only if
the resulting array is one-dimensional, and the call has no keyword
arguments with the exception of @code{:element-type}.
@strong{Note}: stack space is limited, so allocation of a large vector
may cause stack overflow. For this reason potentially large vectors,
which might circumvent stack overflow detection, are stack allocated
only in zero @code{safety} policies.
@item
closures defined with @code{flet} or @code{labels}, with a bound
@code{dynamic-extent} declaration. Closed-over variables, which are
assigned to (either inside or outside the closure) are still allocated
on the heap. Blocks and tags are also allocated on the heap, unless
all non-local control transfers to them are compiled with zero
@code{safety}.
@item
user-defined structures when the structure constructor defined using
@code{defstruct} has been declared @code{inline} and the result of the
call to the constructor is bound to a variable declared
@code{dynamic-extent}.
@strong{Note:} structures with ``raw'' slots can currently be
stack-allocated only on x86 and x86-64.
@item
all of the above when they appear as initial parts if another
stack-allocated object.
@end itemize
Examples:
@lisp
;;; Declaiming a structure constructor inline before definition makes
;;; stack allocation possible.
(declaim (inline make-thing))
(defstruct thing obj next)
;;; Stack allocation of various objects bound to DYNAMIC-EXTENT
;;; variables.
(let* ((list (list 1 2 3))
(nested (cons (list 1 2) (list* 3 4 (list 5))))
(vector (make-array 3 :element-type 'single-float))
(thing (make-thing :obj list
:next (make-thing :obj (make-array 3)))))
(declare (dynamic-extent list nested vector thing))
...)
;;; Stack allocation of arguments to a local function is equivalent
;;; to stack allocation of local variable values.
(flet ((f (x)
(declare (dynamic-extent x))
...))
...
(f (list 1 2 3))
(f (cons (cons 1 2) (cons 3 4)))
...)
;;; Stack allocation of &REST lists
(defun foo (&rest args)
(declare (dynamic-extent args))
...)
@end lisp
Future plans include
@itemize
@item
Stack allocation of assigned-to closed-over variables, where these are
declared @code{dynamic-extent};
@item
Automatic detection of the common idiom of applying a function to some
defaults and a @code{&rest} list, even when this is not declared
@code{dynamic-extent};
@item
Automatic detection of the common idiom of calling quantifiers with a
closure, even when the closure is not declared @code{dynamic-extent}.
@end itemize
@node Modular arithmetic
@comment node-name, next, previous, up
@section Modular arithmetic
@cindex Modular arithmetic
@cindex Arithmetic, modular
@cindex Arithmetic, hardware
Some numeric functions have a property: @var{N} lower bits of the
result depend only on @var{N} lower bits of (all or some)
arguments. If the compiler sees an expression of form @code{(logand
@var{exp} @var{mask})}, where @var{exp} is a tree of such ``good''
functions and @var{mask} is known to be of type @code{(unsigned-byte
@var{w})}, where @var{w} is a ``good'' width, all intermediate results
will be cut to @var{w} bits (but it is not done for variables and
constants!). This often results in an ability to use simple machine
instructions for the functions.
Consider an example.
@lisp
(defun i (x y)
(declare (type (unsigned-byte 32) x y))
(ldb (byte 32 0) (logxor x (lognot y))))
@end lisp
The result of @code{(lognot y)} will be negative and of type
@code{(signed-byte 33)}, so a naive implementation on a 32-bit
platform is unable to use 32-bit arithmetic here. But modular
arithmetic optimizer is able to do it: because the result is cut down
to 32 bits, the compiler will replace @code{logxor} and @code{lognot}
with versions cutting results to 32 bits, and because terminals
(here---expressions @code{x} and @code{y}) are also of type
@code{(unsigned-byte 32)}, 32-bit machine arithmetic can be used.
As of SBCL 0.8.5 ``good'' functions are @code{+}, @code{-};
@code{logand}, @code{logior}, @code{logxor}, @code{lognot} and their
combinations; and @code{ash} with the positive second
argument. ``Good'' widths are 32 on HPPA, MIPS, PPC, Sparc and x86 and
64 on Alpha. While it is possible to support smaller widths as well,
currently this is not implemented.
@node Miscellaneous Efficiency Issues
@comment node-name, next, previous, up
@section Miscellaneous Efficiency Issues
FIXME: The material in the CMUCL manual about getting good
performance from the compiler should be reviewed, reformatted in
Texinfo, lightly edited for SBCL, and substituted into this
manual. In the meantime, the original CMUCL manual is still 95+%
correct for the SBCL version of the Python compiler. See the
sections
@itemize
@item Advanced Compiler Use and Efficiency Hints
@item Advanced Compiler Introduction
@item More About Types in Python
@item Type Inference
@item Source Optimization
@item Tail Recursion
@item Local Call
@item Block Compilation
@item Inline Expansion
@item Object Representation
@item Numbers
@item General Efficiency Hints
@item Efficiency Notes
@end itemize
Besides this information from the CMUCL manual, there are a few other
points to keep in mind.
@itemize
@item
The CMUCL manual doesn't seem to state it explicitly, but Python has a
mental block about type inference when assignment is involved. Python
is very aggressive and clever about inferring the types of values
bound with @code{let}, @code{let*}, inline function call, and so
forth. However, it's much more passive and dumb about inferring the
types of values assigned with @code{setq}, @code{setf}, and
friends. It would be nice to fix this, but in the meantime don't
expect that just because it's very smart about types in most respects
it will be smart about types involved in assignments. (This doesn't
affect its ability to benefit from explicit type declarations
involving the assigned variables, only its ability to get by without
explicit type declarations.)
@c <!-- FIXME: Python dislikes assignments, but not in type
@c inference. The real problems are loop induction, closed over
@c variables and aliases. -->
@item
Since the time the CMUCL manual was written, CMUCL (and thus SBCL) has
gotten a generational garbage collector. This means that there are
some efficiency implications of various patterns of memory usage which
aren't discussed in the CMUCL manual. (Some new material should be
written about this.)
@item
SBCL has some important known efficiency problems. Perhaps the most
important are
@itemize @minus
@item
The garbage collector is not particularly efficient, at least on
platforms without the generational collector (as of SBCL 0.8.9, all
except x86).
@item
Various aspects of the PCL implementation of CLOS are more inefficient
than necessary.
@end itemize
@end itemize
Finally, note that Common Lisp defines many constructs which, in
the infamous phrase, ``could be compiled efficiently by a
sufficiently smart compiler''. The phrase is infamous because
making a compiler which actually is sufficiently smart to find all
these optimizations systematically is well beyond the state of the art
of current compiler technology. Instead, they're optimized on a
case-by-case basis by hand-written code, or not optimized at all if
the appropriate case hasn't been hand-coded. Some cases where no such
hand-coding has been done as of SBCL version 0.6.3 include
@itemize
@item
@code{(reduce #'f x)} where the type of @code{x} is known at compile
time
@item
various bit vector operations, e.g. @code{(position 0
some-bit-vector)}
@item
specialized sequence idioms, e.g. @code{(remove item list :count 1)}
@item
cases where local compilation policy does not require excessive type
checking, e.g. @code{(locally (declare (safety 1)) (assoc item
list))} (which currently performs safe @code{endp} checking internal
to assoc).
@end itemize
If your system's performance is suffering because of some construct
which could in principle be compiled efficiently, but which the SBCL
compiler can't in practice compile efficiently, consider writing a
patch to the compiler and submitting it for inclusion in the main
sources. Such code is often reasonably straightforward to write;
search the sources for the string ``@code{deftransform}'' to find many
examples (some straightforward, some less so).