- always use numpy array types,
array
ormatrix
as appropriate - let's generate a (10^3) random dataset and look at its dimensions
import numpy
data=numpy.random.random((10,10,10))
print data.shape
- accessing the array is easy: the object knows all the usual mathematical operations; all these operations are pointwise, so you can just
print data, data + 2*data/data**0.5 - data**2
-
this is also reasonably efficient!
-
we can easily access slices (subarrays) as well
array[a:b]
gives you elementsarray[a]
toarray[b-1]
- multidimensional arrays have dimensions separated by commas:
array[a:b,c:d,e:f]
- any missing endpoint is treated as the extreme point (inclusive): missing starting point becomes first element and missing endpoint becomes last element (note that this is inclusive)
- negative indices count from the end of the array
- note that there is no way to define a slice with a negative endpoint index which includes last element of the array
- slices can stride
array[a:b:c]
gives every cth point from a (inclusive) to b (exclusive) - or can slice backwards:
array[a:b:-c]
gives every cth point backwards from a (inclusive) to b (exclusive) - if any axis of your slice contains no points at all, you get an empty array: if you want to remove a dimension, use a trivial slice of one element
array=numpy.linspace(1,10,10)
print array, array[:], array[:4], array[4:]
print array[:-4], array[-4:]
print array[1:8:2], array[8:1:-2], array[1:8:-2]
- slices are also numpy arrays, so they know arithmetic
print data[3:6,:3,-2:]
print data[3:6,:3,-2:]**2
-
slice access can be slow or fast, depending on factors like whether numpy copies the data, what the striding pattern is like, how big the data is etc
- could even be faster to process the whole data instead of a slice
- no general rule, be prepared to experiment if copy-process-copy back is better than striding in place or something else
-
numpy also has a bunch of ufunc functions: they do the obvious thing point-wise:
print numpy.sin(data)
- they are implemented in C using numpy array's buffer interface, so are probably at least 100x faster than the equivalent from
math
(e.g.math.sin
) - numpy also knows of matrices (and tensors!)
mat=numpy.matrix(numpy.random.random((3,3)))
mat1=numpy.matrix(numpy.random.random((3,4)))
mat2=numpy.matrix(numpy.random.random((4,3)))
colvec=numpy.matrix(numpy.random.random((1,3)))
rowvec=numpy.matrix(numpy.random.random((3,1)))
- You can also check some details about a numpy array easily
print colvec.flags
print rowvec.nbytes, rowvec.size
print rowvec.shape, rowvec.reshape(1,3).shape
print mat, mat**2, rowvec, colvec
print colvec*mat1, mat2*colvec.T, mat2*rowvec
print mat1*colvec # this raises ValueError: (3,4) matrix cannot be multipied from the left by (1,3) matrix
- never think or call an array a matrix or vice versa: they obey different arithmetic
- but to a certain extent, arrays are vectors: they can be broadcast to 1-column and 1-row matrices, but do not have the usual transposes (they DO have a transpose, though), arithmetic is array arithmetic etc
arrayvec=numpy.random.random((3))
print arrayvec, arrayvec.T
print arrayvec*mat1
print mat2*arrayvec # gives an error
- Let's also take a sneak peek into next topic, profiling while we look at how to do numerics using Laplacian as an example
- Laplacian is obviously related to PDEs, but the arithmetic is very similar to e.g. a discrete low pass filter like (y_{i} = x_{i-1} + a
(y_{i} - x_{i-1})) or
output[i] :
output[i-1] + a * (input[i] - output[i-1])= withoutput[0]
input[0]= - in python we could do
import numpy
import cProfile
import time as timemod
def init_data(sizes):
return numpy.random.random(sizes)
def Laplacian(data, lapl, d):
for ii in range(1,data.shape[0]-1):
for jj in range(1,data.shape[1]-1):
for kk in range(1,data.shape[2]-1):
lapl[ii,jj,kk] = (
(data[ii-1,jj,kk] - 2*data[ii,jj,kk] + data[ii+1,jj,kk])/d[0]*d[1]*d[2] +
(data[ii,jj-1,kk] - 2*data[ii,jj,kk] + data[ii,jj+1,kk])/d[1]*d[0]*d[2] +
(data[ii,jj,kk-1] - 2*data[ii,jj,kk] + data[ii,jj,kk+1])/d[2]*d[0]*d[1])
return
def runone(func):
d=numpy.array([0.1,0.1,0.1])
data=init_data((100,100,100))
lapl=numpy.zeros_like(data)
cp=cProfile.Profile()
start = timemod.clock()
cp.runcall(func, data, lapl, d)
end = timemod.clock()
print("cProfile gave total time of {time} s and the following profile.".format(time=end-start))
cp.print_stats()
L=runone(Laplacian)
-
that took a while (9.4 s on my laptop)! Any ideas why?
-
let's try numpy-style without explicit loops
-
numpy converts operatins between sliced or whole numpy arrays into vectorised loops
- note that this can deceive you: how much memory does
array_A
arrayB + arrayC*arrayD= consume? How many memory accesses does it contain?
- note that this can deceive you: how much memory does
import numpy
import cProfile
import time as timemod
def init_data(sizes):
return numpy.random.random(sizes)
def Laplacian_numpyic(data, lapl, d):
lapl[1:-1, 1:-1, 1:-1] = (
(data[0:-2,1:-1,1:-1] - 2*data[1:-1,1:-1,1:-1] + data[2:,1:-1,1:-1])/d[0]*d[1]*d[2] +
(data[1:-1,0:-2,1:-1] - 2*data[1:-1,1:-1,1:-1] + data[1:-1,2:,1:-1])/d[1]*d[0]*d[2] +
(data[1:-1,1:-1,0:-2] - 2*data[1:-1,1:-1,1:-1] + data[1:-1,1:-1,2:])/d[2]*d[0]*d[1])
return
L=runone(Laplacian_numpyic)
-
that took 0.05 s on the same laptop!
-
conclusion: never write a for-loop in python
-
let's see how cython works and improves performance
- everything from
%%cython
to the next empty line will be saved to a tepmorary file, turned into a C code using cython and then compiled into a python module which is then imported - when cython runs, it does not see our current namespace (it is a separate process), so we need to import whatever we use
- there is also a special
cimport
command, which imports "into C code" - the
@cython
lines are decorators which affect how cython treats the following function: we want no bounds checking on our arrays and we want (1/0) to produce (\infty) instead of python'sZeroDivisionError
- this is more or less standard cython preamble
- notice also the type definitions in the function definition: always type everything in cython as if you do not, cython treats them as pytohn objects with all the performance penalty that implies
- everything from
%load_ext Cython
%%cython
import cython
import numpy
cimport numpy
@cython.boundscheck(False)
@cython.cdivision(True)
def Laplacian_cython1(object[double, ndim=3] data, object[double, ndim=3] lapl, object[double, ndim=1] d):
lapl[1:-1, 1:-1, 1:-1] = (
(data[0:-2,1:-1,1:-1] - 2*data[1:-1,1:-1,1:-1] + data[2:,1:-1,1:-1])/d[0]*d[1]*d[2] +
(data[1:-1,0:-2,1:-1] - 2*data[1:-1,1:-1,1:-1] + data[1:-1,2:,1:-1])/d[1]*d[0]*d[2] +
(data[1:-1,1:-1,0:-2] - 2*data[1:-1,1:-1,1:-1] + data[1:-1,1:-1,2:])/d[2]*d[0]*d[1])
return
L=runone(Laplacian_cython1)
- that took 0.05 s --- was cython not supposed to speed things up?
- unfortunately as much as numpy likes array-operations, cython dislikes them
- we'll also introduce the right datatypes: the
double
we used above just happens to be the same as an element of thenumpy.ndarray
we passed Laplacian
%%cython
import cython
import numpy
cimport numpy
DTYPE=numpy.float64
ctypedef numpy.float64_t DTYPE_t
@cython.boundscheck(False)
@cython.cdivision(True)
def Laplacian_cython2(numpy.ndarray[DTYPE_t, ndim=3] data, numpy.ndarray[DTYPE_t, ndim=3] lapl, numpy.ndarray[DTYPE_t, ndim=1] d):
cdef int xmax = data.shape[0]
cdef int ymax = data.shape[1]
cdef int zmax = data.shape[2]
cdef int ii, jj, kk
for ii in range(1,xmax-1):
for jj in range(1,ymax-1):
for kk in range(1,zmax-1):
lapl[ii,jj,kk] = (
(data[ii-1,jj,kk] - 2*data[ii,jj,kk] + data[ii+1,jj,kk])/d[0]*d[1]*d[2] +
(data[ii,jj-1,kk] - 2*data[ii,jj,kk] + data[ii,jj+1,kk])/d[1]*d[0]*d[2] +
(data[ii,jj,kk-1] - 2*data[ii,jj,kk] + data[ii,jj,kk+1])/d[2]*d[0]*d[1])
return
L=runone(Laplacian_cython2)
- there we go: 0.014 s on the laptop
- we can do still better: the gcc compiler used does not realise that the lattice constants do not change from lattice site to lattice site, so the
/d[0]*d[1]*d[2]
etc could be done just once and then multiplied (never divide if you can avoid it!) into the stencil:
%%cython
import cython
import numpy
cimport numpy
DTYPE=numpy.float64
ctypedef numpy.float64_t DTYPE_t
@cython.boundscheck(False)
@cython.cdivision(True)
def Laplacian_cython3(numpy.ndarray[DTYPE_t, ndim=3] data, numpy.ndarray[DTYPE_t, ndim=3] lapl, numpy.ndarray[DTYPE_t, ndim=1] d):
cdef int xmax = data.shape[0]
cdef int ymax = data.shape[1]
cdef int zmax = data.shape[2]
cdef int ii, jj, kk
cdef double d1d2bd0=1.0/d[0]*d[1]*d[2], d0d2bd1=1.0/d[1]*d[0]*d[2], d0d1bd2=1.0/d[2]*d[0]*d[1]
for ii in range(1,xmax-1):
for jj in range(1,ymax-1):
for kk in range(1,zmax-1):
lapl[ii,jj,kk] = (
(data[ii-1,jj,kk] - 2*data[ii,jj,kk] + data[ii+1,jj,kk])*d1d2bd0 +
(data[ii,jj-1,kk] - 2*data[ii,jj,kk] + data[ii,jj+1,kk])*d0d2bd1 +
(data[ii,jj,kk-1] - 2*data[ii,jj,kk] + data[ii,jj,kk+1])*d0d1bd2)
return
L=runone(Laplacian_cython3)
- and down to a healthy 0.005 s
- speedup compared to original code is now 1900x
- even compared to the vectorised pure python, it is 10x
- save the code into a file (complicated to arrange in a jupyter notebook, so get
profiling.pyx
andsetup.py
from the repo and place in the right directory - run
python setup.py build_ext --inplace
to get a module calledprofiling
you can import - Profiling
- we already know cProfile, but let's see what it gives in a more complicated example
more complicated cProfile
- cython's profiling capabilities are also of interes: in earlier examples, we saw just something like
_cython_magic_c63ab7889ce7cc65e5cd8f75df5d29ae.Laplacian_cython2
and that's all we would have seen even if the cython code would have had deeper call hierarchies: cProfile cannot see into cython without cython giving it a hand - this hand is =@cython.profile(True):
%%cython
import cython
import numpy
cimport numpy
DTYPE=numpy.float64
ctypedef numpy.float64_t DTYPE_t
@cython.boundscheck(False)
@cython.cdivision(True)
@cython.profile(True)
def Laplacian_cython3_profile(numpy.ndarray[DTYPE_t, ndim=3] data, numpy.ndarray[DTYPE_t, ndim=3] lapl, numpy.ndarray[DTYPE_t, ndim=1] d):
cdef int xmax = data.shape[0]
cdef int ymax = data.shape[1]
cdef int zmax = data.shape[2]
cdef int ii, jj, kk
cdef double d1d2bd0=1.0/d[0]*d[1]*d[2], d0d2bd1=1.0/d[1]*d[0]*d[2], d0d1bd2=1.0/d[2]*d[0]*d[1]
for ii in range(1,xmax-1):
for jj in range(1,ymax-1):
for kk in range(1,zmax-1):
lapl[ii,jj,kk] = (
(data[ii-1,jj,kk] - 2*data[ii,jj,kk] + data[ii+1,jj,kk])*d1d2bd0 +
(data[ii,jj-1,kk] - 2*data[ii,jj,kk] + data[ii,jj+1,kk])*d0d2bd1 +
(data[ii,jj,kk-1] - 2*data[ii,jj,kk] + data[ii,jj,kk+1])*d0d1bd2)
return
L=runone(Laplacian_cython3_profile)
- unfortunately, profiling creates overhead so now our code is now a bit slower
- for small functions, this overhead is enough to misguide you
def recip_square(i):
return 1./i**2
def approx_pi(n=10000000):
val = 0.
for k in range(1,n+1):
val += recip_square(k)
return (6 * val)**.5
cp=cProfile.Profile()
cp.runcall(approx_pi)
cp.print_stats(sort="time")
- note how the
cumtime
andtottime
work: thecumtime
of a function equals itstottime
plus thecumtime
of any of its callees - before cythoninsing, let's make one change
def recip_square(i):
return 1./i**2
def approx_pi(n=10000000):
val = 0.
for k in xrange(1,n+1):
val += recip_square(k)
return (6 * val)**.5
cp=cProfile.Profile()
cp.runcall(approx_pi)
cp.print_stats(sort="time")
- we got the
xrange
fall below the radar, whererange
took a significant amount of time! - now we cythonise
- cython will turn
**
into a call topow()
which is bad, so we remove that - this forces us to change
int i
intolong i
lest we get integer overflows!
%%cython
import cython
@cython.profile(True)
def recip_square(int i):
return 1./(i*i)
@cython.profile(True)
def approx_pi(int n=10000000):
cdef double val = 0.
cdef int k
for k in xrange(1,n+1):
val += recip_square(k)
return (6 * val)**.5
cp=cProfile.Profile()
cp.runcall(approx_pi)
cp.print_stats(sort="time")
- without
@cython.profile(True)
we'd only see the{_cython_magic_6246327bdc7da2785b99c8775b1bdbc3.approx_pi}
line - now the crucial point about small functions: the
tottime
ofapprox_pi
is "wrong" as it includes time spent setting up profiling for recipsquare! - so to see the real time
approx_pi
takes, we turn off profiling fromrecip_square
:
%%cython
import cython
@cython.profile(False)
def recip_square(long i):
return 1./(i*i)
@cython.profile(True)
def approx_pi(long n=10000000):
cdef double val = 0.
cdef long k
for k in xrange(1,n+1):
val += recip_square(k)
return (6 * val)**.5
cp=cProfile.Profile()
cp.runcall(approx_pi)
cp.print_stats(sort="time")
- This is a problem with profiling: the overheads of setting up profiling of an oft-called function will give the wrong impression of how much time the caller takes.
- two more useful tricks: inlining and defining a pure-C function (not directly callable from python)
%%cython
import cython
cimport cython
@cython.profile(False)
cdef inline double recip_square(long i):
return 1./(i*i)
@cython.profile(True)
def approx_pi(long n=10000000):
cdef double val = 0.
cdef long k
for k in xrange(1,n+1):
val += recip_square(k)
return (6 * val)**.5
cp=cProfile.Profile()
cp.runcall(approx_pi)
cp.print_stats(sort="time")
-
Profiling adds a generic performance penatly, so turn profiling off for production
-
Debugging
-
by far the best python debugger
-
interface not very good (pydb has better) but
- the only debugger capable of breakpointing inside a GUI mainloop
-
if you want a good interface, run interactively in ipython
-
won't do GUI mainloops interactively
-
hard to go inside modules you
import
-
very hard to use with MPI and more than one rank
- there is a way:
mpirun -np 1 ipython your_progran.py : -np 7 screen python your_program.py
- or replace
ipython
withpudb
- but you need to make sure your interactive thing does not cause timeouts or deadlocks on the others
- there is a way:
-
- pdb comes with python but is rather limited
- pydb is a slighly more useful but still loses to pudb by a fair margin
- you can get into the stack trace with
ipython --pdb
- qtcreator
- do you want QtQuick or Qt Proper?
- QtQuick uses javascript!