### Optimization workflow
1. Make it work: write the code in a simple legible ways.
2. Make it work reliably: write automated test cases, make really sure that your algorithm is right and that if you break it, the tests will capture the breakage.
3. Optimize the code by profiling simple use-cases to find the bottlenecks and speeding up these bottleneck, finding a better algorithm or implementation. Keep in mind that a trade off should be found between profiling on a realistic example and the simplicity and speed of execution of the code. For efficient work, it is best to work with profiling runs lasting around 10s.

### Profiling Python code
- Measure: profiling, timing
- the fastes code is not always what you think

#### Timeit

In [10]:
import numpy as np
a = np.arange(1000)
%timeit a ** 2

The slowest run took 21.69 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 5.83 µs per loop


#### Profiler

In [11]:
# %run -t demo.py

In [12]:
# % run -p demo.py

``` shell
python -m cProfile -o demo.prof demo.py

```

#### Line-proifiler

In [13]:
# @profile
def test():
    data = np.random.random((5000, 100))
    u, s, v = linalg.svd(data)
    pca = np.dot(u[:, :10], data)
    results = fastica(pca.T, whiten=False)

$ kernprof.py -l -v demo.py

### Making code go faster

#### Algorithmic optimization

In [14]:
data = np.random.random((500, 100))
%timeit np.linalg.svd(data)

1 loops, best of 3: 441 ms per loop


In [15]:
from scipy import linalg

In [16]:
%timeit linalg.svd(data)

1 loops, best of 3: 438 ms per loop


In [17]:
%timeit linalg.svd(data, full_matrices=False)

1 loops, best of 3: 347 ms per loop


### Writing faster numerical code
- Vectorizing for loops
- Broadcastinh
- In place operations

In [19]:
a = np.zeros(1e7)
%timeit global a ; a = 0*a

10 loops, best of 3: 98.7 ms per loop


  if __name__ == '__main__':


In [20]:
%timeit global a ; a *= 0

100 loops, best of 3: 17.6 ms per loop


- Be easy on the memory: use views, and not copies

In [21]:
a = np.zeros(1e7)
%timeit a.copy()

10 loops, best of 3: 44.9 ms per loop


  if __name__ == '__main__':


In [22]:
%timeit a + 1

10 loops, best of 3: 44 ms per loop


- Beware of cache effects

In [23]:
c = np.zeros((1e4, 1e4), order='C')

  if __name__ == '__main__':


In [24]:
%timeit c.sum(axis=0)


1 loops, best of 3: 95.1 ms per loop


In [25]:
%timeit c.sum(axis=1)

1 loops, best of 3: 148 ms per loop


In [27]:
c.strides

(80000, 8)

In [28]:
In [5]: a = np.random.rand(20, 2**18)

In [6]: b = np.random.rand(20, 2**18)

In [7]: %timeit np.dot(b, a.T)

10 loops, best of 3: 27.4 ms per loop


In [29]:
In [8]: c = np.ascontiguousarray(a.T)

In [9]: %timeit np.dot(b, c)

10 loops, best of 3: 25.5 ms per loop


In [30]:
In [10]: %timeit c = np.ascontiguousarray(a.T)

10 loops, best of 3: 148 ms per loop


- Use compiled code

For all the above: profile and time your choices. Don’t base your optimization on theoretical considerations.