# Chapter 1. Benchmarking and Profiling

**Profiling** is the technique that allows to pinpoint the most resource-intensive spots in an application.

**Profiler** is a program that runs an application and monitors how long each function takes to execute, thus detecting the functions in which application spends most of its time.

**Benchmarks** are small scripts used to assess the total execution time of the application.

## 1.1 Designing your application

1. **Make it run**: We have to get the software in a working state, and ensure that it produces the correct results. This exploratory phase serves to better understand the application and to spot major design issues in the early stages.
2. **Make it right**: We want to ensure that the design of the program is solid. Refactoring should be done before attempting any performance optimization. This really helps separate the application into independent and cohesive units that are easier to maintain.
3. **Make it fast**: Once our program is working and is well structured, we can focus on performance optimization. We may also want to optimize memory usage if that constitutes an issue

**Calculate the particle position at time t**
1. Calculate the direction of motion ( v_x and v_y).
2. Calculate the displacement (d_x and d_y), which is the product of time step, angular velocity, and direction of motion.
3. Repeat steps 1 and 2 for enough times to cover the total time t.

**Visualize particle trajectory using matplotlib**
1. Set up the axes and use the plot function to display the particles. plot takes a list of x and y coordinates.
2. Write an initialization function, init, and a function, animate, that updates the x and y coordinates using the line.set_data method
3. Create a FuncAnimation instance by passing the init and animate functions plus the interval parameters, which specify the update interval, and blit, which improves the update rate of the image
4. Run the animation with plt.show()

The *test_visualize* function is helpful to graphically understand the system time
evolution.

## 1.2 Writing tests and benchmarks

A benchmark is a simple and representative use case that can be run to assess the running time of an application. Benchmarks are very useful to keep score of how fast the program is with each new version that implemented

**Timing the benchmark**

By default, *time* displays three metrics:
1. **real**: The actual time spent running the process from start to finish, as if it was measured by a human with a stopwatch
2. **user**: The cumulative time spent by all the CPUs during the computation
3. **sys**: The cumulative time spent by all the CPUs during system-related tasks, such as memory allocation

***simul.py***

In [None]:
from matplotlib import pyplot as plt
from matplotlib import animation
from random import uniform
import timeit

class Particle:

    __slots__ = ('x', 'y', 'ang_speed')

    def __init__(self, x, y, ang_speed):
        self.x = x
        self.y = y
        self.ang_speed = ang_speed


class ParticleSimulator:

    def __init__(self, particles):
        self.particles = particles

    def evolve(self, dt):
        timestep = 0.00001
        nsteps = int(dt/timestep)

        for i in range(nsteps):
            for p in self.particles:
                # 1. calculate the direction
                norm = (p.x**2 + p.y**2)**0.5
                v_x = (-p.y)/norm
                v_y = p.x/norm
                
                # 2. calculate the displacement
                d_x = timestep * p.ang_speed * v_x
                d_y = timestep * p.ang_speed * v_y

                p.x += d_x
                p.y += d_y
                # 3. repeat for all the time steps

    # def evolve(self, dt):
    #     timestep = 0.00001
    #     nsteps = int(dt/timestep)

    #     # First, change the loop order
    #     for p in self.particles:
    #         t_x_ang = timestep * p.ang_speed
    #         for i in range(nsteps):
    #             norm = (p.x**2 + p.y**2)**0.5
    #             p.x, p.y = p.x - t_x_ang*p.y/norm, p.y + t_x_ang * p.x/norm

def visualize(simulator):

    X = [p.x for p in simulator.particles]
    Y = [p.y for p in simulator.particles]

    fig = plt.figure()
    ax = plt.subplot(111, aspect='equal')
    line, = ax.plot(X, Y, 'ro')

    # Axis limits
    plt.xlim(-1, 1)
    plt.ylim(-1, 1)

    # It will be run when the animation starts
    def init():
        line.set_data([], [])
        return line,

    def animate(i):
        # We let the particle evolve for 0.1 time units
        simulator.evolve(0.01)
        X = [p.x for p in simulator.particles]
        Y = [p.y for p in simulator.particles]

        line.set_data(X, Y)
        return line,

    # Call the animate function each 10 ms
    anim = animation.FuncAnimation(fig,
                                   animate,
                                   init_func=init,
                                   blit=True,
                                   interval=10)
    plt.show()


def test_visualize():
    particles = [Particle( 0.3, 0.5, +1),
                 Particle( 0.0, -0.5, -1),
                 Particle(-0.1, -0.4, +3)]

    simulator = ParticleSimulator(particles)
    visualize(simulator)

def test_evolve():
    particles = [Particle( 0.3,  0.5, +1),
                 Particle( 0.0, -0.5, -1),
                 Particle(-0.1, -0.4, +3)]

    simulator = ParticleSimulator(particles)

    simulator.evolve(0.1)

    p0, p1, p2 = particles

    def fequal(a, b):
        return abs(a - b) < 1e-5

    assert fequal(p0.x, 0.2102698450356825)
    assert fequal(p0.y, 0.5438635787296997)

    assert fequal(p1.x, -0.0993347660567358)
    assert fequal(p1.y, -0.4900342888538049)

    assert fequal(p2.x,  0.1913585038252641)
    assert fequal(p2.y, -0.3652272210744360)


def benchmark():
    particles = [Particle(uniform(-1.0, 1.0),
                          uniform(-1.0, 1.0),
                          uniform(-1.0, 1.0))
                  for i in range(100)]

    simulator = ParticleSimulator(particles)
    simulator.evolve(0.1)


def timing():
    result = timeit.timeit('benchmark()',
                           setup='from __main__ import benchmark',
                           number=10)
    # Result is the time it takes to run the whole loop
    print(result)

    result = timeit.repeat('benchmark()',
                           setup='from __main__ import benchmark',
                           number=10,
                           repeat=3)
    # Result is a list of times
    print(result)


def benchmark_memory():
    particles = [Particle(uniform(-1.0, 1.0),
                          uniform(-1.0, 1.0),
                          uniform(-1.0, 1.0))
                  for i in range(100000)]

    simulator = ParticleSimulator(particles)
    simulator.evolve(0.001)


if __name__ == '__main__':
    benchmark()


## 1.3 Better tests and benchmarks with pytest-benchmark

In [None]:
conda install pytest

In [None]:
conda install pytest-benchmark

***test_simul.py***

In [None]:
from simul import Particle, ParticleSimulator

def test_evolve(benchmark):
    particles = [Particle( 0.3,  0.5, +1),
                 Particle( 0.0, -0.5, -1),
                 Particle(-0.1, -0.4, +3)]

    simulator = ParticleSimulator(particles)

    simulator.evolve(0.1)

    p0, p1, p2 = particles

    def fequal(a, b):
        return abs(a - b) < 1e-5

    assert fequal(p0.x, 0.2102698450356825)
    assert fequal(p0.y, 0.5438635787296997)

    assert fequal(p1.x, -0.0993347660567358)
    assert fequal(p1.y, -0.4900342888538049)

    assert fequal(p2.x,  0.1913585038252641)
    assert fequal(p2.y, -0.3652272210744360)

    benchmark(simulator.evolve, 0.1)


In [None]:
!pytest test_simul.py::test_evolve

## 1.4 Finding bottlenecks with cProfile

Two profiling modules are available through the Python standard library:
1. **profile**: This module is written in pure Python and adds a significant overhead to the program execution. Its presence in the standard library is because of its vast platform support and because it is easier to extend.
2. **cProfile**: This is the main profiling module, with an interface equivalent to profile. It is written in C, has a small overhead, and is suitable as a general purpose profiler

The cProfile module can be used in three different ways:
1. From the command line
2. As a Python module
3. With IPython

In [None]:
!python -m cProfile simul.py

In [None]:
!python -m cProfile -s tottime simul.py

In [None]:
!python -m cProfile -o prof.out simul.py

In [None]:
from simul import benchmark
import cProfile
cProfile.run("benchmark()")

In [None]:
from simul import benchmark
import cProfile
pr = cProfile.Profile()
pr.enable()
benchmark()
pr.disable()
pr.print_stats()

In [None]:
from simul import benchmark
%prun benchmark()

The cProfile output is divided into five columns:
1. ncalls: The number of times the function was called.
2. tottime: The total time spent in the function without taking into account the calls to other functions.
3. cumtime: The time in the function including other function calls.
4. percall: The time spent for a single call of the function--it can be obtained by dividing the total or cumulative time by the number of calls.
5. filename:lineno: The filename and corresponding line numbers. This information is not available when calling C extensions modules

The most important metric is tottime, the actual time spent in the function body excluding subcalls, which tell exactly where the bottleneck is.

cProfile only provides information at the function level and does not tell us which specific statements are responsible for the bottleneck. Fortunately, the line_profiler tool is capable of providing line-by-line information of the time spent in the function

**KCachegrind** is a Graphical User Interface (GUI) useful to analyze the profiling output emitted by cProfile

In [None]:
conda install pyprof2calltree

***taylor.py***

In [None]:
def factorial(n): 
    if n == 0: 
        return 1.0 
    else: 
        return float(n) * factorial(n-1) 

def taylor_exp(n): 
    return [1.0/factorial(i) for i in range(n)] 

def taylor_sin(n): 
    res = [] 
    for i in range(n): 
        if i % 2 == 1: 
           res.append((-1)**((i-1)/2)/float(factorial(i))) 
        else: 
           res.append(0.0) 
    return res 

def benchmark(): 
    taylor_exp(500) 
    taylor_sin(500) 

if __name__ == '__main__': 
    benchmark() 


In [None]:
!python -m cProfile -o prof.out taylor.py

In [None]:
!pyprof2calltree -i prof.out -o prof.calltree

In [None]:
#!kcachegrind prof.calltree # or qcachegrind prof.calltree

Mac users can compile QCacheGrind using Mac Ports (http://www.macports.org/) 

In [None]:
!QCacheGrind prof.calltree

In [None]:
pip install gprof2dot

## 1.5 Profile line by line with line_profiler

Now that we know which function we have to optimize, we can use the line_profiler module that provides information on how time is spent in a line-by-line fashion.

In [None]:
pip install line_profiler

In order to use line_profiler, we need to apply a @profile decorator to the functions we intend to monitor

In [None]:
@profile
def evolve(self, dt):
# code

The kernprof.py script will produce an output file and will print the result of the profiling on the standard output. We should run the script with two options:
1. -l to use the line_profiler function
2. -v to immediately print the results on screen

In [None]:
!kernprof.py -l -v simul.py

It is also possible to run the profiler in an IPython shell for interactive editing. You should first load the line_profiler extension that will provide the lprun magic command. Using that command, you can avoid adding the @profile decorator

In [None]:
%load_ext line_profiler  

In [None]:
from simul import benchmark, ParticleSimulator 

In [None]:
%lprun -f ParticleSimulator.evolve benchmark() 

## 1.6 Optimizing our code

Improve the algorithms used

In [None]:
x = r * cos(alpha)
y = r * sin(alpha)

Improve the performance of the loop by reducing the number of assignment operations performed. To do that, we can avoid intermediate variables by rewriting the expression into a single, slightly more complex statement (note that the right-hand side gets evaluated completely before being assigned to the variables):

In [None]:
def evolve_fast(self, dt):
    timestep = 0.00001
    nsteps = int(dt/timestep)
# Loop order is changed
    for p in self.particles:
        t_x_ang = timestep * p.ang_vel
        for i in range(nsteps):
            norm = (p.x**2 + p.y**2)**0.5
            p.x, p.y = (p.x - t_x_ang * p.y/norm,
                        p.y + t_x_ang * p.x/norm)

In [None]:
!time python simul.py # Performance Tuned

In [None]:
!time python simul.py # Original

## 1.7 The dis module

In the CPython interpreter, Python code is first converted to an intermediate representation, the bytecode, and then executed by the Python interpreter.

To inspect how the code is converted to bytecode, we can use the dis Python module (dis stands for disassemble).

In [None]:
import dis
from simul import ParticleSimulator
dis.dis(ParticleSimulator.evolve)

The dis module helps discover how the statements get converted and serves mainly as an exploration and learning tool of the Python bytecode representation.

## 1.8 Profiling memory usage with memory_profiler

The memory_profiler module summarizes, in a way similar to line_profiler, the memory usage of the process

In [None]:
pip install psutil

Just like line_profiler, memory_profiler also requires the instrumentation of the source code by placing a @profile decorator on the function we intend to monitor.

In [None]:
def benchmark_memory():
    particles = [Particle(uniform(-1.0, 1.0),
                          uniform(-1.0, 1.0),
                          uniform(-1.0, 1.0))
                    for i in range(100000)]
    simulator = ParticleSimulator(particles)
    simulator.evolve(0.001)

We can use memory_profiler from an IPython shell through the %mprun magic command

In [None]:
%load_ext memory_profiler

In [None]:
from simul import benchmark_memory

In [None]:
%mprun -f benchmark_memory benchmark_memory()

In [None]:
class Particle:
    __slots__ = ('x', 'y', 'ang_vel')
def __init__(self, x, y, ang_vel):
    self.x = x
    self.y = y
    self.ang_vel = ang_vel

## 1.9 Summary