# Profiling

<a href="https://colab.research.google.com/github/Ziaeemehr/workshop_hpcpy/blob/main/notebooks/profiling/mem_line_profiler.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np

In [2]:
def euclidean_broadcast(x, y):
    """Euclidean square distance matrix.
    
    Inputs:
    x: (N, m) numpy array
    y: (N, m) numpy array
    
    Ouput:
    (N, N) Euclidean square distance matrix:
    r_ij = (x_ij - y_ij)^2
    """
    diff = x[:, np.newaxis, :] - y[np.newaxis, :, :]

    return (diff * diff).sum(axis=2)

In [3]:
def euclidean_trick(x, y):
    """Euclidean square distance matrix.
    
    Inputs:
    x: (N, m) numpy array
    y: (N, m) numpy array
    
    Ouput:
    (N, N) Euclidean square distance matrix:
    r_ij = (x_ij - y_ij)^2
    """
    x2 = np.einsum('ij,ij->i', x, x)[:, np.newaxis]
    y2 = np.einsum('ij,ij->i', y, y)[np.newaxis, :]

    xy = x @ y.T

    return np.abs(x2 + y2 - 2. * xy)

In [4]:
nrows = 2000
ncols = 50

rng = np.random.default_rng()
x = 10. * rng.random((nrows, ncols))
y = 10. * rng.random((nrows, ncols))
print(np.allclose(euclidean_broadcast(x, y), euclidean_trick(x, y)))

True


### 1. `timeit`

In [5]:
%timeit euclidean_broadcast(x, y)
%timeit euclidean_trick(x, y)

978 ms ± 11.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
41.2 ms ± 470 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
41.2 ms ± 470 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)


### 2. `line_profiler`

In [6]:
# ! pip install line_profiler -q

In [7]:
%load_ext line_profiler

In [8]:
%lprun -f euclidean_broadcast euclidean_broadcast(x,x)

Timer unit: 1e-09 s

Total time: 1.03329 s
File: /tmp/ipykernel_508912/3677175976.py
Function: euclidean_broadcast at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
     1                                           def euclidean_broadcast(x, y):
     2                                               """Euclidean square distance matrix.
     3                                           
     4                                               Inputs:
     5                                               x: (N, m) numpy array
     6                                               y: (N, m) numpy array
     7                                           
     8                                               Ouput:
     9                                               (N, N) Euclidean square distance matrix:
    10                                               r_ij = (x_ij - y_ij)^2
    11                                               """
    12         1  433999775.0 4.34e+08     

In [9]:
%lprun -f euclidean_trick euclidean_trick(x, x)

Timer unit: 1e-09 s

Total time: 0.0434415 s
File: /tmp/ipykernel_508912/2952558958.py
Function: euclidean_trick at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
     1                                           def euclidean_trick(x, y):
     2                                               """Euclidean square distance matrix.
     3                                           
     4                                               Inputs:
     5                                               x: (N, m) numpy array
     6                                               y: (N, m) numpy array
     7                                           
     8                                               Ouput:
     9                                               (N, N) Euclidean square distance matrix:
    10                                               r_ij = (x_ij - y_ij)^2
    11                                               """
    12         1     275873.0 275873.0      0.6  

#### Line Profiler Output Explanation

The `line_profiler` output shows detailed timing information for each line of code:

- **Timer unit**: The time measurement unit (1e-09 s = nanoseconds)
- **Total time**: Total execution time for the function
- **Line #**: Line number in the function
- **Hits**: Number of times that line was executed
- **Time**: Total time spent on that line (in nanoseconds)
- **Per Hit**: Average time per execution of that line
- **% Time**: Percentage of total function time spent on that line

In the `euclidean_broadcast` example:
- Line 12 (broadcasting operation): **42.9% of time** - creating the broadcasted difference array
- Line 14 (multiplication and sum): **57.1% of time** - computing element-wise multiplication and summing

This helps identify performance bottlenecks at the line level, which is more precise than function-level profiling.

### 3. `cProfile`

In [10]:
%prun -r euclidean_trick(x, x)

 

<pstats.Stats at 0x7f5e259a2410>

         18 function calls in 0.044 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.043    0.043    0.043    0.043 2952558958.py:1(euclidean_trick)
        1    0.001    0.001    0.044    0.044 <string>:1(<module>)
        2    0.000    0.000    0.000    0.000 {built-in method numpy._core._multiarray_umath.c_einsum}
        1    0.000    0.000    0.044    0.044 {built-in method builtins.exec}
        2    0.000    0.000    0.000    0.000 einsumfunc.py:1057(einsum)
       10    0.000    0.000    0.000    0.000 einsumfunc.py:1049(_einsum_dispatcher)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

`Memory profiling`

In [11]:
# ! pip install memory_profiler -q

In [12]:
%%bash 
python memprofiler_euclidean_trick.py

Filename: /home/ziaee/git/workshops/workshop_hpcpy/notebooks/profiling/memprofiler_euclidean_trick.py

Line #    Mem usage    Increment  Occurrences   Line Contents
     5     68.0 MiB     68.0 MiB           1   @profile
     6                                         def euclidean_trick(x, y):
     7                                             """Euclidean square distance matrix.
     8                                         
     9                                             Inputs:
    10                                             x: (N, m) numpy array
    11                                             y: (N, m) numpy array
    12                                         
    13                                             Ouput:
    14                                             (N, N) Euclidean square distance matrix:
    15                                             r_ij = (x_ij - y_ij)^2
    16                                             """
    17     68.0 MiB      0.0 MiB     

In [13]:
%%bash 
python memprofiler_euclidean_broadcast.py

Filename: /home/ziaee/git/workshops/workshop_hpcpy/notebooks/profiling/memprofiler_euclidean_broadcast.py

Line #    Mem usage    Increment  Occurrences   Line Contents
     5     67.8 MiB     67.8 MiB           1   @profile
     6                                         def euclidean_broadcast(x, y):
     7                                             """Euclidean square distance matrix.
     8                                         
     9                                             Inputs:
    10                                             x: (N, m) numpy array
    11                                             y: (N, m) numpy array
    12                                         
    13                                             Ouput:
    14                                             (N, N) Euclidean square distance matrix:
    15                                             r_ij = (x_ij - y_ij)^2
    16                                             """
    17   1593.7 MiB   1525.9 

#### Memory Profiling Analysis & Discussion

**Observation Task:** Compare the memory profiles from the two scripts above.

**Question:** 
The `euclidean_broadcast` approach creates intermediate arrays through broadcasting operations, while the `euclidean_trick` approach uses `einsum` and matrix multiplication to compute the distance matrix.

1. **Which approach uses more peak memory?** Why do you think that is?
2. **Where is the memory peak occurring in each function?** (Hint: Look at the line-by-line memory increment/decrement)
3. **Can you explain the trade-off between memory usage and execution speed** based on what you observed from both `line_profiler` and `memory_profiler` outputs?

**Key Insight:** While the broadcasting approach may be more intuitive, it creates a large temporary array of shape `(nsamples, nsamples, nfeat)` with size 2000×2000×50 = 500 million elements! The `einsum` trick avoids this by computing intermediate statistics line-by-line.