# Measuring resource usage

- CPU, memory, disk and network I/O
- Task Manager on Windows, Activity Monitor on Mac, `top` on Unix
- Every running program is a process (PID)
  - and it's subprocesses
- Processes request memory from OS to store data (variables) and use CPU time to process them
- Under Jupyter, every notebook starts a kernel - Python subprocess

# Execution time

IPython magics: `%time` and `%timeit`.
Python modules: `time` and `timeit`.

- Wall time: time that clock on the wall would measure.
- User time: CPU executing your code.
- System time: CPU executing lower level OS code.
- If `wall > user + sys`, CPU is probably waiting for I/O operation (eg. read from disk)
- If `wall < user + sys`, you are using multiple CPU cores.

I also highly recommend [execute-time](https://github.com/deshaw/jupyterlab-execute-time) extension for JupyterLab.

In [None]:
%%time
1 + 1

In [None]:
%%timeit
1 + 1

In [None]:
%%timeit?

In [None]:
import time
t0 = time.time()
time.sleep(3)
dt = time.time() - t0
print(f'Elapsed time: {dt:.1f} seconds')

In [None]:
import timeit

def add():
    return 1 + 1

t = timeit.timeit(add, number=100_000)
print(f'Total execution time: {t:.3f} seconds.')

# OS monitoring utilities

- Windows: Task Manager
- MacOS: Activity Monitor
- Unix terminal: `top`
- Binder: total memory at the bottom

Using `top`:
- find process by PID
- look at %CPU and RES (memory)
- `h` for help, `q` to quit
- `f` to customize columns and sorting
- `e` to change memory units
- `u` to filter single user, `o` for other filters, `=` to reset filters
- `V` for tree view, `c` for command format
- `L` to find and highlight a value, eg. PID

If you are in Binder:
- File -> New -> Terminal
- Arrange this notebook and terminal side by side
- Type `top` in terminal to start the utility
- Press `f` and leave highlighted only PID, RES, %CPU and COMMAND with `d`, then `q`.
- Press `e` until memory is in megabytes.

In [None]:
import os
os.getpid()

Use 100% CPU.

In [None]:
while True:
    cpu = 'hard at work'

Create big memory object.

In [None]:
x = [1] * 100_000_000

Memory usage goes up...

In [None]:
x = 1

... and then goes down.

### how Python allocates and releases memory: garbage collection

Do not leave "garbage" that Python can not automatically "throw away".

In [None]:
x = [1] * 100_000_000
y = x

Memory goes up...

In [None]:
x = 1

... and stays high!

Because `x = 1` merely binds variable `x` to a new piece of data (number). But `y` still points to a big object (list of ones), and so memory is not released by garbage collector.

Bind `y` to something new, then no more variables are referencing that big list, and it will be garbage collected.

In [None]:
y = 2

Memory usage falls down again.

It might be obvious and clear in this example, but think about real code where you have multiple dataframes for temporary and intermediate results. This is why organizing your code into functions is useful. Variables local to a function are deallocated after it returns.

In [None]:
def use_memory(seconds):
    print('start')
    z = [1] * 100_000_000
    time.sleep(seconds)
    print('finish')

use_memory(5)

Note. Jupyter notebooks (more precisely IPython) may hold references to your large objects where you don't expect them. For example, calling `df` in a single cell with execution count N will store reference that dataframe in `_N`, `_oh[N]` and `Out[N]` (see [IPython docs](https://ipython.readthedocs.io/en/stable/interactive/reference.html#output-caching-system)). Even if you later re-assign `df` to something else, dataframe will still hang in memory. Take-away: if memory is a concern, never call `df` to view the dataframe (use `df.head()` or `df.sample()`) and restart kernel to start afresh.

In [None]:
# remember execution count of this cell
x = [1] * 100_000_000
x

In [None]:
x = 1

Nothing happens with memory, even though we reassigned `x`. The long list is still somewhere there:

In [None]:
# replace with execution count from above
len(_oh['execution count of cell with large output'])

# numpy arrays and pandas DataFrames

In [None]:
import numpy as np
import pandas as pd

x = np.ones(1_000_000)
x.nbytes

Always use `deep=True` with DataFrames that hold string data.

In [None]:
df = pd.read_csv('data/synig/2020.csv')
print(df.shape)
shallow = df.memory_usage()
deep = df.memory_usage(deep=True)
(pd.DataFrame({'shallow': shallow, 'deep': deep}) / 10**6).round(1)

In [None]:
print('Total memory (shallow), MB:', shallow.sum() / 1000_000)
print('Total memory (deep), MB:', deep.sum() / 1000_000)

# resources used by a process

[`psutil` docs](https://psutil.readthedocs.io)
> psutil (python system and process utilities) is a cross-platform library for retrieving information on running processes and system utilization

In [None]:
import psutil
from psutil._common import bytes2human

p = psutil.Process(pid=None)

CPU usage

In [None]:
p.cpu_percent()

Memory usage.

In [None]:
p.memory_info()

... in human readable format.

In [None]:
bytes2human(p.memory_info().rss)

Disk I/O operations: reading and writing (not available on MacOS).

In [None]:
p.io_counters()

In [None]:
print('Memory before:', bytes2human(p.memory_info().rss))

In [None]:
x = [1] * 100_000_000
print('Memory after:', bytes2human(p.memory_info().rss))

In [None]:
x = 1
print('And back again:', bytes2human(p.memory_info().rss))

#### exercise

We can use `psutil` to monitor another process if we know its PID.

Open a new notebook (Notebook B) and arrange side by side with this one (Notebook A). Create a `psutil.Process` object in Notebook B that will monitor the kernel of Notebok A. Run the below cells (use CPU and memory) and check resource usage using your `Process` object in Notebook B.

In [None]:
while True:
    cpu = 'hard at work'

In [None]:
def use_memory(seconds):
    print('start')
    z = [1] * 100_000_000
    time.sleep(seconds)
    print('finish')

use_memory(10)

# `tools.ResourceMonitor`

I have wrapped `psutil` functionality in a Python class that will constantly record resource usage at a given interval and report results (see `tools.py`). We will be using it in this workshop, and you can use it in your work. Or get inspired to make a better one. :)

Interface:
- `ResourceMonitor(pid, interval)`: create new monitor for a process
- `start()`: start monitoring
- `stop()`: stop monitoring
- `tag(label)`: tag a moment of execution
- `df`: dataframe with usage history
- `plot()`: visualize CPU, memory and disk I/O
- `dump(filepath)` and `load(filepath)`: save and open monitoring results on disk

In [None]:
from tools import ResourceMonitor

from tempfile import TemporaryFile

def use_cpu(t):
    t0 = time.time()
    while time.time() - t0 < t:
        x = 1

def use_mem(s, n):
    x = []
    for _ in range(n):
        x += [1] * s * 1_000_000
        time.sleep(1)

def write(f, size_mb):
    size = size_mb * 2**20
    count = 0
    block_size = 8 * 2**10
    data = b'a' * block_size
    f.seek(0)
    while count < size:
        count += f.write(data)
        f.flush()

def read(f):
    block_size = 8 * 2**10
    f.seek(0)
    while f.peek():
        f.read(block_size)

mon = ResourceMonitor(interval=0.2)
mon.start()
time.sleep(1)
mon.tag('cpu v')
use_cpu(2)
mon.tag('cpu ^')
time.sleep(1)
mon.tag('mem v')
use_mem(30, 2)
mon.tag('mem ^')
time.sleep(1)
with TemporaryFile() as tf:
    mon.tag('write v')
    write(tf, 1000)
    mon.tag('write ^')
    time.sleep(1)
    mon.tag('read v')
    read(tf)
    mon.tag('read ^')
time.sleep(1)
mon.stop()
mon.plot()

# more...

Many more tools exist for monitoring and profiling. Check out these if you need finer measurement of individual pieces of your code.

- [`cProfile`](https://docs.python.org/3/library/profile.html): standard Python profiler of execution time.
- [`snakeviz`](https://jiffyclub.github.io/snakeviz/): graphical interface to `cProfile`.
- [`line_profiler`](https://github.com/pyutils/line_profiler): for line-by-line profiling.
- [`tracemalloc`](https://docs.python.org/3/library/tracemalloc.html) and [`resource`](https://docs.python.org/3/library/resource.html): measure memory usage of individual Python objects.
- [`memory_profiler`](https://github.com/pythonprofilers/memory_profiler): line-by-line memory usage.