In [None]:
%load_ext autoreload
%autoreload 2
import io_backend as io

**Note: this notebook requires `io_backend.py` and `compModel.js` to be in the same directory**

Sequential Flooding: An example of better buffer management
--------------------------------------------------------

A _buffer manager_ is a system to handle operations supporting the buffer
* For example, a buffer manager should handle **eviction** of pages from the buffer when space needs to be freed up for reading in new pages.

In general the OS has a built-in buffer manager; however, most DBMSs implement their own.  *Why?*

**Key concept:** Often, a DBMS can potentially implement *more efficient* buffer management since it knows what operations it is carrying out.

**Example: _Sequential flooding_**

In many situations a DBMS will want to loop over an (ordered) list of pages repeatedly (for example, when carrying out a *join*- see next lectures).

In general, most OS buffer managers will have a **least recently used (LRU)** eviction policy, which means that the least recently used page\* will be evicted if more space is needed in the buffer.

\* *[Note: here we define "used" to mean any sort of action involving the page, for simplicty!]*

Let's see what happens under this policy when we have a buffer of size $B$ and want to loop over $B+1$ pages $M$ times:

In [None]:
# Create pages & flush to disk
# NOTE: set display mark so that we don't bother displaying this
def init_pages(b, n_pages):
    file_id = b.new_file()
    for i in range(n_pages):
        page = b.new_page(file_id)
        b.flush(page)
    b.display_set_mark()
    return file_id

# Loop through one iteration of the file, highlighting the LRU/MRU page
def get_next_page(b, fid, pid, eviction_method='LRU'):
    try:
        page = b.read(fid, pid)
    except io.BufferMemoryException:
        old_page, old_buffer_idx = b.get_buffer_page(eviction_method)
        b.release(old_page)
        page = b.read(fid, pid)
    return page

We will do the following $M$ times:
* For $i$ in range $B+1$:
    * If page $i$ is already in buffer, read from buffer (*fast!*)
    * Else:
        * If buffer **is full**:
            * **Evict**- i.e. flush to disk- the LRU page (*slow!*)
        * Read page $i$ from disk -> buffer (*slow!*), then read from buffer

In [None]:
EVICTION_METHOD = 'LRU'
BUFFER_SIZE = 3
N_PAGES = BUFFER_SIZE + 1
M = 3

# Initialize buffer
b = io.Buffer(buffer_size=BUFFER_SIZE, page_size=1, buffer_queue_indicator=EVICTION_METHOD)
file_id = init_pages(b, N_PAGES)

# Do M ordered passes over the N pages
for i in range(M):
    for pid in range(N_PAGES):
        page = get_next_page(b, file_id, pid, eviction_method=EVICTION_METHOD)

Can you see what happens?

In [None]:
# Visualize what we just did
b.display(speed=1000)

We see that the problem in our example above is that, once the buffer is full, **the LRU page which we evict is always _exactly the page we are going to want to read next!_**  Thus, we end up having to read in a page from disk _for every page we read from the buffer_.  This seems like an incredibly pointless use of a buffer!

What happens if $N > B+1$?  Play around with the demo above to see!

We see that with LRU eviction, looping through pages is $O(M*N)$ in terms of disk IO **for all $N>B$**

What happens on the other hand if we use a **most recently used (MRU)** eviction policy?

In [None]:
EVICTION_METHOD = 'MRU'
BUFFER_SIZE = 3
N_PAGES = BUFFER_SIZE + 1
M = 3

# Initialize buffer
b = io.Buffer(buffer_size=BUFFER_SIZE, page_size=1, buffer_queue_indicator=EVICTION_METHOD)
file_id = init_pages(b, N_PAGES)

# Do M ordered passes over the N pages
for i in range(M):
    for pid in range(N_PAGES):
        page = get_next_page(b, file_id, pid, eviction_method=EVICTION_METHOD)
        
# Visualize what we just did- note that a different buffer_num must be specified so it displays in own pane!
b.display(speed=1000, buffer_num=1)

Note that we have _the same number of reads from buffer_ as with LRU- indicating that we read in the same data- but **half the reads from disk!**

This is now $O(M+N-1)$ for $N=B+1$,

and $O((N-B)*(M-1) + N) \approx O((N-B)*M)$ in general for all $N>B$ (can you explain why?)

I.e. for some fixed $N$ and $B$, there is **a factor of $B*M$ difference in performance!**

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
M = range(1,100)
B = 3
N = B + 1
plt.plot(M, map(lambda m : m*N if N > B else N, M))
plt.plot(M, map(lambda m : (N-B)*(m-1) + N if N > B else N, M))
plt.show()