Measuring Disk Bandwidth
----------------------------

We measure the speed of writing and reading data to and from disk in Megabytes per Second (MB/s).  We find that disk bandwidth is strongly impacted by the following factors:

1.  Reading and writing a few large blocks of data is faster than
    reading/writing many many small blocks of data
2.  Reading is faster than writing
3.  Solid state drives are faster than spinning disk *especially for
    many small reads/writes*

In this notebook we measure the impact of file size on disk bandwidths for both reading and writing.

In [None]:
%load_ext watermark
%watermark -a "name ? sequence number: " \
   -d -v -m -p numpy

In [None]:
import numpy as np
import os
from time import time, sleep

In [None]:
import shutil
if os.path.exists('tmp'):
    shutil.rmtree('tmp')
os.mkdir('tmp')

Write Bandwidth
-----------------

We create random bytes of varying sizes using numpy

    data = np.random.random(n).tobytes()
    
We write these bytes to many different files.  For large file sizes we write only a few files, for small file sizes we write many files.  We measure the total runtime and compare that to the number of bytes written per file size.

In [None]:
nk = 2**28
nks = [(int(nk / k), min(1000, k)) for k in 2**np.arange(0, 19)]
nks

In [None]:
def median(L):
    return sorted(L)[len(L) // 2]  # asymptotically inefficient

In [None]:
%%time
write_bandwidths = dict()
for n, k in nks:
    data = np.random.random(int(n / 8)).tobytes()
    filenames = [os.path.join('tmp', '%d-%d' % (n, i)) for i in range(k)]   

    start = time()
    for fn in filenames:
        with open(fn, 'wb') as f:
            f.write(data)
            os.fsync(f)  # sync file system with disk (avoids file system magic)
    end = time()
    
    write_bandwidths[n] = (n*k) / (end - start) / 1e6
    
    sleep(1)  # let things settle between runs

In [None]:
%%time
write_bandwidths = dict()
for n, k in nks:
    data = np.random.random(int(n / 8)).tobytes()
    filenames = [os.path.join('tmp', '%d-%d' % (n, i)) for i in range(k)]   

    start = time()
    for fn in filenames:
        with open(fn, 'wb') as f:
            f.write(data)
            os.fsync(f)  # sync file system with disk (avoids file system magic)
    end = time()
    
    write_bandwidths[n] = (n*k) / (end - start) / 1e6
    
    sleep(1)  # let things settle between runs

In [None]:
write_bandwidths  # MB/s

In [None]:
write_bandwidths  # MB/s

In [None]:
from matplotlib import pyplot as plt
%matplotlib inline
x = sorted(write_bandwidths.keys())
y = [write_bandwidths[k] for k in x]
plt.figure(figsize=(12, 6))
plt.title("Write Bandwidths by File Size")
plt.xlabel("File size (bytes)")
plt.ylabel("Bandwidth (MB/s)")
plt.semilogx(x, y)

In [None]:
from matplotlib import pyplot as plt
%matplotlib inline
x = sorted(write_bandwidths.keys())
y = [write_bandwidths[k] for k in x]
plt.figure(figsize=(12, 6))
plt.title("Write Bandwidths by File Size")
plt.xlabel("File size (bytes)")
plt.ylabel("Bandwidth (MB/s)")
plt.semilogx(x, y)

### Clear file system buffers

Warning, your file system likely caches recent writes in RAM.  For accurate results you need to clear out your file system buffers before running the code below.  A simple way to clear out file system buffers is to restart your machine.

Alternatively your operating system likely has a mechanism to clear the cache.  On Ubuntu 14.04 I use the following: http://ubuntuforums.org/showthread.php?t=589975

```
$ sudo su 
# sync
# echo 3 > /proc/sys/vm/drop_caches
```

### Read bandwidths



In [None]:
read_bandwidths = dict()

for n, k in nks:
    filenames = [os.path.join('tmp', '%d-%d' % (n, i)) for i in range(k)]   

    start = time()
    for fn in filenames:
        with open(fn, 'rb') as f:
            _ = f.read()
    end = time()

    read_bandwidths[n] = (n * k) / (end - start) / 1e6
    
    sleep(1)  # let things settle between runs

In [None]:
from matplotlib import pyplot as plt
%matplotlib inline
x = sorted(read_bandwidths.keys())
read = [read_bandwidths[k] for k in x]
write = [write_bandwidths[k] for k in x]
plt.figure(figsize=(12, 6))
plt.title("Bandwidths by File Size")
plt.xlabel("File size (bytes)")
plt.ylabel("Bandwidth (MB/s)")
plt.semilogx(x, read, label='read (MB/s)', color='blue')
plt.semilogx(x, write, label='write (MB/s)', color='green')
plt.legend(loc='best')

In [None]:
read_bandwidths