In [None]:
%pip install numpy h5py

## Writing Scientific Data to Binary Files

Text files are nice for small files, but most scientific data is way too large and complex to be easy to work with as text.  So, we need more flexibility, and for that we use binary files, which allow us to store data however we want.  

### Numpy Arrays Review: Arrays of DTypes

In scientific work, the most important kind of binary data are `arrays`.  Controlling how they are stored both on-disk and in-memory affects how precise, compact, and performant our analyses are.  Since all elements of an array are stored the same way (i.e. they have the same `dtype`), all values get processed the same way in calculations.

| Code | Description |
| :-- | :-- |
| **`np.array([1, 2, 3])`** | Make an array from a list of values |
| **`np.array(..., dtype=np.uint8)`** | Make an array of 8-bit, unsigned integers |
| **`new_x = x.astype(np.uint8)`** | Make a copy of the array that has 8-bit, unsigned integers |
| **`x.nbytes`** | How many bytes in memory an array takes up. |
| **`x.size`** | How many elements an array contains. |

What do these dtypes represent?  Here are some examples to help parse them:

| DType | Value Stored | Supports Negative Numbers | N Bits | N Bytes |  Commonly Used With |
| :-- | :-- | :-- | :-- | :-- | :-- |
| **`float32`** | scientific | yes | 16 | 2 | GPU Math |
| **`float64`** | scientific | yes | 64 | 8 | Any Decimal Numbers |
| **`int64`** | whole numbers | yes | 16 | 2 | Any Whole Numbers |
| **`uint8`** | whole numbers | no | 16 | 2 | Image Pixel Values for 8-bit images |
| **`bool`** | true or false | no | 4 | 1 | Filtering Masks |


In [1]:
import numpy as np

**Exercises**: Let's practice making arrays with different dtypes.

**Example**: Store the values `[1, 2, 3]` as 64-bit integers. How many bytes does the array take up in memory?

In [2]:
x = np.array([1, 2, 3], dtype=np.int64)
x, x.dtype, x.nbytes

(array([1, 2, 3]), dtype('int64'), 24)

**Exercise**: Store the values `[1, 2, 3]` as 16-bit integers. How many bytes does the array take up in memory?

In [5]:
x = np.array([1, 2, 3], dtype=np.int16)
x, x.nbytes

(array([1, 2, 3], dtype=int16), 6)

**Exercise**: `float` are meant to store decimal values, and `int` is meant to store whole numbers.  What happens if a decimal value is stored as an `int`?  Store the values `[1.2, 3.4, 5.6, 7.8]` as 32-bit integers.  What happens to the values?

In [8]:
x = np.array([1.2, 3.4, 5.6, 7.8], dtype=np.int32)
x

array([1, 3, 5, 7], dtype=int32)

**Exercise**: The fewer the bits, the smaller the range of values that can be stored in them.  Let's try exceeding a limit by store the values `[253, 254, 255, 256]` as unsigned, 8-bit integers.  What error do we get?  What's the highest number that can be stored as `uint8` values?

In [10]:
np.array([253, 254, 255, 256], np.uint8)

OverflowError: Python integer 256 out of bounds for uint8

**Exercise**: The fewer the bits, the worse the precision of floating point values.  Note the three calculations below: Which of the three is incorrect?

In [11]:
np.float64(10) / np.float64(0.001), \
np.float32(10) / np.float32(0.001), \
np.float16(10) / np.float16(0.001)

(np.float64(10000.0), np.float32(10000.0), np.float16(9990.0))

### Serializing Numpy arrays directly to a binary File

Can we create our own binary files?  Of course!  Below, we'll make numpy arrays and play around with serializing the numpy 

| **Code** | Description |
| :-- | :-- |
| **`x = np.arange(10).astype(np.uint8)`** | Create an array `x` of 10 values (0-9) where each is stored as unsigned, 8-bit integers |
| <code>f = open('data.dat', 'wb')<br>f.write(x.tobytes())<br>f.close()</code> | Write the bytestring representation of the array to a binary file. |
| <code>f = open('data.dat', 'rb')<br>x = np.frombuffer(f.read(), dtype=np.uint8)<br>f.close()</code> | Read the bytestring data from a binary file and interpret it as a numpy array. |



In [12]:
import numpy as np

**Example**: Write 5 16-bit integers to the binary file `x1.dat`.

In [13]:
x = np.arange(5).astype(np.int16)
f = open('x1.dat', 'wb')
f.write(x.tobytes())
f.close()

Write 8 8-bit, unsigned integers to the binary file `x2.dat`.

In [14]:
x = np.arange(8).astype(np.uint8)
f = open('x2.dat', 'wb')
f.write(x.tobytes())
f.close()

Open up the `x2.dat` file in your text editor (you'll have to insist that it's readable as text, even though it's really not).  What does the data look like, is the data recognizable?  More importantly, how many characters are there in the file, does this seem like the right number of bytes (Tip: You can count characters just by highlighting all the characters in the VSCode text editor and in the bottom-right corner it will say how many is selected.)

Write 8 64-bit, unsigned integers to the binary file `x3.dat`.

In [15]:
x = np.arange(8).astype(np.uint64)
f = open('x3.dat', 'wb')
f.write(x.tobytes())
f.close()

Open up the `x3.dat` file in your text editor (you'll have to insist that it's readable as text, even though it's really not).  What does the data look like, is the data recognizable?  More importantly, does it seem like it's got **more** data than the `x2.dat` file?  It should take up 8x as much space, because 64 bits is 8 times more space than 8 bits.

Read in the `x2.dat` file back as a Numpy array, and make sure it was read back correctly  (note: you'll have to tell numpy what dtype the data should be read as). 

In [20]:
f = open('x2.dat', 'rb').read()
np.frombuffer(f, dtype=np.uint8)

array([0, 1, 2, 3, 4, 5, 6, 7], dtype=uint8)

Read in the `x3.dat` file back as a Numpy array. 

In [22]:
f = open('x3.dat', 'rb').read()
np.frombuffer(f, dtype=np.uint64)

array([0, 1, 2, 3, 4, 5, 6, 7], dtype=uint64)

### Writing Numpy Arrays to NPY and NPZ Files using Numpy

It's challenging to read in data files when you don't know ahead of time how the data is actually stored.  Numpy provides two formats, ".npy" and ".npz" that make reading data 
into Numpy easier by putting the data's format directly into the file itself.  This takes up just a little more space on disk, but it makes the data way easier to work with!

| **Code** | **Description ** |
| :-- | :-- |
| **`np.random.random(size=100).astype(np.float32)`** | Create 100 random float32 values between 0 and 1 |
| **`np.random.randint(1, 10, size=20).astype(np.uint8)`** | Create 20 random 8-bit, unsigned integers between 0 and 10 |
| **`np.save('data.npy', data)`** | Save the `data` array to the `data.npy` file |
| **`np.savez('data.npz', x=x1, y=y1)`** | Save the `x1` and `y1` arrays as `x` and `y` variables in the `data.npz` file |
| **`data = np.load('data.npy')`** | Load the `data` array from the `data.npy` file |

In [None]:
# %pip install numpy

In [55]:
import numpy as np

**Exercises**

**Example**: Create 10 random 16-bit floating numbers and save them to `x1.npy`.

In [56]:
x = np.random.random(size=10).astype(np.float16)
np.save('x1.npy', x)

**Example**: Load the data in `x1.npy`.  Was it saved correctly?

In [57]:
np.load('x1.npy')

array([0.658   , 0.001224, 0.4019  , 0.788   , 0.0731  , 0.0832  ,
       0.4272  , 0.5933  , 0.5312  , 0.369   ], dtype=float16)

Create 15 random 32-bit integers between 100 and 200 and save them to `x2.npy`.

In [30]:
x = np.random.randint(100, 200, dtype=np.int32, size=15)
np.save('x2.npy', x)

Read in `x2.npy`.  was it saved correctly?

In [31]:
np.load('x2.npy')

array([107, 140, 122, 182, 199, 102, 108, 128, 120, 122, 124, 117, 154,
       197, 187], dtype=int32)

In a text editor, take a look at how the `x2.npy` file was saved. (Note: you may have to "insist" that the data can be read in the text editor). Can you find where the data's schema is stored in the file?

Save both an array of times (float64) and an array of voltages (uint16) to `ephys.npz`.

In [32]:
times = np.array([.1, .2, .3, .4], dtype=np.float64)
volts = np.array([300, 678, 426, 378], dtype=np.uint16)

In [33]:
np.savez('ephys.npz', times=times, volts=volts)

In a text editor, take a look at how the `ephys.npz` file was saved. (Note: you may have to "insist" that the data can be read in the text editor). Can you find where the data's schema is stored in the file?

Read in the volts from `ephys.npz`.  Was it saved correctly?

In [36]:
np.load('ephys.npz')['times']

array([0.1, 0.2, 0.3, 0.4])

## Demo: Compressing Data with `np.savez_compressed`

Numpy comes with a function called `savez_compressed` that makes NPZ files and does Zip-compression on each variable.  This simple compression method can make quite a difference in how compact the data is stored on the computer, though how much will depend on the data itself.  

Below are some examples of benchmarks on different datasets to get an idea of how the data affects things.  Note that both different data and different compression algorithms will have different performance characteristics.

In [37]:
from io import BytesIO  # Used to simulate files
from time import perf_counter  # Used to measure time

**Case 1**: All Values the Same

In [38]:
x = np.ones(1_000_000, dtype=np.int32)  # All values are 1
f = BytesIO()
t0 = perf_counter()
np.savez(f, x=x)
t1 = perf_counter()
f.seek(0)
f"Uncompressed.  File Size: {len(f.read())/1024:.1f} MB, Write Time: {(t1 - t0)*1000:.1f} msecs"

'Uncompressed.  File Size: 3906.5 MB, Write Time: 4.2 msecs'

In [39]:
x = np.ones(1_000_000, dtype=np.int32)  # All values are 1
f = BytesIO()
t0 = perf_counter()
np.savez_compressed(f, x=x)
t1 = perf_counter()
f.seek(0)
f"Compressed.  File Size: {len(f.read())/1024:.1f} MB, Write Time: {(t1 - t0)*1000:.1f} msecs"

'Compressed.  File Size: 4.0 MB, Write Time: 13.2 msecs'

**Question**: Did compression make much of a difference in file size when all the values were the same?  What about in write time?

**Case 2**: Ascending Values

In [40]:
x = np.arange(1, 1_000_000, dtype=np.int32)
f = BytesIO()
t0 = perf_counter()
np.savez(f, x=x)
t1 = perf_counter()
f.seek(0)
f"Uncompressed.  File Size: {len(f.read())/1024:.2f} MB, Write Time: {(t1 - t0)*1000:.1f} msecs"

'Uncompressed.  File Size: 3906.50 MB, Write Time: 4.2 msecs'

In [41]:
x = np.arange(1, 1_000_000, dtype=np.int32)
f = BytesIO()
t0 = perf_counter()
np.savez_compressed(f, x=x)
t1 = perf_counter()
f.seek(0)
f"Compressed.  File Size: {len(f.read())/1024:.2f} MB, Write Time: {(t1 - t0)*1000:.1f} msecs"

'Compressed.  File Size: 1350.82 MB, Write Time: 295.9 msecs'

**Question**: Did compression make much of a difference in file size when the values were different, but changing in a simple pattern?  What about in write time?

**Case 3**: Random Values

In [42]:
x = np.random.randint(1, 1_000_000, size=1_000_000, dtype=np.int32)  # random integers
f = BytesIO()
t0 = perf_counter()
np.savez(f, x=x)
t1 = perf_counter()
f.seek(0)
f"Uncompressed.  File Size: {len(f.read())/1024:.2f} MB, Write Time: {(t1 - t0)*1000:.1f} msecs"

'Uncompressed.  File Size: 3906.50 MB, Write Time: 4.7 msecs'

In [43]:
x = np.random.randint(1, 1_000_000, size=1_000_000, dtype=np.int32)  # random integers
f = BytesIO()
t0 = perf_counter()
np.savez_compressed(f, x=x)
t1 = perf_counter()
f.seek(0)
f"Compressed.  File Size: {len(f.read())/1024:.2f} MB, Write Time: {(t1 - t0)*1000:.1f} msecs"

'Compressed.  File Size: 3084.45 MB, Write Time: 169.9 msecs'

**Question**: Did compression make much of a difference in file size when all the values were random?  What about in write time?

**Case 4: Noisy Data**

In [44]:
x = np.arange(1_000_000, dtype=np.int32) + np.random.randint(1, 20, size=1_000_000, dtype=np.int32)  
f = BytesIO()
t0 = perf_counter()
np.savez(f, x=x)
t1 = perf_counter()
f.seek(0)
f"Uncompressed.  File Size: {len(f.read())/1024:.2f} MB, Write Time: {(t1 - t0)*1000:.1f} msecs"

'Uncompressed.  File Size: 3906.50 MB, Write Time: 3.1 msecs'

In [45]:
x = np.arange(1_000_000, dtype=np.int32) + np.random.randint(1, 20, size=1_000_000, dtype=np.int32) 
f = BytesIO()
t0 = perf_counter()
np.savez_compressed(f, x=x)
t1 = perf_counter()
f.seek(0)
f"Compressed.  File Size: {len(f.read())/1024:.2f} MB, Write Time: {(t1 - t0)*1000:.1f} msecs"

'Compressed.  File Size: 1262.42 MB, Write Time: 230.4 msecs'

**Question**: Did compression make much of a difference in file size when the values had noise in them?  What about in write time?

**Case 5: Random-but-small-range.**  This data is completely noise, but only in the values 0-255, which is much fewer bits than a 32-int can represent.

In [46]:
x = np.random.randint(0, 255, size=1_000_000, dtype=np.int32)  # random integers
f = BytesIO()
t0 = perf_counter()
np.savez(f, x=x)
t1 = perf_counter()
f.seek(0)
f"Uncompressed.  File Size: {len(f.read())/1024:.2f} MB, Write Time: {(t1 - t0)*1000:.1f} msecs"

'Uncompressed.  File Size: 3906.50 MB, Write Time: 5.4 msecs'

In [47]:
x = np.random.randint(0, 255, size=1_000_000, dtype=np.int32)  # random integers
f = BytesIO()
t0 = perf_counter()
np.savez_compressed(f, x=x)
t1 = perf_counter()
f.seek(0)
f"Compressed.  File Size: {len(f.read())/1024:.2f} MB, Write Time: {(t1 - t0)*1000:.1f} msecs"

'Compressed.  File Size: 1497.75 MB, Write Time: 374.2 msecs'

In [48]:
x = np.random.randint(0, 255, size=1_000_000, dtype=np.uint8)  # random integers
f = BytesIO()
t0 = perf_counter()
np.savez(f, x=x)
t1 = perf_counter()
f.seek(0)
f"Uncompressed, but Optimal Dtype.  File Size: {len(f.read())/1024:.2f} MB, Write Time: {(t1 - t0)*1000:.1f} msecs"

'Uncompressed, but Optimal Dtype.  File Size: 976.81 MB, Write Time: 1.6 msecs'

In [49]:
x = np.random.randint(0, 255, size=1_000_000, dtype=np.uint8)  # random integers
f = BytesIO()
t0 = perf_counter()
np.savez_compressed(f, x=x)
t1 = perf_counter()
f.seek(0)
f"Compressed, and Optimal Dtype.  File Size: {len(f.read())/1024:.2f} MB, Write Time: {(t1 - t0)*1000:.1f} msecs"

'Compressed, and Optimal Dtype.  File Size: 977.09 MB, Write Time: 21.0 msecs'

**Question**: Did compression make much of a difference in file size when the values were in a smaller range than needed?  What about in write time?  Did compressing remove the benefits of storing the data in an optimal dtype in the first place?

## Demo: Selecting Data "Out-of-Core" using Memory Mapping: `np.memmap`

If you have a massive file on disk, but don't want it to take up a lot of space in RAM, one technique you can use is to read in only small part of your data at a time.  For example, on could read in one frame at a time of a video, or one channel at a time of a long multi-channel recording.  Numpy provides a technique called Memory Mapping, where it uses its knowledge of the expected bytes order of the file to read in only the needed data.

Note that this saves a lot of memory, but slows down calculations compared to when the data is already loaded into RAM.

**Case 1**: A Simple Buffer of Data.  Here, all you need to know is the dtype and shape of the array to be read.

In [50]:
with open('mydata2.dat', 'wb') as f:
    f.write(np.arange(100, dtype=np.uint16).tobytes())

x = np.memmap('mydata2.dat', dtype=np.uint16, shape=100)
x

memmap([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15,
        16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
        32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
        48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
        64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79,
        80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95,
        96, 97, 98, 99], dtype=uint16)

**Case 2**: An NPY File.  Here, you need to know is the dtype and shape of the array to be read, plus how many bytes to skip before the data starts.

In [53]:
np.save("mydata3.npy", np.arange(100, dtype=np.uint16))
x = np.memmap('mydata3.npy', dtype=np.uint16, shape=100, offset=128)
x

memmap([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15,
        16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
        32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
        48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
        64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79,
        80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95,
        96, 97, 98, 99], dtype=uint16)

**Case 3:** A Compressed NPZ File.  As far as I know, this is **impossible** to memory map, as it cannot be known in advance which bytes in the file correspond to which values in the data.

In [54]:
np.savez_compressed("mydata3.npz", x=np.arange(100, dtype=np.uint16))
x = np.memmap('mydata3.npz', dtype=np.uint16, offset=1)
x

memmap([  843, 11524,     0,  2048,     0,  8448, 13824, 62680, 65475,
        65535, 65535, 65535,  1535,  5120, 30720, 28206, 31088,     1,
           16,   328,     0,     0,     0,   217,     0,     0,     0,
        51357, 14041,   347, 16384, 19921, 12685, 45907,  3442, 42477,
        25170, 62294, 19276, 62290,  4444, 62148, 24928, 30757, 31921,
        32581, 65176, 62338, 63094, 50367, 31838, 10483, 46023, 61591,
        40234, 52413,  4996, 14401, 13557, 17944, 61571, 16122, 39155,
        56649, 61221, 22323, 65513, 37566, 52666, 57254, 31549, 31379,
        48968, 63359, 11328, 60694,  1673, 49583, 35719,  9104, 10199,
        62073, 10261, 42068, 17496, 21129, 51813, 43093, 42324, 36186,
        27343, 43477, 41175, 37713, 11622, 26814,  9941, 44752, 42819,
        24366,  4989, 55018, 48099, 20830, 62527,  6121, 24627, 37072,
         9057, 36166, 63257, 34003, 21321, 52646, 13720, 56935, 17794,
        38475, 22701, 59061, 57239, 28118, 19336, 25592, 38611, 15213,
      