# Preparation of raw (binary) data

Raw data comes in many forms, often binary. Working with it is straightforward, but requires a certain degree of care to ensure that the data that we read in fact contains the information we expect.

As a straightforward exercise in manipulating binary files using standard Python functions, here we shall make use of the the well-known database of handwritten digits, called MNIST, a __m__odified subset of a larger dataset from the __N__ational __I__nstitute of __S__tandards and __T__echnology.

<img src="img/ex_MNIST.png" alt="Stimuli Image" />

A typical source for this data set is the website of Y. LeCun (http://yann.lecun.com/exdb/mnist/). They provide the following description,

> *"The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image."*

and the following files containing the data of interest.

> train-images-idx3-ubyte.gz:  training set images (9912422 bytes)

> train-labels-idx1-ubyte.gz:  training set labels (28881 bytes)

> t10k-images-idx3-ubyte.gz:   test set images (1648877 bytes)

> t10k-labels-idx1-ubyte.gz:   test set labels (4542 bytes)

The files are stored in a binary format called __IDX__, typically used for storing vector data. First we decompress the files via

```
$ cd data/MNIST
$ gunzip train-images-idx3-ubyte.gz train-labels-idx1-ubyte.gz
$ gunzip t10k-images-idx3-ubyte.gz t10k-labels-idx1-ubyte.gz
```
which leaves us with the binary files. Let us begin by opening a file connection with the training examples.

In [1]:

import numpy as np
import matplotlib.pyplot as plt

toread = "data/MNIST/train-images-idx3-ubyte"

f_bin = open(toread, mode="rb")

print(f_bin)


<_io.BufferedReader name='data/MNIST/train-images-idx3-ubyte'>


Now, in order to ensure that we are reading the data correctly, the only way to confirm this is by inspecting and checking with what the authors of the data file tell us *should* be there. From the page of LeCun et al. linked above, we have the following:

```
TRAINING SET IMAGE FILE (train-images-idx3-ubyte):
[offset] [type]          [value]          [description]
0000     32 bit integer  0x00000803(2051) magic number
0004     32 bit integer  60000            number of images
0008     32 bit integer  28               number of rows
0012     32 bit integer  28               number of columns
0016     unsigned byte   ??               pixel
0017     unsigned byte   ??               pixel
........
xxxx     unsigned byte   ??               pixel
```

The "offset" here refers to the number of bytes read from the start of the file. An offset of zero refers to the first byte, and an offset of 0004 refers to the fifth byte, 0008 the ninth byte, and so forth. Let's check that we are able to successfully read what we expect.

In [2]:

print("First four bytes:") # should be magic number, 2051.
b = f_bin.read(4)
print("bytes: ", b)
print(" int: ", int.from_bytes(b, byteorder="big", signed=False))


First four bytes:
bytes:  b'\x00\x00\x08\x03'
 int:  2051


Note that the byte data `b'\x00\x00\x08\x03'` shown here by Python is a hexadecimal representation of the first four bytes. This corresponds directly to the "value" in the first row of the table above, ``0x00000803``. The ``\x`` breaks simply show where one byte starts and another ends, recalling that using two hexadecimal digits we can represent the integers from $0, 1, 2, \ldots$ through to $(15 \times 16^{1} + 15 \times 16^{0}) = 255$, just as we can with 8 binary digits, or 8 *bits*. Anyways, converting this to decimal, $3 \times 16^{0} + 8 \times 16^{2} = 2051$, precisely what we expect.

Using the *read* method, let us read four bits at a time to ensure the remaining data is read correctly.

In [3]:

print("Second four bytes:") # should be number of imgs = 60000
b = f_bin.read(4)
print("bytes: ", b)
print(" int: ", int.from_bytes(b, byteorder="big", signed=False))


Second four bytes:
bytes:  b'\x00\x00\xea`'
 int:  60000


In [4]:

print("Third four bytes:") # should be number of rows = 28
b = f_bin.read(4)
print("bytes: ", b)
print(" int: ", int.from_bytes(b, byteorder="big", signed=False))


Third four bytes:
bytes:  b'\x00\x00\x00\x1c'
 int:  28


In [5]:

print("Fourth four bytes:") # should be number of cols = 28
b = f_bin.read(4)
print("bytes: ", b)
print(" int: ", int.from_bytes(b, byteorder="big", signed=False))


Fourth four bytes:
bytes:  b'\x00\x00\x00\x1c'
 int:  28


Things seem to be as they should be. We have been able to accurately extract all the information necessary to read out all the remaining data stored in this file. Since these happen to be images, the accuracy of our read-out can be easily assessed by looking at the image content.

In [9]:

n = 60000 # (anticipated) number of images.
d = 28*28 # number of entries (int values) per image.
times_todo = 5 # number of images to view.
bytes_left = d
data_x = np.zeros((d,), dtype=np.uint8) # initialize.


Note that we are using the *uint8* (unsigned 1-byte int) data type, because we know that the values range between 0 and 255, based on the description by the authors of the data set, which says

> "Pixels are organized row-wise. Pixel values are 0 to 255. 0 means background (white), 255 means foreground (black)."

More concretely, we have that the remaining elements of the data set are of the form

```
0016     unsigned byte   ??               pixel
0017     unsigned byte   ??               pixel
........
xxxx     unsigned byte   ??               pixel
```
and so to read out one pixel value at a time (values between 0 and 255), we must read one byte at a time (rather than four bytes as we had just been doing). There should be $28 \times 28 = 784$ pixels per image.

In [10]:

for t in range(times_todo):

    idx = 0
    while idx < bytes_left:
        # Iterate one byte at a time.
        b = f_bin.read(1)
        data_x[idx] = int.from_bytes(b, byteorder="big", signed=False)
        idx += 1

    img_x = data_x.reshape( (28,28) ) # populate one row at a time.
    # binary colour map highlights foreground (black) against background(white)
    plt.imshow(img_x, cmap=plt.cm.binary)
    plt.show()


ValueError: read of closed file

The images look to be as we expect. Let's close this file for the moment, and confirm that the *times_todo* patterns observed above have (correct) corresponding labels in the *train-labels-idx1-ubyte* file.

In [11]:

f_bin.close()
if f_bin.closed:
    print("Successfully closed.")
    

Successfully closed.


In [12]:

toread = "data/MNIST/train-labels-idx1-ubyte"

f_bin = open(toread, mode="rb")

print(f_bin)


<_io.BufferedReader name='data/MNIST/train-labels-idx1-ubyte'>


Once again from the page of LeCun et al. linked above, we have for labels that the contents should be as follows.

```
TRAINING SET LABEL FILE (train-labels-idx1-ubyte):
[offset] [type]          [value]          [description]
0000     32 bit integer  0x00000801(2049) magic number (MSB first)
0004     32 bit integer  60000            number of items
0008     unsigned byte   ??               label
0009     unsigned byte   ??               label
........
xxxx     unsigned byte   ??               label
```

Let's inspect the first eight bytes.

In [13]:

print("First four bytes:") # should be magic number, 2049.
b = f_bin.read(4)
print("bytes: ", b)
print(" int: ", int.from_bytes(b, byteorder="big"))


First four bytes:
bytes:  b'\x00\x00\x08\x01'
 int:  2049


In [14]:

print("Second four bytes:") # should be number of observations, 60000.
b = f_bin.read(4)
print("bytes: ", b)
print(" int: ", int.from_bytes(b, byteorder="big"))


Second four bytes:
bytes:  b'\x00\x00\xea`'
 int:  60000


From here should be labels from 0 to 9. Let's confirm that the patterns above have correct labels corresponding to them in the place that we expect:

In [15]:

for t in range(times_todo):

    b = f_bin.read(1)
    mylabel = int.from_bytes(b, byteorder="big", signed=False)
    
    print("Label =", mylabel)
    

Label = 5
Label = 0
Label = 4
Label = 1
Label = 9


### Exercises:

1. Change `signed` from False to True, and see if/how the output changes.
2. Replace the `np.uint8` data type specification with `np.int8`. What happens?
3. Change the byte order from "big" to "little", and see how things change.
4. Do an analogous check for the first few images (and labels) of the test data to ensure it can be read as desired.
5. Make a function that uses the `seek` method to jump to an arbitrary byte offset, which can be used to display the $k$th image, given integer $k$ only.



All that remains to be done here: read out the full data sets and construct Python-format binary files, with a specification of the *dtype*. When the data may be read many times on the same machine, using the *tofile* and *fromfile* functions from numpy allows for very fast reads, as long as we know the data type.


In [17]:

toread = "data/MNIST/train-images-idx3-ubyte"
n = 60000
d = 28*28
bytes_left = n * d
data_x = np.zeros((d,), dtype=np.uint8)

with open(toread, mode="rb") as f_bin:

    f_bin.seek(16) # go to start of images.
    idx = 0
    data_X = np.empty((n*d,), dtype=np.uint8)
    
    print("Reading binary file...", end=" ")
    while bytes_left > 0:
        b = f_bin.read(1)
        data_X[idx] = np.uint8(int.from_bytes(b, byteorder="big", signed=False))
        bytes_left -= 1
        idx += 1
    print("Done reading...", end=" ")
print("OK, file closed.")


Reading binary file... Done reading... OK, file closed.


With the data successfully read, let's store it.

In [23]:

print("Writing binary file...", end=" ")
towrite = "data/MNIST/X_tr.dat"
with open(towrite, mode="bw") as g_bin:
    data_X.tofile(g_bin)
print("OK.")


Writing binary file... OK.


In [24]:

print(towrite)


X_tr.dat


Now, try reading this file, and compare with the original data. There should be no errors in reconstruction, and clearly reading from the Python-format binary file is *much* faster than reading from the IDX file one byte at a time.

In [25]:

with open(towrite, mode="br") as g_bin:
    data_X_check = np.fromfile(g_bin, dtype=np.uint8)
print("OK.")

print("Taking difference of data_X_check and data_X:")
print("Difference =", np.linalg.norm(data_X_check-data_X))


OK.
Taking difference of data_X_check and data_X:
Difference = 0.0


### Exercises:

1. Do the same process for `X_te`, `y_tr`, and `y_te`, saving their respective `.dat` files on disk (written in code examples below).

### End of lesson: paste any routines to be re-used in the `scripts/MNIST.py` file.