# Loading data with TensorFlow
In this notebook we will investigate a few different ways to handle data with TensorFlow on Alvis.

## Using your own data
In many cases you have a dataset in mind that you've already acquired and are keeping in your home folder or perhaps more probable in a storage project.

In this section we will use the dataset in `data.tar.gz`, first let us take a look at it.

***N.B.:*** We've found that that fastest way to load data on Alvis is to directly stream from archives stored on Mimer. Utilities exist in tensorflow.datasets for zip and tar, but loading from tfrecords might work just as well if not better.

### The file tree
First we inspect the dataset archive that we have.

In [None]:
%%bash
# First we will decompress the dataset for better reading
if [ ! -f data.tar ]; then
    gunzip data.tar.gz
fi

In [None]:
# Utility to iterate over dataset
import tarfile
import random
from matplotlib.image import imread
def iter_archive(path, extractfile=False, shuffle=False):
    with tarfile.open('data.tar') as datatar:
        members = datatar.getmembers()
        if shuffle:
            random.shuffle(members)
        for member in members:
            filename = member.name
            if not filename.endswith(".png"):
                continue
            if extractfile:
                img = imread(datatar.extractfile(member))
            else:
                img = None
            yield filename, img

In [None]:
# This will find the directories and files that do not have names ending with .png
# and then count the number of files with names containing ".png" for each of these
import os
import tarfile
from collections import Counter
from tqdm.auto import tqdm

In [None]:
# Print the first five files in the archive
for i, (filename, _) in enumerate(iter_archive('data.tar')):
    print(filename)
    if i >= 5:
        break

In [None]:
# For each directory print the number of files with specific extension
directory_entries = Counter()
for filename, _ in tqdm(iter_archive('data.tar')):
    dirname = os.path.dirname(filename)
    extension = os.path.splitext(filename)[-1]
    directory_entries.update([f"{dirname}/*{extension}"])
    while '/' in dirname:
        dirname = os.path.dirname(dirname)
        directory_entries.update([f"{dirname}/**/*{extension}"])

for path, entry_count in sorted(directory_entries.items(), key=lambda t:t[1], reverse=True):
    print(path, entry_count)


***NOTE:*** For this tar file there where "only" 60000 files, for archives that are much larger these operations will mean a significant FileIO and should be avoided as much as possible. If there is a README in connection with the dataset it is wise to take a look at it. There are some advantages when it comes to zipfiles as then it is possible to learn about the members of the archive without going through the whole archive.

### Looking at some of the data
Now we know the file structure of the data. Let us now get acquainted with the data a bit.

Let us take a look at a few of the images.

In [None]:
from fnmatch import fnmatch

import matplotlib.pyplot as plt
%matplotlib inline

# Visualize images
fig, ax_grid = plt.subplots(3, 3, figsize=(15, 15))
for ax, (fn, img) in zip(ax_grid.flatten(), iter_archive('data.tar', extractfile=True)):
    # Get path to file and label
    label = fn.split('/')[1]

    # Add to axis
    ax.imshow(img)
    ax.set_title(f'Label {label}')
    
fig.tight_layout()

Note that the labels are offset by 1 compared to the digits. The dataset is actually a modified version of the MNIST handwritten digit training database. The images have been shrunk to only 9x9 pixels and monochrome images to reduce the size of the dataset.

### Training a classifier from this data
Now we have some understanding of what the database does and we are ready to do some ML on it.

First we will define our machine learning model.

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.data import Dataset


In [None]:
# 10 (3, 3) convolutional filters followed by a dense layer
model = keras.Sequential([
    layers.Conv2D(10, 3, activation="relu", input_shape=(10, 10, 1), use_bias=True),
    layers.Flatten(),
    layers.Dense(10),
])

model.summary()


Now we come to the step were we will load the data. When we have a dataset with the structure "root/class/input" then we can use `tf.keras.utils.image_dataset_from_directory`. But now we have an archive so we will create our own generator and use that instead.

In [None]:
import random

def generate_tiny_mnist(archive_path, shuffle=True):
    data = list(iter_archive(archive_path, extractfile=True, shuffle=shuffle))
    
    # Read
    for filename, img in data:
        label = tf.one_hot(
            indices=int(filename.split('/')[1]) - 1,  # labels
            depth=10,  # num classes
        )
        img = tf.convert_to_tensor(img[..., None])
        yield img, label

        
batch_size = 128
dataset = tf.data.Dataset.from_generator(
    generator=lambda: generate_tiny_mnist('data.tar'),
    output_signature=(
        tf.TensorSpec(shape=(10,10, 1), dtype=tf.float32),
        tf.TensorSpec(shape=(10), dtype=tf.int32),
    ),
)
dataset = dataset.repeat(3)
dataset = dataset.shuffle(1024)  # might not be needed depending on if generator is called again for each epoch, I was unsure
dataset = dataset.batch(128);

In [None]:
model.compile(
    keras.optimizers.Adam(learning_rate=0.01),
    keras.losses.CategoricalCrossentropy(from_logits=True),
    metrics=["accuracy"],
)

model.fit(dataset, steps_per_epoch=(1 + 60000 // batch_size), epochs=3, verbose=2)

### Tasks
 1. Make yourself acquainted with the above code.
 2. In the future, if you have data on Mimer you can probably skip step 3.
 3. Take a look at `jobscript-tensorflow.sh` in this script we will unpack the dataset on \$TMPDIR and then train the model on the entire datase. Make sure not to unpack in your home folder as then you will exceed your file quota.

## Using available datasets
Some common public datasets are available at `/mimer/NOBACKUP/Datasets`, if there are some specific dataset you would like to see added you can create a request at [SNIC-support](https://supr.snic.se/support/).

In this part we will access the MNIST dataset available at `/mimer/NOBACKUP/Datasets/MNIST/mnist.npz`

In [None]:
# 10 (3, 3) convolutional filters followed by a dense layer
model = keras.Sequential([
    layers.Conv2D(10, 3, activation="relu", input_shape=(28, 28, 1), use_bias=True),
    layers.Flatten(),
    layers.Dense(10),
])

model.summary()


In this case we'll load the data has numpy arrays through the TensorFlow Keras backend. Then we'll massage this output into the correct shape. Another alternative would have been to use the TensorFlow Datasets API.

In [None]:
(train_imgs, train_labels), _ = keras.datasets.mnist.load_data(path="/mimer/NOBACKUP/Datasets/MNIST/mnist.npz")
train_data = (
    tf.expand_dims(train_imgs, 3),
    tf.one_hot(train_labels, 10),
)
dataset = tf.data.Dataset.from_tensor_slices(train_data).batch(128)


In [None]:
model.compile(
    keras.optimizers.Adam(learning_rate=0.01),
    keras.losses.CategoricalCrossentropy(from_logits=True),
    metrics=["accuracy"],
)

model.fit(dataset, steps_per_epoch=(1 + train_batches.n // train_batches.batch_size), epochs=3, verbose=2)

## Loading data through a TensorFlow related API
Some datasets can be found and used through TensorFlow Keras as we did in the earlier example. The only difference is to change the path to were you would like to store the dataset. More datasets can be found through the [TensorFlow Datasets](https://www.tensorflow.org/datasets/overview), this package doesn't currently exist in the module tree but if interest exist it can probably be added.

However, note that for both of these the data download can take some time and you will have to store them yourself. So for your and others sake please see if the datasets exist and for larger datasets don't hesitate to contact support if your are hesitant about anything.