# Loading data with TensorFlow
In this notebook we will investigate a few different ways to handle data with TensorFlow on Alvis.

## Using your own data
In many cases you have a dataset in mind that you've already acquired and are keeping in your home folder or perhaps more probable in a storage project.

In this section we will use the `tiny-imagenet-200` that is one of the centrally stored datasets. However, if you've got your own private dataset then the only difference will be that you would store it in your project storage instead.

***N.B.:*** We've found that that fastest way to load data on Alvis is to directly stream from archives stored on Mimer. Utilities exist in tensorflow.datasets for zip and tar, but you could just as well and probably more easily use built-in zipfile and tarfile libraries in combination with [tf.data.Dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) as done here. Loading from tfrecords might work just as well if not better.

### The file tree
First we inspect the dataset archive that we have.

In [None]:
import zipfile

path_to_dataset = '/mimer/NOBACKUP/Datasets/tiny-imagenet-200/tiny-imagenet-200.zip'
with zipfile.ZipFile(path_to_dataset) as archive:
    # Print first 10 files
    print(*archive.namelist()[:10], sep='\n')

***NOTE:*** Investigating files like this can be quite slow if the archives are very large. Looking at the first few files are fast and can be good to get a sense of the file, but you don't want to have to search through them every time. If there is a README in connection with the dataset it is wise to take a look at it. Furthermore, you might want to note down the structure inside the archive yourself if it isn't in the README.

In [None]:
# Let's take a look at one of the text files next
with zipfile.ZipFile(path_to_dataset) as archive:
    wnids = archive.read(
        'tiny-imagenet-200/wnids.txt'
    ).decode(
        'utf-8'
    ).split()
    print(wnids)

This will later be used as the labels for our task.

In [None]:
# Looking at a few example images
from fnmatch import fnmatch

import matplotlib.pyplot as plt
%matplotlib inline

# Construct filter for training images
def train_image_filter(filename):
    return fnmatch(filename)

# Construct generator for train images
def iter_train_images():
    with zipfile.ZipFile(path_to_dataset) as archive:
        for fn in archive.namelist():
            # Filter for train images
            if not fnmatch(fn, '*train*.JPEG'):
                continue
            
            # Decode label from filename
            label = fn.split('/')[-1].split('_')[0]
            
            # Parse image
            with archive.open(fn) as imgfile:
                img = plt.imread(imgfile)
            
            yield img, label

            
# Visualize images
fig, ax_grid = plt.subplots(3, 3, figsize=(15, 15))
for ax, (img, label) in zip(ax_grid.flatten(), iter_train_images()):
    ax.imshow(img)
    ax.set_title(f'Label {label}')

fig.tight_layout()

## Creating a Dataset
Now we have an idea of the structure of the dataset and are ready to write our Dataset object.

In [None]:
import random
import tensorflow as tf

# Write a generator for the dataset
def dataset_generator(
    path=path_to_dataset,
    split='train',
    shuffle=True,
):
    with zipfile.ZipFile(path) as archive:
        # Find wnid too label mapping
        wnids = archive.read(
            'tiny-imagenet-200/wnids.txt'
        ).decode(
            'utf-8'
        ).split()
        wnid2label = {wnid: [label] for label, wnid in enumerate(wnids)}
        
        # Iterate over images
        namelist = archive.namelist()
        if shuffle:
            random.shuffle(namelist)
        
        for filename in namelist:
            # Filter for JPEG files and split
            if not fnmatch(filename, f'*{split}*.JPEG'):
                continue
            
            # Read label
            if split != 'train':
                raise NotImplementedError('Reading label only implemented for train split.')
            wnid = filename.split('/')[-1].split('_')[0]
            label = wnid2label[wnid]
            
            # Read image
            with archive.open(filename) as imgfile:
                img = plt.imread(imgfile)
            if img.ndim == 2:
                # Not all images in tiny-imagenet are RGB valued
                img = img[..., None]
                img = tf.image.grayscale_to_rgb(
                    tf.convert_to_tensor(
                        img,
                        dtype=tf.float32,
                    )
                )
            
            yield img, label


In [None]:
# Get size of train split
with zipfile.ZipFile(path_to_dataset) as archive:
    n_train = len([fn for fn in archive.namelist() if fnmatch(fn, '*train*.JPEG')])
print("Train split is", n_train, "images.")

In [None]:
# Create Dataset object
dataset = tf.data.Dataset.from_generator(
    generator=dataset_generator,
    output_signature=(
        tf.TensorSpec(shape=(64, 64, 3), dtype=tf.float32),
        tf.TensorSpec(shape=(1), dtype=tf.int32),
    ),
)

n_epochs = 3
dataset = dataset.repeat(n_epochs)
batch_size = 128
dataset = dataset.batch(batch_size)
dataset = dataset.prefetch(tf.data.AUTOTUNE);
# See https://www.tensorflow.org/guide/data_performance
# for more performance considerations

When we have a dataset with the structure "root/class/input" then we can use `tf.keras.utils.image_dataset_from_directory`. But we want to avoid working with many small files for perfomance reasons, so we wrote our own iterator over an archive instead. A similar approach can be used for tarfiles, but that approach will not be as fast if we want to shuffle the data. HDF5 and TFRecord based approaches should work just as well, but will use some different tools.

### Training a classifier from this data
Now we have some understanding of what the database does and we are ready to do some ML on it.

First we will define our machine learning model.

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.data import Dataset


In [None]:
# 10 (3, 3) convolutional filters followed by a dense layer
model = keras.Sequential([
    layers.Conv2D(10, 3, activation="relu", input_shape=(64, 64, 3), use_bias=True),
    layers.Flatten(),
    layers.Dense(200),
])

model.summary()


In [None]:
model.compile(
    keras.optimizers.Adam(learning_rate=0.01),
    keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy', 'top_k_categorical_accuracy'],
)

model.fit(
    dataset,
    steps_per_epoch=(n_train // batch_size),
    epochs=n_epochs,
    verbose=2,
);

As you might notice the accuracy is not very good with this very simple model and no hyperparameter tuning. But that's not the topic for this excercise so let's ignore that for now.

### Tasks
 1. Make yourself acquainted with the above code.
 2. (Optional) play around with hyperparameters to get data loading to be faster.

## Using available datasets
Some common public datasets are available at `/mimer/NOBACKUP/Datasets`, if there are some specific dataset you would like to see added you can create a request at [NAISS-support](https://supr.naiss.se/support/).

In this part we will access the MNIST dataset available at `/mimer/NOBACKUP/Datasets/MNIST/mnist.npz`

In [None]:
# 10 (3, 3) convolutional filters followed by a dense layer
model = keras.Sequential([
    layers.Conv2D(10, 3, activation="relu", input_shape=(28, 28, 1), use_bias=True),
    layers.Flatten(),
    layers.Dense(10),
])

model.summary()


In this case we'll load the data has numpy arrays through the TensorFlow Keras backend. Then we'll massage this output into the correct shape. Another alternative would have been to use the TensorFlow Datasets API.

In [None]:
(train_imgs, train_labels), _ = keras.datasets.mnist.load_data(path="/mimer/NOBACKUP/Datasets/MNIST/mnist.npz")
train_data = (
    tf.expand_dims(train_imgs, 3),
    tf.one_hot(train_labels, 10),
)
dataset = tf.data.Dataset.from_tensor_slices(train_data).batch(128)


In [None]:
len(dataset)

In [None]:
model.compile(
    keras.optimizers.Adam(learning_rate=0.01),
    keras.losses.CategoricalCrossentropy(from_logits=True),
    metrics=["accuracy"],
)

model.fit(dataset, steps_per_epoch=len(dataset), epochs=3, verbose=2);

## Loading data through a TensorFlow related API
Some datasets can be found and used through TensorFlow Keras as we did in the earlier example. The only difference is to change the path to were you would like to store the dataset. More datasets can be found through the [TensorFlow Datasets](https://www.tensorflow.org/datasets/overview), this package doesn't currently exist in the module tree but if interest exist it can probably be added.

However, note that for both of these the data download can take some time and you will have to store them yourself. So for your and others sake please see if the datasets exist and for larger datasets don't hesitate to contact support if your are hesitant about anything.