# Loading data with TensorFlow
In this notebook we will investigate a few different ways to handle data with TensorFlow on Alvis.

## Using your own data
In many cases you have a dataset in mind that you've already acquired and are keeping in your home folder or perhaps more probable in a storage project.

In this section we will use the dataset in `data.tar.gz`, first let us take a look at it.

### The file tree
To see what is contained in a tar file the command `tar -tf my_tarfile.tar` is useful. However, we might want to specifically know some things about the directory structure and filenames. This is done in the below script.

In [None]:
%%bash
# This will find the directories and files that do not have names ending with .png
# and then count the number of files with names containing ".png" for each of these
echo " #Files   Path"
echo "=================="
for dir in $(tar --exclude="*.png" -tf data.tar.gz); do
    n_files=$(tar -tf data.tar.gz --wildcards "$dir*.png" | wc -l)
    printf "  %5s   %s\n" "$n_files" "$dir"
done

# List the 5 first and last png filenames in /data/1/
echo  # New line
echo " Typical filenames"
echo "==================="
tar -tf data.tar.gz --wildcards --no-anchored "data/1/*.png" | (head -n 5; echo "..."; tail -n 5)

***NOTE:*** For this tar file there where "only" 60000 files, for archives that are much larger these operations will mean a significant FileIO and should be avoided as much as possible. If there is a README in connection with the dataset it is wise to take a look at it.

### Looking at some of the data
Now we know the file structure of the data. Let us now get acquainted with the data a bit.

First extract a small subset of the images.

In [None]:
%%bash
# Extract the first 49 files
tar -xvf data.tar.gz --wildcards "data/*/im-000[0-4]?.png"

Now let us take a look at these.

In [None]:
from glob import glob
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
%matplotlib inline

fig, ax_grid = plt.subplots(7, 7, figsize=(15, 15))
for ix, ax in enumerate(ax_grid.flatten()):
    # Get path to file and label
    filepath = glob(f"data/*/im-{ix:05d}.png")[0]
    _, label, filename = filepath.split("/")
    
    # Add to axis
    img = mpimg.imread(filepath)
    ax.imshow(img)
    # here I cheated because I already knew what the label meant
    ax.set_title(f"Digit {int(label) - 1}")

fig.tight_layout()

Note that the labels are offset by 1 compared to the digits. The dataset is actually a modified version of the MNIST handwritten digit training database. The images have been shrunk to only 9x9 pixels and monochrome images to reduce the size of the dataset.

### Training a classifier from this data
Now we have some understanding of what the database does and we are ready to do some ML on it.

First we will define our machine learning model.

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.data import Dataset
from tensorflow.keras.preprocessing.image import ImageDataGenerator


In [None]:
# 10 (3, 3) convolutional filters followed by a dense layer
model = keras.Sequential([
    layers.Conv2D(10, 3, activation="relu", input_shape=(9, 9, 1), use_bias=True),
    layers.Flatten(),
    layers.Dense(10),
])

model.summary()


Now we come to the step were we will load the data. When we have a dataset with the structure "root/class/input" then we can use `torchvision.dataset.DatasetFolder` or in the case of images `torchvision.dataset.ImageFolder`.

In [None]:
if len(glob("data/?/*.png")) < 60000:
    import warnings
    warnings.warn("\"data/\" is not fully unpacked!")

img_path = '/cephyr/NOBACKUP/Datasets/tiny-imagenet-200/train'

train_batches = ImageDataGenerator().flow_from_directory(
    "data",
    target_size=(9, 9),
    color_mode="grayscale",
    batch_size=128,
    shuffle=True,
)


dataset = Dataset.from_generator(
    lambda: train_batches,
    output_types=(tf.float32, tf.float32), 
    output_shapes=([None, 9, 9, 1], [None, 10]),
)


In [None]:
model.compile(
    keras.optimizers.Adam(learning_rate=0.01),
    keras.losses.CategoricalCrossentropy(from_logits=True),
    metrics=["accuracy"],
)

model.fit(dataset, steps_per_epoch=(1 + train_batches.n // train_batches.batch_size), epochs=3, verbose=2)

### Tasks
 1. Make yourself acquainted with the above code.
 2. Take a look at `jobscript-tensorflow.sh` in this script we will unpack the dataset on \$TMPDIR and then train the model on the entire dataset

## Using available datasets
Some common public datasets are available at `/cephyr/NOBACKUP/Datasets`, if there are some specific dataset you would like to see added you can create a request at [SNIC-support](https://supr.snic.se/support/).

In this part we will access the MNIST dataset available at `/cephyr/NOBACKUP/Datasets/MNIST/mnist.npz`

In [None]:
# 10 (3, 3) convolutional filters followed by a dense layer
model = keras.Sequential([
    layers.Conv2D(10, 3, activation="relu", input_shape=(28, 28, 1), use_bias=True),
    layers.Flatten(),
    layers.Dense(10),
])

model.summary()


In this case we'll load the data has numpy arrays through the TensorFlow Keras backend. Then we'll massage this output into the correct shape. Another alternative would have been to use the TensorFlow Datasets API.

In [None]:
(train_imgs, train_labels), _ = keras.datasets.mnist.load_data(path="/cephyr/NOBACKUP/Datasets/MNIST/mnist.npz")
train_data = (
    tf.expand_dims(train_imgs, 3),
    tf.one_hot(train_labels, 10),
)
dataset = tf.data.Dataset.from_tensor_slices(train_data).batch(128)


In [None]:
model.compile(
    keras.optimizers.Adam(learning_rate=0.01),
    keras.losses.CategoricalCrossentropy(from_logits=True),
    metrics=["accuracy"],
)

model.fit(dataset, steps_per_epoch=(1 + train_batches.n // train_batches.batch_size), epochs=3, verbose=2)

## Loading data through a TensorFlow related API
Some datasets can be found and used through TensorFlow Keras as we did in the earlier example. The only difference is to change the path to were you would like to store the dataset. More datasets can be found through the [TensorFlow Datasets](https://www.tensorflow.org/datasets/overview), this package doesn't currently exist in the module tree but if interest exist it can probably be added.

However, note that for both of these the data download can take some time and you will have to store them yourself. So for your and others sake please see if the datasets exist and for larger datasets don't hesitate to contact support if your are hesitant about anything.