# Sample Notebook for Saving and Loading Tensorflow Model

This notebook shows how to save and load tensorflow models/checkpoints in Synapse. 
It leverages mssparkutils to mount an ADLS storage to a local folder, 
then save or load models/checkpoints with the mounted folder, the files
will be saved to/loaded from the ADLS storage.

**This solution should work for any frameworks: tensorflow, pytorch, mxnet, etc.**

### **Before Running:**
1. Set configuration for Spark pool, details in section [Spark pool configuration](#spark_configuration),
2. Prepare ADLS storage linked service and account info, details in section [Define parameters and Setup ADLS storage](#adls_storage).

<span id="spark_configuration"></span>
## Spark pool configuration
**The configuration must be in the first cell (including markdown and code cells) of the notebook.**

In most cases, we only needs to set the `numExecutors` and `spark.rapids.memory.gpu.reserve`. For very large model such as BERT, `spark.kryoserializer.buffer.max` may also needed. 
For tensorflow models, `spark.executorEnv.TF_FORCE_GPU_ALLOW_GROWTH` should be set to `true`. 

Another example with more configuration parameters is as below. The detailed meaning of each parameter is explained in [Spark Configuration](https://spark.apache.org/docs/latest/configuration.html).
The values here are the best practice for Synapse GPU large pools. For `numExecutors`, it should be less or equal to the number of nodes.
```
%%configure -f
{
    "driverMemory": "30g",
    "driverCores": 4,
    "executorMemory": "60g",
    "executorCores": 12,
    "numExecutors": 3,
    "conf":{
        "spark.rapids.memory.gpu.reserve": "10g",
        "spark.executorEnv.TF_FORCE_GPU_ALLOW_GROWTH": "true",
        "spark.kryoserializer.buffer.max": "2000m"
   }
}
```

**Below configuration is only needed for GPU pool.**

In [1]:
%%configure -f
{
    "numExecutors": 3,
    "conf":{
        "spark.rapids.memory.gpu.reserve": "10g",
        "spark.executorEnv.TF_FORCE_GPU_ALLOW_GROWTH": "true"
   }
}

StatementMeta(, 36, -1, Finished, Available)

<span id="adls_storage"></span>
## Define parameters and Setup ADLS storage

An ADLS storage is needed for intermediate data and model storing. 

- First, setup [Linkded service](https://docs.microsoft.com/en-us/azure/data-factory/concepts-linked-services?context=%2Fazure%2Fsynapse-analytics%2Fcontext%2Fcontext&tabs=data-factory).

- Second, modify `remote_url` and `linked_service_name` according to your service.

In [2]:
# configure adls store remote url
remote_url = "abfss://default@lijiang1ppesg.dfs.core.windows.net"
linked_service_name = 'lijiang1-synapse-ml-gpu-preprod-ne-WorkspaceDefaultStorage'
model_path = 'saved_model'

StatementMeta(gpu32, 36, 2, Finished, Available)

### Mount ADLS storage to local

In [3]:
from notebookutils import mssparkutils

local_folder = "/tmp"  # local folder to mount adls folder
remote_folder = "/tmp/"  # folder in adls to save model


def mount_folder():
    # mount adls folder for model timeline
    mssparkutils.fs.mount(remote_url + remote_folder, local_folder,
                          {"linkedService": linked_service_name})
    jobId = mssparkutils.env.getJobId()
    mount_path = f"/synfs/{jobId}/{remote_folder}/"
    return mount_path


def unmount_folder():
    mssparkutils.fs.unmount(local_folder)


mount_path = mount_folder()  # save your model in this folder
print(mount_path)

StatementMeta(gpu32, 36, 3, Finished, Available)

/synfs/36//tmp//


## Define Model

In [8]:
import numpy as np
from tensorflow import keras
import tensorflow as tf
from tensorflow.keras import layers
from keras.callbacks import ModelCheckpoint

StatementMeta(gpu32, 36, 8, Finished, Available)

In [9]:
# Model / data parameters
num_classes = 10
input_shape = (28, 28, 1)

# Load the data and split it between train and test sets
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

# Scale images to the [0, 1] range
x_train = x_train.astype("float32") / 255
x_test = x_test.astype("float32") / 255
# Make sure images have shape (28, 28, 1)
x_train = np.expand_dims(x_train, -1)
x_test = np.expand_dims(x_test, -1)
print("x_train shape:", x_train.shape)
print(x_train.shape[0], "train samples")
print(x_test.shape[0], "test samples")

# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

StatementMeta(gpu32, 36, 9, Finished, Available)

x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples


In [10]:
batch_size = 128
epochs = 3

def create_model():
    model = keras.Sequential(
        [
            keras.Input(shape=input_shape),
            layers.Conv2D(32, kernel_size=(3, 3), activation="relu"),
            layers.MaxPooling2D(pool_size=(2, 2)),
            layers.Conv2D(64, kernel_size=(3, 3), activation="relu"),
            layers.MaxPooling2D(pool_size=(2, 2)),
            layers.Flatten(),
            layers.Dropout(0.5),
            layers.Dense(num_classes, activation="softmax"),
        ]
    )

    model.compile(loss="categorical_crossentropy",
                optimizer="adam",
                metrics=["accuracy"])

    return model

model = create_model()
model.summary()

StatementMeta(gpu32, 36, 10, Finished, Available)

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 26, 26, 32)        320       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 13, 13, 32)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 11, 11, 64)        18496     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 5, 5, 64)          0         
_________________________________________________________________
flatten (Flatten)            (None, 1600)              0         
_________________________________________________________________
dropout (Dropout)            (None, 1600)              0         
_________________________________________________________________
dense (Dense)                (None, 10)                1

## Train and Save Model

In [11]:
checkpoint = ModelCheckpoint(filepath=mount_path + model_path +
                             '/checkpointfile',
                             save_weights_only=True,
                             monitor='val_accuracy',
                             verbose=2,
                             save_best_only=True)

callbacks = [checkpoint]

model.fit(x_train,
          y_train,
          batch_size=batch_size,
          epochs=epochs,
          validation_split=0.1,
          callbacks=callbacks)


StatementMeta(gpu32, 36, 11, Finished, Available)

Epoch 1/3

Epoch 00001: val_accuracy improved from -inf to 0.97750, saving model to /synfs/36//tmp//saved_model/checkpointfile
Epoch 2/3

Epoch 00002: val_accuracy improved from 0.97750 to 0.98467, saving model to /synfs/36//tmp//saved_model/checkpointfile
Epoch 3/3

Epoch 00003: val_accuracy improved from 0.98467 to 0.98733, saving model to /synfs/36//tmp//saved_model/checkpointfile


<keras.callbacks.History at 0x7f5818145550>

In [13]:
# Evaluate the model
loss, acc = model.evaluate(x_test, y_test, verbose=2)
print("Trained model, accuracy: {:5.2f}%".format(100 * acc))

StatementMeta(gpu32, 36, 13, Finished, Available)

313/313 - 1s - loss: 0.0498 - accuracy: 0.9852
Trained model, accuracy: 98.52%


Checkpoint files will be saved in ADLS storage.

![](https://synapseaisolutionsa.blob.core.windows.net/public/imgs/tf_save_load.png)

## Load Model

In [16]:
loaded_model = create_model()
loaded_model.load_weights(mount_path + model_path + '/checkpointfile')

StatementMeta(gpu32, 36, 16, Finished, Available)

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7f5b0dd0bcd0>

In [17]:
# Evaluate the loaded model
loss1, acc1 = loaded_model.evaluate(x_test, y_test, verbose=2)
print("Untrained model, accuracy: {:5.2f}%".format(100 * acc))

StatementMeta(gpu32, 36, 17, Finished, Available)

313/313 - 1s - loss: 0.0498 - accuracy: 0.9852
Untrained model, accuracy: 98.52%


In [19]:
print('Save and Load ' + 'Succeed!' if acc1==acc else 'Failed!')

StatementMeta(gpu32, 36, 19, Finished, Available)

Save and Load Succeed!
