# Working with datasets in Tensorflow
## The tf.keras.datasets vs the Tensorflow Datasets
Introductory ML class on datasets and ML engineering\

Daniel Trad, using chatGPT.

The tf.keras.datasets module provides access to a number of public datasets as tf.data.Dataset objects, which are easy to use with tf.keras models. These datasets are small and well-understood, and are useful for testing and debugging.

tfds (TensorFlow Datasets) is a collection of datasets ready to use with TensorFlow. It includes a wide range of datasets for various tasks such as object detection, language translation, and recommendation systems. The datasets provided by tfds are typically larger and more complex than those in tf.keras.datasets. They are also well-documented and include detailed information about the data, such as the number of classes, the format of the data, and how the data was collected and preprocessed. Additionally, tfds includes tools for loading, preprocessing, and manipulating the data, making it easier to work with large and complex datasets.

Here is a simple example of how you can create a deep neural network (DNN) in tf.keras for the MNIST dataset using tf.keras.datasets:

In [2]:
import tensorflow as tf
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras import Model

# Load the MNIST dataset
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

# Normalize the pixel values
x_train, x_test = x_train / 255.0, x_test / 255.0

# Build the model
inputs = tf.keras.layers.Input(shape=(28, 28))
x = Flatten()(inputs)
x = Dense(128, activation='relu')(x)
x = Dense(128, activation='relu')(x)
predictions = Dense(10, activation='softmax')(x)
model = Model(inputs=inputs, outputs=predictions)

# Compile the model
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# Train the model
model.fit(x_train, y_train, epochs=5)

# Evaluate the model
model.evaluate(x_test, y_test)


2023-01-06 08:57:50.570562: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-06 08:57:51.742583: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 10226 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3060, pci bus id: 0000:09:00.0, compute capability: 8.6
2023-01-06 08:57:51.743118: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 5859 MB memory:  -> device: 1, name: NVIDIA GeForce RTX 2070 SUPER, pci bus id: 0000:05:00.0, compute capability: 7.5
2023-01-06 08:57:52.181692: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR 

Epoch 1/5
  29/1875 [..............................] - ETA: 3s - loss: 1.6431 - accuracy: 0.5603   

2023-01-06 08:57:53.570769: I tensorflow/stream_executor/cuda/cuda_blas.cc:1760] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.


Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


[0.07693281024694443, 0.9782999753952026]

This code will create a simple DNN with two hidden layers (128 units each) and an output layer with 10 units, corresponding to the 10 classes in the MNIST dataset. The model is then compiled using the adam optimizer and the sparse_categorical_crossentropy loss function, and is trained using the training data for 5 epochs. Finally, the model is evaluated on the test data.

Here is a simple example of how you can create a deep neural network (DNN) in tf.keras for the MNIST dataset using tfds:


In [3]:
import tensorflow as tf
import tensorflow_datasets as tfds
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras import Model

# Load the MNIST dataset
ds, info = tfds.load('mnist', split=['train', 'test'], as_supervised=True)

# Preprocess the data
def preprocess(image, label):
  image = tf.cast(image, tf.float32) / 255.0
  return image, label

ds = ds.map(preprocess)
ds = ds.batch(32)
ds = ds.prefetch(tf.data.experimental.AUTOTUNE)

# Build the model
inputs = tf.keras.layers.Input(shape=(28, 28))
x = Flatten()(inputs)
x = Dense(128, activation='relu')(x)
x = Dense(128, activation='relu')(x)
predictions = Dense(10, activation='softmax')(x)
model = Model(inputs=inputs, outputs=predictions)

# Compile the model
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# Train the model
model.fit(ds, epochs=5)


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f40ba074700>

This code will create a simple DNN with two hidden layers (128 units each) and an output layer with 10 units, corresponding to the 10 classes in the MNIST dataset. The data is loaded using tfds.load and preprocessed by normalizing the pixel values and batching the data. The model is then compiled using the adam optimizer and the sparse_categorical_crossentropy loss function, and is trained using the training data for 5 epochs. Note that the test data is not used in this example.

To use the test data to evaluate the model, you can use the ds_test dataset created in the same way as ds, and call model.evaluate on it.

## Bigger datasets: machine learning engineering

There are a few different ways you could modify the example code to work with a larger dataset:

Use a generator to load the data in batches: Instead of loading the entire dataset into memory at once, you can use a generator function to load the data in smaller batches, which can be more memory-efficient. You can then pass the generator to the fit function using the steps_per_epoch and validation_steps arguments.

Use model checkpointing to save and restore the model weights: As the model trains, you can use the ModelCheckpoint callback to save the model weights to disk after each epoch. This way, if the training process is interrupted, you can restore the model weights from the most recent checkpoint and continue training from there.

Use distributed training to train the model across multiple GPUs: If you have access to multiple GPUs, you can use TensorFlow's tf.distribute API to distribute the training process across multiple devices. This can significantly speed up training on large datasets.

Preprocess the data in parallel: You can use the tf.data API to preprocess the data in parallel using multiple CPU cores. This can help to speed up the data loading and preprocessing steps, especially if the dataset is large.

Use data augmentation to generate additional training data: If the dataset is small, you can use data augmentation techniques to generate additional training examples by applying random transformations to the existing data. This can help to improve the generalization performance of the model.

Here is an example of how you could modify the code to use a data generator:

In [4]:
import tensorflow as tf

# Create a data generator
def data_generator(x, y, batch_size=32):
  while True:
    for i in range(0, len(x), batch_size):
      x_batch = x[i:i+batch_size]
      y_batch = y[i:i+batch_size]
      yield x_batch, y_batch

# Load and preprocess the data
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

# Create the data generators
train_generator = data_generator(x_train, y_train)
test_generator = data_generator(x_test, y_test)

# Build the model
model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10, activation='softmax')
])

# Compile and train the model
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
model.fit(train_generator, steps_per_epoch=len(x_train) // 32, epochs=5)

# Evaluate the model
model.evaluate(test_generator, steps=len(x_test) // 32)


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


[0.0832531750202179, 0.9737580418586731]

In this version of the code, we define a data_generator function that yields batches of data from the input arrays. We then create generator objects for the training and test data using this function. Finally, we pass the generator objects to the fit and evaluate functions using the steps_per_epoch and steps arguments, respectively. This allows the model to train and evaluate using the data in batches, rather than loading the entire dataset into memory at once.

To use multiple GPUs with TensorFlow, you can use the tf.distribute.Strategy API. Here is an example of how you could modify the code to use 2 GPUs:

In [6]:
import tensorflow as tf

# Create a data generator
def data_generator(x, y, batch_size=32):
  while True:
    for i in range(0, len(x), batch_size):
      x_batch = x[i:i+batch_size]
      y_batch = y[i:i+batch_size]
      yield x_batch, y_batch

# Load and preprocess the data
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

# Create the data generators
train_generator = data_generator(x_train, y_train)
test_generator = data_generator(x_test, y_test)

In [9]:
def createModel():
    model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(10, activation='softmax')
    ])
    return model


In [10]:
# Use the MirroredStrategy to distribute the model across 2 GPUs
strategy = tf.distribute.MirroredStrategy()

# Compile and train the model
with strategy.scope():
  # Define the model
    model = createModel()
    model.compile(optimizer='adam',
                loss='sparse_categorical_crossentropy',
                metrics=['accuracy'])
    model.fit(train_generator, steps_per_epoch=len(x_train) // 32, epochs=5)

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1')


INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1')
2023-01-06 10:26:10.297260: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:695] AUTO sharding policy will apply DATA sharding policy as it failed to apply FILE sharding policy because of the following reason: Did not find a shardable source, walked to a node which is not a dataset: name: "FlatMapDataset/_2"
op: "FlatMapDataset"
input: "TensorDataset/_1"
attr {
  key: "Targuments"
  value {
    list {
    }
  }
}
attr {
  key: "f"
  value {
    func {
      name: "__inference_Dataset_flat_map_flat_map_fn_81611"
    }
  }
}
attr {
  key: "output_shapes"
  value {
    list {
      shape {
        dim {
          size: -1
        }
        dim {
          size: -1
        }
        dim {
          size: -1
        }
      }
      shape {
        dim {
          size: -1
        }
      }
    }
  }
}
attr {
  key: "output_types"
  val

Epoch 1/5
INFO:tensorflow:batch_all_reduce: 4 all-reduces with algorithm = nccl, num_packs = 1


INFO:tensorflow:batch_all_reduce: 4 all-reduces with algorithm = nccl, num_packs = 1


INFO:tensorflow:batch_all_reduce: 4 all-reduces with algorithm = nccl, num_packs = 1


INFO:tensorflow:batch_all_reduce: 4 all-reduces with algorithm = nccl, num_packs = 1


Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [11]:
# Evaluate the model
model.evaluate(test_generator, steps=len(x_test) // 32)


2023-01-06 10:27:25.246835: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:695] AUTO sharding policy will apply DATA sharding policy as it failed to apply FILE sharding policy because of the following reason: Did not find a shardable source, walked to a node which is not a dataset: name: "FlatMapDataset/_2"
op: "FlatMapDataset"
input: "TensorDataset/_1"
attr {
  key: "Targuments"
  value {
    list {
    }
  }
}
attr {
  key: "f"
  value {
    func {
      name: "__inference_Dataset_flat_map_flat_map_fn_101929"
    }
  }
}
attr {
  key: "output_shapes"
  value {
    list {
      shape {
        dim {
          size: -1
        }
        dim {
          size: -1
        }
        dim {
          size: -1
        }
      }
      shape {
        dim {
          size: -1
        }
      }
    }
  }
}
attr {
  key: "output_types"
  value {
    list {
      type: DT_FLOAT
      type: DT_UINT8
    }
  }
}
. Consider either turning off auto-sharding or switching the auto_shard_policy



[0.08742991834878922, 0.9740168452262878]

In this version of the code, we use the tf.distribute.MirroredStrategy to distribute the model across 2 GPUs. The MirroredStrategy creates a copy of the model on each GPU and synchronizes the gradients and variables between the copies. To use the strategy, we first create a MirroredStrategy object, and then use the strategy.scope context manager to compile and train the model. This will automatically distribute the training process across the available GPUs. Note that you will need to have at least 2 GPUs available in order to run this code.