## `tensorflow_datasets`,  `tf.data.Datasets` API.

This notebook illustrates various 

`tensorflow_datasets` is a library for accessing and loading the tensorflow library of data sets that we will use from time to time. `tf.data.Datasets`  is an API for creating pipelines within tensorflow modeling ecosystem. Sometimes it will be convenient to use this, and many of the examples in Chollet use it. It is also discussed in Geron. We illustrate how to download data from the tfds library. Those data sets are automatically loaded as `tf.data.Datasets` objects, so we expose a little of that functionality. 

There are many ways to acquire data and process as `tf.data.Datasets` objects, and you dont need to use `tensorflow_datasets` to do it. Here we illustrate, and show the similarities and differences.



In [1]:
import numpy as np
import tensorflow as tf
import keras



### `cifar10` data.

This is one of the data sets curated in tensorflow. We will use as an example and you also will use in your assignment. This data set consists of image pixel data in 32 x 32 x 3 shape.  This is image data, with ten categories. See the documentation for further information about the data set.


In [20]:
import tensorflow_datasets as tfds

train = tfds.load('cifar10', split='train[:20%]', shuffle_files=True, as_supervised= True, download=False)
val = tfds.load('cifar10', split='train[20%:30%]', shuffle_files=True, as_supervised=True, download=False)



This gives you `tf.data.Dataset` objects. When you use `tfds.load()` with `as_supervised=True`, it returns a `tf.data.Dataset` where each element is a tuple of `(image, label)`, which is ready to use directly with Keras/TensorFlow models.

The key points:
- `as_supervised=True` ensures you get tuples of `(features, labels)` instead of dictionaries
- No need for lambda functions to extract the image and label components
- The returned `train` and `val` variables are `tf.data.Dataset` objects
- You can then apply additional transformations like `.batch()`, `.shuffle()`, `.prefetch()` etc.


In [21]:
train.element_spec

(TensorSpec(shape=(32, 32, 3), dtype=tf.uint8, name=None),
 TensorSpec(shape=(), dtype=tf.int64, name=None))

 `take(1)` returns the first batch. The batch size is just 1, so `take(1)` will just be a single example.

In [25]:
# Let's examine the actual shape and value of a label
for image, label in train.take(1):
    print(f"Label value: {label}")
    print(f"Label shape: {label.shape}")
    print(f"Label dtype: {label.dtype}")
    print(f"image shape: {image.shape}")
    print(f"image dtype: {image.dtype}")

Label value: 7
Label shape: ()
Label dtype: <dtype: 'int64'>
image shape: (32, 32, 3)
image dtype: <dtype: 'uint8'>


2025-11-02 16:56:59.673966: W tensorflow/core/kernels/data/cache_dataset_ops.cc:858] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
2025-11-02 16:56:59.674588: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


In [None]:
# Getting single example shape from batched data for neural network input layer

# Method 1: From element_spec (before batching)
single_image_shape = train.element_spec[0].shape
print(f"Single image shape from element_spec: {single_image_shape}")

# Method 2: Create a small batch and examine
batched_train = train.batch(32)
print(f"Batched data element_spec: {batched_train.element_spec}")

# Method 3: Take one batch and get shape of single example
for batch_images, batch_labels in batched_train.take(1):
    print(f"Batch shape: {batch_images.shape}")
    print(f"Single example shape (remove batch dim): {batch_images.shape[1:]}")
    
    # This is what you use for your neural network input layer:
    input_shape = batch_images.shape[1:]
    print(f"Input shape for NN: {input_shape}")
    break

In [None]:
# Practical example: Using the shape in a neural network
import tensorflow as tf
from tensorflow import keras

# Get the input shape for a single example (without batch dimension)
input_shape = train.element_spec[0].shape  # This gives (32, 32, 3)

print(f"Input shape for model: {input_shape}")

# Example neural network using this shape
model = keras.Sequential([
    keras.layers.Input(shape=input_shape),  # (32, 32, 3) - no batch dimension!
    keras.layers.Conv2D(32, (3, 3), activation='relu'),
    keras.layers.MaxPooling2D((2, 2)),
    keras.layers.Flatten(),
    keras.layers.Dense(10, activation='softmax')
])

print("Model input shape:", model.input_shape)  # This will show (None, 32, 32, 3)
# The None is for the batch dimension - Keras adds it automatically

## Train, validation, test 

To do this we might split the training into validation and train.

In [None]:

train = tfds.load('cifar10', split='train[:20%]', shuffle_files=True, as_supervised= True, download=False)
val = tfds.load('cifar10', split='train[20%:30%]', shuffle_files=True, as_supervised=True, download=False)
test = tfds.load('cifar10', split='train[30:40%]', shuffle_files=True, as_supervised= True, download=False)


## Alternative

Finally, if you dont want to use `tensorflow_datasets`

In [1]:
import tensorflow as tf

(x_train, y_train), _ = tf.keras.datasets.cifar10.load_data()

train_ds = (
    tf.data.Dataset.from_tensor_slices((x_train, y_train))
    .shuffle(50000)
    .take(5000)          # take only first 5000 examples
    .batch(64)
    .prefetch(tf.data.AUTOTUNE)
)

## Remarks

**`tf.data.Dataset.from_tensor_slices()` is for converting in-memory arrays to `tf.data.Dataset` objects.**

## Common scenarios:

### 1. **Converting NumPy arrays**


In [None]:
# You have NumPy arrays
X = np.array([[1, 2], [3, 4], [5, 6]])
y = np.array([0, 1, 0])

# Convert to tf.data.Dataset for batching/transformations
dataset = tf.data.Dataset.from_tensor_slices((X, y))



### 2. **Your highlighted example - Keras datasets**


In [None]:
(x_train, y_train), _ = tf.keras.datasets.cifar10.load_data()  # NumPy arrays
train_ds = tf.data.Dataset.from_tensor_slices((x_train, y_train))  # Convert to Dataset



### 3. **Converting pandas DataFrames**


In [None]:
df = pd.read_csv('data.csv')
features = df[['feature1', 'feature2']].values
labels = df['target'].values

dataset = tf.data.Dataset.from_tensor_slices((features, labels))




## **Why use `from_tensor_slices()`?**

✅ **Benefits of converting to `tf.data.Dataset`:**
- **Batching**: `.batch(32)`
- **Shuffling**: `.shuffle(1000)` 
- **Transformations**: `.map(preprocess_function)`
- **Prefetching**: `.prefetch(tf.data.AUTOTUNE)`
- **Performance optimizations**: Parallel processing, memory efficiency
- **Consistent API**: Same interface as other TensorFlow datasets

## **Summary of the workflow:**
1. **Start with**: In-memory arrays (NumPy, lists, tensors)
2. **Convert using**: `tf.data.Dataset.from_tensor_slices()`
3. **Apply transformations**: `.batch()`, `.shuffle()`, `.map()`, etc.
4. **Use in training**: `model.fit(dataset)`

So yes, you've understood it perfectly! `from_tensor_slices()` is the bridge between traditional array-based data and TensorFlow's powerful `tf.data` pipeline system.

Made changes.




## **Why use `from_tensor_slices()`?**

✅ **Benefits of converting to `tf.data.Dataset`:**
- **Batching**: `.batch(32)`
- **Shuffling**: `.shuffle(1000)` 
- **Transformations**: `.map(preprocess_function)`
- **Prefetching**: `.prefetch(tf.data.AUTOTUNE)`
- **Performance optimizations**: Parallel processing, memory efficiency
- **Consistent API**: Same interface as other TensorFlow datasets

## **Summary of the workflow:**
1. **Start with**: In-memory arrays (NumPy, lists, tensors)
2. **Convert using**: `tf.data.Dataset.from_tensor_slices()`
3. **Apply transformations**: `.batch()`, `.shuffle()`, `.map()`, etc.
4. **Use in training**: `model.fit(dataset)`



In [None]:
# Demonstrating tf.data.Dataset.from_tensor_slices() usage

# Example 1: Simple NumPy arrays
print("Example 1: Simple arrays")
X_simple = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y_simple = np.array([0, 1, 0, 1])

simple_ds = tf.data.Dataset.from_tensor_slices((X_simple, y_simple))
print(f"Created dataset from arrays: {simple_ds}")

# Now we can batch, shuffle, etc.
batched_simple = simple_ds.batch(2)
for batch_x, batch_y in batched_simple:
    print(f"Batch X: {batch_x.numpy()}")
    print(f"Batch y: {batch_y.numpy()}")
    break

print("\nExample 2: CIFAR-10 arrays (your highlighted code)")
print("(x_train, y_train) are NumPy arrays → convert to tf.data.Dataset → apply transformations")

In [None]:
# Memory comparison: 

# Method 1: Loads full dataset into memory (your current approach)
print("Method 1: tf.keras.datasets (loads all into memory)")
(x_train, y_train), _ = tf.keras.datasets.cifar10.load_data()
print(f"x_train memory usage: {x_train.nbytes / (1024**2):.1f} MB")
print(f"y_train memory usage: {y_train.nbytes / (1024**2):.1f} MB")

# Method 2: Streaming from disk (memory efficient)
print("\nMethod 2: tensorflow_datasets (streams from disk)")
import tensorflow_datasets as tfds
train_stream = tfds.load('cifar10', split='train[:10%]', as_supervised=True, download=False)
train_batched = train_stream.batch(64).prefetch(tf.data.AUTOTUNE)

print("With tfds: Data is loaded batch by batch from disk during training")
print("Memory usage: Only one batch (64 images) in memory at a time")

## Clarification: Different data loading methods

**Important distinction between data loading approaches:**

### 1. `tensorflow_datasets` (tfds)
- **Pre-curated datasets**: CIFAR-10, ImageNet, MNIST, etc.
- **Downloads from internet**: Not from your local folders
- **Standard format**: All datasets follow same structure
- **Example**: `tfds.load('cifar10')` gets CIFAR-10 from TensorFlow's servers

### 2. `image_dataset_from_directory` (Keras utility)  
- **Your custom images**: Reads from YOUR folder structure
- **Expects class folders**: Each subfolder = one class
- **Local files**: Works with images on your computer
- **Example**: Reading cats_vs_dogs from your local folders

### 3. `tf.keras.datasets`
- **Built-in datasets**: Limited selection (CIFAR-10, MNIST, etc.)
- **Loads into memory**: Downloads then loads all into RAM
- **NumPy arrays**: Returns regular arrays, not tf.data.Dataset

All can create `tf.data.Dataset` objects, but they get data from different sources

In [None]:
# Example comparison of the three approaches:

# 1. tensorflow_datasets - downloads CIFAR-10 from internet
import tensorflow_datasets as tfds
tfds_data = tfds.load('cifar10', split='train[:1%]', as_supervised=True, download=False)
print("1. TFDS: Downloads standard datasets from TensorFlow servers")

# 2. Keras datasets - built-in datasets loaded into memory  
import tensorflow as tf
(x, y), _ = tf.keras.datasets.cifar10.load_data()
keras_data = tf.data.Dataset.from_tensor_slices((x[:100], y[:100]))
print("2. Keras datasets: Built-in datasets loaded into memory")

# 3. image_dataset_from_directory - YOUR folders (hypothetical example)
# from tensorflow.keras.utils import image_dataset_from_directory
# custom_data = image_dataset_from_directory(
#     "my_photos/",  # YOUR folder with subfolders like cats/, dogs/
#     image_size=(224, 224),
#     batch_size=32
# )
print("3. image_dataset_from_directory: Reads YOUR custom folder structure")

print("\nAll three can create tf.data.Dataset objects, but from different sources!")

Great question! This code does **both** - it loads the full dataset into memory first, then creates a batched processing pipeline:

## What happens step by step:

1. **`tf.keras.datasets.cifar10.load_data()`** 
   - ✅ **Loads ENTIRE dataset into memory** as NumPy arrays
   - `x_train` shape: `(50000, 32, 32, 3)` - all 50,000 images in RAM
   - `y_train` shape: `(50000, 1)` - all labels in RAM

2. **`tf.data.Dataset.from_tensor_slices()`**
   - Creates a dataset from the in-memory arrays
   - The data is still in memory, but now wrapped in a `tf.data.Dataset`

3. **`.batch(64).prefetch()`**
   - ✅ **Processes in batches of 64** during training
   - Only loads one batch at a time to the GPU/model
   - `prefetch` prepares the next batch while current batch is processing

## Memory vs Processing:

- **Memory**: Full dataset loaded into RAM upfront
- **Processing**: Batched processing during training

## For truly memory-efficient loading:

If you want to avoid loading everything into memory, use `tensorflow_datasets` or `image_dataset_from_directory`: 


## Summary:

- **Your current code**: Loads ~600MB of CIFAR-10 into RAM, then processes in batches
- **Alternative (tfds)**: Streams data from disk, only keeping one batch in memory

For CIFAR-10 (small dataset), loading into memory is fine. For larger datasets (ImageNet, custom datasets), streaming approaches are essential to avoid running out of memory.

The key difference is **where** the data lives before training, not **how** it's fed to the model during training.

Made changes.



## 1. **`image_dataset_from_directory` is a Keras utility**
- It's part of `tensorflow.keras.utils`
- As you can see in your code: `from tensorflow.keras.utils import image_dataset_from_directory`

## 2. **It returns `tf.data.Dataset` objects**
- The returned objects (`train_dataset`, `validation_dataset`, `test_dataset`) are indeed `tf.data.Dataset` objects
- They have the same API and methods as datasets created with `tfds.load()` or `tf.data.Dataset.from_tensor_slices()`

Let me add a cell to verify this: 

Edit [](file:///Users/marciero/Library/CloudStorage/Dropbox/DS-stats/DSC_DSP_courses/DSP_566/week_04_b/pretrained_models.ipynb)

## Key Points:

**Same ecosystem, different utilities:**
- `tensorflow_datasets` (tfds): For accessing curated datasets
- `keras.utils.image_dataset_from_directory`: For loading your own image folders
- `tf.data.Dataset.from_tensor_slices()`: For creating datasets from arrays

**All return `tf.data.Dataset` objects**, so you can:
- Apply the same transformations (`.batch()`, `.map()`, `.shuffle()`, etc.)
- Use them interchangeably in `model.fit()`
- Chain operations in the same way

**Why this matters:**
- Consistent API across different data loading methods
- Same performance optimizations (prefetching, parallel processing)
- Same memory management patterns
- Can mix and match approaches in the same project

The Keras utility is essentially a convenience wrapper that creates `tf.data.Dataset` objects from folder structures, making it easy to work with your own image collections!

Made changes.