# Chapter 13: Loading and Preprocessing Data with TensorFlow

**Tujuan:** Menguasai `tf.data` API untuk memuat & memroses data secara efisien, termasuk:
- Membaca dari array, CSV, dan TFRecord  
- Transformasi (map, shuffle, batch, prefetch)  
- Preprocessing layers (`Normalization`, `CategoryEncoding`, `StringLookup`)  

---

## 1. The `tf.data.Dataset` API

Dengan `tf.data`, Anda dapat membangun pipeline data yang:
1. **Memuat** data (`from_tensor_slices`, `from_csv`, `TFRecordDataset`)  
2. **Transform**: `.map()`, `.filter()`  
3. **Shuffle** & **Batch**  
4. **Prefetch** untuk overlap I/O & komputasi

---

In [1]:
import tensorflow as tf
from tensorflow import data

# Contoh: dataset sederhana dari array NumPy
import numpy as np

X = np.random.rand(1000, 5).astype("float32")
y = (np.sum(X, axis=1) > 2.5).astype("int32")

ds = tf.data.Dataset.from_tensor_slices((X, y))
ds = ds.shuffle(buffer_size=1000).batch(32).prefetch(tf.data.AUTOTUNE)

# Iterasi singkat
for x_batch, y_batch in ds.take(1):
    print(x_batch.shape, y_batch.shape)

(32, 5) (32,)


## 2. Membaca CSV dengan `tf.data`

In [2]:
# Misal kita punya file CSV 'data.csv' dengan header:
# feature1,feature2,...,label
# 0.1,0.2,...,0

# Definisikan parsing function
def parse_csv(line):
    # DefaultTextLineDataset mengembalikan satu string per baris
    defaults = [0.0] * 5 + [0]    # 5 fitur float + 1 label int
    fields = tf.io.decode_csv(line, record_defaults=defaults)
    features = tf.stack(fields[:-1], axis=0)
    label    = fields[-1]
    return features, label

# Bangun dataset
csv_ds = tf.data.TextLineDataset("data.csv") \
              .skip(1) \
              .map(parse_csv) \
              .shuffle(1000) \
              .batch(32) \
              .prefetch(tf.data.AUTOTUNE)

## 3. TFRecord: Format Binary Efisien

### 3.1 Menulis TFRecord

In [3]:
# Fungsi bantu untuk serialisasi
def serialize_example(features, label):
    feature = {
        "features": tf.train.Feature(float_list=tf.train.FloatList(value=features)),
        "label":    tf.train.Feature(int64_list=tf.train.Int64List(value=[label]))
    }
    example_proto = tf.train.Example(features=tf.train.Features(feature=feature))
    return example_proto.SerializeToString()

# Tulis ke file
with tf.io.TFRecordWriter("data.tfrecord") as writer:
    for f, l in zip(X, y):
        writer.write(serialize_example(f.tolist(), int(l)))

## 3.2 Membaca TFRecord

In [4]:
raw_ds = tf.data.TFRecordDataset("data.tfrecord")

# Definisikan parsing spec
feature_spec = {
    "features": tf.io.FixedLenFeature([5], tf.float32),
    "label":    tf.io.FixedLenFeature([],   tf.int64),
}

def parse_tfrecord(example_proto):
    parsed = tf.io.parse_single_example(example_proto, feature_spec)
    return parsed["features"], parsed["label"]

tfrecord_ds = raw_ds.map(parse_tfrecord) \
                    .shuffle(1000) \
                    .batch(32) \
                    .prefetch(tf.data.AUTOTUNE)

## 4. Preprocessing Layers di Keras
Keras menyediakan layer preprocessing yang bisa disertakan dalam model:

1. Normalization → normalisasi `mean=0, std=1`

2. StringLookup & CategoryEncoding → mapping string → integer → one‑hot

3. Discretization, Hashing, TextVectorization

In [5]:
from tensorflow.keras import layers

# 4.1 Contoh Normalization
num_data = np.random.rand(1000,3).astype("float32")
norm_layer = layers.Normalization(axis=-1)
norm_layer.adapt(num_data)  # hitung mean & var

print("Mean:", norm_layer.mean.numpy())
print("Transformed:", norm_layer(num_data[:2]))

# 4.2 Contoh Category Encoding
raw_cat = np.array([["apple"], ["banana"], ["orange"], ["banana"]])
str_lookup = layers.StringLookup(output_mode="one_hot")
str_lookup.adapt(raw_cat)
print("Encoded:", str_lookup(raw_cat))

Mean: [[0.47925436 0.5087743  0.504174  ]]
Transformed: tf.Tensor(
[[ 1.3886446  -1.0806028   1.0751902 ]
 [-0.24987271  1.6817173   0.10524415]], shape=(2, 3), dtype=float32)
Encoded: tf.Tensor(
[[0 0 0 1]
 [0 1 0 0]
 [0 0 1 0]
 [0 1 0 0]], shape=(4, 4), dtype=int64)


## 5. Pipeline Lengkap dalam Model
Gabungkan `tf.data` + preprocessing layer dalam `tf.keras.Sequential`:

In [6]:
# Dataset dummy
(ds_X, ds_y), _ = tf.keras.datasets.boston_housing.load_data()
ds = tf.data.Dataset.from_tensor_slices((ds_X, ds_y)) \
                   .shuffle(512) \
                   .batch(32) \
                   .prefetch(tf.data.AUTOTUNE)

# Model dengan preprocessing
model = tf.keras.Sequential([
    layers.Normalization(input_shape=(ds_X.shape[1],)),
    layers.Dense(64, activation="relu"),
    layers.Dense(1)
])

model.compile(optimizer="adam", loss="mse", metrics=["mae"])
model.fit(ds, epochs=5)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/boston_housing.npz
[1m57026/57026[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step


  super().__init__(**kwargs)


Epoch 1/5
[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 38ms/step - loss: 2067.8362 - mae: 39.7209
Epoch 2/5
[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 581.4373 - mae: 17.9957
Epoch 3/5
[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 199.2198 - mae: 10.8742
Epoch 4/5
[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 117.4487 - mae: 8.3402
Epoch 5/5
[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 114.2525 - mae: 8.0627 


<keras.src.callbacks.history.History at 0x7dcefcb380d0>

# Ringkasan Chapter 13
1. `tf.data.Dataset` untuk pipeline data efisien (map, shuffle, batch, prefetch).

2. Bisa membaca array, CSV, TFRecord (format binary).

3. Keras punya preprocessing layers untuk normalisasi & encoding.

4. Gabungkan pipeline data & model dalam satu graph untuk performa maksimal.