<a href="https://colab.research.google.com/github/bintangnabiil/Hands-On-Machine-Learning-with-Scikit-Learn-Keras-and-TensorFlow/blob/main/Rangkuman_Chapter_13.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Chapter 13: Loading and Preprocessing Data with TensorFlow
Chapter ini membahas cara efisien memuat dan memproses data menggunakan TensorFlow Data API (tf.data). Kita akan mempelajari cara menangani dataset besar, preprocessing pipeline yang efisien, dan teknik optimasi untuk meningkatkan performa training.

##1. TensorFlow Data API (tf.data)
tf.data adalah API yang dirancang khusus untuk membangun pipeline data yang efisien dan scalable. API ini memungkinkan kita untuk menangani dataset yang terlalu besar untuk dimuat sekaligus ke memory.

###Keuntungan tf.data:

- Streaming data dari disk untuk dataset besar
- Parallel processing untuk preprocessing
- Prefetching untuk mengurangi bottleneck I/O
- Integration yang seamless dengan training loop
- Memory efficiency untuk dataset yang sangat besar

##2. Dataset Objects
Dataset adalah abstraksi inti dari tf.data yang merepresentasikan sekuens elemen-elemen data.

###Jenis-jenis Dataset:

- Dataset dari tensor: Dibuat dari data yang sudah ada di memory
- Dataset dari generator: Menggunakan Python generator
- Dataset dari file: Langsung membaca dari file (CSV, TFRecord, dll)
- Dataset dari directory: Untuk image datasets

##3. Dataset Transformations
Dataset dapat ditransformasi menggunakan berbagai method yang tersedia.

###Transformasi umum:

- .map(): Apply function ke setiap elemen
- .filter(): Filter elemen berdasarkan kondisi
- .batch(): Grouping elemen menjadi batch
- .shuffle(): Mengacak urutan elemen
- .repeat(): Mengulangi dataset
- .take() dan .skip(): Mengambil atau melewati sejumlah elemen

##4. Data Preprocessing
Preprocessing adalah langkah penting untuk mempersiapkan data mentah menjadi format yang siap untuk training.

###Teknik preprocessing umum:

- Normalisasi dan scaling
- Encoding categorical data
- Image preprocessing (resize, crop, flip)
Text preprocessing (tokenization, embedding)
Feature engineering

##5. Handling Different Data Formats
TensorFlow mendukung berbagai format data dengan parser khusus.
###Format yang didukung:

- CSV files: Dengan tf.data.experimental.CsvDataset
- TFRecord files: Format binary TensorFlow
- Image files: JPEG, PNG dengan tf.image
- Text files: Dengan tf.data.TextLineDataset

##6. Performance Optimization
Optimasi performa adalah kunci untuk training yang efisien, terutama untuk dataset besar.

###Teknik optimasi:

- Prefetching: Memuat data berikutnya saat GPU sedang training
- Parallel processing: Menggunakan multiple cores untuk preprocessing
- Caching: Menyimpan hasil preprocessing di memory/disk
- Vectorization: Memproses multiple samples sekaligus

##7. TFRecord Format
TFRecord adalah format file binary yang efisien untuk menyimpan data TensorFlow.

###Keuntungan TFRecord:

- Efisien dalam storage dan loading
- Mendukung compression
- Optimal untuk sequential access
- Built-in support untuk protocol buffers

##8. Feature Engineering dengan tf.feature_column
tf.feature_column menyediakan tools untuk feature engineering yang terintegrasi dengan model.

###Jenis feature columns:

- Numeric columns
- Categorical columns
- Embedding columns
- Crossed columns untuk interaction features

##9. Data Validation dan Quality Checks
Memastikan kualitas data sebelum training untuk menghindari masalah during training.

###Validation checks:

- Missing values detection
- Data type consistency
- Range validation
- Distribution analysis



#Implementasi Kode
##1. Dasar-dasar tf.data Dataset

In [1]:
import tensorflow as tf
import numpy as np
import pandas as pd

# Membuat dataset dari tensor
data = tf.constant([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
dataset = tf.data.Dataset.from_tensor_slices(data)

print("Dataset elements:")
for element in dataset:
    print(element.numpy())

# Membuat dataset dari numpy array
X = np.random.randn(1000, 5)
y = np.random.randint(0, 2, 1000)

dataset = tf.data.Dataset.from_tensor_slices((X, y))
print(f"Dataset element spec: {dataset.element_spec}")

# Membuat dataset dari generator
def data_generator():
    for i in range(100):
        yield (np.random.randn(5), np.random.randint(0, 2))

dataset = tf.data.Dataset.from_generator(
    data_generator,
    output_signature=(
        tf.TensorSpec(shape=(5,), dtype=tf.float32),
        tf.TensorSpec(shape=(), dtype=tf.int32)
    )
)

print("Generated dataset sample:")
for x, y in dataset.take(3):
    print(f"X shape: {x.shape}, y: {y.numpy()}")

Dataset elements:
1
2
3
4
5
6
7
8
9
10
Dataset element spec: (TensorSpec(shape=(5,), dtype=tf.float64, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None))
Generated dataset sample:
X shape: (5,), y: 1
X shape: (5,), y: 1
X shape: (5,), y: 1


##2. Dataset Transformations

In [2]:
# Membuat dataset contoh
dataset = tf.data.Dataset.range(100)

# Map transformation - mengaplikasikan fungsi ke setiap elemen
squared_dataset = dataset.map(lambda x: x ** 2)

# Filter transformation - filter berdasarkan kondisi
even_dataset = dataset.filter(lambda x: x % 2 == 0)

# Batch transformation - grouping menjadi batch
batched_dataset = dataset.batch(10)

# Shuffle transformation - mengacak data
shuffled_dataset = dataset.shuffle(buffer_size=50)

# Chaining transformations
processed_dataset = (dataset
                    .filter(lambda x: x % 2 == 0)  # Ambil angka genap
                    .map(lambda x: x ** 2)         # Kuadratkan
                    .shuffle(50)                   # Acak
                    .batch(5)                      # Batch ukuran 5
                    .repeat(2))                    # Ulangi 2x

print("Processed dataset batches:")
for batch in processed_dataset.take(3):
    print(f"Batch: {batch.numpy()}")

# Complex mapping dengan multiple inputs
def preprocess_data(features, labels):
    # Normalize features
    features = tf.cast(features, tf.float32) / 255.0
    # One-hot encode labels
    labels = tf.one_hot(labels, depth=10)
    return features, labels

# Contoh dengan MNIST-like data
X_sample = np.random.randint(0, 256, (1000, 28, 28), dtype=np.uint8)
y_sample = np.random.randint(0, 10, 1000)

dataset = tf.data.Dataset.from_tensor_slices((X_sample, y_sample))
processed_dataset = dataset.map(preprocess_data)

print("Preprocessed sample:")
for features, labels in processed_dataset.take(1):
    print(f"Features shape: {features.shape}, dtype: {features.dtype}")
    print(f"Labels shape: {labels.shape}, sum: {tf.reduce_sum(labels).numpy()}")

Processed dataset batches:
Batch: [ 900 1600 8100 7056 2304]
Batch: [8836 1156 4900  484 2500]
Batch: [9216 3600 4356 1444 6724]
Preprocessed sample:
Features shape: (28, 28), dtype: <dtype: 'float32'>
Labels shape: (10,), sum: 1.0


##3. Loading Data dari File CSV

In [3]:
# Membuat contoh CSV file
import tempfile
import os

# Create sample CSV data
csv_data = """age,income,education,purchased
25,50000,bachelor,1
30,75000,master,1
22,35000,highschool,0
28,60000,bachelor,1
35,90000,phd,1
24,40000,bachelor,0
"""

# Write to temporary file
temp_dir = tempfile.mkdtemp()
csv_file = os.path.join(temp_dir, 'sample_data.csv')
with open(csv_file, 'w') as f:
    f.write(csv_data)

# Define column names and defaults
column_names = ['age', 'income', 'education', 'purchased']
column_defaults = [0, 0, '', 0]

# Load CSV dataset
def parse_csv_line(line):
    fields = tf.io.decode_csv(line, column_defaults)
    features = {
        'age': fields[0],
        'income': fields[1],
        'education': fields[2]
    }
    label = fields[3]
    return features, label

# Create dataset from CSV
dataset = tf.data.TextLineDataset(csv_file)
dataset = dataset.skip(1)  # Skip header
dataset = dataset.map(parse_csv_line)

print("CSV dataset samples:")
for features, label in dataset:
    print(f"Features: {features}, Label: {label.numpy()}")

# Menggunakan tf.data.experimental.CsvDataset (alternatif)
csv_dataset = tf.data.experimental.CsvDataset(
    csv_file,
    record_defaults=column_defaults,
    header=True
)

print("\nAlternative CSV loading:")
for record in csv_dataset.take(2):
    print([field.numpy() for field in record])

CSV dataset samples:
Features: {'age': <tf.Tensor: shape=(), dtype=int32, numpy=25>, 'income': <tf.Tensor: shape=(), dtype=int32, numpy=50000>, 'education': <tf.Tensor: shape=(), dtype=string, numpy=b'bachelor'>}, Label: 1
Features: {'age': <tf.Tensor: shape=(), dtype=int32, numpy=30>, 'income': <tf.Tensor: shape=(), dtype=int32, numpy=75000>, 'education': <tf.Tensor: shape=(), dtype=string, numpy=b'master'>}, Label: 1
Features: {'age': <tf.Tensor: shape=(), dtype=int32, numpy=22>, 'income': <tf.Tensor: shape=(), dtype=int32, numpy=35000>, 'education': <tf.Tensor: shape=(), dtype=string, numpy=b'highschool'>}, Label: 0
Features: {'age': <tf.Tensor: shape=(), dtype=int32, numpy=28>, 'income': <tf.Tensor: shape=(), dtype=int32, numpy=60000>, 'education': <tf.Tensor: shape=(), dtype=string, numpy=b'bachelor'>}, Label: 1
Features: {'age': <tf.Tensor: shape=(), dtype=int32, numpy=35>, 'income': <tf.Tensor: shape=(), dtype=int32, numpy=90000>, 'education': <tf.Tensor: shape=(), dtype=string,

##4. Image Dataset Processing

In [4]:
# Membuat contoh image dataset
import matplotlib.pyplot as plt

# Simulasi image paths
image_paths = [f"image_{i}.jpg" for i in range(10)]
labels = np.random.randint(0, 3, 10)  # 3 classes

# Create dataset dari paths dan labels
path_dataset = tf.data.Dataset.from_tensor_slices((image_paths, labels))

def load_and_preprocess_image(path, label):
    # Dalam praktik nyata, gunakan tf.io.read_file dan tf.image.decode_image
    # Di sini kita simulasikan dengan random image
    image = tf.random.uniform([224, 224, 3], 0, 255, dtype=tf.float32)

    # Preprocessing steps
    image = tf.cast(image, tf.float32) / 255.0  # Normalize
    image = tf.image.resize(image, [224, 224])   # Resize

    # Data augmentation
    image = tf.image.random_flip_left_right(image)
    image = tf.image.random_brightness(image, 0.2)

    return image, label

# Apply preprocessing
image_dataset = path_dataset.map(
    load_and_preprocess_image,
    num_parallel_calls=tf.data.AUTOTUNE
)

print("Image dataset sample:")
for image, label in image_dataset.take(1):
    print(f"Image shape: {image.shape}, Label: {label.numpy()}")

# Batch dan prefetch untuk training
train_dataset = (image_dataset
                .shuffle(1000)
                .batch(4)
                .prefetch(tf.data.AUTOTUNE))

print("Batched image dataset:")
for images, labels in train_dataset.take(1):
    print(f"Batch images shape: {images.shape}")
    print(f"Batch labels: {labels.numpy()}")

Image dataset sample:
Image shape: (224, 224, 3), Label: 2
Batched image dataset:
Batch images shape: (4, 224, 224, 3)
Batch labels: [0 0 2 2]


##5. TFRecord Format

In [5]:
# Membuat dan menulis TFRecord
def create_example(features, label):
    """Create a tf.Example message dari features dan label."""
    feature = {
        'features': tf.train.Feature(
            float_list=tf.train.FloatList(value=features.flatten())
        ),
        'label': tf.train.Feature(
            int64_list=tf.train.Int64List(value=[label])
        ),
        'feature_shape': tf.train.Feature(
            int64_list=tf.train.Int64List(value=features.shape)
        )
    }
    return tf.train.Example(features=tf.train.Features(feature=feature))

# Generate sample data
n_samples = 100
X_data = np.random.randn(n_samples, 10, 5)
y_data = np.random.randint(0, 3, n_samples)

# Write TFRecord file
tfrecord_file = os.path.join(temp_dir, 'sample.tfrecord')
with tf.io.TFRecordWriter(tfrecord_file) as writer:
    for i in range(n_samples):
        example = create_example(X_data[i], y_data[i])
        writer.write(example.SerializeToString())

print(f"TFRecord file created: {tfrecord_file}")

# Reading TFRecord
def parse_tfrecord(example_proto):
    """Parse TFRecord example."""
    feature_description = {
        'features': tf.io.FixedLenFeature([50], tf.float32),  # 10*5 = 50
        'label': tf.io.FixedLenFeature([], tf.int64),
        'feature_shape': tf.io.FixedLenFeature([2], tf.int64),
    }

    parsed_features = tf.io.parse_single_example(example_proto, feature_description)

    # Reshape features back to original shape
    features = tf.reshape(parsed_features['features'], [10, 5])
    label = parsed_features['label']

    return features, label

# Load TFRecord dataset
tfrecord_dataset = tf.data.TFRecordDataset(tfrecord_file)
parsed_dataset = tfrecord_dataset.map(parse_tfrecord)

print("TFRecord dataset samples:")
for features, label in parsed_dataset.take(2):
    print(f"Features shape: {features.shape}, Label: {label.numpy()}")

TFRecord file created: /tmp/tmpv_gt8v86/sample.tfrecord
TFRecord dataset samples:
Features shape: (10, 5), Label: 2
Features shape: (10, 5), Label: 1


##6. Performance Optimization

In [6]:
# Contoh dataset dengan berbagai optimasi
def create_optimized_dataset(X, y, batch_size=32):
    """Create optimized dataset dengan berbagai teknik optimasi."""

    dataset = tf.data.Dataset.from_tensor_slices((X, y))

    # 1. Cache - simpan hasil preprocessing
    dataset = dataset.cache()

    # 2. Shuffle dengan buffer yang cukup
    dataset = dataset.shuffle(buffer_size=len(X))

    # 3. Batch
    dataset = dataset.batch(batch_size)

    # 4. Prefetch - load batch berikutnya saat training
    dataset = dataset.prefetch(tf.data.AUTOTUNE)

    return dataset

# Comparison function untuk mengukur performa
def benchmark_dataset(dataset, num_epochs=3):
    """Benchmark dataset performance."""
    import time

    start_time = time.time()

    for epoch in range(num_epochs):
        epoch_start = time.time()
        batch_count = 0

        for batch in dataset:
            # Simulate processing time
            tf.reduce_mean(batch[0])
            batch_count += 1

        epoch_time = time.time() - epoch_start
        print(f"Epoch {epoch + 1}: {epoch_time:.2f}s, {batch_count} batches")

    total_time = time.time() - start_time
    print(f"Total time: {total_time:.2f}s")
    return total_time

# Create sample data
X_large = np.random.randn(10000, 100)
y_large = np.random.randint(0, 10, 10000)

# Non-optimized dataset
basic_dataset = tf.data.Dataset.from_tensor_slices((X_large, y_large)).batch(32)

# Optimized dataset
optimized_dataset = create_optimized_dataset(X_large, y_large, batch_size=32)

print("Benchmarking basic dataset:")
basic_time = benchmark_dataset(basic_dataset, num_epochs=2)

print("\nBenchmarking optimized dataset:")
optimized_time = benchmark_dataset(optimized_dataset, num_epochs=2)

print(f"\nSpeedup: {basic_time/optimized_time:.2f}x")

Benchmarking basic dataset:
Epoch 1: 0.16s, 313 batches
Epoch 2: 0.17s, 313 batches
Total time: 0.33s

Benchmarking optimized dataset:
Epoch 1: 0.19s, 313 batches
Epoch 2: 0.16s, 313 batches
Total time: 0.36s

Speedup: 0.93x


##7. Advanced Data Pipeline

In [7]:
class DataPipeline:
    """Advanced data pipeline dengan preprocessing dan augmentation."""

    def __init__(self, batch_size=32, shuffle_buffer=1000):
        self.batch_size = batch_size
        self.shuffle_buffer = shuffle_buffer

    def preprocess_features(self, features):
        """Preprocess numerical features."""
        # Normalize features
        features = tf.cast(features, tf.float32)
        features = tf.nn.l2_normalize(features, axis=-1)

        # Add noise untuk regularization
        noise = tf.random.normal(tf.shape(features), stddev=0.01)
        features = features + noise

        return features

    def augment_data(self, features, labels):
        """Data augmentation."""
        # Random feature dropout
        dropout_mask = tf.random.uniform(tf.shape(features)) > 0.1
        features = tf.where(dropout_mask, features, tf.zeros_like(features))

        return features, labels

    def create_pipeline(self, X, y, is_training=True):
        """Create complete data pipeline."""
        dataset = tf.data.Dataset.from_tensor_slices((X, y))

        if is_training:
            # Shuffle untuk training
            dataset = dataset.shuffle(self.shuffle_buffer)

            # Data augmentation
            dataset = dataset.map(
                self.augment_data,
                num_parallel_calls=tf.data.AUTOTUNE
            )

        # Preprocessing
        dataset = dataset.map(
            lambda x, y: (self.preprocess_features(x), y),
            num_parallel_calls=tf.data.AUTOTUNE
        )

        # Batch
        dataset = dataset.batch(self.batch_size)

        if is_training:
            # Repeat untuk multiple epochs
            dataset = dataset.repeat()

        # Prefetch
        dataset = dataset.prefetch(tf.data.AUTOTUNE)

        return dataset

# Menggunakan advanced pipeline
pipeline = DataPipeline(batch_size=64, shuffle_buffer=2000)

# Create sample data
X_train = np.random.randn(5000, 20)
y_train = np.random.randint(0, 5, 5000)
X_val = np.random.randn(1000, 20)
y_val = np.random.randint(0, 5, 1000)

# Create datasets
train_dataset = pipeline.create_pipeline(X_train, y_train, is_training=True)
val_dataset = pipeline.create_pipeline(X_val, y_val, is_training=False)

print("Advanced pipeline created successfully!")

# Test pipeline
print("Training dataset sample:")
for features, labels in train_dataset.take(1):
    print(f"Features shape: {features.shape}")
    print(f"Labels shape: {labels.shape}")
    print(f"Feature statistics - mean: {tf.reduce_mean(features):.4f}, std: {tf.math.reduce_std(features):.4f}")

print("\nValidation dataset sample:")
for features, labels in val_dataset.take(1):
    print(f"Features shape: {features.shape}")
    print(f"Labels shape: {labels.shape}")

Advanced pipeline created successfully!
Training dataset sample:
Features shape: (64, 20)
Labels shape: (64,)
Feature statistics - mean: 0.0073, std: 0.2236

Validation dataset sample:
Features shape: (64, 20)
Labels shape: (64,)


##8. Handling Text Data

In [8]:
# Text preprocessing pipeline
def create_text_pipeline(texts, labels, vocab_size=10000, max_length=100):
    """Create text preprocessing pipeline."""

    # Create dataset
    dataset = tf.data.Dataset.from_tensor_slices((texts, labels))

    # Text preprocessing function
    def preprocess_text(text, label):
        # Convert to lowercase
        text = tf.strings.lower(text)

        # Remove punctuation (simplified)
        text = tf.strings.regex_replace(text, r'[^\w\s]', '')

        # Split into tokens
        tokens = tf.strings.split(text)

        return tokens, label

    # Apply preprocessing
    dataset = dataset.map(preprocess_text)

    # Create vocabulary
    vocab_dataset = dataset.map(lambda tokens, label: tokens)
    vocab_dataset = vocab_dataset.flat_map(lambda tokens: tf.data.Dataset.from_tensor_slices(tokens))

    # Get unique tokens (simplified vocabulary creation)
    # In practice, use tf.keras.utils.text_dataset_from_directory or Tokenizer

    def tokenize_and_pad(tokens, label):
        # Simple hash-based tokenization (for demo)
        token_ids = tf.strings.to_hash_bucket_fast(tokens, vocab_size)

        # Pad sequences
        padded = tf.pad(token_ids[:max_length], [[0, max_length - tf.size(token_ids[:max_length])]])
        padded = padded[:max_length]  # Ensure exact length

        return padded, label

    # Apply tokenization and padding
    dataset = dataset.map(
        tokenize_and_pad,
        num_parallel_calls=tf.data.AUTOTUNE
    )

    return dataset

# Sample text data
sample_texts = [
    "This is a positive review about the product",
    "Negative feedback about poor quality",
    "Excellent service and great experience",
    "Bad experience with customer support",
    "Amazing product, highly recommended"
]
sample_labels = [1, 0, 1, 0, 1]  # 1: positive, 0: negative

# Create text pipeline
text_dataset = create_text_pipeline(sample_texts, sample_labels)

print("Text preprocessing pipeline:")
for tokens, label in text_dataset:
    print(f"Tokenized text shape: {tokens.shape}, Label: {label.numpy()}")
    print(f"First few tokens: {tokens.numpy()[:10]}")
    break

# Batch and prepare for training
text_dataset = text_dataset.shuffle(100).batch(2).prefetch(tf.data.AUTOTUNE)

print("\nBatched text dataset:")
for batch_tokens, batch_labels in text_dataset.take(1):
    print(f"Batch tokens shape: {batch_tokens.shape}")
    print(f"Batch labels: {batch_labels.numpy()}")

Text preprocessing pipeline:
Tokenized text shape: (100,), Label: 1
First few tokens: [6019 8320 3939 2691 8932 2062 5354 5792    0    0]

Batched text dataset:
Batch tokens shape: (2, 100)
Batch labels: [1 0]
