# Data Preparation with Streaming

Training Large Language Models requires massive datasets. Downloading terabytes of data to a local disk is often impractical. Instead, we can **stream** the data directly from the Hugging Face Hub.

In this notebook, we will:
1.  Stream the **FineWeb** dataset (or a sample of it) from Hugging Face.
2.  Tokenize the text on-the-fly.
3.  Create an `IterableDataset` for PyTorch training.
4.  Save a small sample for local debugging.

In [None]:
import torch
from torch.utils.data import IterableDataset, DataLoader
from datasets import load_dataset
from transformers import AutoTokenizer # Or our custom tokenizer

# Use a popular tokenizer for demonstration (e.g., GPT-2)
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

## 1. Streaming the Dataset

We use the `datasets` library with `streaming=True`. This allows us to iterate over the dataset without downloading it.

We'll use `HuggingFaceFW/fineweb-edu` (sample-10BT) as it is a high-quality web dataset.

In [None]:
dataset_name = "HuggingFaceFW/fineweb-edu"
subset = "sample-10BT" # A smaller subset for demonstration

print(f"Streaming {dataset_name} ({subset})...")
dataset = load_dataset(dataset_name, name=subset, split="train", streaming=True)

# Peek at the first example
print(next(iter(dataset)))

## 2. On-the-fly Tokenization

We define a generator function that yields tokenized chunks.

In [None]:
def tokenization_generator(dataset, tokenizer, seq_len=1024):
    buffer = []
    for sample in dataset:
        text = sample['text']
        tokens = tokenizer.encode(text)
        buffer.extend(tokens)
        
        # Yield chunks of seq_len + 1 (input + target)
        while len(buffer) >= seq_len + 1:
            yield torch.tensor(buffer[:seq_len + 1])
            buffer = buffer[seq_len + 1:]

# Create an IterableDataset wrapper
class StreamedTextDataset(IterableDataset):
    def __init__(self, dataset, tokenizer, seq_len):
        self.dataset = dataset
        self.tokenizer = tokenizer
        self.seq_len = seq_len

    def __iter__(self):
        return tokenization_generator(self.dataset, self.tokenizer, self.seq_len)

streamed_dataset = StreamedTextDataset(dataset, tokenizer, seq_len=128)
dataloader = DataLoader(streamed_dataset, batch_size=4)

# Test the dataloader
batch = next(iter(dataloader))
print("Batch shape:", batch.shape) # Should be [4, 129]

## 3. Saving a Local Sample

For debugging other notebooks without internet access or for faster iteration, it's useful to save a small chunk locally.

In [None]:
import json

local_samples = []
for i, sample in enumerate(dataset):
    if i >= 100: break
    local_samples.append(sample['text'])

with open("local_sample.json", "w") as f:
    json.dump(local_samples, f)
    
print(f"Saved {len(local_samples)} samples to local_sample.json")