# Hugging Face Datasets — End‑to‑End Tutorial

*A hands‑on, topic‑by‑topic guide with explanations before each code block.*

> **Last updated:** 2025-09-17 09:11 UTC
> 
> **What you’ll learn**
> - Installing and setting up `datasets`
> - Loading, inspecting, and transforming datasets
> - Efficient preprocessing with `map`, filtering, and splitting
> - Streaming large datasets
> - Working with text, images, and audio
> - Interoperability with pandas, NumPy, PyTorch, and TensorFlow
> - Saving, exporting, and versioning data
> - Pushing datasets to the Hub and writing Dataset Cards
> - Performance tips, caching, and troubleshooting


## 0) Prerequisites & Environment

This notebook is designed to be **read top‑to‑bottom**, with each topic introduced by a markdown explanation followed by runnable code.
You’ll need:

- Python 3.8+
- Recent versions of:
  - `datasets`
  - `pandas` (optional, for tabular interop)
  - `numpy`
  - `torch` or `tensorflow` (optional, for model training examples)
  - `transformers` (optional, for tokenization examples)

> **Tip:** If you’re running in a restricted environment (e.g., no internet), you can still read through the code cells. Execute them later in a connected runtime.


In [None]:
# (Optional) Install or update libraries in your environment
# If you're in an offline environment, skip this. Otherwise, uncomment to run.
# %pip install -U datasets pandas numpy transformers torch torchvision torchaudio tensorflow pillow soundfile pyarrow fastparquet

## 1) Quickstart: Load and Peek at a Dataset

`datasets.load_dataset` fetches datasets from the Hugging Face Hub (or disk). It returns a **DatasetDict** (for multiple splits) or a **Dataset**.

Below we load the classic IMDB reviews dataset and take a quick look.

> **Key ideas**
> - Use `load_dataset("<namespace>/<name>")` or just `load_dataset("<name>")`.
> - Access splits like `dataset["train"]` and `dataset["test"]`.
> - Use `.features`, `.column_names`, `.num_rows`, and slicing to inspect.


In [None]:
from datasets import load_dataset

# Load a small text dataset (will download on first run and cache locally)
imdb = load_dataset("imdb")

# Peek
imdb, imdb["train"][0], imdb["train"].features, imdb["train"].column_names, imdb["train"].num_rows

## 2) Inspecting Structure & Metadata

Understanding a dataset’s schema is crucial. `features` defines column types; `DatasetInfo` contains license, citation, description, and more.

> **Pro tips**
> - `dataset.info.description` often mirrors the Dataset Card.
> - Use `dataset.unique("label")` or `value_counts()` via pandas for quick EDA.


In [None]:
info = imdb["train"].info
print("Description (truncated):\n", (info.description or "")[:500], "...")
print("\nLicense:", info.license)
print("\nFeatures:", imdb["train"].features)

## 3) Basic Operations: Select, Filter, and Map

`datasets` uses **Apache Arrow** under the hood for efficient, memory‑aware ops.

- **`select`**: take rows by index (or slice) — great for subsampling.
- **`filter`**: keep rows matching a predicate.
- **`map`**: transform rows; supports batched processing and multiprocessing.

> **Immutability:** Operations return **new** datasets; originals remain intact.


In [None]:
from datasets import Dataset

train = imdb["train"]

# Select a subset
small_train = train.select(range(2000))

# Filter by length
def is_long(example):
    return len(example["text"].split()) > 50

long_reviews = small_train.filter(is_long)

# Map: add a simple feature (word count)
def add_word_count(example):
    example["word_count"] = len(example["text"].split())
    return example

with_wc = long_reviews.map(add_word_count)

with_wc[0], with_wc.features

## 4) Train/Validation/Test Splits

If a dataset doesn’t provide a validation split, you can create one deterministically.

> **Determinism:** Set a `seed` for reproducible splits.


In [None]:
from datasets import DatasetDict

# Create a validation split from the training set
split = small_train.train_test_split(test_size=0.2, seed=42)
split = DatasetDict({
    "train": split["train"],
    "validation": split["test"],
    "test": imdb["test"].select(range(2000))  # smaller test for demos
})

{key: ds.num_rows for key, ds in split.items()}

## 5) Text Preprocessing with 🤗 Transformers Tokenizers

Use `map` with a tokenizer for fast, batched tokenization. Keep columns compact to save memory.

> **Tips**
> - Use `batched=True` and set `num_proc` for parallelism (where supported).
> - Use `remove_columns` to drop raw text once tokenized (if you don’t need it).

In [None]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_batch(batch):
    return tokenizer(batch["text"], truncation=True, padding=False)

tokenized = split.map(tokenize_batch, batched=True, remove_columns=["text"])
tokenized

## 6) Building PyTorch Dataloaders (Optional)

Create `DataLoader`s directly from a `Dataset` using a **data collator** to pad dynamically.

> **Why dynamic padding?** Saves compute by padding only to the longest sequence in the batch.


In [None]:
from torch.utils.data import DataLoader
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="pt")

train_loader = DataLoader(tokenized["train"], batch_size=8, shuffle=True, collate_fn=data_collator)
batch = next(iter(train_loader))
{ k: v.shape for k, v in batch.items() }

## 7) Interoperability with pandas & NumPy

Convert to pandas when you need rich tabular EDA or plotting. Convert back if needed.

> **Note:** Conversions materialize data in memory.


In [None]:
import pandas as pd

df = with_wc.to_pandas()
df.head(), df["word_count"].describe()

## 8) Saving, Loading, and Exporting

Persist datasets to disk to avoid repeated downloads or heavy preprocessing.

- `save_to_disk` / `load_from_disk` for Arrow‑backed format
- `to_parquet` / `from_parquet` and `to_csv` for interop


In [None]:
from datasets import load_from_disk

with_wc.save_to_disk("/mnt/data/imdb-with-wc")
reloaded = load_from_disk("/mnt/data/imdb-with-wc")

# Export a small sample to parquet and CSV
sample = reloaded.select(range(100))
sample.to_parquet("/mnt/data/imdb-sample.parquet")
sample.to_csv("/mnt/data/imdb-sample.csv")
print("Saved to /mnt/data/ (parquet & csv).")

## 9) Streaming Large Datasets

For huge datasets, avoid downloading everything: use **streaming** to iterate examples lazily.

> **Caveats:** Not all operations are supported in streaming mode (e.g., random access).


In [None]:
streamed = load_dataset("c4", "en", split="train", streaming=True)
it = iter(streamed)
for _ in range(2):
    ex = next(it)
    print({k: (str(v)[:80] + '...') for k, v in ex.items()})

## 10) Working with Image Datasets

`datasets` supports image columns via **PIL** and can apply transforms with `set_transform` or inside `map`.

> **Example:** Load CIFAR‑10 and normalize images lazily.


In [None]:
from datasets import load_dataset
from PIL import Image
import numpy as np

cifar10 = load_dataset("cifar10")

mean = [0.4914, 0.4822, 0.4465]
std  = [0.2470, 0.2435, 0.2616]

def normalize_images(batch):
    imgs = np.stack([np.array(img) for img in batch["img"]], axis=0).astype("float32") / 255.0
    imgs = (imgs - mean) / std
    batch["img_array"] = imgs
    return batch

cifar10_small = cifar10["train"].select(range(64)).with_format("numpy")
cifar10_small = cifar10_small.map(normalize_images, batched=True, batch_size=32)
cifar10_small[0].keys()

## 11) Working with Audio Datasets

Audio columns use **librosa**/**soundfile**. Resampling and feature extraction can be done with `map`.

> **Example:** Load a speech dataset and resample to 16 kHz.


In [None]:
speech = load_dataset("PolyAI/minds14", "en-US")

from datasets import Audio
speech = speech.cast_column("audio", Audio(sampling_rate=16_000))

sample = speech["train"][0]["audio"]
sample["sampling_rate"], type(sample["array"]), sample["array"].shape

## 12) Metrics with `evaluate` (Optional)

Use 🤗 `evaluate` for standard metrics (accuracy, F1, BLEU, WER, etc.).


In [None]:
# %pip install -U evaluate
import evaluate

accuracy = evaluate.load("accuracy")
preds = [0, 1, 1, 0]
refs  = [0, 1, 0, 0]
accuracy.compute(predictions=preds, references=refs)

## 13) Versioning & Sharing: Push Datasets to the Hub

You can push processed datasets to your (private or public) Hugging Face repo.

> **Steps**
> 1. Login with `huggingface-cli login` or `huggingface_hub.login()`
> 2. Give your dataset a repo name like `username/my-imdb-processed`
> 3. Call `push_to_hub()`

> **Privacy:** Use private repos for sensitive data and include a clear license.


In [None]:
# from huggingface_hub import login
# login()  # Follow prompts or pass a token string

# Reuse the 'reloaded' dataset from section 8
# reloaded.push_to_hub("username/my-imdb-processed")

## 14) Writing a Good Dataset Card

A **Dataset Card** documents motivation, composition, collection process, preprocessing, intended uses, and limitations.
Include:
- **Overview:** summary, source, size, splits
- **Licensing & Rights**
- **Ethical Considerations & Biases**
- **Usage:** tasks, benchmarks, metrics
- **Caveats:** known issues, annotation quality


In [None]:
dataset_card_template = """
# Dataset Card for <your-dataset-name>

## Dataset Summary
<1–3 sentences>

## Supported Tasks and Leaderboards
- Text Classification

## Languages
English

## Dataset Structure
- Splits: train/validation/test
- Features: text (string), label (class)

## Data Instances
```
{"text": "...", "label": 0}
```

## Data Fields
- text: string — the review text
- label: int — 0 = negative, 1 = positive

## Data Splits
| Split | #Examples |
|------:|----------:|
| train | 20000     |
| valid | 5000      |
| test  | 5000      |

## Licensing Information
MIT

## Citation
```
@inproceedings{...}
```

## Ethical Considerations
- Potential biases in user reviews.

## Limitations
- Binary sentiment only.
"""

print(dataset_card_template)

## 15) Caching, Memory, and Performance Tips

- **Cache**: The first `load_dataset` download is cached under `~/.cache/huggingface/datasets`. Subsequent loads are instant.
- **Arrow format**: Columnar in‑memory format enables fast slicing and vectorized ops.
- Use `batched=True` (and optionally `num_proc`) in `map` for speedups.
- Use `with_format("torch")` / `"tensorflow"` / `"numpy"` to avoid extra conversions in training loops.
- Remove large raw columns after feature extraction to reduce memory.
- For **very large** data, use **streaming** and write out processed shards to disk.


In [None]:
# Example: set a target framework format to avoid per-batch conversion
tokenized_torch = tokenized.with_format("torch")
next(iter(DataLoader(tokenized_torch["train"], batch_size=8, collate_fn=data_collator))).keys()

## 16) Troubleshooting & Common Errors

- **`ConnectionError` / 403 / 404**: Check internet, dataset name, or private repo access.
- **`ArrowInvalid`**: Often due to mixed types in a column — clean or cast columns.
- **`Killed` / OOM**: Reduce batch sizes, remove columns, or stream data.
- **`transformers` tokenizer slow**: Use fast tokenizers (default), batched mapping, and multiprocessing where supported.
- **Windows path issues**: Use shorter cache paths or move cache via env var `HF_HOME`.


In [None]:
# Quick helper: inspect memory footprint of a dataset
def table_size_mb(ds):
    # Estimate: sum of column buffers (approximate)
    return round(ds.data.nbytes / (1024**2), 2)

size_mb = table_size_mb(tokenized["train"])
print(f"Approx Arrow payload size: {size_mb} MB")

## 17) Where to Go Next

- Explore the [Hugging Face Hub](https://huggingface.co/datasets) for thousands of datasets.
- Read the official docs for advanced features (streaming, sharding, pausing/resuming maps).
- Try adapting this workflow to your **own** dataset: load from local files with `load_dataset("csv" | "json" | "parquet", data_files=...)`.


In [None]:
# Example: load from local files
# local = load_dataset("csv", data_files={"train": "train.csv", "test": "test.csv"})