# Overview of the Hugging Face datasets Library
The Hugging Face **datasets** library provides a uniform, efficient, and lightweight way to access, process, and share datasets for NLP, as well as Computer Vision and Audio tasks.

## Key Classes

**DatasetDict**: A dictionary-like object that holds multiple Dataset objects, typically the different splits (e.g., `train`, `validation`, `test`).This is what _load_dataset_ often returns.

**Dataset**: The main object, similar to a `pandas.DataFrame`, but optimized for ML workflows. It provides dictionary-style access to rows and columns.

**Features**: Defines the data types and structure of your dataset columns (e.g., `Value('string')`, `ClassLabel` for labels, or specialized features for `Image` and `Audio`.

## How to Use the Hugging Face datasets Library

### Step 1: Installation

```shell
pip install datasets
```

### Step 2: Loading Data

#### A. Loading from the Hugging Face Hub (Public Datasets)

This is the most common use case. You simply provide the dataset identifier.

In [2]:
from datasets import load_dataset

# Load the entire dataset dictionary for the IMDB review classification dataset
# It will download and cache the data.
dataset_dict = load_dataset("imdb")

# Access the individual splits
train_dataset = dataset_dict["train"]
test_dataset = dataset_dict["test"]

print(f"Train split size: {len(train_dataset)}")
print(f"Test split size: {len(test_dataset)}")
print(train_dataset.features)

  from .autonotebook import tqdm as notebook_tqdm


Train split size: 25000
Test split size: 25000
{'text': Value('string'), 'label': ClassLabel(names=['neg', 'pos'])}


#### B. Loading from Local Files (CSV, JSON, Text, etc.)

You specify the data format (e.g., 'csv', 'json') and provide the path to your file(s).21Assume you have a local file named `my_data.csv`.

In [3]:
from datasets import load_dataset
import pandas as pd

# Create a dummy CSV file for the example
pd.DataFrame({
    'text': ["This is great.", "This is terrible.", "So-so."],
    'label': [1, 0, 1]
}).to_csv("my_data.csv", index=False)

# Load the local CSV file
local_dataset = load_dataset("csv", data_files="my_data.csv")

print(local_dataset)
print(local_dataset["train"][0])

Generating train split: 3 examples [00:00, 768.56 examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 3
    })
})
{'text': 'This is great.', 'label': 1}





##### How to Load `ClassLabel` from CSV

To load a label column (e.g., `'positive'`, `'negative'`) from a local CSV as a dedicated **`ClassLabel`** feature, you must explicitly **cast** the column's type after the initial load.

In [4]:
from datasets import load_dataset, Features, ClassLabel
import pandas as pd
import os

# Create dummy data for the example
df = pd.DataFrame({
    'text': ["This is great!", "I'm not happy.", "Neutral feeling."],
    'sentiment': ["positive", "negative", "neutral"]
})
df.to_csv("sentiment_data.csv", index=False)

# 1. Initial Load: 'sentiment' loads as a simple string Value
local_data = load_dataset("csv", data_files="sentiment_data.csv")
raw_dataset = local_data["train"]

# 2. Define the Target Features
# IMPORTANT: The 'names' list defines the string-to-integer mapping (0, 1, 2)
new_features = raw_dataset.features.copy()
new_features["sentiment"] = ClassLabel(names=['negative', 'neutral', 'positive'])

# 3. Cast the Dataset
# This converts the string labels into their corresponding integer IDs (0, 1, or 2)
dataset_with_classlabel = raw_dataset.cast(new_features)

print("\n--- ClassLabel Conversion Complete ---")
print(f"Final Feature Type: {dataset_with_classlabel.features['sentiment']}")

# Demonstrating label decoding:
example_id = dataset_with_classlabel[1]['sentiment']
example_label_name = dataset_with_classlabel.features['sentiment'].int2str(example_id)

print(f"Example ID: {example_id}, Decoded Label: {example_label_name}")

# Clean up the dummy file
os.remove("sentiment_data.csv")

Generating train split: 3 examples [00:00, 886.81 examples/s]
Casting the dataset: 100%|██████████| 3/3 [00:00<00:00, 1165.30 examples/s]


--- ClassLabel Conversion Complete ---
Final Feature Type: ClassLabel(names=['negative', 'neutral', 'positive'])
Example ID: 0, Decoded Label: negative





##### Saving the Processed Dataset

After you have loaded your initial CSV and successfully cast the column to `ClassLabel`, you should save the resulting `Dataset` object.

**1. The Best Practice Code**

This process saves the data, the schema, and the `ClassLabel` mapping to a local directory.

In [5]:
from datasets import load_from_disk
import os

# Assume 'dataset_with_classlabel' is the Dataset object you created in the last step
# with the 'sentiment' column already cast to ClassLabel.

# Define the path where the dataset structure will be saved
SAVE_PATH = "./my_processed_sentiment_dataset"

# --- Save the Dataset ---
dataset_with_classlabel.save_to_disk(SAVE_PATH)

print(f"Dataset successfully saved to: {SAVE_PATH}")
print("This directory now contains the data and the feature schema.")

Saving the dataset (1/1 shards): 100%|██████████| 3/3 [00:00<00:00, 1045.79 examples/s]

Dataset successfully saved to: ./my_processed_sentiment_dataset
This directory now contains the data and the feature schema.





**2. Loading the Dataset with Features Intact**

To load the dataset later, you use the `load_from_disk()` function, which instantly re-reads the data, including all the feature definitions.

In [6]:
# --- Load the Dataset ---
reloaded_dataset = load_from_disk(SAVE_PATH)

print("\n--- Reloaded Dataset Check ---")
print(f"Features upon reload: {reloaded_dataset.features['sentiment']}")
print(f"Example value: {reloaded_dataset[0]['sentiment']}")
print(f"Decoded label: {reloaded_dataset.features['sentiment'].int2str(reloaded_dataset[0]['sentiment'])}")



--- Reloaded Dataset Check ---
Features upon reload: ClassLabel(names=['negative', 'neutral', 'positive'])
Example value: 2
Decoded label: positive


**Why this is the best practice**

1. **Preserves Metadata:** When you save to disk using `save_to_disk()`, the entire **schema**, including your custom `ClassLabel` definition, is serialized and saved alongside the data files (in Arrow/Parquet format).
2. **Instant Loading:** `load_from_disk()` is typically **faster** than re-loading from a generic format like CSV, as it skips file parsing and schema inference steps, directly loading the optimized columnar structure.
3. **Efficiency:** It ensures the dataset is in the most **optimized columnar format** (Apache Arrow/Parquet) for subsequent mapping and processing operations, avoiding the performance bottlenecks of plain text files.

#### C. Creating a Dataset from a Python Object

You can easily convert standard Python lists or dictionaries into a **Dataset** object.

In [7]:
from datasets import Dataset

# Create a dataset from a dictionary
data = {
    "sentence": ["Hello, world!", "Coding is fun."],
    "id": [1, 2]
}
my_dataset = Dataset.from_dict(data)

print(my_dataset)
print(my_dataset["sentence"])

Dataset({
    features: ['sentence', 'id'],
    num_rows: 2
})
Column(['Hello, world!', 'Coding is fun.'])


## Data Manipulation Examples

The **Dataset** object has a powerful set of methods for preprocessing, which are applied efficiently using Apache Arrow.

### 1. Tokenization with `map()`

The `map()` method is the workhorse of data processing. It applies a function to every example (or batch of examples) in the dataset. This is where you would typically perform tokenization.

In [8]:
from datasets import load_dataset
from transformers import AutoTokenizer

# Load a small dataset split
raw_datasets = load_dataset("glue", "mrpc", split="train[:500]")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Define the tokenization function
def tokenize_function(examples):
    # This processes two sentences, as in a sentence-pair classification task
    return tokenizer(examples["sentence1"], examples["sentence2"], truncation=True)

# Apply the function to the entire dataset
# batched=True is often faster and necessary for some tokenizers
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

# Remove original text columns and rename 'label' to 'labels' for Transformer model compatibility
final_datasets = tokenized_datasets.remove_columns(["sentence1", "sentence2", "idx"])
final_datasets = final_datasets.rename_column("label", "labels")

print("--- Tokenized Dataset ---")
print(final_datasets)
print(final_datasets[0])

--- Tokenized Dataset ---
Dataset({
    features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 500
})
{'labels': 1, 'input_ids': [101, 2572, 3217, 5831, 5496, 2010, 2567, 1010, 3183, 2002, 2170, 1000, 1996, 7409, 1000, 1010, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102, 7727, 2000, 2032, 2004, 2069, 1000, 1996, 7409, 1000, 1010, 2572, 3217, 5831, 5496, 2010, 2567, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


### 2. Filtering Examples

The `filter()` method keeps examples that satisfy a condition defined in a function.

In [9]:
# Filter the IMDB dataset to only keep positive reviews (label 1)
def is_positive(example):
    # The 'label' feature has ClassLabel, where 1 is positive
    return example["label"] == 1

positive_reviews = raw_datasets.filter(is_positive)

print("\n--- Filtered Dataset (Positive Reviews) ---")
print(f"Original size: {len(raw_datasets)}, Positive size: {len(positive_reviews)}")


--- Filtered Dataset (Positive Reviews) ---
Original size: 500, Positive size: 346


### 3. Selecting Columns and Rows

You can use familiar Python/NumPy slicing and indexing.


| **Method**             | **Description**                           | **Example**                                  |
| ---------------------- | ----------------------------------------- | -------------------------------------------- |
| **Indexing/Slicing**   | Accesses a single row or a range of rows. | `train_dataset[0]`                 |
| **Column Access**      | Accesses an entire column.                | `train_dataset['text']`            |
| **`select()`**         | Selects examples by a list of indices.    | `dataset.select(range(100))`        |
| **`remove_columns()`** | Removes one or more columns.              | `dataset.remove_columns(["text"])` |




### 4. Splitting and Shuffling

You can easily split a single split into training and testing portions using a simple method.


In [10]:
# Split the training set into new train and validation splits
split_datasets = raw_datasets.train_test_split(test_size=0.1, seed=42)

print("\n--- Split DatasetDict ---")
print(split_datasets)
print(f"New validation split size: {len(split_datasets['test'])}")


--- Split DatasetDict ---
DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 450
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 50
    })
})
New validation split size: 50


### 5. Formatting for Frameworks

To prepare the dataset for direct use in PyTorch, TensorFlow, or NumPy, you use `set_format()`. This automatically converts the Apache Arrow arrays into the correct tensor format and makes columns required by the framework (like `input_ids`) available as tensors.

In [11]:
# Set the format to PyTorch tensors
pytorch_dataset = final_datasets.with_format("torch")

print("\n--- PyTorch Dataset ---")
# The columns are now PyTorch tensors when accessed
print(type(pytorch_dataset[0]['input_ids']))


--- PyTorch Dataset ---
<class 'torch.Tensor'>
