# Quick Guide: Exploring Hugging Face Datasets

A simple walkthrough of loading and inspecting datasets before fine-tuning.

**Goal**: Before we fine-tune a model, we need to understand the dataset structure and content.


In [1]:
# !pip install datasets


Collecting datasets
  Obtaining dependency information for datasets from https://files.pythonhosted.org/packages/ca/51/409a8184ed35453d9cbb3d6b20d524b1115c2c2d117b85d5e9b06cd70b45/datasets-4.3.0-py3-none-any.whl.metadata
  Downloading datasets-4.3.0-py3-none-any.whl.metadata (18 kB)
Collecting pyarrow>=21.0.0 (from datasets)
  Obtaining dependency information for pyarrow>=21.0.0 from https://files.pythonhosted.org/packages/af/63/ba23862d69652f85b615ca14ad14f3bcfc5bf1b99ef3f0cd04ff93fdad5a/pyarrow-22.0.0-cp312-cp312-macosx_12_0_arm64.whl.metadata
  Downloading pyarrow-22.0.0-cp312-cp312-macosx_12_0_arm64.whl.metadata (3.2 kB)
Collecting dill<0.4.1,>=0.3.0 (from datasets)
  Obtaining dependency information for dill<0.4.1,>=0.3.0 from https://files.pythonhosted.org/packages/50/3d/9373ad9c56321fdab5b41197068e1d8c25883b3fea29dd361f9b55116869/dill-0.4.0-py3-none-any.whl.metadata
  Using cached dill-0.4.0-py3-none-any.whl.metadata (10 kB)
Collecting httpx<1.0.0 (from datasets)
  Obtaining dep

In [2]:
from datasets import load_dataset
import random


## Step 1: Load the Dataset

We'll use the famous **Alpaca dataset** - a collection of instruction-following examples.


In [3]:
print("=" * 70)
print("Loading the Alpaca Dataset from Hugging Face")
print("=" * 70)
print()

# Load dataset
dataset = load_dataset("tatsu-lab/alpaca")

print("Dataset loaded! Let's explore it...")
print()


Loading the Alpaca Dataset from Hugging Face

Dataset loaded! Let's explore it...



## Step 2: Check Available Splits

Most datasets have train/validation/test splits. Let's see what we have:


In [4]:
print("Available splits:")
print(dataset)
print()
print(f"Number of examples: {len(dataset['train']):,}")
print()


Available splits:
DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output', 'text'],
        num_rows: 52002
    })
})

Number of examples: 52,002



## Step 3: Understand the Structure

What fields does each example have?


In [5]:
print("=" * 70)
print("Dataset Structure")
print("=" * 70)
print()

print("Fields in each example:")
print(dataset['train'].column_names)
print()


Dataset Structure

Fields in each example:
['instruction', 'input', 'output', 'text']



## Step 4: Look at Examples

Let's examine some actual examples to understand the data:


In [6]:
print("First example:")
print("-" * 70)
first = dataset['train'][0]
for key, value in first.items():
    print(f"{key}: {value}")
print()


First example:
----------------------------------------------------------------------
instruction: Give three tips for staying healthy.
input: 
output: 1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. 
2. Exercise regularly to keep your body active and strong. 
3. Get enough sleep and maintain a consistent sleep schedule.
text: Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Give three tips for staying healthy.

### Response:
1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. 
2. Exercise regularly to keep your body active and strong. 
3. Get enough sleep and maintain a consistent sleep schedule.



In [7]:
print("-" * 70)
print("Random example:")
print("-" * 70)
random_ex = dataset['train'][random.randint(0, len(dataset['train']) - 1)]
for key, value in random_ex.items():
    print(f"{key}: {value}")
print()


----------------------------------------------------------------------
Random example:
----------------------------------------------------------------------
instruction: Update the following sentence with the right punctuation
input: What are you doing
output: What are you doing?
text: Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Update the following sentence with the right punctuation

### Input:
What are you doing

### Response:
What are you doing?



## Step 5: Quick Statistics

Let's gather some useful statistics about the dataset:


In [8]:
print("=" * 70)
print("Quick Statistics")
print("=" * 70)
print()

# Check output lengths
output_lengths = [len(ex['output']) for ex in dataset['train']]
print(f"Output length - Min: {min(output_lengths)}, Max: {max(output_lengths)}, Avg: {sum(output_lengths)/len(output_lengths):.0f} chars")

# Check for empty inputs
empty_inputs = sum(1 for ex in dataset['train'] if not ex['input'].strip())
print(f"Examples with empty input field: {empty_inputs:,} ({100*empty_inputs/len(dataset['train']):.1f}%)")
print()


Quick Statistics

Output length - Min: 0, Max: 4181, Avg: 270 chars
Examples with empty input field: 31,323 (60.2%)



## Step 6: Create Train/Validation Split

The Alpaca dataset only has a train split. Let's create a validation set:


In [9]:
print("=" * 70)
print("Creating Train/Validation Split")
print("=" * 70)
print()

split_dataset = dataset['train'].train_test_split(test_size=0.1, seed=42)

print(f"Training examples: {len(split_dataset['train']):,}")
print(f"Validation examples: {len(split_dataset['test']):,}")
print()


Creating Train/Validation Split

Training examples: 46,801
Validation examples: 5,201



---

## Key Takeaways

1. **Loading is easy**: Use `load_dataset("dataset-name")` from Hugging Face
2. **Inspect the structure**: Check column names and splits before training
3. **Sample examples**: Always look at actual examples to understand the data format
4. **Statistics matter**: Check for empty fields, length distributions, etc.
5. **Create validation sets**: If the dataset doesn't have one, split it yourself

**Next Steps**: 
- Format these examples for instruction tuning
- Tokenize the data
- Create a DataLoader for training
