# Load MiraData Video Dataset from Hugging Face

Import the MiraData video dataset sample from GitHub into Pixeltable tables. MiraData is a high-quality video dataset with detailed annotations for video understanding tasks.

**What's in this recipe:**
- Import 100 sample videos from MiraData via CSV
- Load video clips with rich metadata (captions, timestamps, source info)
- Work with diverse video content from YouTube and other sources
- Leverage detailed annotations for video understanding tasks


## Problem

MiraData is a comprehensive video dataset with detailed annotations for video understanding. You need a representative sample in Pixeltable to apply AI models for video analysis, captioning, or understanding without downloading the massive full dataset from Hugging Face.


## Solution

**What's in this recipe:**
- Load 100 video samples from MiraData's curated sample CSV on GitHub
- Rich metadata including video URLs, captions, timestamps, and source information
- Diverse content from YouTube including 3D engine-rendered scenes and real-world footage

The MiraData team provides a [sample CSV with 100 videos](https://github.com/mira-space/MiraData/blob/v1/assets/miradata_v1_100_samples.csv) that you can load directly into Pixeltable without needing to download the full dataset from [Hugging Face](https://huggingface.co/datasets/TencentARC/MiraData).


### Setup


In [None]:
!uv add pixeltable pandas

In [None]:
import pixeltable as pxt
import pandas as pd


### Load MiraData Sample CSV

Load the [MiraData sample CSV](https://raw.githubusercontent.com/mira-space/MiraData/v1/assets/miradata_v1_100_samples.csv) containing 100 video samples with metadata.


In [None]:
# Load MiraData sample CSV from GitHub
csv_url = "https://raw.githubusercontent.com/mira-space/MiraData/v1/assets/miradata_v1_100_samples.csv"
df = pd.read_csv(csv_url)

# Display first few rows to see the structure
df.head()


In [None]:
# Create directory for MiraData videos
pxt.drop_dir('miradata_videos', force=True)
pxt.create_dir('miradata_videos')


### Create Pixeltable Table


In [None]:
# Create table with schema for videos and metadata
t = pxt.create_table(
    'miradata_videos.samples',
    schema={
        'video_url': pxt.String,
        'clip_id': pxt.String,
        'source': pxt.String,
        'video_id': pxt.String,
        'width': pxt.Int,
        'height': pxt.Int,
        'fps': pxt.Float,
        'seconds': pxt.Float,
        'short_caption': pxt.String,
        'dense_caption': pxt.String
    },
    comment='MiraData video dataset with 100 samples from GitHub CSV sample'
)


In [None]:
# Prepare rows for insertion from DataFrame
rows = []
for _, row in df.iterrows():
    rows.append({
        'video_url': row['video_url'],
        'clip_id': str(row['clip_id']),
        'source': row['source'],
        'video_id': row['video_id'],
        'width': int(row['width']),
        'height': int(row['height']),
        'fps': float(row['fps']),
        'seconds': float(row['seconds']),
        'short_caption': row['short_caption'],
        'dense_caption': row['dense_caption']
    })

t.insert(rows)


In [None]:
# View sample data
t.select(t.video_url, t.source, t.short_caption, t.width, t.height).head(10)


In [None]:
# Check total count
t.count()


### Publish to Pixeltable Cloud

Publish the table to make it available on Pixeltable Cloud.


In [None]:
# Publish the table to Pixeltable Cloud
pxt.publish(
    'miradata_videos.samples',
    'pxt://pixeltable:huggingface/miradata_videos',
    access='public'
)


## See also

- [MiraData on Hugging Face](https://huggingface.co/datasets/TencentARC/MiraData)
- [MiraData GitHub Repository](https://github.com/mira-space/MiraData)
- [MiraData Sample CSV](https://raw.githubusercontent.com/mira-space/MiraData/v1/assets/miradata_v1_100_samples.csv)
