Skip to content

anyscale/video-curation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Building a Video Processing Pipeline with Ray

⏱️ Time to complete: 1 hr

In this workshop we will build a multimodal video curation pipeline with Ray Data on Anyscale. It turns raw videos into clean, semantically-annotated clip datasets in a single streaming pipeline where CPU and GPU stages run concurrently with automatic backpressure.

Typical video curation pipeline

Pipeline

Videos are streamed directly from the HuggingFaceFV/finevideo dataset, eliminating the need for local prefetching. Each video is split on-the-fly into multiple clips, which are then streamed, processed, and written to Parquet format.

HF parquet (mp4 bytes)
    |
    +--flat_map(process_video_bytes)    # 1 video -> ~10 clips
    |     scene detect + quality filter + keyframe extraction (fused)
    |
    +--vLLMEngineProcessor              # 1:1, attaches category/is_safe/desc
    |     Qwen2.5-VL-3B, one replica per GPU
    |
    +--filter(is_safe)                  # drops unsafe rows
    |
    +--map_batches(CLIPEmbedder)        # 1:1, attaches 512-d embedding
    |     CLIP ViT-B/32 on CPU actor pool
    |
    +--write_parquet                    # /mnt/cluster_storage/...

All processing functions for each pipeline stage are defined in stages.py. The notebook brings these in and constructs the Ray Data pipeline incrementally, adding one stage at a time.

The key idea is streaming execution with heterogeneous resources. Traditional staged pipelines run one stage at a time, GPUs sit idle during CPU stages. This pipeline chains all five stages so CPU and GPU work run concurrently:

Video curation pipeline with Ray Data

Ray Data executes each operation on the specified compute type, streams data block-by-block between operations, and applies backpressure automatically.

Heterogeneous scheduling with Ray Data

How to run

Open ray_data_video_curation_pipeline.ipynb in your Anyscale workspace and execute each cell sequentially from start to finish.

Running at Production Scale

This pipeline can be run as an Anyscale Job at production scale. Follow the step-by-step tutorial here:

Anyscale Job Tutorial: Streaming Video Curation

References & Resources

About

Building a Multimodal Video Processing Pipeline with Ray

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors