⏱️ Time to complete: 1 hr
In this workshop we will build a multimodal video curation pipeline with Ray Data on Anyscale. It turns raw videos into clean, semantically-annotated clip datasets in a single streaming pipeline where CPU and GPU stages run concurrently with automatic backpressure.
Videos are streamed directly from the HuggingFaceFV/finevideo dataset, eliminating the need for local prefetching. Each video is split on-the-fly into multiple clips, which are then streamed, processed, and written to Parquet format.
HF parquet (mp4 bytes)
|
+--flat_map(process_video_bytes) # 1 video -> ~10 clips
| scene detect + quality filter + keyframe extraction (fused)
|
+--vLLMEngineProcessor # 1:1, attaches category/is_safe/desc
| Qwen2.5-VL-3B, one replica per GPU
|
+--filter(is_safe) # drops unsafe rows
|
+--map_batches(CLIPEmbedder) # 1:1, attaches 512-d embedding
| CLIP ViT-B/32 on CPU actor pool
|
+--write_parquet # /mnt/cluster_storage/...
All processing functions for each pipeline stage are defined in stages.py. The notebook brings these in and constructs the Ray Data pipeline incrementally, adding one stage at a time.
The key idea is streaming execution with heterogeneous resources. Traditional staged pipelines run one stage at a time, GPUs sit idle during CPU stages. This pipeline chains all five stages so CPU and GPU work run concurrently:
Ray Data executes each operation on the specified compute type, streams data block-by-block between operations, and applies backpressure automatically.
Open ray_data_video_curation_pipeline.ipynb in your Anyscale workspace and execute each cell sequentially from start to finish.
This pipeline can be run as an Anyscale Job at production scale. Follow the step-by-step tutorial here:
Anyscale Job Tutorial: Streaming Video Curation
- Ray Data Documentation: Learn about Ray Data, its features, and pipeline construction.
- Ray Data LLM API Guide: Official documentation for running LLM-based data operations with Ray Data.
- HuggingFaceFV/finevideo Dataset: The open video dataset used for this pipeline.
- Qwen2.5-VL-3B-Instruct Model: Multimodal model for video understanding and annotation.
- OpenAI CLIP ViT-B/32 Model: Used for generating high-dimensional video clip embeddings.


