Motivation
While the wasi-gfx proposal provides standard interfaces for graphics rendering and windowing, there is currently no standard for handling the ingestion, compression, and packaging of audio and video streams.
wasi-media: a proposal for a standardized set of high-level WIT interfaces for media pipelines.
By shadowing existing W3C web standards (like WebCodecs, Media Capture, and Web Audio) where possible, a wasi-media would enable developers to write isomorphic media pipelines on top of wasm.
Scope
wasi-media API should be backend-agnostic and rely on robust capability negotiation to map guest requests to available hardware drivers or software fallbacks.
In Scope:
- Device enumeration and capture (webcams, microphones, desktop screen capture)
- Hardware-accelerated and software video/audio encoding and decoding
- Media container multiplexing and demultiplexing (e.g. mp4, mkv)
- Audio routing, mixing, and resampling
- Virtual loopback broadcasting (virtual webcam)
Out of Scope:
- Visual compositing, scaling, and 3D rendering (delegated to
wasi-gfx)
- Window creation and UI event handling
- Direct access to GPU or other hardware
Use Cases
wasi-media workflows may include:
- Basic Media Recording: Discover host devices (microphones, webcams), initiating a stream, passing raw frames to hardware encoders, and multiplexing the compressed chunks to a local file.
- Stream Transcoding: A headless, server-side pipeline that demuxes an incoming RTSP or file stream, decodes the video, and re-encodes it to a different bitrate or resolution for distribution.
- Real-Time Webcam Filters and Virtual Broadcasting: Acquiring a webcam source, passing the buffer to
wasi-gfx or wasi-nn to apply a virtual overlay, and writing the composited frames back to the host OS as a virtual camera.
- Non-Linear Video Editing (NLE): Loading disparate media clips, demuxing with precise seek-to-timestamp capabilities to generate frame previews, and rendering/muxing a finalized timeline to disk.
- FFmpeg-style Processing: Stitching sequential rendered images into a video container, mixing multiple audio tracks with host-provided resampling, and muxing streams synthetically without relying on a real-time system clock.
Implementation Environments
The interfaces should be designed to be fulfilled by a variety of host embeddings:
- Web Browsers: Implemented via a lightweight JS shim that translates
wasi-media calls directly to the standard WebCodecs, Media Capture, and Web audio APIs.
- Desktop/General Server Runtimes like Wasmtime: Backed by multimedia frameworks like GStreamer, FFmpeg (libavcodec/libavformat), or libwebrtc.
- Native Edge/Embedded: Wired directly to OS-level SDKs like AVFoundation (macOS), Media Foundation (Windows), or V4L2/VA-API (Linux) for native performance
- Virtualized Fallbacks: If a host lacks a specific hardware capability (like AV1 encoding), the component model allows linking a
wasi-media import to a software-compiled wasm polyfill.
Draft WIT Definitions
- Types (
wasi:media/types.wit)
Shared data structures representing raw buffers and compressed chunks.
package wasi:media@0.1.0-draft;
interface types {
/// Represents standard video pixel formats for uncompressed frames.
enum pixel-format {
rgba8888,
bgra8888,
yuv420p,
nv12,
}
/// Represents standard audio sample formats for uncompressed audio.
enum sample-format {
f32-planar,
s16-interleaved,
}
/// An opaque handle to raw, uncompressed media data in memory.
/// Designed to be interoperable with `wasi:gfx` abstract buffers.
resource media-buffer {
/// Returns the byte size of the underlying buffer.
size: func() -> u64;
}
/// Represents a single, uncompressed frame of video.
record video-frame {
buffer: media-buffer,
format: pixel-format,
width: u32,
height: u32,
/// Presentation timestamp in microseconds.
timestamp-us: u64,
}
/// Represents a chunk of uncompressed audio data.
record audio-data {
buffer: media-buffer,
format: sample-format,
sample-rate: u32,
channels: u8,
/// Presentation timestamp in microseconds.
timestamp-us: u64,
}
/// Represents a packet of compressed media (audio or video).
record encoded-chunk {
data: list<u8>,
/// Presentation timestamp in microseconds.
timestamp-us: u64,
/// The duration of the chunk in microseconds, if known.
duration-us: option<u64>,
/// Indicates if this chunk can be decoded independently.
is-keyframe: bool,
}
}
- Codec (
wasi:media/codec.wit)
Hardware and software compression engine and capability negotiation.
interface codec {
use types.{video-frame, audio-data, encoded-chunk};
/// Supported video compression standards.
enum video-codec-type {
h264,
hevc,
av1,
vp9,
}
/// Supported audio compression standards.
enum audio-codec-type {
aac,
opus,
mp3,
flac,
}
/// Indicates whether the host is utilizing hardware acceleration or a software fallback.
enum codec-implementation {
hardware,
software,
}
/// Configuration parameters for initializing a video encoder.
record video-encoder-config {
codec: video-codec-type,
width: u32,
height: u32,
bitrate: u32,
framerate: u32,
/// Optional parameters to declare incoming frame dimensions.
/// If different from target width/height, the host performs a native resize.
source-width: option<u32>,
source-height: option<u32>,
}
/// Configuration parameters for initializing an audio encoder.
record audio-encoder-config {
codec: audio-codec-type,
sample-rate: u32,
channels: u8,
bitrate: u32,
}
/// The response from the host regarding requested codec capabilities.
record encoder-support-result {
supported: bool,
implementation: codec-implementation,
}
/// Queries the host to check if a specific video configuration is supported.
query-video-encoder-support: func(config: video-encoder-config) -> encoder-support-result;
/// Queries the host to check if a specific audio configuration is supported.
query-audio-encoder-support: func(config: audio-encoder-config) -> encoder-support-result;
/// A resource representing an active video encoding session.
resource video-encoder {
/// Initializes the encoder. Traps if the configuration is unsupported.
constructor(config: video-encoder-config);
/// Submits a raw video frame for compression.
encode: func(frame: video-frame, force-keyframe: bool);
/// Retrieves compressed packets from the pipeline.
get-encoded-chunks: func() -> list<encoded-chunk>;
/// Forces the encoder to empty its internal buffers.
flush: func();
}
/// A resource representing an active video decoding session.
resource video-decoder {
constructor(codec: video-codec-type);
/// Submits a compressed packet for decoding.
decode: func(chunk: encoded-chunk);
/// Retrieves uncompressed video frames from the pipeline.
get-decoded-frames: func() -> list<video-frame>;
flush: func();
}
/// A resource representing an active audio encoding session.
resource audio-encoder {
constructor(config: audio-encoder-config);
/// Submits raw audio data for compression.
encode: func(data: audio-data);
get-encoded-chunks: func() -> list<encoded-chunk>;
flush: func();
}
/// A resource representing an active audio decoding session.
resource audio-decoder {
constructor(codec: audio-codec-type);
/// Submits a compressed packet for decoding.
decode: func(chunk: encoded-chunk);
/// Retrieves uncompressed audio data from the pipeline.
get-decoded-audio: func() -> list<audio-data>;
flush: func();
}
}
- Capture (
wasi:media/capture.wit)
Hardware ingest and virtual loopback broadcasting.
interface capture {
use types.{video-frame, audio-data};
/// Categories of physical and virtual media devices.
enum device-type {
camera,
microphone,
screen,
window,
/// A virtual loopback device created by a Wasm guest.
virtual-output,
}
/// Metadata describing an available capture device on the host.
record device-info {
id: string,
name: string,
type: device-type,
}
/// Request parameters for opening a media stream.
record video-capture-constraints {
device-id: option<string>,
ideal-width: option<u32>,
ideal-height: option<u32>,
ideal-framerate: option<u32>,
}
/// Discovers available media capture devices on the host system.
enumerate-devices: func() -> list<device-info>;
/// A resource representing an active ingest stream from a physical device.
resource media-stream {
/// Opens a stream based on the provided constraints.
constructor(constraints: video-capture-constraints);
/// Pulls the latest video frame from the capture device buffer.
read-video-frame: func() -> option<video-frame>;
/// Pulls the latest audio data from the capture device buffer.
read-audio-data: func() -> option<audio-data>;
/// Closes the stream and releases the hardware.
stop: func();
}
/// A resource that allows the Wasm component to present itself as a virtual device to the host OS.
resource virtual-broadcaster {
/// Registers a new virtual device with the host operating system.
constructor(name: string, has-video: bool, has-audio: bool);
/// Pushes a guest-generated video frame out to the host system as a virtual camera.
write-video-frame: func(frame: video-frame);
/// Pushes guest-generated audio out to the host system as a virtual microphone.
write-audio-data: func(data: audio-data);
/// Unregisters the virtual device.
stop: func();
}
}
- Container (
wasi:media/container.wit)
Multiplexing and demultiplexing for streams and files.
interface container {
use types.{encoded-chunk};
/// Supported media packaging formats.
enum container-format {
mp4,
mkv,
webm,
ts,
}
/// Metadata describing a specific stream within a multiplexed container.
record track-info {
id: u32,
codec: string,
is-video: bool,
}
/// A resource for packaging separate encoded tracks into a unified byte stream.
resource muxer {
/// Initializes a muxer for the target format.
constructor(format: container-format);
/// Registers a new video or audio track into the container header.
add-track: func(info: track-info);
/// Writes a compressed chunk to the multiplexed stream.
write-chunk: func(track-id: u32, chunk: encoded-chunk);
/// Retrieves the serialized container bytes to be written to disk or network.
get-output-bytes: func() -> list<u8>;
/// Finalizes the container's footers and metadata.
finalize: func();
}
/// A resource for parsing a unified byte stream into separate encoded tracks.
resource demuxer {
/// Feeds raw container bytes (from a file or socket) into the parser.
append-bytes: func(data: list<u8>);
/// Reads the available tracks discovered in the container.
get-tracks: func() -> list<track-info>;
/// Pulls the next sequential compressed chunk for a specific track.
read-chunk: func(track-id: u32) -> option<encoded-chunk>;
/// Fast-forwards or rewinds the parser to the nearest keyframe at the target timestamp.
seek-to-timestamp: func(time-us: u64) -> result<_, string>;
}
}
- Audio Mixer (
wasi:media/audio-mixer.wit)
Host-delegated audio resampling and channel mixing graph.
interface audio-mixer {
use types.{audio-data};
/// A resource representing a node-based audio processing and mixing graph.
resource audio-context {
/// Initializes the master output format. The host handles resampling to match this.
constructor(master-sample-rate: u32, master-channels: u8);
/// Registers an input track into the mix.
add-source-track: func(track-id: u32);
/// Submits raw audio data to a specific source track.
/// The host automatically resamples this data to match the master format.
submit-audio: func(track-id: u32, data: audio-data);
/// Applies gain (volume) adjustments to a specific track before mixing.
set-track-gain: func(track-id: u32, gain: f32);
/// Pulls the finalized, mixed audio data from the graph.
read-master-output: func() -> option<audio-data>;
}
}
- Playback (
wasi:media/playback.wit)
Hardware playback of sampled media mirrors wasi-gfx surface.wit
interface playback {
use types.{audio-data};
use audio-mixer.{audio-context};
/// Represents a physical audio output device on the host.
record audio-output-info {
id: string,
name: string,
}
/// Queries the host for available speakers/headphones.
enumerate-audio-outputs: func() -> list<audio-output-info>;
/// Represents a connection to a physical output device.
resource audio-player {
/// Initializes playback on a specific device (or default if None).
constructor(device-id: option<string>);
/// Option A: Push/Enqueue raw audio data directly to the speaker.
/// Useful for simple, unmixed playback (like an alert sound).
enqueue-audio: func(data: audio-data);
/// Option B: Link the speaker directly to the mixer's master output.
/// The host will now automatically pull audio from the mixer graph
/// and play it out of the physical speakers in real-time.
connect-mixer: func(context: audio-context);
/// Playback state controls
play: func();
pause: func();
set-system-volume: func(level: f32);
}
}
- Synchronization (
wasi:media/clocks.wit)
Centralized clock to prevent audio/video sync drift.
package wasi:media@0.1.0-draft;
interface clock {
/// Defines what hardware or system mechanism drives the clock's progression.
enum clock-source {
/// Driven by the audio output hardware's buffer consumption rate.
/// (Recommended master for media players).
audio-playback,
/// Driven by the display's VSync or surface presentation rate.
/// (Recommended master for pure video/UI rendering or games).
video-surface,
/// Driven by the host OS monotonic system clock.
/// (Recommended for headless environments or stream multiplexing).
system-monotonic,
}
/// Represents a synchronized timeline for media playback and processing.
resource media-clock {
/// Initializes a new clock tied to a specific hardware or system source.
constructor(source: clock-source);
/// Starts or resumes the clock.
start: func();
/// Pauses the clock. The presentation time remains fixed until started again.
pause: func();
/// Sets the playback speed multiplier (e.g., 1.0 is normal, 2.0 is fast-forward,
/// -1.0 could theoretically be reverse if the demuxer supports it).
set-rate: func(rate: f32);
/// Seeks the clock's internal timeline to a specific timestamp in microseconds.
/// Crucial for syncing the pipeline when a user scrubs an NLE timeline.
seek: func(time-us: u64);
/// Returns the current presentation time (PTS) in microseconds.
/// Decoders and renderers constantly poll this to determine if a frame is due.
now: func() -> u64;
/// Blocks or yields guest execution until the clock reaches the target timestamp.
/// This is the primary pacing mechanism.
wait-until: func(target-time-us: u64);
}
}
Motivation
While the
wasi-gfxproposal provides standard interfaces for graphics rendering and windowing, there is currently no standard for handling the ingestion, compression, and packaging of audio and video streams.wasi-media: a proposal for a standardized set of high-level WIT interfaces for media pipelines.By shadowing existing W3C web standards (like WebCodecs, Media Capture, and Web Audio) where possible, a
wasi-mediawould enable developers to write isomorphic media pipelines on top of wasm.Scope
wasi-mediaAPI should be backend-agnostic and rely on robust capability negotiation to map guest requests to available hardware drivers or software fallbacks.In Scope:
Out of Scope:
wasi-gfx)Use Cases
wasi-mediaworkflows may include:wasi-gfxorwasi-nnto apply a virtual overlay, and writing the composited frames back to the host OS as a virtual camera.Implementation Environments
The interfaces should be designed to be fulfilled by a variety of host embeddings:
wasi-mediacalls directly to the standard WebCodecs, Media Capture, and Web audio APIs.wasi-mediaimport to a software-compiled wasm polyfill.Draft WIT Definitions
wasi:media/types.wit)Shared data structures representing raw buffers and compressed chunks.
wasi:media/codec.wit)Hardware and software compression engine and capability negotiation.
wasi:media/capture.wit)Hardware ingest and virtual loopback broadcasting.
wasi:media/container.wit)Multiplexing and demultiplexing for streams and files.
wasi:media/audio-mixer.wit)Host-delegated audio resampling and channel mixing graph.
wasi:media/playback.wit)Hardware playback of sampled media mirrors
wasi-gfxsurface.witwasi:media/clocks.wit)Centralized clock to prevent audio/video sync drift.