Focus on Chapter metadata

Centralized information about audio chapters and their various implementations.

Recap

Standard	Marker	Supported fields	Specifications	Notes
ID3v1	None	N/A
APE tag	None	N/A
ID3v2	CHAP / CTOC	Start timestamp, End timestamp, Start frame, End Frame, Title, Subtitle + any other ID3v2 field	Link	Very modular
Vorbis	CHAPTERnnnXXX	Start timestamp, title, URL	Link	Purely text-based, simple and efficient
MP4	"Quicktime" chapters	Duration, title, attached picture	Link - short paragraph about chapter lists	Text data muxed into the media stream (akin to subtitles) -> complex to read and write
MP4	"Nero" chapters (CHPL)	Start timestamp, title	None	Metadata-based -> simpler to use than QT chapters

ID3v2

ID3v2 chapters structure is organized around the CTOC (table of contents) and CHAP (chapter) fields

CTOC basically lists all chapters, giving them an order. CTOC frames can be nested in one another, making it possible to even describe a tree structure
CHAP describes a chapter : title, subtitle, starting and ending frame / timestamps... The format allows the use of any ID3v2 field to describe a chapter, which makes it very flexible.

The standard is documented thoroughly at http://id3.org/id3v2-chapters-1.0

Vorbis

Vorbis structure for chapters is very simple and efficient. It fits into "vanilla" Vorbis, as it is text-based.

The idea is to use basic frames to describe chapter metadata, prefixing them with "CHAPTERnnn" (e.g. CHAPTER001NAME=Prologue)

The standard is documented thoroughly at https://wiki.xiph.org/Chapter_Extension

MP4

MP4 allows two methods for describing chapters : "Quicktime" chapters and "Nero" chapters

Quicktime (QT) Chapters

The MP4 file format being the grandson of the Quicktime file format, they share many things in common, including chapter formatting.

Contrary to ID3v2 or Vorbis implementations, QT chapters are not described into a isolated metadata field somewhere in the header of the file. They are incorporated (multiplexed, to be more precise) into the media data stream itself. They "show up" at their own start timestamps, just like subtitles would. Actually, they do use the same data structure as subtitles.

That being said, parsing QT chapters requires a bit more than peeking into the udta.meta atom :

In every audio track, look for moov.trak.tref.chap (optional atom)
If present, 'chap' contains as many int32 (1) as there are related text tracks that contain chapters (e.g. track 1 has a 'chap' atom containing a single int32 with the '3' value -> that means track 3 contains chapters for track 1)

NB : As far as I know, if an audio track refers to multiple chapter tracks, the one containing chapter titles is the first of the list. I'm not sure about how to identify the contents of the other referred chapter tracks (that might contain chapter URLs, chapter pictures or chapters in other languages ?)

Go to the referred track and make sure its handler type is 'text'
Get its number of samples and their duration using moov.trak.mdia.minf.stbl.stts. Be aware that actual durations have to be calculated using track timescale (trak.tkhd), and not file timescale (moov.mvhd)
Map each sample to its containing chunk using moov.trak.mdia.minf.stbl.stsc. Make sure to carefully read the documentation of stsc, as its way of describing data is somehow unusual.
Calculate the absolute offset of each sample using moov.trak.mdia.minf.stbl.stco (chunk offset) and moov.trak.mdia.minf.stbl.stsz (frame size)
For each sample thus located, read its title located at the offset.

Type (1)	Data	Notes
int16	String data size	Size of following string data
string	Chapter name	Uses UTF-8 encoding; size of binary data is declared on previous field

(1) : Big-Endian convention

Specifications for QT chapters are limited to a small paragraph in the Quicktime File Format, that explains how it works from a functional point of view : https://developer.apple.com/standards/qtff-2001.pdf

Nero Chapters

"Nero" chapters are an alternative to Quicktime chapters implemented by Nero software suite. It aims at providing a simpler, metadata-based chapter description akin to Vorbis chapters.

They are implemented as a specific atom located at moov.udta.chpl

The contents of the atom is as follows

Type (1)	Data	Notes
int32	Atom Size	As part of standard MP4 atom header
char[4]	Atom Name	As part of standard MP4 atom header; value is "chpl"
byte	Version	Atom version
int24	Flags	Atom flags (none known so far)
byte	Reserved	Unknown reserved byte
int32	Chapter count	Number of chapters
---	---	--- Following lines are repeated for each chapter
int64	Chapter start time	Uses 100-nanosecond base; divide by 10 000 to get milliseconds
byte	String data size	Size of following string data
string	Chapter name	Uses UTF-8 encoding; size of binary data is declared on previous field

(1) : Big-Endian convention

To my knowledge, understanding of Nero chapters comes from retro-engineering, as there are no official specifications.

NB1 : Quicktime player, iTunes and the built-in iOS audiobook player support Quicktime chapters only, and ignore Nero chapters entirely.

NB2 : Some players such as VLC seem to fail reading Nero chapters properly when there are more than 255 of them, for instance on (very) long audiobooks. As the Nero structure actually allows for any number of chapters to be written, I'm unsure if this is a bug or a part of the Nero standard I'm unaware of...

Other formats

As far as I know, there is no other implementation of audio chapters. I wouldn't be surprised to see Vorbis-like chapters included informally in other standards, as they are portable to any tagging system without effort.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly