Skip to content

Focus on Chapter metadata

Zeugma440 edited this page Aug 24, 2022 · 14 revisions

Centralized information about audio chapters and their various implementations.

Recap

Standard Marker Supported fields Specifications Notes
ID3v1 None N/A
APE tag None N/A
ID3v2 CHAP / CTOC Start timestamp, End timestamp, Start frame, End Frame, Title, Subtitle + any other ID3v2 field Link Very modular
Vorbis CHAPTERnnnXXX Start timestamp, title, URL Link Purely text-based, simple and efficient
MP4 "Quicktime" chapters Duration, title, attached picture Link - short paragraph about chapter lists Text data muxed into the media stream (akin to subtitles) -> complex to read and write
MP4 "Nero" chapters (CHPL) Start timestamp, title None Metadata-based -> simpler to use than QT chapters

ID3v2

ID3v2 chapters structure is organized around the CTOC (table of contents) and CHAP (chapter) fields

  • CTOC basically lists all chapters, giving them an order. CTOC frames can be nested in one another, making it possible to even describe a tree structure
  • CHAP describes a chapter : title, subtitle, starting and ending frame / timestamps... The format allows the use of any ID3v2 field to describe a chapter, which makes it very flexible.

The standard is documented thoroughly at http://id3.org/id3v2-chapters-1.0

Vorbis

Vorbis structure for chapters is very simple and efficient. It fits into "vanilla" Vorbis, as it is text-based.

The idea is to use basic frames to describe chapter metadata, prefixing them with "CHAPTERnnn" (e.g. CHAPTER001NAME=Prologue)

The standard is documented thoroughly at https://wiki.xiph.org/Chapter_Extension

MP4

MP4 allows two methods for describing chapters : "Quicktime" chapters and "Nero" chapters

Quicktime (QT) Chapters

The MP4 file format being the grandson of the Quicktime file format, they share many things in common, including chapter formatting.

Contrary to ID3v2 or Vorbis implementations, QT chapters are not described into a isolated metadata field somewhere in the header of the file. They are incorporated (multiplexed, to be more precise) into the media data stream itself. They "show up" at their own start timestamps, just like subtitles would. Actually, they do use the same data structure as subtitles.

That being said, parsing QT chapters requires a bit more than peeking into the udta.meta atom :

  1. In every audio track, look for moov.trak.tref.chap (optional atom)
  2. If present, 'chap' contains as many int32 (1) as there are related text tracks that contain chapters (e.g. track 1 has a 'chap' atom containing a single int32 with the '3' value -> that means track 3 contains chapters for track 1)

NB : As far as I know, if an audio track refers to multiple chapter tracks, the one containing chapter titles is the first of the list. I'm not sure about how to identify the contents of the other referred chapter tracks (that might contain chapter URLs, chapter pictures or chapters in other languages ?)

  1. Go to the referred track and make sure its handler type is 'text'
  2. Get its number of samples and their duration using moov.trak.mdia.minf.stbl.stts. Be aware that actual durations have to be calculated using track timescale (trak.tkhd), and not file timescale (moov.mvhd)
  3. Map each sample to its containing chunk using moov.trak.mdia.minf.stbl.stsc. Make sure to carefully read the documentation of stsc, as its way of describing data is somehow unusual.
  4. Calculate the absolute offset of each sample using moov.trak.mdia.minf.stbl.stco (chunk offset) and moov.trak.mdia.minf.stbl.stsz (frame size)
  5. For each sample thus located, read its title located at the offset.
Type (1) Data Notes
int16 String data size Size of following string data
string Chapter name Uses UTF-8 encoding; size of binary data is declared on previous field

(1) : Big-Endian convention

Specifications for QT chapters are limited to a small paragraph in the Quicktime File Format, that explains how it works from a functional point of view : https://developer.apple.com/standards/qtff-2001.pdf

Nero Chapters

"Nero" chapters are an alternative to Quicktime chapters implemented by Nero software suite. It aims at providing a simpler, metadata-based chapter description akin to Vorbis chapters.

They are implemented as a specific atom located at moov.udta.chpl

The contents of the atom is as follows

Type (1) Data Notes
int32 Atom Size As part of standard MP4 atom header
char[4] Atom Name As part of standard MP4 atom header; value is "chpl"
byte Version Atom version
int24 Flags Atom flags (none known so far)
byte Reserved Unknown reserved byte
int32 Chapter count Number of chapters
--- --- --- Following lines are repeated for each chapter
int64 Chapter start time Uses 100-nanosecond base; divide by 10 000 to get milliseconds
byte String data size Size of following string data
string Chapter name Uses UTF-8 encoding; size of binary data is declared on previous field

(1) : Big-Endian convention

To my knowledge, understanding of Nero chapters comes from retro-engineering, as there are no official specifications.

NB1 : Quicktime player, iTunes and the built-in iOS audiobook player support Quicktime chapters only, and ignore Nero chapters entirely.

NB2 : Some players such as VLC seem to fail reading Nero chapters properly when there are more than 255 of them, for instance on (very) long audiobooks. As the Nero structure actually allows for any number of chapters to be written, I'm unsure if this is a bug or a part of the Nero standard I'm unaware of...

Other formats

As far as I know, there is no other implementation of audio chapters. I wouldn't be surprised to see Vorbis-like chapters included informally in other standards, as they are portable to any tagging system without effort.