Skip to content

podcast:transcript type is always "text/plain" regardless of actual file format #7

@mark-c4r

Description

@mark-c4r

We consume RSS feeds from multiple podcast sources. When parsing transcripts from a PODSTR-powered feed, we noticed the <podcast:transcript> type attribute is always text/plain, even when the linked file is SRT format. This causes our transcript pipeline to miss the SRT structure — timestamps and cue markers end up as noise in the extracted text, which degrades the quality of downstream summaries and analysis built on top of it.

Details

scripts/build-rss.ts hardcodes the type:

${transcriptUrl ? `<podcast:transcript url="${escapeXml(transcriptUrl)}" type="text/plain" />` : ''}

The Podcasting 2.0 transcript spec defines type as a required attribute. Podcast apps and validators use it to interpret transcript format.

Why this happens

The Nostr event tag stores ['transcript', url] without a MIME type, so build-rss.ts has no type information at RSS generation time.

Suggested fix

At upload time, infer the MIME type from the file extension (since browsers' File.type is unreliable for formats like .srt that aren't in the IANA registry). Pass it as an optional third element in the Nostr tag:

// usePublishEpisode.ts — infer MIME from extension (File.type is unreliable for .srt)
function inferTranscriptMime(filename: string, fileType: string): string {
  const ext = filename.split('.').pop()?.toLowerCase();
  const mimeMap: Record<string, string> = {
    srt: 'application/x-subrip',
    vtt: 'text/vtt',
    json: 'application/json',
    html: 'text/html',
    txt: 'text/plain',
  };
  return mimeMap[ext ?? ''] || fileType || 'text/plain';
}

tags.push(['transcript', transcriptUrl, inferTranscriptMime(transcriptFile.name, transcriptFile.type)]);

Then in build-rss.ts, read the optional third element:

const transcriptType = tags.get('transcript')?.[1] || 'text/plain';

Backward-compatible: existing 2-element ['transcript', url] tags continue defaulting to text/plain with no behavior change.

Reproduction

curl -sI "https://blossom.primal.net/82daa00294af2bda132885feef9085c5daeb265c09ad15f9f8e0e65c5dbf8520" | grep -i content-type
# → application/x-subrip

curl -s "https://podcast.nostrcompass.org/rss.xml" | grep "podcast:transcript"
# → type="text/plain"  (expected: application/x-subrip)

I'm opening a PR with this fix.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions