-
Notifications
You must be signed in to change notification settings - Fork 0
Technical Documentation
CaptionMate follows the Model-View-ViewModel (MVVM) architecture pattern for clean separation of concerns and maintainability.
┌─────────────────────────────────────────────────┐
│ View Layer │
│ (SwiftUI Views - Presentation) │
└────────────────┬────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ ViewModel Layer │
│ (ContentViewModel - Business Logic) │
└────────────────┬────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ Model Layer │
│ (Data Models, Services, APIs) │
└─────────────────────────────────────────────────┘
CaptionMate/
├── App/
│ └── CaptionMateApp.swift # App entry point
├── Configuration/
│ ├── Constants/
│ │ └── Constants.swift # App constants
│ └── Extensions/
│ ├── Ex+Animation.swift # Animation helpers
│ ├── Ex+Color.swift # Color extensions
│ ├── Ex+NSSavePanel.swift # File save dialog
│ ├── Ex+Theme.swift # Theme management
│ ├── Ex+TimeFormatter.swift # Time formatting
│ └── Ex+UTType.swift # File type utilities
├── Source/
│ ├── Data/
│ │ ├── Models/
│ │ │ ├── StateModels.swift # State management
│ │ │ └── DownloadError.swift # Error types
│ │ └── Services/
│ │ ├── WhisperService.swift # Whisper integration
│ │ ├── ExportService.swift # Subtitle export
│ │ └── FileTypeService.swift # File type handling
│ └── Presentation/
│ ├── ViewModels/
│ │ └── ContentViewModel.swift # Main ViewModel
│ └── Views/
│ ├── ContentView.swift # Main view
│ ├── AudioViews/ # Audio UI components
│ ├── ModelManagementViews/ # Model management UI
│ └── TranscriptionViews/ # Transcription UI
└── Tests/
├── CaptionMateTests/ # Unit tests
└── CaptionMateUITests/ # UI tests
Purpose: Central ViewModel managing all app state and business logic
Key Responsibilities:
- Model lifecycle management
- Audio file processing
- Transcription orchestration
- Export coordination
- User settings persistence
Published Properties:
@Published var whisperKit: WhisperKit?
@Published var transcriptionState: TranscriptionState
@Published var modelManagementState: ModelManagementState
@Published var audioState: AudioState
@Published var uiState: UIState
@Published var transcriptionResult: TranscriptionResult?
@Published var audioPlayer: AVAudioPlayer?Key Methods:
// Model Management
func loadModel(_ model: String, redownload: Bool = false)
func releaseModel() async
func downloadModel(_ model: String)
func cancelDownload(_ model: String)
func deleteModel(_ model: String)
// Transcription
func transcribeFile(path: String)
func transcribeAudioSamples(_ samples: [Float]) async throws -> TranscriptionResult?
// Audio Control
func playImportedAudio()
func pauseImportedAudio()
func seekToPosition(_ position: Double)
func setVolume(_ volume: Double)
// Export
func exportTranscription() asyncstruct TranscriptionState {
var currentText: String
var currentChunks: [Int: (chunkText: [String], fallbacks: Int)]
var confirmedSegments: [TranscriptionSegment]
// Performance metrics
var tokensPerSecond: TimeInterval
var effectiveRealTimeFactor: TimeInterval
var effectiveSpeedFactor: TimeInterval
var totalInferenceTime: TimeInterval
}struct ModelManagementState {
var modelState: ModelState
var availableModels: [String]
var localModels: [String]
var downloadProgress: [String: Float]
var currentDownloadingModels: Set<String>
var modelSizes: [String: Int64]
}struct AudioState {
var importedAudioURL: URL?
var audioFileName: String
var isPlaying: Bool
var isTranscribing: Bool
var totalDuration: Double
var waveformSamples: [Float]
}struct UIState {
var showAdvancedOptions: Bool
var isModelmanagerViewPresented: Bool
var showDownloadErrorAlert: Bool
var downloadError: DownloadError?
}Audio File → AudioProcessor → WhisperKit → DecodingOptions → TranscriptionResult → Export
Process Flow:
-
Audio Loading
let samples = try AudioProcessor.loadAudioAsFloatArray(fromPath: path)
-
Model Inference
let results = try await whisperKit.transcribe( audioArray: samples, decodeOptions: options, callback: progressCallback )
-
DecodingOptions Configuration
DecodingOptions( task: .transcribe, language: languageCode, temperature: Float(temperatureStart), temperatureFallbackCount: Int(fallbackCount), sampleLength: sampleLength, usePrefillPrompt: enablePromptPrefill, usePrefillCache: enableCachePrefill, skipSpecialTokens: !enableSpecialCharacters, withoutTimestamps: !enableTimestamps, wordTimestamps: enableWordTimestamp, concurrentWorkerCount: Int(concurrentWorkerCount), chunkingStrategy: chunkingStrategy )
-
Result Processing
- Segment cleaning and sorting
- Overlap adjustment
- Empty segment removal
-
Export
- Format conversion (SRT/WebVTT/JSON/FCPXML)
- File writing with user-selected path
Supported Models:
| Model | Size | Speed | Memory | Use Case |
|---|---|---|---|---|
| Tiny | ~150MB | ~10x RT | ~500MB | Quick drafts |
| Base | ~250MB | ~8x RT | ~800MB | General use |
| Small | ~500MB | ~5x RT | ~1.5GB | Balanced |
| Medium | ~1.5GB | ~3x RT | ~3GB | High quality |
| Large-v3 | ~3GB | ~2x RT | ~6GB | Maximum accuracy |
RT = Real-time factor (1x = same duration as audio)
Model States:
enum ModelState: String {
case unloaded // No model in memory
case downloading // Downloading from remote
case downloaded // Downloaded, not loaded
case prewarming // Optimizing for device
case loading // Loading into memory
case loaded // Ready for use
case unloading // Releasing resources
}Download Management:
- Concurrent downloads (max 3)
- Progress tracking (weighted by model size)
- Automatic disk space verification
- Graceful cancellation handling
- Partial download cleanup
1
00:00:00,000 --> 00:00:05,000
Subtitle text here
2
00:00:05,000 --> 00:00:10,000
Next subtitle
Timestamp Format:
func toSRTTimestamp() -> String {
let hours = Int(self) / 3600
let minutes = (Int(self) % 3600) / 60
let seconds = Int(self) % 60
let milliseconds = Int((self - Double(Int(self))) * 1000)
return String(format: "%02d:%02d:%02d,%03d", hours, minutes, seconds, milliseconds)
}WEBVTT
00:00:00.000 --> 00:00:05.000
Subtitle text here
00:00:05.000 --> 00:00:10.000
Next subtitle
{
"segments": [
{
"id": 0,
"start": 0.0,
"end": 5.0,
"text": "Subtitle text here"
}
],
"text": "Full transcription text",
"language": "en"
}<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE fcpxml>
<fcpxml version="1.11">
<resources>
<format id="r1" frameDuration="100/3000s" width="1920" height="1080"/>
</resources>
<library>
<event name="Subtitles">
<project name="Subtitle Project">
<sequence format="r1">
<spine>
<title offset="0/30s" duration="150/30s">
<text>Subtitle text here</text>
</title>
</spine>
</sequence>
</project>
</event>
</library>
</fcpxml>| Setting | Type | Range | Default | Description |
|---|---|---|---|---|
| Temperature | Double | 0.0 - 1.0 | 0.0 | Sampling randomness |
| Fallback Count | Int | 0 - 10 | 5 | Temperature fallback attempts |
| Sample Length | Int | 224 - 448 | 224 | Audio sample window |
| Compression Check | Int | 20 - 100 | 60 | Token window for compression check |
| Concurrent Workers | Int | 1 - 8 | 4 | Parallel processing workers |
enum ChunkingStrategy {
case vad // Voice Activity Detection
case none // No chunking
case timeBasedSplit // Fixed time intervals
}VAD (Voice Activity Detection):
- Automatically detects speech segments
- Skips silent portions
- More efficient processing
- Better segment boundaries
Time-Based Split:
- Fixed duration chunks
- Predictable processing time
- Useful for continuous speech
enum AppTheme: String, CaseIterable, Identifiable {
case light = "Light"
case dark = "Dark"
case auto = "Auto"
var colorScheme: ColorScheme? {
switch self {
case .light: return .light
case .dark: return .dark
case .auto:
let appearance = NSApp.effectiveAppearance
return appearance.bestMatch(from: [.darkAqua, .aqua]) == .darkAqua ? .dark : .light
}
}
}All custom colors use NSColor.dynamicProvider for automatic theme adaptation:
static var transcriptionBackground: Color {
Color(nsColor: NSColor(name: nil, dynamicProvider: { appearance in
appearance.bestMatch(from: [.darkAqua, .aqua]) == .darkAqua
? NSColor(Color(red: 0.12, green: 0.12, blue: 0.14)) // Dark
: NSColor(Color(red: 0.99, green: 0.99, blue: 1.0)) // Light
}))
}Custom Colors:
transcriptionBackgroundaudioControlBackgroundaudioControlSectionBackgroundaudioControlContainerBackgroundaudioControlDivideraudioPlaceholderBackgroundwaveformBackgroundcontrolButtonForegroundmodelSelectorText
Test Suites:
-
ContentViewModelTests: ViewModel state management -
FileTypeServiceTests: File type conversions -
TimeIntervalExtensionTests: Time formatting -
ColorExtensionTests: Theme colors -
StateModelsTests: State initialization -
ExportServiceTests: Subtitle export
Example Test:
@Test("SRT 타임스탬프 변환 테스트")
func testToSRTTimestamp() throws {
let result = TimeInterval(3661.5).toSRTTimestamp()
#expect(result == "01:01:01,500")
}Test Classes:
-
BasicUITests: App launch and UI elements (single launch) -
InteractionTests: User interactions (fresh launch per test) -
PerformanceTests: Launch performance metrics
Test Strategy:
- Use
waitForExistence(timeout:)for reliable element detection - Support localization with
NSPredicatequeries - Minimize app launches for speed
- Use
accessibilityIdentifierfor stable element references
| Model | Real-Time Factor | 1 Hour Audio | Memory Usage |
|---|---|---|---|
| Tiny | ~10x | ~6 minutes | ~500MB |
| Base | ~8x | ~7.5 minutes | ~800MB |
| Small | ~5x | ~12 minutes | ~1.5GB |
| Medium | ~3x | ~20 minutes | ~3GB |
| Large | ~2x | ~30 minutes | ~6GB |
Memory Management:
let samples = try await Task {
try autoreleasepool {
try AudioProcessor.loadAudioAsFloatArray(fromPath: path)
}
}.valueUI Performance:
-
LazyVStackfor long lists - Combine timers for smooth updates (30fps)
- Debounced progress updates (0.5s interval)
- Background task execution
Audio Processing:
- Chunking strategies (VAD/time-based)
- Parallel worker processing
- Automatic volume normalization
1. Check Disk Space
↓
2. Initialize WhisperKit
↓
3. Download Model (if needed)
↓
4. Prewarm Model (optimize for device)
↓
5. Load Model (into memory)
↓
6. Ready for Transcription
Progress Stages:
- Initialization: 0% - 20%
- Prewarming: 20% - 70%
- Loading: 70% - 100%
Features:
- Concurrent downloads (max 3 simultaneous)
- Weighted progress calculation by model size
- Automatic retry on transient failures
- Cancellation with cleanup
- Network error detection
Error Handling:
enum DownloadError {
case networkOffline // No internet
case networkLost // Connection dropped
case cannotConnectToServer // Server unreachable
case timeout // Request timeout
case diskSpaceFull // Insufficient storage
case permissionDenied // File access denied
case fileNotFound // Missing file
case unknown(String) // Other errors
}Process:
- Load audio as Float array
- Compute RMS (Root Mean Square) values
- Chunk into 1024-sample segments
- Generate visual representation
Implementation:
private func computeWaveform(from samples: [Float]) -> [Float] {
let chunkSize = 1024
var rmsValues = [Float]()
var index = 0
while index < samples.count {
let chunk = samples[index..<min(index + chunkSize, samples.count)]
let sumSquares = chunk.reduce(0) { $0 + $1 * $1 }
let rms = sqrt(sumSquares / Float(chunk.count))
rmsValues.append(rms)
index += chunkSize
}
return rmsValues
}Target: -14 LUFS (Loudness Units relative to Full Scale)
Process:
- Sample audio at multiple points
- Calculate average power (dB)
- Estimate LUFS
- Compute gain adjustment
- Apply attenuation only (no amplification)
File: Localizable.xcstrings
Structure:
{
"sourceLanguage": "en",
"strings": {
"key": {
"localizations": {
"en": { "stringUnit": { "value": "English text" } },
"ko": { "stringUnit": { "value": "한국어 텍스트" } }
}
}
}
}Implementation:
.environment(\.locale, Locale(identifier: appLanguage))Features:
- No app restart required
- Instant UI update
- Menu bar localization
- Sheet content localization
Network Errors:
if nsError.domain == NSURLErrorDomain {
switch nsError.code {
case NSURLErrorNotConnectedToInternet:
return .networkOffline
case NSURLErrorNetworkConnectionLost:
return .networkLost
// ...
}
}File System Errors:
if nsError.domain == NSCocoaErrorDomain {
switch nsError.code {
case NSFileWriteOutOfSpaceError:
return .diskSpaceFull
case NSFileWriteNoPermissionError:
return .permissionDenied
// ...
}
}Automatic:
- CoreML cache clearing on space issues
- Model reload retry on MPSGraph errors
- Partial download cleanup on cancellation
User-Initiated:
- Retry button on download failures
- Manual cache clearing
- State reset option
Autoreleasepool:
try autoreleasepool {
try AudioProcessor.loadAudioAsFloatArray(fromPath: path)
}Resource Cleanup:
func releaseModel() async {
await whisperKit?.unloadModels()
whisperKit = nil
clearCoreMLRuntimeCache()
}Lazy Loading:
LazyVStack {
ForEach(models) { model in
ModelRowView(model: model)
}
}Efficient Updates:
// 30fps playback timer
Timer.publish(every: 0.033, on: .main, in: .common)
.autoconnect()
.sink { [weak self] _ in
self?.currentPlayerTime = player.currentTime
}- Xcode: 15.0+
- macOS SDK: 15.0+
- Swift: 5.0+
- Swift Package Manager: For dependencies
dependencies: [
.package(url: "https://github.com/argmaxinc/WhisperKit", from: "0.9.0")
]Frameworks:
- SwiftUI
- AVFoundation
- Core ML
- AppKit
- Combine
- UniformTypeIdentifiers
SwiftFormat:
mint run swiftformat .Validation Scripts:
scripts/check_whitespace.sh # Trailing whitespace check
scripts/check_filename_spaces.sh # Filename validation
scripts/check_copyright.sh # Copyright header verification
scripts/style.sh # Code style checkname: CaptionMate CI
on: [push, pull_request]
jobs:
build:
- Checkout code
- Setup Xcode
- Build macOS app
- Run unit tests
- Run UI testsname: Code Quality Check
on: [push, pull_request]
jobs:
quality:
- Check trailing whitespace
- Validate filenames
- Verify copyright headers
- Run SwiftFormat- Split view with sidebar and main content
- Model selector in sidebar
- Audio control and transcription in main area
- Model search and filtering
- Download progress visualization
- Model size and status display
- Grouped sections (Downloaded/Available)
- Grouped settings (Format/Quality/Performance)
- Slider controls with info buttons
- Real-time preview of settings
- Waveform visualization
- Playback controls
- Volume and speed adjustment
- File information display
- Scrollable segment list
- Timestamped text display
- Text selection enabled
- Auto-scroll to latest segment
let config = WhisperKitConfig(
computeOptions: ModelComputeOptions(
audioEncoderCompute: .cpuAndNeuralEngine,
textDecoderCompute: .cpuAndNeuralEngine
),
verbose: true,
logLevel: .debug
)
let whisperKit = try await WhisperKit(config)Download:
let folder = try await WhisperKit.download(
variant: modelName,
from: repositoryName,
progressCallback: { progress in
// Update UI with progress.fractionCompleted
}
)Load:
try await whisperKit.prewarmModels() // Optimize
try await whisperKit.loadModels() // LoadTranscribe:
let results = try await whisperKit.transcribe(
audioArray: samples,
decodeOptions: options,
callback: decodingCallback
)User Action (Import File)
↓
AudioProcessor.loadAudioAsFloatArray()
↓
ContentViewModel.transcribeAudioSamples()
↓
WhisperKit.transcribe()
↓
DecodingCallback (Progress Updates)
↓
TranscriptionResult
↓
UI Update (TranscriptionView)
↓
User Action (Export)
↓
ExportService.exportTranscriptionResult()
↓
NSSavePanel (File Selection)
↓
File Written to Disk
User Input → ViewModel (@Published) → SwiftUI View
↓
UserDefaults (@AppStorage)
↓
Persistent Storage
private var totalDownloadProgress: Double {
let downloadingModels = viewModel.modelManagementState.currentDownloadingModels
if downloadingModels.isEmpty { return 0 }
var totalSize: Int64 = 0
var downloadedSize: Double = 0
for model in downloadingModels {
let modelSize = viewModel.modelManagementState.modelSizes[model] ?? 0
let progress = Double(viewModel.modelManagementState.downloadProgress[model] ?? 0)
totalSize += modelSize
downloadedSize += Double(modelSize) * progress
}
return totalSize > 0 ? downloadedSize / Double(totalSize) : 0
}func updateProgressBar(startProgress: Float, targetProgress: Float, maxTime: TimeInterval) async {
let progressRange = targetProgress - startProgress
let decayConstant = -log(1 - 0.95) / Float(maxTime)
let startTime = Date()
while !Task.isCancelled {
let elapsedTime = Date().timeIntervalSince(startTime)
let decayFactor = exp(-decayConstant * Float(elapsedTime))
let progressIncrement = progressRange * (1 - decayFactor)
let currentProgress = startProgress + progressIncrement
await MainActor.run {
modelManagementState.loadingProgressValue = min(currentProgress, targetProgress)
}
if currentProgress >= targetProgress { break }
try? await Task.sleep(nanoseconds: 200_000_000) // 0.2s
}
}- English (en)
- Korean (ko)
All languages supported by Whisper model:
- English, Spanish, French, German, Italian, Portuguese
- Chinese, Japanese, Korean
- Russian, Arabic, Hindi
- And 90+ more languages
- MP3
- M4A (AAC)
- WAV
- FLAC
- AIFF
- CAF
- Any format supported by AVFoundation
- SRT (SubRip)
- WebVTT (Web Video Text Tracks)
- JSON (JavaScript Object Notation)
- FCPXML (Final Cut Pro XML 1.11)
extension UTType {
static var srt: UTType {
UTType(exportedAs: "com.CaptionMate.srtsubtitle")
}
static var vtt: UTType {
UTType(exportedAs: "com.CaptionMate.vttsubtitle")
}
static var fcpxml: UTType {
UTType(importedAs: "com.apple.finalcutpro.fcpxml")
}
}- Optimization Level: None (-Onone)
- Debug Symbols: Yes
- Assertions: Enabled
- Dead Code Stripping: Yes
- Optimization Level: Aggressive (-O)
- Debug Symbols: No
- Assertions: Disabled
- Dead Code Stripping: Yes
- Whole Module Optimization: Yes
- macOS: 15.0 (Sequoia)
- Architecture: Universal (Apple Silicon + Intel)
@Published var audioState: AudioState
// Automatically triggers UI updates when changedstruct ContentView: View {
@ObservedObject var viewModel: ContentViewModel
// ViewModel injected from parent
}@AppStorage("selectedModel") var selectedModel: String
// Persistent storage with UserDefaultsfunc transcribeFile(path: String) {
Task {
try await transcribeCurrentFile(path: path)
}
}- Basic controls visible by default
- Advanced settings in separate view
- Info buttons for detailed explanations
- Progress bars for long operations
- Loading indicators for async tasks
- Visual state changes (colors, animations)
.onDrop(of: [.fileURL], isTargeted: $isTargeted) { providers in
handleDroppedFiles(providers: providers)
return true
}func loadModel(_ model: String, redownload: Bool = false)
func releaseModel() async
func downloadModel(_ model: String)
func cancelDownload(_ model: String)
func deleteModel(_ model: String)
func fetchModels()func transcribeFile(path: String)
func transcribeCurrentFile(path: String) async throws
func transcribeAudioSamples(_ samples: [Float]) async throws -> TranscriptionResult?func playImportedAudio()
func pauseImportedAudio()
func stopImportedAudio()
func seekToPosition(_ position: Double)
func seekToPositionInLine(lineIndex: Int, secondsPerLine: Double, ratio: Double, totalDuration: Double)
func setVolume(_ volume: Double)
func toggleMute()
func changePlaybackRate(faster: Bool)
func currentPlaybackRateText() -> String
func skipForward()
func skipBackward()func selectFile()
func handleFilePicker(result: Result<[URL], Error>)
func handleDroppedFiles(providers: [NSItemProvider])
func deleteImportedAudio()func exportTranscription() asyncfunc changeAppLanguage(to language: String)
func getCurrentLanguageDisplayName() -> String
var appTheme: AppTheme { get set }Automatically trigger UI updates when changed:
@Published var transcriptionState = TranscriptionState()
// Any change to transcriptionState updates all observing viewsPersist across app launches:
@AppStorage("enableTimestamps") var enableTimestamps: Bool = true
// Saved to UserDefaults, survives app restartPropagate through view hierarchy:
.environment(\.locale, Locale(identifier: appLanguage))
// All child views receive updated locale// Input: 3661.5 seconds
// Output: "01:01:01,500"
func toSRTTimestamp() -> String {
let hours = Int(self) / 3600
let minutes = (Int(self) % 3600) / 60
let seconds = Int(self) % 60
let milliseconds = Int((self - Double(Int(self))) * 1000)
return String(format: "%02d:%02d:%02d,%03d", hours, minutes, seconds, milliseconds)
}// Input: 3661.5 seconds
// Output: "01:01:01 - 01:01:06"
func formatTimeRange(to endTime: TimeInterval) -> String {
let startFormatted = self.toSRTTimestamp().replacingOccurrences(of: ",", with: ".")
let endFormatted = endTime.toSRTTimestamp().replacingOccurrences(of: ",", with: ".")
return "\(startFormatted) - \(endFormatted)"
}func writeSRT(segments: [TranscriptionSegment], to url: URL) throws {
var srtContent = ""
for (index, segment) in segments.enumerated() {
srtContent += "\(index + 1)\n"
srtContent += "\(segment.start.toSRTTimestamp()) --> \(segment.end.toSRTTimestamp())\n"
srtContent += "\(segment.text)\n\n"
}
try srtContent.write(to: url, atomically: true, encoding: .utf8)
}func writeFCPXML(segments: [TranscriptionSegment], frameRate: Double, to url: URL) throws {
let frameDuration = "100/\(Int(frameRate * 100))s"
// Generate XML structure with proper timing
// Convert seconds to frame-based timing
// Write to file
}WhisperKit Verbose Mode:
let config = WhisperKitConfig(
verbose: true,
logLevel: .debug
)Console Output:
- Model loading progress
- Transcription metrics
- Error details
- Performance statistics
Available Metrics:
struct TranscriptionState {
var tokensPerSecond: TimeInterval
var effectiveRealTimeFactor: TimeInterval
var effectiveSpeedFactor: TimeInterval
var totalInferenceTime: TimeInterval
var currentFallbacks: Int
var currentEncodingLoops: Int
var currentDecodingLoops: Int
}- Clean build folder
- Update version and build number
- Archive for distribution
- Automatic code signing
- Notarization submission
- App Store Connect upload
Format: MAJOR.MINOR.PATCH
- MAJOR: Breaking changes
- MINOR: New features (backward compatible)
- PATCH: Bug fixes
- macOS Version: [e.g., 15.0]
- App Version: [e.g., 1.0.0]
- Model Used: [e.g., openai_whisper-small]
- Audio Format: [e.g., MP3, 44.1kHz]
- File Duration: [e.g., 30 minutes]
- Error Message: [exact error text]
Email: parfume407@gmail.com
Response Time: 3-5 business days
© 2025 [Harrison Cho]. All rights reserved. | GitHub Repository