# Bark TTS - High Quality Text-to-Speech

This notebook demonstrates Bark TTS, which produces the most human-like speech but is slower (~40 seconds generation time).

**Features:**
- Excellent quality and naturalness
- Runs on CPU (MPS has compatibility issues)
- ~40 seconds generation time
- Large model size (~1B+ parameters)

In [2]:
# Load Bark TTS model on Apple Silicon GPU
from transformers import pipeline
import torch
import os

# Enable MPS fallback for compatibility
os.environ['PYTORCH_ENABLE_MPS_FALLBACK'] = '1'

print(f"MPS available: {torch.backends.mps.is_available()}")
print("Loading Bark TTS on Apple Silicon GPU...")

MPS available: True
Loading Bark TTS on Apple Silicon GPU...


In [3]:
# Initialize Bark pipeline on CPU (due to MPS compatibility issues)
pipe = pipeline("text-to-speech", model="suno/bark-small", device="cpu")
print("Bark TTS loaded successfully on CPU!")
print("Note: Running on CPU due to MPS dtype compatibility issues")

  self.register_buffer("padding_total", torch.tensor(kernel_size - stride, dtype=torch.int64), persistent=False)


Bark TTS loaded successfully on CPU!
Note: Running on CPU due to MPS dtype compatibility issues


In [4]:
# Generate speech with Bark TTS
import time

text = "Hello, this is Bark TTS. I produce the most natural and human-like speech, though I take a bit longer to generate."

start_time = time.time()
speech = pipe(text)
end_time = time.time()

print(f"Audio generated in {end_time - start_time:.2f} seconds")
print(f"Sample rate: {speech['sampling_rate']}")
print(f"Audio shape: {speech['audio'].shape}")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Audio generated in 93.27 seconds
Sample rate: 24000
Audio shape: (1, 333120)


In [5]:
# Save Bark audio as MP3
import soundfile as sf
from pydub import AudioSegment
import numpy as np
import os

# Create output directory
os.makedirs("output", exist_ok=True)

# Save as WAV first
wav_file = "output/bark_output.wav"
mp3_file = "output/bark_output.mp3"

# Normalize audio to prevent clipping
audio_normalized = speech['audio'].squeeze() / np.max(np.abs(speech['audio'].squeeze()))

# Save as WAV
sf.write(wav_file, audio_normalized, speech['sampling_rate'])

# Convert to MP3
audio_segment = AudioSegment.from_wav(wav_file)
audio_segment.export(mp3_file, format="mp3")

print(f"Bark audio saved as {mp3_file}")
print(f"Duration: {len(audio_segment)/1000:.2f} seconds")
print(f"Generation time: {end_time - start_time:.2f} seconds")
print("\n🎭 Bark TTS: Best quality, human-like speech!")

Bark audio saved as output/bark_output.mp3
Duration: 13.88 seconds
Generation time: 93.27 seconds

🎭 Bark TTS: Best quality, human-like speech!


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
