
 **🔊Voice Cloning Similarity (VCS)**

 This code compares how similar a generated voice is to a reference voice. It does this by converting both audio files into embeddings (numerical voice representations) using the **Resemblyzer** model, then calculating the cosine similarity between them. The result is shown as a percentage, where higher values mean the generated speech is closer to the original speaker’s voice.


In [None]:
# Install required packages:
pip install resemblyzer numpy soundfile
from resemblyzer import VoiceEncoder, preprocess_wav
import numpy as np

encoder = VoiceEncoder()
# Load reference and generated audio
ref_wav = preprocess_wav("reference_vcs.wav")
gen_wav = preprocess_wav("gpt_generated_vcs.wav")

# Compute embeddings
ref_embed = encoder.embed_utterance(ref_wav)
gen_embed = encoder.embed_utterance(gen_wav)

# Cosine similarity
similarity = np.dot(ref_embed, gen_embed) / (np.linalg.norm(ref_embed) * np.linalg.norm(gen_embed))

# Convert to percentage and round to 1 decimal place
similarity_percent = round(similarity * 100, 1)
print("Similarity (%):", similarity_percent)

**🕒 Real-Time Factor (RTF)**

The following code provides a general template for measuring the Real-Time Factor (RTF) of any Text-to-Speech (TTS) system. It takes a sample text, synthesizes speech using the chosen model, records the time taken for synthesis, and compares it with the actual audio duration. From this, the RTF is calculated, showing how fast the model generates speech relative to real time. Additionally, an optional scoring function converts the RTF value into a 0–5 scale for easier comparison between models.

In [None]:
import time
import soundfile as sf
from IPython.display import Audio, display

text = (
    "Субҳ барвақт бедор шудам. "
    "Ба ошхона рафтам ва чой нӯшидем. "
    "Субҳона омода кардам ва ба хона хӯрок хӯрдем. "
    "Баъд ба бозор рафтам."
)

# -----------------------------
# ✅ Placeholder TTS function
# Replace this with the API call / model inference
# -----------------------------
def synthesize_text(text, output_file="output.wav"):
    """
    General TTS synthesis function.
    Replace this part with:
      - OpenAI API call
      - Hugging Face pipeline
      - Local TTS model inference
    """
    # Example (OpenAI):
    # response = openai.audio.speech.create(
    #     model="gpt-4o-mini-tts",
    #     voice="alloy",
    #     input=text
    # )
    # audio_bytes = response.read()
    # with open(output_file, "wb") as f:
    #     f.write(audio_bytes)

    # For now just assume output.wav exists
    return output_file

def calculate_rtf(text, output_file="output.wav"):
    start_time = time.time()
    synthesize_text(text, output_file=output_file)  # Run TTS
    end_time = time.time()

    synthesis_time = end_time - start_time

    # Load audio and get duration
    audio, sr = sf.read(output_file)
    audio_duration = len(audio) / sr

    # Real-Time Factor
    rtf = synthesis_time / audio_duration
    return synthesis_time, audio_duration, rtf

def rtf_to_score(rtf):
    if rtf >= 2.0:
        return 0.0
    score = 5.0 * (1.5 - rtf) / 1.5
    return round(max(0.0, min(score, 5.0)), 2)

synthesis_time, audio_duration, rtf = calculate_rtf(text)
rtf_score = rtf_to_score(rtf)

print(f"🕒 Synthesis Time: {round(synthesis_time, 3)} sec")
print(f"🎧 Audio Duration: {round(audio_duration, 3)} sec")
print(f"📏 RTF: {round(rtf, 3)}")
print(f"✅ RTF Score (0–5): {rtf_score}")

# Playback audio
display(Audio("output.wav"))
