<a href="https://colab.research.google.com/github/anshupandey/Working_with_Large_Language_models/blob/main/WWL_C23_Text2Voice_ElevenLabs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text-to-speech using Python and ElevenLabs API

Signup for an account to fetch API Key from: https://elevenlabs.io/app/speech-synthesis


Documentation: https://elevenlabs.io/docs/api-reference/getting-started

In [1]:
# An API key is defined here. You'd normally get this from the service you're accessing. It's a form of authentication.
XI_API_KEY = "sk_2b5b48ecffa6bfed1e346cdbb00a24e4a4053578605cd67c"

## Fetching the voice_id

In [2]:
# The 'requests' and 'json' libraries are imported.
# 'requests' is used to send HTTP requests, while 'json' is used for parsing the JSON data that we receive from the API.
import requests
import json

# This is the URL for the API endpoint we'll be making a GET request to.
url = "https://api.elevenlabs.io/v1/voices"

# Here, headers for the HTTP request are being set up.
# Headers provide metadata about the request. In this case, we're specifying the content type and including our API key for authentication.
headers = {
  "Accept": "application/json",
  "xi-api-key": XI_API_KEY,
  "Content-Type": "application/json"
}

# A GET request is sent to the API endpoint. The URL and the headers are passed into the request.
response = requests.get(url, headers=headers)

# The JSON response from the API is parsed using the built-in .json() method from the 'requests' library.
# This transforms the JSON data into a Python dictionary for further processing.
data = response.json()

# A loop is created to iterate over each 'voice' in the 'voices' list from the parsed data.
# The 'voices' list consists of dictionaries, each representing a unique voice provided by the API.
for voice in data['voices']:
  # For each 'voice', the 'name' and 'voice_id' are printed out.
  # These keys in the voice dictionary contain values that provide information about the specific voice.
  print(f"{voice['name']}; {voice['voice_id']}")


Rachel; 21m00Tcm4TlvDq8ikWAM
Drew; 29vD33N1CtxCmqQRPOHJ
Clyde; 2EiwWnXFnvU5JabPnv8n
Paul; 5Q0t7uMcjvnagumLfvZi
Domi; AZnzlk1XvdvUeBnXmlld
Dave; CYw3kZ02Hs0563khs1Fj
Fin; D38z5RcWu1voky8WS1ja
Sarah; EXAVITQu4vr4xnSDxMaL
Antoni; ErXwobaYiN019PkySvjV
Thomas; GBv7mTt0atIp3Br8iCZE
Charlie; IKne3meq5aSn9XLyUdCD
George; JBFqnCBsd6RMkjVDRZzb
Emily; LcfcDJNUP1GQjkzn1xUU
Elli; MF3mGyEYCl7XYWbV9V6O
Callum; N2lVS1w4EtoT3dr4eOWO
Patrick; ODq5zmih8GrVes37Dizd
Harry; SOYHLrjzK2X1ezoPC6cr
Liam; TX3LPaxmHKxFdv7VOQHJ
Dorothy; ThT5KcBeYPX3keUQqHPh
Josh; TxGEqnHWrfWFTfGW9XjX
Arnold; VR6AewLTigWG4xSOukaG
Charlotte; XB0fDUnXU5powFXDhCwa
Alice; Xb7hH8MSUJpSbSDYk0k2
Matilda; XrExE9yKIg1WjnnlVkGX
James; ZQe5CZNOzWyzPSCn5a3c
Joseph; Zlb1dXrM653N07WRdFW3
Jeremy; bVMeCyTHy58xNoL34h3p
Michael; flq6f7yk4E4fJM5XTYuZ
Ethan; g5CIjZEefAph4nQFvHAz
Chris; iP95p4xoKVk53GoZ742B
Gigi; jBpfuIE2acCO8z3wKNLl
Freya; jsCqWAovK2LkecY7zXl4
Brian; nPczCjzI2devNBz1zQrb
Grace; oWAxZDx7w5VEj9dCyTzz
Daniel; onwK4e9ZLuTAKqWW03F9
Lily; p

## Text-to-speech

In [3]:
# Import necessary libraries
import requests  # Used for making HTTP requests
import json  # Used for working with JSON data

# Define constants for the script
CHUNK_SIZE = 1024  # Size of chunks to read/write at a time
VOICE_ID = "zrHiDhphv9ZnVXBqCLjz"  # ID of the voice model to use
TEXT_TO_SPEAK = "Hello Everyone, Hope you are enjoying Learning Generative AI and, Large Language Models"  # Text you want to convert to speech
OUTPUT_PATH = "output.mp3"  # Path to save the output audio file

# Construct the URL for the Text-to-Speech API request
tts_url = f"https://api.elevenlabs.io/v1/text-to-speech/{VOICE_ID}/stream"

# Set up headers for the API request, including the API key for authentication
headers = {
    "Accept": "application/json",
    "xi-api-key": XI_API_KEY
}

# Set up the data payload for the API request, including the text and voice settings
data = {
    "text": TEXT_TO_SPEAK,
    "model_id": "eleven_multilingual_v2",
    "voice_settings": {
        "stability": 0.5,
        "similarity_boost": 0.8,
        "style": 0.0,
        "use_speaker_boost": True
    }
}

# Make the POST request to the TTS API with headers and data, enabling streaming response
response = requests.post(tts_url, headers=headers, json=data, stream=True)

# Check if the request was successful
if response.ok:
    # Open the output file in write-binary mode
    with open(OUTPUT_PATH, "wb") as f:
        # Read the response in chunks and write to the file
        for chunk in response.iter_content(chunk_size=CHUNK_SIZE):
            f.write(chunk)
    # Inform the user of success
    print("Audio stream saved successfully.")
else:
    # Print the error message if the request was not successful
    print(response.text)


Audio stream saved successfully.


In [4]:
from IPython.display import Audio
Audio(OUTPUT_PATH)

## Speech to Speech

In [6]:
!pip install ffmpeg-python --quiet

In [7]:
# @title Audio record utility function
# @markdown Run this cell before proceeding to import libraries and define utility functions. \
# @markdown This cell will define a function to record audio in Google Colab.

"""
To write this piece of code I took inspiration/code from a lot of places.
It was late night, so I'm not sure how much I created or just copied o.O
Here are some of the possible references:
https://blog.addpipe.com/recording-audio-in-the-browser-using-pure-html5-and-minimal-javascript/
https://stackoverflow.com/a/18650249
https://hacks.mozilla.org/2014/06/easy-audio-capture-with-the-mediarecorder-api/
https://air.ghost.io/recording-to-an-audio-file-using-html5-and-js/
https://stackoverflow.com/a/49019356
"""
from IPython.display import HTML, Audio
from google.colab.output import eval_js
from base64 import b64decode
import numpy as np
from scipy.io.wavfile import read as wav_read
import io
import ffmpeg

AUDIO_HTML = """
<script>
var my_div = document.createElement("DIV");
var my_p = document.createElement("P");
var my_btn = document.createElement("BUTTON");
var t = document.createTextNode("Press to start recording");

my_btn.appendChild(t);
//my_p.appendChild(my_btn);
my_div.appendChild(my_btn);
document.body.appendChild(my_div);

var base64data = 0;
var reader;
var recorder, gumStream;
var recordButton = my_btn;

var handleSuccess = function(stream) {
  gumStream = stream;
  var options = {
    //bitsPerSecond: 8000, //chrome seems to ignore, always 48k
    mimeType : 'audio/webm;codecs=opus'
    //mimeType : 'audio/webm;codecs=pcm'
  };
  //recorder = new MediaRecorder(stream, options);
  recorder = new MediaRecorder(stream);
  recorder.ondataavailable = function(e) {
    var url = URL.createObjectURL(e.data);
    var preview = document.createElement('audio');
    preview.controls = true;
    preview.src = url;
    document.body.appendChild(preview);

    reader = new FileReader();
    reader.readAsDataURL(e.data);
    reader.onloadend = function() {
      base64data = reader.result;
      //console.log("Inside FileReader:" + base64data);
    }
  };
  recorder.start();
  };

recordButton.innerText = "Recording... press to stop";

navigator.mediaDevices.getUserMedia({audio: true}).then(handleSuccess);


function toggleRecording() {
  if (recorder && recorder.state == "recording") {
      recorder.stop();
      gumStream.getAudioTracks()[0].stop();
      recordButton.innerText = "Saving the recording... pls wait!"
  }
}

// https://stackoverflow.com/a/951057
function sleep(ms) {
  return new Promise(resolve => setTimeout(resolve, ms));
}

var data = new Promise(resolve=>{
//recordButton.addEventListener("click", toggleRecording);
recordButton.onclick = ()=>{
toggleRecording()

sleep(2000).then(() => {
  // wait 2000ms for the data to be available...
  // ideally this should use something like await...
  //console.log("Inside data:" + base64data)
  resolve(base64data.toString())

});

}
});

</script>
"""

def get_audio():
  display(HTML(AUDIO_HTML))
  data = eval_js("data")
  binary = b64decode(data.split(',')[1])

  process = (ffmpeg
    .input('pipe:0')
    .output('pipe:1', format='wav')
    .run_async(pipe_stdin=True, pipe_stdout=True, pipe_stderr=True, quiet=True, overwrite_output=True)
  )
  output, err = process.communicate(input=binary)

  riff_chunk_size = len(output) - 8
  # Break up the chunk size into four bytes, held in b.
  q = riff_chunk_size
  b = []
  for i in range(4):
      q, r = divmod(q, 256)
      b.append(r)

  # Replace bytes 4:8 in proc.stdout with the actual size of the RIFF chunk.
  riff = output[:4] + bytes(b) + output[8:]

  sr, audio = wav_read(io.BytesIO(riff))

  return audio, sr

In [8]:
audio, sr = get_audio()

In [10]:
import scipy
human_path = 'recording.wav'
scipy.io.wavfile.write(human_path, sr, audio)

### Speech to Speech conversion

In [11]:
# Import necessary libraries
import requests  # Used for making HTTP requests
import json  # Used for working with JSON data

# Define constants for the script
CHUNK_SIZE = 1024  # Size of chunks to read/write at a time
VOICE_ID = "MF3mGyEYCl7XYWbV9V6O"  # ID of the voice model to use
AUDIO_FILE_PATH = human_path  # Path to the input audio file
OUTPUT_PATH = "output.mp3"  # Path to save the output audio file

# Construct the URL for the Speech-to-Speech API request
sts_url = f"https://api.elevenlabs.io/v1/speech-to-speech/{VOICE_ID}/stream"

# Set up headers for the API request, including the API key for authentication
headers = {
    "Accept": "application/json",
    "xi-api-key": XI_API_KEY
}

# Set up the data payload for the API request, including model ID and voice settings
# Note: voice settings are converted to a JSON string
data = {
    "model_id": "eleven_english_sts_v2",
    "voice_settings": json.dumps({
        "stability": 0.5,
        "similarity_boost": 0.8,
        "style": 0.0,
        "use_speaker_boost": True
    })
}

# Set up the files to send with the request, including the input audio file
files = {
    "audio": open(AUDIO_FILE_PATH, "rb")
}

# Make the POST request to the STS API with headers, data, and files, enabling streaming response
response = requests.post(sts_url, headers=headers, data=data, files=files, stream=True)

# Check if the request was successful
if response.ok:
    # Open the output file in write-binary mode
    with open(OUTPUT_PATH, "wb") as f:
        # Read the response in chunks and write to the file
        for chunk in response.iter_content(chunk_size=CHUNK_SIZE):
            f.write(chunk)
    # Inform the user of success
    print("Audio stream saved successfully.")
else:
    # Print the error message if the request was not successful
    print(response.text)


Audio stream saved successfully.


In [12]:
from IPython.display import Audio
Audio(OUTPUT_PATH)

## Thank You