Skip to content


Folders and files

Last commit message
Last commit date

Latest commit



45 Commits

Repository files navigation


VoxSDK is a comprehensive toolkit designed to facilitate easy integration of AI-driven speech recognition and synthesis into your applications. With a focus on simplicity and efficiency, VoxSDK offers a set of React hooks and utilities to seamlessly connect with AI services for voice interactions.


  • VoxProvider: A context provider to encapsulate the SDK's functionalities and make them accessible throughout your React application.
  • useListen: A hook to capture and transcribe user speech in real-time.
  • useSpeak: A hook for text-to-speech functionality, converting text responses into natural-sounding speech.


Install VoxSDK using npm:

npm install vox-sdk

Or using yarn:

yarn add vox-sdk

Install tslib.

Using npm

npm install tslib --save-dev

Using yarn

yarn add tslib -D


Server Setup

  • On your server, you will need to create a GET endpoint at /token.

  • Using the speech_key and region, you will generate an authorization token from Microsoft's APIs.

  • Set these values in the .env file as SPEECH_KEY and SPEECH_REGION.

  • The /token endpoint should return the following response:.

  • Here's a sample implementation of the /token endpoint.

import express from "express";
import cors from "cors";
import "dotenv/config";
import axios from "axios";

const app = express();

    origin: process.env.FRONTEND_URL,

let token = null;
const speechKey = process.env.SPEECH_KEY;
const speechRegion = process.env.SPEECH_REGION;

const getToken = async () => {
  try {
    const headers = {
      headers: {
        "Ocp-Apim-Subscription-Key": speechKey,
        "Content-Type": "application/x-www-form-urlencoded",

    const tokenResponse = await`https://${speechRegion}`, null, headers);

    token =;
  } catch (error) {
    console.error("Error while getting token:", error);

app.get("/token", async (req, res) => {
  try {
    res.setHeader("Content-Type", "application/json");

    // When client asks for refresh token
    const refreshTheToken = req.query?.refresh;

    if (!token || refreshTheToken) {
      await getToken();

      token: token,
      region: speechRegion,
  } catch (error) {
    console.error("Error while handling /token request:", error);
    res.status(500).send({ error: "An error occurred while processing your request." });

app.listen(8080, () => console.log("Server running on port 8080"));

Client Setup

  • Wrap your application with VoxProvider to make the SDK available throughout your app:

    import { VoxProvider } from "vox-sdk";
    function App() {
      return <VoxProvider>{/* Your app components go here */}</VoxProvider>;
    export default App;
  • VoxProvider expects config object which includes,

    1. baseUrl : url to your backend. e.g. :, Ensure that the /token route serves the token and region..

    2. OnAuthRefresh : A callback function that is invoked when any authentication error occurs or the token expires.

    3. headersForBaseUrl : Option to pass baseUrl Config.

  • Here's the implmentation of the above two step.

        baseUrl: "",
        onAuthRefresh: async () => {
          const { data } = await axios.get("");
          return { token: data.token, region: data.region };
        headersForBaseUrl: {
          //... Bearer Authentication token or other config
      <App />
  • The onAuthRefresh callback will refresh the token and return it with the region.

  • For more details you can visit here sample app implementation


Using useListen Hook

After setting up the Server and VoxProvider we are ready to use useListen and useSpeak.

Integrate speech-to-text functionality in your components:

import { useListen } from "vox-sdk";
import React from "react";
const SpeechToText = () => {
  const { answers, loading, startSpeechRecognition, stopSpeechRecognition } = useListen({
    onEndOfSpeech: () => {
    automatedEnd: true,
    delay: 1000,
  return (
      <button disabled={loading} onClick={startSpeechRecognition}>
        Start Litsening
      <button onClick={stopSpeechRecognition}> Stop Listening</button>

export default SpeechToText;

useListen hook expects following parameters.

  1. automatedEnd :

    • Expects a boolean value, default is true.
    • When the user finishes speaking, the hook will automatically start the speech-to-text conversion.
    • To listen continuously until the user clicks stopSpeechRecognition, pass false.
  2. delay :

    • Expects a value in milliseconds.
    • This is the debounce duration for listening to the user.
    • The default is set to 2000ms.
  3. onEndOfSpeech :

    • Expects a callback function that is invoked when speech ends.

useListen Hook Returns.

  1. startSpeechRecognition : Function to start speech recognition.
  2. stopSpeechRecognition : Function to stop speech recognition.
  3. answers : Returns an array of strings containing all the transcribed text.
  4. answer : The last transcribed text.
  5. recognizerRef : An instance of microsoft-cognitiveservices-speech-sdk.

Using useSpeak Hook

Implement text-to-speech in your application:

import React from "react";
import { useState } from "react";
import { useSpeak, SpeechVoices } from "vox-sdk";
const TextToSpeech = () => {
  const [text, setText] = useState("");
  const { interruptSpeech, speak, isSpeaking } = useSpeak({
    onEnd: () => {
      console.log("Spech ended");
    shouldCallOnEnd: true,
    throttleDelay: 1000,

    voice: SpeechVoices.enUSAIGenerate1Neural, // AI Voices


  return (
      <h3>Text To Speech</h3>
      <input type="text" onChange={(e) => setText(} value={text} />
        onClick={() => {
        Start Speaking
        onClick={() => {
        Stop Speaking

export default TextToSpeech;

useSpeak hook expects following parameters.

  1. voice :

    • Expects a string value.

    • Choose your preferred AI voice from Microsoft Azure.

    • Here's the list of available voices.

      export enum SpeechVoices {
        // Arabic
        arAEFatimaNeural = "ar-AE-FatimaNeural",
        arBHAliNeural = "ar-BH-AliNeural",
        arEGSalmaNeural = "ar-EG-SalmaNeural",
        arJOTaimNeural = "ar-JO-TaimNeural",
        arKWFahedNeural = "ar-KW-FahedNeural",
        arLYImanNeural = "ar-LY-ImanNeural",
        arQAAmalNeural = "ar-QA-AmalNeural",
        arSAHamedNeural = "ar-SA-HamedNeural",
        arSYAmanyNeural = "ar-SY-AmanyNeural",
        arTNHediNeural = "ar-TN-HediNeural",
        arYEMaryamNeural = "ar-YE-MaryamNeural",
        // Chinese
        zhCNXiaoxiaoNeural = "zh-CN-XiaoxiaoNeural",
        zhCNYunxiNeural = "zh-CN-YunxiNeural",
        zhCNYunyeNeural = "zh-CN-YunyeNeural",
        zhHKHiuGaaiNeural = "zh-HK-HiuGaaiNeural",
        zhHKHiuMaanNeural = "zh-HK-HiuMaanNeural",
        zhTWHsiaoChenNeural = "zh-TW-HsiaoChenNeural",
        zhTWHsiaoYuNeural = "zh-TW-HsiaoYuNeural",
        // Danish
        daDKChristelNeural = "da-DK-ChristelNeural",
        daDKJeppeNeural = "da-DK-JeppeNeural",
        // Dutch
        nlBEArnaudNeural = "nl-BE-ArnaudNeural",
        nlBEDenaNeural = "nl-BE-DenaNeural",
        nlNLColetteNeural = "nl-NL-ColetteNeural",
        nlNLFennaNeural = "nl-NL-FennaNeural",
        // English (Australia)
        enAUNatashaNeural = "en-AU-NatashaNeural",
        enAUWilliamNeural = "en-AU-WilliamNeural",
        // English (Canada)
        enCAClaraNeural = "en-CA-ClaraNeural",
        enCALiamNeural = "en-CA-LiamNeural",
        // English (India)
        enINNeerjaNeural = "en-IN-NeerjaNeural",
        enINPrabhatNeural = "en-IN-PrabhatNeural",
        // English (UK)
        enGBLibbyNeural = "en-GB-LibbyNeural",
        enGBRyanNeural = "en-GB-RyanNeural",
        // English (US)
        enUSAIGenerate1Neural = "en-US-AIGenerate1Neural",
        enUSAmberNeural = "en-US-AmberNeural",
        enUSAriaNeural = "en-US-AriaNeural",
        enUSAshleyNeural = "en-US-AshleyNeural",
        enUSBrandonNeural = "en-US-BrandonNeural",
        enUSChristopherNeural = "en-US-ChristopherNeural",
        enUSCoraNeural = "en-US-CoraNeural",
        enUSDavisNeural = "en-US-DavisNeural",
        enUSElizabethNeural = "en-US-ElizabethNeural",
        enUSEricNeural = "en-US-EricNeural",
        enUSGuyNeural = "en-US-GuyNeural",
        enUSJacobNeural = "en-US-JacobNeural",
        enUSJasonNeural = "en-US-JasonNeural",
        enUSJennyNeural = "en-US-JennyNeural",
        enUSMichelleNeural = "en-US-MichelleNeural",
        enUSMonicaNeural = "en-US-MonicaNeural",
        enUSSaraNeural = "en-US-SaraNeural",
        enUSTonyNeural = "en-US-TonyNeural",
        // Finnish
        fiFINooraNeural = "fi-FI-NooraNeural",
        fiFISelmaNeural = "fi-FI-SelmaNeural",
        // French (Canada)
        frCADiegoNeural = "fr-CA-DiegoNeural",
        frCAFelixNeural = "fr-CA-FelixNeural",
        frCAJeanNeural = "fr-CA-JeanNeural",
        frCASylvieNeural = "fr-CA-SylvieNeural",
        // French (France)
        frFRDeniseNeural = "fr-FR-DeniseNeural",
        frFREloiseNeural = "fr-FR-EloiseNeural",
        frFRHenriNeural = "fr-FR-HenriNeural",
        // German
        deDEKatjaNeural = "de-DE-KatjaNeural",
        deDEKillianNeural = "de-DE-KillianNeural",
        // Greek
        elGRAthinaNeural = "el-GR-AthinaNeural",
        elGRNestorasNeural = "el-GR-NestorasNeural",
        // Hindi
        hiINMadhurNeural = "hi-IN-MadhurNeural",
        hiINSwaraNeural = "hi-IN-SwaraNeural",
        // Italian
        itITDiegoNeural = "it-IT-DiegoNeural",
        itITElsaNeural = "it-IT-ElsaNeural",
        // Japanese
        jaJPAoiNeural = "ja-JP-AoiNeural",
        jaJPNanamiNeural = "ja-JP-NanamiNeural",
        // Korean
        koKRInJoonNeural = "ko-KR-InJoonNeural",
        koKRSunHiNeural = "ko-KR-SunHiNeural",
        // Portuguese (Brazil)
        ptBRFranciscaNeural = "pt-BR-FranciscaNeural",
        ptBRAntonioNeural = "pt-BR-AntonioNeural",
        // Russian
        ruRUDmitryNeural = "ru-RU-DmitryNeural",
        ruRUSvetlanaNeural = "ru-RU-SvetlanaNeural",
        // Spanish (Mexico)
        esMXJorgeNeural = "es-MX-JorgeNeural",
        esMXDaliaNeural = "es-MX-DaliaNeural",
        // Spanish (Spain)
        esESElviraNeural = "es-ES-ElviraNeural",
        esESAlvaroNeural = "es-ES-AlvaroNeural",
        // Swedish
        svSESofieNeural = "sv-SE-SofieNeural",
        svSEMattiasNeural = "sv-SE-MattiasNeural",
  2. throttleDelay :

    • Expects a value in milliseconds.
    • This is the throttle duration for listening to the user.
    • The default is set to 2000ms.
  3. onEnd :

    • Expects a callback function that is invoked when the AI speech ends.
    • To invoke this, set shouldCallOnEnd to true.

useSpeak Hook Returns.

  1. speak :

    • Function to start text-to-speech recognition.
    • Expects a string argument to be converted to speech.
  2. interruptSpeech :

    • Function to stop the AI speech.
  3. hasAllSentencesBeenSpoken :

    • Returns a boolean value indicating if all sentences have been recognized.
  4. isSpeaking :

    • Returns a boolean value indicating if the AI is speaking.
  5. streamedSentences :

    • Returns an array of strings with all streamed sentences.


Contributions are welcome! Please read our Contributing Guide for more information.


This project is licensed under the MIT License.


No description, website, or topics provided.







No releases published


No packages published