Add Support for Google Cloud Speech-To-Text v2 in mod_google_transcribe #164

entenschnabel · 2024-02-21T23:43:15Z

This PR addresses #149 and offers support for the v2 version of the Speech-To-Text library whilst still supporting v1 simultaneously. The default behaviour is to use the v1 version of the library where everything works identically to the way it did in the previous version. In order to use v2 the FreeSWITCH variable GOOGLE_SPEECH_CLOUD_SERVICES_VERSION must be set to the value "v2". Setting it to "v1" or not setting it at all results in the default behaviour.

If the variable is used then it is essential to provide a so called recognizer parent path in the GOOGLE_SPEECH_RECOGNIZER_PARENT FreeSWITCH variable. Failure to do so will result in a failure to construct the GStreamer class. Recognizers allow commonly used streaming recognition parameters to be stored in the cloud. These stored values can be overridden with parameters passed at runtime but it is essential to provide a recognizer to v2 streaming recognition invocations. If you happen to have already created a recognizer in your Google Cloud account its id can be passed using the GOOGLE_SPEECH_RECOGNIZER_ID variable. If this is not set then mod_google_transcribe will just use the so called wildcard recognizer id ( the "_" character) and a recognizer will be created on the fly and not stored for future use. Note that even if a persistent recognizer is not required, it is always necessary to provide at least the parent id of the recognizer in GOOGLE_SPEECH_RECOGNIZER_PARENT, otherwise even the wildcard recognizer cannot be created. This parent id is a path string which consists of the google cloud project id which was used to create the google credentials file used, and a geographical location. For more details about recognizers, see https://cloud.google.com/speech-to-text/v2/docs/recognizers

As long as GOOGLE_SPEECH_CLOUD_SERVICES_VERSION is set to "v2" and GOOGLE_SPEECH_RECOGNIZER_PARENT is also set to a valid recognizer parent id then the "v2" library will be used and calls to uuid_google_transcribe should function as it did previously and any configuration parameters provided at runtime will override anything already defined in a predefined recognizer.

Differences between `v1` and `v2`

No single utterances in v2. That is to say that it is no longer required to specify this as a parameter. Instead it is taken to be implicit from the model selected. If single utterance behaviour is required then this is supported by the short model, for example. To see more details on models see https://cloud.google.com/speech-to-text/v2/docs/streaming-recognize.
Speaker diarization does not seem to be supported yet. The code to perform this is still there in mod_google_transcribe for v2 but I didn't manage to stuble across a combination of model, language and location which supports this. See https://stackoverflow.com/questions/76779418/speaker-diarization-is-disabled-even-for-supported-languages-in-google-speech-to
Multiple Language Support. If you provide up to a maximum of three languages to the recognition request, the speech engine will determine which of the three languages is most likely to have been spoken, automatically.

There are sure to be many more differences but these are the main things I found so far.

Some Notes on the Code and Building

To avoid code duplication we placed 'v1 specific code in google_glue_v1.cpp and the v2 specific stuff in google_glue_v2.cpp. Generic code used by both libraries now resides in generic_google_glue.h. We use our own docker image to build the drachtio modules but our make file is based on this one:
https://github.com/drachtio/docker-drachtio-freeswitch-base/blob/main/files/Makefile.am.extra
In order to compile and link the v2 stuff we had to add the following lines to the nodist_libfreeswitch_libgoogleapis_la_SOURCES assignment:

libs/googleapis/gens/google/api/policy.pb.cc \
libs/googleapis/gens/google/cloud/speech/v1/resource.pb.cc \
libs/googleapis/gens/google/cloud/speech/v1/resource.grpc.pb.cc \
libs/googleapis/gens/google/cloud/speech/v2/cloud_speech.pb.cc \
libs/googleapis/gens/google/cloud/speech/v2/cloud_speech.grpc.pb.cc \

If you don't do this, you'll most likely get some problems linking.

That's all I can think of for now. It would be really great if you also find this useful and we manage to get it merged. I am of course available for questions.

Some methods and data structures are now obsolete in v2.

…arameters

…amer's constructor.

…te` on the grpc stream.

…with streaming recognition in v2.

…citly provided. If a recognizer id is provided then use this and ignore all parameters except for interim.

…. This can only happen in v2-specific code.

entenschnabel added 16 commits February 13, 2024 22:15

Reference v2 namespace as opposed to v1p1beta1.

6bdda2c

Change function calls to v2 function names.

eb8dfc3

Some methods and data structures are now obsolete in v2.

Integrate V2 version of Google Cloud Services API parallel with V1

1d5cb3d

Initial attempt to use v2, with reduced number of RecognitionConfig p…

8ece435

…arameters

The recognizer was initialized too late. It must be done in the GStre…

248e59e

…amer's constructor.

Add some debug logging

b136d13

Set the recognizer and the audio content within the same call to `Wri…

1408fea

…te` on the grpc stream.

Ensure that interim result property is set initially before starting …

61a77e6

…with streaming recognition in v2.

Allow wild card recognizer id to be used if no recognizer id is expli…

1bc7a67

…citly provided. If a recognizer id is provided then use this and ignore all parameters except for interim.

Do not check for FreeSWITCH variable for recognizer parent in v1 code…

4277668

…. This can only happen in v2-specific code.

Allow sample rate to be set using a FreeSWITCH/environment variable

605cd2a

Move duplicated code into more generic functions.

4171cbf

Enable multiple languages for v2

57eb10b

Tidy up some comments and log statements.

2778c5c

Event max_duration_exceeded not firing in v2.

12f2779

Add TODO comment for solution to 5 minute timeout.

d251867

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Support for Google Cloud Speech-To-Text v2 in mod_google_transcribe #164

Add Support for Google Cloud Speech-To-Text v2 in mod_google_transcribe #164

entenschnabel commented Feb 21, 2024

Add Support for Google Cloud Speech-To-Text v2 in mod_google_transcribe #164

Are you sure you want to change the base?

Add Support for Google Cloud Speech-To-Text v2 in mod_google_transcribe #164

Conversation

entenschnabel commented Feb 21, 2024

Differences between v1 and v2

Some Notes on the Code and Building

Differences between `v1` and `v2`