In [1]:
# SPDX-FileCopyrightText: Copyright (c) 2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: MIT

# ASR API tutorial

This tutorial demonstates how to use Python Riva API.

## <font color="blue">Server</font>

Before running client part of Riva, please set up a server. The simplest
way to do this is to follow
[quick start guide](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/quick-start-guide.html#local-deployment-using-quick-start-scripts).


## <font color="blue">Authentication</font>

Before using Riva services you will need to establish connection with a server.

In [1]:
import riva.client

uri = "10.195.220.132:8888"  # Default value

auth = riva.client.Auth(uri=uri)

## <font color="blue">Setting up service</font>

To instantiate a service pass `riva.client.Auth` instance to a constructor.

In [2]:
asr_service = riva.client.ASRService(auth)

For speech recognition you will need to create a recognition config (an instance of `riva.client.RecognitionConfig`). 
A detailed description of config fields is available in Riva 
[documentation](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/reference/protos/riva_asr.proto.html?highlight=max%20alternatives#riva-proto-riva-asr-proto).
If you intend to use streaming recognition, an offline config has to wrapped into `riva.client.StreamingRecognitionConfig`.


In [3]:
from copy import deepcopy
offline_config = riva.client.RecognitionConfig(
    encoding=riva.client.AudioEncoding.LINEAR_PCM,
    language_code="en-US",
    max_alternatives=1,
    enable_automatic_punctuation=True,
    verbatim_transcripts=False,
)
streaming_config = riva.client.StreamingRecognitionConfig(config=deepcopy(offline_config), interim_results=True)

You also need to a set frame rate and number of channels of audio which is going to be processed. If you'd like to process file `data/examples/en-US_AntiBERTa_for_word_boosting_testing.wav`, then your code will be

In [4]:
# my_wav_file = '../data/examples/en-US_AntiBERTa_for_word_boosting_testing.wav'
my_wav_file = "interview-with-bill.wav"
riva.client.add_audio_file_specs_to_config(offline_config, my_wav_file)
riva.client.add_audio_file_specs_to_config(streaming_config, my_wav_file)

If you intent to use word boosting, then use convenience method `riva.client.add_word_boosting_to_config()` to add boosting parameters to config.

In [5]:
boosted_lm_words = ['AntiBERTa', 'ABlooper']
boosted_lm_score = 20.0
riva.client.add_word_boosting_to_config(offline_config, boosted_lm_words, boosted_lm_score)
riva.client.add_word_boosting_to_config(streaming_config, boosted_lm_words, boosted_lm_score)

In [6]:
print(offline_config)

encoding: LINEAR_PCM
sample_rate_hertz: 16000
language_code: "en-US"
max_alternatives: 1
speech_contexts {
  phrases: "AntiBERTa"
  phrases: "ABlooper"
  boost: 20
}
audio_channel_count: 1
enable_automatic_punctuation: true



In [7]:
print(streaming_config)

config {
  encoding: LINEAR_PCM
  sample_rate_hertz: 16000
  language_code: "en-US"
  max_alternatives: 1
  speech_contexts {
    phrases: "AntiBERTa"
    phrases: "ABlooper"
    boost: 20
  }
  audio_channel_count: 1
  enable_automatic_punctuation: true
}
interim_results: true



## <font color="blue">Offline</font>

To run offline speech recognition read data from a file and pass to a service.

In [8]:
with open(my_wav_file, 'rb') as fh:
    data = fh.read()

In [9]:
response = asr_service.offline_recognize(data, offline_config)

In [10]:
print(response)

results {
  alternatives {
    transcript: "I\'m Bill Bill Turner. I\'m from St., Cloud, Minnesota, and it\'s 19 June in 1955. "
    confidence: 0.23720637
    words {
      start_time: 3000
      end_time: 3160
      word: "i\'m"
      confidence: 0.211123496
    }
    words {
      start_time: 3240
      end_time: 3440
      word: "bill"
      confidence: 0.0618735403
    }
    words {
      start_time: 3840
      end_time: 4040
      word: "bill"
      confidence: 0.208551154
    }
    words {
      start_time: 4160
      end_time: 4520
      word: "turner"
      confidence: 0.0643256
    }
    words {
      start_time: 4720
      end_time: 4880
      word: "i\'m"
      confidence: 0.14079
    }
    words {
      start_time: 5040
      end_time: 5080
      word: "from"
      confidence: 0.188600674
    }
    words {
      start_time: 5840
      end_time: 6040
      word: "saint"
      confidence: 0.195283547
    }
    words {
      start_time: 6080
      end_time: 6360
      word: "

To extract a transcript you may use

In [11]:
print(response.results[0].alternatives[0].transcript)

I'm Bill Bill Turner. I'm from St., Cloud, Minnesota, and it's 19 June in 1955. 


In [12]:
print(response.results[0].alternatives[0].confidence)

0.237206369638443


### <font color="green">Asynchronous calls</font>

You can recognize speech asynchronously by setting `future=True` in `ASRService.offline_recognize()`.

In [13]:
from time import time

num_repeats = 10

In [37]:
sync_transcripts = []
start_time = time()
for _ in range(num_repeats):
    sync_transcripts.append(
        asr_service.offline_recognize(data, offline_config).results[0].alternatives[0].transcript
    )
print(f"Time spent on synchronous recognition: {time() - start_time:.2f}")

Time spent on synchronous recognition: 5.96


In [38]:
async_transcripts = []
start_time = time()
futures = []
for _ in range(num_repeats):
    futures.append(asr_service.offline_recognize(data, offline_config, future=True))
for f in futures:
    async_transcripts.append(f.result().results[0].alternatives[0].transcript)
print(f"Time spent on async recognition: {time() - start_time:.2f}")

Time spent on async recognition: 2.54


In [39]:
assert sync_transcripts == async_transcripts

## <font color="blue">Streaming</font>

To imitate audio streaming use `riva.client.AudioChunkFileIterator`. You can imitate realtime audio by providing a delay callback to the iterator.

In [44]:
wav_parameters = riva.client.get_wav_file_parameters(my_wav_file)
# correponds to 1 second of audio
chunk_size = wav_parameters['framerate']
with riva.client.AudioChunkFileIterator(
    my_wav_file, chunk_size, delay_callback=riva.client.sleep_audio_length,
) as audio_chunk_iterator:
    for i, chunk in enumerate(audio_chunk_iterator):
        print(i, len(chunk))

0 32000
1 32000
2 32000
3 32000
4 32000
5 32000
6 32000
7 32000
8 32000
9 32000
10 32000
11 32000
12 32000
13 32000
14 32000
15 32000
16 32000
17 32000
18 32000
19 32000
20 32000
21 32000
22 32000
23 32000
24 32000
25 32000
26 32000
27 32000
28 32000
29 32000
30 32000
31 32000
32 32000
33 32000
34 32000
35 32000
36 32000
37 32000
38 32000
39 32000
40 32000
41 32000
42 32000
43 32000
44 32000
45 32000
46 32000
47 32000
48 32000
49 32000
50 32000
51 32000
52 32000
53 32000
54 32000
55 32000
56 32000
57 32000
58 32000
59 32000
60 736


Then audio chunks are passed to `ASRService.streaming_response_generator()` and response generator is created.

In [46]:
audio_chunk_iterator = riva.client.AudioChunkFileIterator(my_wav_file, 4800)
response_generator = asr_service.streaming_response_generator(audio_chunk_iterator, streaming_config)

You may find description of streaming response (`StreamingRecognizeResponse`) fields in Riva [documentation](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/reference/protos/riva_asr.proto.html?highlight=max%20alternatives#riva-proto-riva-asr-proto).

In [47]:
streaming_response = next(response_generator)

For showing streaming results it is convenient to use function `riva.client.print_streaming()`.

In [1]:
# riva.client.print_streaming(response_generator, additional_info='time')

If you set a delay callback in audio chunk iterator and `show_intermediate=True` in `riva.client.print_streaming()`, then you will be able watch transcript forming.

In [55]:
audio_chunk_iterator = riva.client.AudioChunkFileIterator(my_wav_file, 4800, riva.client.sleep_audio_length)
response_generator = asr_service.streaming_response_generator(audio_chunk_iterator, streaming_config)
riva.client.print_streaming(response_generator, show_intermediate=True)

## Well, I'm Bill. Bill Turner. I'm from. 
## St., Cloud, Minnesota, and it's 19 June in 1955.                               
## So where are you working?  
## Well, since I moved here, I've been working as a part in town. It's alright. Work pays well. Good enough to support my family of four. 
## How's home life?  
## Okay, Barb and I just got back from a camping trip in the boundary waters. Steve, my six year old, is just about to start kindergarten. 
## really sounds like you're at a good place in your life.  
## Yeah, most definitely. Junior's got a Little League game coming and we're all very excited for it. Should be a very nice time for all of us. 
## Oh, I'm sure it'll be just lovely. 
## Do you consider yourself happy?  
## Well, got a good job, good family, nice sturdy house. 
## That ain't happiness. I don't know what it is. 


It is also possible to print streaming results in several places, e.g. in STDOUT and a file.

In [57]:
import sys
output_file = "my_results.txt"
audio_chunk_iterator = riva.client.AudioChunkFileIterator(my_wav_file, 4800)
response_generator = asr_service.streaming_response_generator(audio_chunk_iterator, streaming_config)
riva.client.print_streaming(response_generator,show_intermediate=True, output_file=[sys.stdout, output_file])

## Well, I'm Bill. Bill Turner. I'm from. 
## St., Cloud, Minnesota, and it's 19 June in 1955.                               
## So where are you working?  
## Well, since I moved here, I've been working as a part in town. It's alright. Work pays well. Good enough to support my family of four. 
## How's home life?  
## Okay, Barb and I just got back from a camping trip in the boundary waters. Steve, my six year old, is just about to start kindergarten. 
## really sounds like you're at a good place in your life.  
## Yeah, most definitely. Junior's got a Little League game coming and we're all very excited for it. Should be a very nice time for all of us. 
## Oh, I'm sure it'll be just lovely. 
## Do you consider yourself happy?  
## Well, got a good job, good family, nice sturdy house. 
## That ain't happiness. I don't know what it is. 


Showing file and clean up in bash

In [15]:
output_file
with open(output_file,'r') as file:
    # print(file.read())

SyntaxError: incomplete input (2427630812.py, line 3)

In [60]:
!rm $output_file

'rm' is not recognized as an internal or external command,
operable program or batch file.


Showing file and clean up in cmd.exe

In [14]:
# !type $output_file

In [63]:
!del $output_file

## <font color="blue">Audio input/output</font>

For using audio input and output you need to install PyAudio.

```bash
conda install -c anaconda pyaudio
```

### <font color="green">Playing audio during transcribing</font>

For playing audio simultaneously with transcribing, provide an instance of `riva.client.audio_io.SoundCallBack` as a `delay_callback` to `riva.client.AudioChunkFileIterator`.

In [71]:
import riva.client.audio_io

In [73]:
# show available output devices
riva.client.audio_io.list_output_devices()

Output audio devices:
3: Microsoft Sound Mapper - Output
4: Headset Earphone (EPOS ADAPT 16
5: Speakers (Realtek(R) Audio)
9: Primary Sound Driver
10: Headset Earphone (EPOS ADAPT 160T)
11: Speakers (Realtek(R) Audio)
12: Speakers (Realtek(R) Audio)
13: Headset Earphone (EPOS ADAPT 160T)
18: Output 1 (EPOS ADAPT 160T)
19: Output 2 (EPOS ADAPT 160T)
21: Communication Speaker (EPOS ADAPT 160T)
22: Headphones ()
24: Speakers 1 (Realtek HD Audio output with SST)
25: Speakers 2 (Realtek HD Audio output with SST)
28: Headphones 1 (Realtek HD Audio 2nd output with SST)
29: Headphones 2 (Realtek HD Audio 2nd output with SST)


In [75]:
output_device = None  # use default device
wav_parameters = riva.client.get_wav_file_parameters(my_wav_file)
sound_callback = riva.client.audio_io.SoundCallBack(
    output_device, wav_parameters['sampwidth'], wav_parameters['nchannels'], wav_parameters['framerate'],
)
audio_chunk_iterator = riva.client.AudioChunkFileIterator(my_wav_file, 4800, sound_callback)
response_generator = asr_service.streaming_response_generator(audio_chunk_iterator, streaming_config)
riva.client.print_streaming(response_generator, show_intermediate=True)
sound_callback.close()

## Well, I'm Bill. Bill Turner. I'm from. 
## St., Cloud, Minnesota, and it's 19 June in 1955.                               
## So where are you working?  
## Well, since I moved here, I've been working as a part in town. It's alright. Work pays well. Good enough to support my family of four. 
## How's home life?  
## Okay, Barb and I just got back from a camping trip in the boundary waters. Steve, my six year old, is just about to start kindergarten. 
## really sounds like you're at a good place in your life.  
## Yeah, most definitely. Junior's got a Little League game coming and we're all very excited for it. Should be a very nice time for all of us. 
## Oh, I'm sure it'll be just lovely. 
## Do you consider yourself happy?  
## Well, got a good job, good family, nice sturdy house. 
## That ain't happiness. I don't know what it is. 


### <font color="green">Streaming from microphone</font>

In [77]:
riva.client.audio_io.list_input_devices()

Input audio devices:
0: Microsoft Sound Mapper - Input
1: Headset Microphone (EPOS ADAPT 
2: Microphone Array (IntelÂ® Smart 
6: Primary Sound Capture Driver
7: Headset Microphone (EPOS ADAPT 160T)
8: Microphone Array (IntelÂ® Smart Sound Technology for Digital Microphones)
14: Headset Microphone (EPOS ADAPT 160T)
15: Microphone Array (IntelÂ® Smart Sound Technology for Digital Microphones)
16: Headset Microphone 1 (EPOS ADAPT 160T)
17: Headset Microphone 2 (EPOS ADAPT 160T)
20: Input (EPOS ADAPT 160T)
23: Microphone (Realtek HD Audio Mic input)
26: PC Speaker (Realtek HD Audio output with SST)
27: Stereo Mix (Realtek HD Audio Stereo input)
30: PC Speaker (Realtek HD Audio 2nd output with SST)
31: Microphone Array 1 ()
32: Microphone Array 2 ()


Run code below and then say something in English

In [85]:
input_device = None  # default device
with riva.client.audio_io.MicrophoneStream(
    rate=streaming_config.config.sample_rate_hertz,
    chunk=streaming_config.config.sample_rate_hertz // 10,
    device=input_device,
) as audio_chunk_iterator:
    riva.client.print_streaming(
        responses=asr_service.streaming_response_generator(
            audio_chunks=audio_chunk_iterator,
            streaming_config=streaming_config,
        ),
        show_intermediate=True,
    )

## Hm.  


KeyboardInterrupt: 

# Speaker Diarization

In [1]:
import io
import IPython.display as ipd
import grpc

In [2]:
path = "interview-with-bill.wav"
with io.open(path, 'rb') as fh:
    content = fh.read()
ipd.Audio(path)

In [17]:
config = riva.client.RecognitionConfig(
  language_code="en-US",
  max_alternatives=1,
  enable_automatic_punctuation=True,
  enable_word_time_offsets=True,
)

# Use utility function to add SpeakerDiarizationConfig with enable_speaker_diarization=True
# Value of max_speaker_count in SpeakerDiarizationConfig has no effect as of now. It will be honoured in future.
riva.client.asr.add_speaker_diarization_to_config(config, diarization_enable=True)

# ASR inference call with Recognize
response = asr_service.offline_recognize(content, config)
print("ASR Transcript with Speaker Diarization:\n", response)

ASR Transcript with Speaker Diarization:
 results {
  alternatives {
    transcript: "I\'m Bill Bill Turner. I\'m from St., Cloud, Minnesota, and it\'s 19 June in 1955. "
    confidence: 0.237279698
    words {
      start_time: 3000
      end_time: 3160
      word: "i\'m"
      confidence: 0.212278828
      speaker_tag: 1
    }
    words {
      start_time: 3240
      end_time: 3440
      word: "bill"
      confidence: 0.0618815236
      speaker_tag: 1
    }
    words {
      start_time: 3840
      end_time: 4040
      word: "bill"
      confidence: 0.208552644
      speaker_tag: 1
    }
    words {
      start_time: 4160
      end_time: 4520
      word: "turner"
      confidence: 0.0643188581
      speaker_tag: 1
    }
    words {
      start_time: 4720
      end_time: 4880
      word: "i\'m"
      confidence: 0.140793696
      speaker_tag: 1
    }
    words {
      start_time: 5040
      end_time: 5080
      word: "from"
      confidence: 0.188606516
      speaker_tag: 1
    }
    w

In [19]:
# Pretty print transcript with color coded speaker tags. Black color text indicates no speaker tag was assigned.
for result in response.results:
    for word in result.alternatives[0].words:
        color = '\033['+ str(30 + word.speaker_tag) + 'm'
        print(color, word.word, end="")
      

[31m i'm[31m bill[31m bill[31m turner[31m i'm[31m from[31m saint[31m cloud[31m minnesota[31m and[31m it's[31m the[31m nineteenth[31m of[31m june[31m in[31m nineteen[31m fifty[31m five[32m So[32m where[32m are[32m you[32m working?[31m Well,[31m since[31m I[31m moved[31m here,[31m I've[31m been[31m working[31m as[31m a[31m pharmacist[31m in[31m town.[31m It's[31m alright,[31m work[31m pays[31m well.[31m Good[31m enough[31m to[31m support[31m my[31m family[31m of[31m four.[32m How's[32m home[32m life?[31m Okay,[31m Barb[31m and[31m I[31m just[31m got[31m back[31m from[31m a[31m camping[31m trip[31m in[31m the[31m boundary[31m waters.[31m Steve,[31m my[31m six[31m year[31m old,[31m is[31m just[31m about[31m to[31m start[31m kindergarten.[32m Really[32m sounds[32m like[32m you're[32m at[32m a[32m good[32m place[32m in[32m your[32m lie.[31m Oh[31m yeah,[31m most[31m definitely.[31m Junior[31m has

In [71]:
print('\033['+str(response.results[0].alternatives[0].words[0].speaker_tag)+'m',response.results[0].alternatives[0].transcript)

[1m I'm Bill Bill Turner. I'm from St., Cloud, Minnesota, and it's 19 June in 1955. 


In [95]:
for result in response.results:
    person = "Person_"+str(result.alternatives[0].words[0].speaker_tag)+': '
    color ='\033['+str(30+result.alternatives[0].words[0].speaker_tag) + 'm'
    # color = '\033['+str(result[0].alternatives[0].words[0].speaker_tag)+'m'
    text = str(result.alternatives[0].transcript)
    # print(person,end="")
    print(color,person+text)

[31m Person_1: I'm Bill Bill Turner. I'm from St., Cloud, Minnesota, and it's 19 June in 1955. 
[32m Person_2: So where are you working? 
[31m Person_1: Well, since I moved here, I've been working as a pharmacist in town. It's alright, work pays well. Good enough to support my family of four. 
[32m Person_2: How's home life? 
[31m Person_1: Okay, Barb and I just got back from a camping trip in the boundary waters. Steve, my six year old, is just about to start kindergarten. 
[32m Person_2: Really sounds like you're at a good place in your lie. 
[31m Person_1: Oh yeah, most definitely. Junior has got a Little League game coming up and we're all very excited for it Should be a very nice time for all of us. 
[32m Person_2: Oh, I'm sure it'll be just lovely. 
[32m Person_2: Do you consider yourself happy? 
[31m Person_1: Well, got a good job, good family, nice sturdy house. 
[31m Person_1: That ain't happiness. I don't know what is. 
