<a href="https://colab.research.google.com/github/cul-data-club/meetings/blob/main/2022/march-31-vosk.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transcribing Audio with Vosk (Offline!)

by Moacir P. de Sá Pereira, [Research Data Services](https://library.columbia.edu/services/research-data-services.html), Columbia University Libraries

A [2022 article in _Politico_](https://www.politico.com/news/2022/02/16/my-journey-down-the-rabbit-hole-of-every-journalists-favorite-app-00009216#:~:text=The%20Otter%20privacy%20policy%20claims,for%20sharing%20with%20third%20parties.) has raised (anew) issues surrounding speech recognition tools used for audio transcription, in this case the popular service [Otter](http://otter.ai), which has even been recommended for use at Columbia. ([Susan McGregor](https://datascience.columbia.edu/people/susan-mcgregor/), of Columbia's Data Science Institute, is quoted in the _Politico_ piece.)

The issues are twofold: the first is giving third parties access to your data (in this case both the audio recordings themselves and your transcripts). This calls to mind the January 2022 scandal regarding the sharing of AI-chat transcripts for The Crisis Text Line ([also reported in _Politico_](https://www.politico.com/news/2022/01/28/suicide-hotline-silicon-valley-privacy-debates-00002617)) with third-party vendors.

The second issue is that AI- (or machine learning-) driven speech recognition solutions are only as good as the training that the AI has received (or how good the training model is). We see the same phenomenon in real life, where if we take a car in for service, a mechanic with 20 years of experience will be able to determine the problem much more quickly than a teenager who has never popped the hood of a car. As a result, services like Otter, but also AI assistants like Siri and Alexa, eat their own dogfood: they take requests they have transcribed and add them to their model so that they can "understand" requests better in the future. This is, of course, also a concern for privacy advocates.

## ML-Driven Speech Recognition with User Control

The alternatives to the above systems (which include [Google's Cloud Speech-to-Text API](https://cloud.google.com/speech-to-text) are ones where the model used for transcription is in the hands of the transcriber at all times. On the one hand, transcribers can train their own models, which can be powerful but also time-consuming. For some languages or registers, a locally-trained model may be the only option. On the other hand, transcribers can make use of pre-trained models. These models don't change as they are fed new information, which means the privacy of the transcriber's source material is maintained.

As with all ML-driven work, the model is a black-box whose only important parameter is efficacy. As a result, a handful of open source APIs have emerged for audio transcription, allowing researchers to investigate how the APIs function and further allowing researchers to use them offline, with no worry of data compromises.

Here are a few open source speech recognition options:

* Mozilla's [DeepSpeech](https://github.com/mozilla/DeepSpeech). As of 2020, [the status of this system is unclear](https://discourse.mozilla.org/t/future-of-deepspeech-stt-after-recent-changes-at-mozilla/66191), and it has been over a year since a release was cut.
* [Kaldi](https://kaldi-asr.org/index.html), which includes [13 pre-trained models](https://kaldi-asr.org/models.html)
* Facebook's Flashlight ML library includes [an automatic speech recognition app](https://github.com/flashlight/flashlight/tree/main/flashlight/app/asr) based on their earlier work, Wav2Letter++.
* [Athena](https://github.com/athena-team/athena) is a Python-based speech processing engine built atop Google's popular [TensorFlow](https://www.tensorflow.org/) ML library.

## Enter Vosk

For this notebook, however, we'll be focusing on relative newcomer [Vosk](https://alphacephei.com/vosk/). What attracts me to Vosk is that they ship models for twenty different languages, and that the models have small versions (50MB) designed for "offline" use on either embedded systems (Raspberry Pis) or smartphones. With such a light footprint, implementing Vosk in a Colab becomes rather trivial.

Additionally, Vosk supports streaming recognition, which can provide real-time transcription, much like how Google's Speech-to-Text API powers the real-time subtitling of the [Studio Remote live stream](https://www.youtube.com/channel/UCLOUh6s8E2FYAVAsJg3lgoA). Finally, Vosk has wrappers for many languages other than just Python.

With the rest of this notebook, we'll try three things:

1. Run the Vosk tutorial to get a feel for the library.
2. Implement an updated version [of this Colab](https://colab.research.google.com/gist/dauuricus/4d9ba614afd7558a2591451fe08949ef/vosk_chinese.ipynb#scrollTo=xNNdQKf4oU2Q) to learn how to pull a video from YouTube and process its speech.
3. Upload bespoke audio (the beginning of this workshop) and process it.

## The Vosk tutorial

This section of the notebook is based on the [Vosk-API Python example](https://github.com/alphacep/vosk-api/tree/master/python/example).

In [None]:
# Install Vosk and clone the Vosk-API to the Colab storage area

import sys
!{sys.executable} -m pip install vosk
!git clone https://github.com/alphacep/vosk-api

In [None]:
# Move some files from the Python example directory to the root directory.

%mv vosk-api/python/example/test_simple.py ./test_simple.py
%mv vosk-api/python/example/test.wav ./test.wav

# Download, unzip, and rename the English model.
# The built-in test script expects to find a model in a folder called 
# "model" in the same directory as the script.

!wget https://alphacephei.com/kaldi/models/vosk-model-small-en-us-0.15.zip
!unzip vosk-model-small-en-us-0.15.zip
%mv vosk-model-small-en-us-0.15 model

In [None]:
# Let's play the audio file

from IPython.display import Audio

Audio("test.wav")

In [None]:
# Let's execute the sample script

!{sys.executable} ./test_simple.py test.wav

Our ears hear:

> one zero zero zero one nine oh two one oh zero one eight zero three

However, the model transcribes: 

> one zero zero zero one nah no to i know zero one eight zero three

In other words, it's not perfect. But maybe it's a bit of a time saver. Note that the second phrase is idiomatic. Presumably, if the speaker said "nine zero two one zero," the model would have understood, but using the idiomatic "oh" instead of "zero" may have suggested to the model that it's not listening to numbers.


## Vosk in Mandarin

Now we can download a video from YouTube, strip the audio, and process it using the Chinese model. We'll use this six-minute video:

In [None]:
from IPython.display import YouTubeVideo

video_id = "cNSq5RdVf28"

YouTubeVideo(video_id)

In [None]:
# Lets download the video and extract the audio

video_url = f"https://youtu.be/{video_id}"

!{sys.executable} -m pip install -q youtube-dl
!youtube-dl --extract-audio --audio-format wav --output "extract.%(ext)s" {video_url}

In [None]:
# The audio is in stereo, which will cause problems down the line. Let's
# convert it to a PCM WAV here.
!apt install ffmpeg
!ffmpeg -i extract.wav -vn -acodec pcm_s16le -ac 1 -ar 16000 -f wav extract-mono.wav

In [None]:
# We now have a .wav file. What do we know about it?
import librosa
%matplotlib inline
import matplotlib.pyplot as plt
import librosa.display

def getWavInfo(file):
  data, framerate = librosa.load(file)
  print(f"The framerate is {framerate} Hz")
  print(f"The duration is {len(data)/framerate} s")
  plt.figure(figsize=(14, 5))
  librosa.display.waveplot(data, sr=framerate)

getWavInfo("extract-mono.wav")

In [None]:
# And the test file, just for laughs?

getWavInfo("test.wav")

In [None]:
# Next, let's get the Chinese model.

!wget https://alphacephei.com/vosk/models/vosk-model-small-cn-0.3.zip 
!unzip vosk-model-small-cn-0.3.zip
%mv vosk-model-small-cn-0.3 model-cn
!rm -rf vosk-model*

In [None]:
# As noted above, the default test script expects a "model" folder, but now we
# have a model folder for English, but a model-cn folder for Chinese. Also, I
# think we can do better for output options, so let's re-tool the script:

from vosk import Model, KaldiRecognizer, SetLogLevel
import os
import wave
import json

def transcribe_simple(file, model_path="model"):
  SetLogLevel(0)
  if not os.path.exists(model_path):
    print ("Please download the model from https://alphacephei.com/vosk/models and unpack as 'model' in the current folder.")
    exit (1)
  wf = wave.open(file, "rb")
  if wf.getnchannels() != 1 or wf.getsampwidth() != 2 or wf.getcomptype() != "NONE":
      print ("Audio file must be WAV format mono PCM.")
      if wf.getnchannels() == 2:
        print("Converting to mono")
      exit (1)
  model = Model(model_path)
  rec = KaldiRecognizer(model, wf.getframerate())
  rec.SetWords(True)
  while True:
      data = wf.readframes(4000)
      if len(data) == 0:
          break
      if rec.AcceptWaveform(data):
          continue
      else:
          continue
  return json.loads(rec.FinalResult())

In [None]:
# With the modified script, let's try out the test.wav.
transcription = transcribe_simple("test.wav")
transcription['text']

In [None]:
# And the Chinese wav?
transcription = transcribe_simple("extract-mono.wav", "model-cn")


In [None]:
transcription['text']

## Vosk with Whatever

Now we can move to our own files if we want to use them. Similarly, we can download [various other Vosk models](https://alphacephei.com/vosk/models) and extract audio from other YouTube videos. For example:

- [Indian English (small)](https://alphacephei.com/vosk/models/vosk-model-small-en-in-0.4.zip)
- [Russian (small)](https://alphacephei.com/vosk/models/vosk-model-small-ru-0.22.zip)
- [French (small)](https://alphacephei.com/vosk/models/vosk-model-small-fr-0.22.zip)
- [German (small)](https://alphacephei.com/vosk/models/vosk-model-small-de-0.15.zip)
- [Spanish (small)](https://alphacephei.com/vosk/models/vosk-model-small-es-0.22.zip)
- [Brazilian Portuguese (small)](https://alphacephei.com/vosk/models/vosk-model-small-pt-0.3.zip)
- [Turkish (small)](https://alphacephei.com/vosk/models/vosk-model-small-tr-0.3.zip)
- [Hindi (small)](https://alphacephei.com/vosk/models/vosk-model-small-hi-0.22.zip)
- [Esperanto (small)](https://alphacephei.com/vosk/models/vosk-model-small-eo-0.22.zip)

and so on... What will we transcribe next?