![Image embedding projection](https://raw.githubusercontent.com/arvest-data-in-context/ml-notebooks/refs/heads/main/docs/images/notebooks/automatic-speech-recognition.png)

In this notebook, we shall find a video that we have stored on Arvest, and then extract the words that are spoken using a speech recognition model provided by [vosk](https://alphacephei.com/vosk/). Once this is done, we shall take the results and build an interactive IIIF Manifest which can be diirectly viewed in [Arvest](https://arvest.app).

# 0. Setup

Let's begin by installing and importing all of the different components we will need.

In [None]:
print("Installing and importing packages...")

# Uninstall and reinstall packages for a clean environment
!pip uninstall -q -y arvestapi
!pip uninstall -q -y arvesttools
!pip uninstall -q -y jhutils
!pip uninstall -q -y iiif_prezi3
!pip uninstall -q -y dvt
!pip install -q --disable-pip-version-check git+https://github.com/arvest-data-in-context/arvest-api.git
!pip install -q --disable-pip-version-check git+https://github.com/arvest-data-in-context/arvest-api-tools.git
!pip install -q --disable-pip-version-check git+https://github.com/jdchart/jh-py-utils.git
!pip install -q --disable-pip-version-check git+https://github.com/iiif-prezi/iiif-prezi3.git
!pip install -q --disable-pip-version-check git+https://github.com/distant-viewing/dvt.git
!pip install -q --disable-pip-version-check vosk
!pip install -q --disable-pip-version-check librosa
!pip install -q --disable-pip-version-check scipy

# Import packages
import arvestapi
import arvesttools.manifest_creation
from jhutils.local_files import read_json, write_json
import jhutils.online_files
from jhutils.misc import print_progress_bar_colab, slugify
import os
import dvt
import iiif_prezi3
import shutil
import numpy as np
import vosk
import librosa
import wave
from scipy.io import wavfile
from scipy.signal import wiener
import json
import mimetypes
mimetypes.add_type('image/webp', '.webp')

TEMP_FOLDER = os.path.join(os.getcwd(), "_TEMP")
if os.path.isdir(TEMP_FOLDER) == False:
    os.makedirs(TEMP_FOLDER)

print("👍 Ready!")

Next, let's get the vosk model we want to use. The model we use will depend on the language of the content we wish to analyse - [here is a list](https://alphacephei.com/vosk/models) of models that vosk have avaiable. Change the `MODEL_NAME` variable to the model you wish to download and use.

The `MODEL_PATH` variable allows you to choose where you would like to save the model. If the folder doesn't exist, it will get created. Even if you have already downloaded the model run this cell anyway as it will let the rest of the notebook know where to find the model.

In [None]:
# Change these variables if needed:
MODEL_NAME = "vosk-model-small-en-us-0.15"
MODEL_PATH = os.path.join(os.getcwd(), "models")

# Create folder if needed:
if os.path.isdir(MODEL_PATH) == False:
    os.makedirs(MODEL_PATH)

# Download model if it doesn't already exist:
print(f"Downloading model \"{MODEL_NAME}\" to {MODEL_PATH}...")
if os.path.isdir(os.path.join(MODEL_PATH, MODEL_NAME)) == False:
    jhutils.online_files.download_zip(os.path.join("https://alphacephei.com/vosk/models", MODEL_NAME) + ".zip", MODEL_PATH)

# Load the model:
model = vosk.Model(os.path.join(MODEL_PATH, MODEL_NAME))

print(f"👍 {MODEL_NAME} model ready and downloaded!")

# 1. Find our video
The first step is to get the video that we wish to process. We have ours stored on our Arvest account, and we have given its metadata `identifier` field the value `"API-TUTORIAL-CONTENT-SPEECH-REC"`. This allows us to find our media using the [Arvest API](https://github.com/arvest-data-in-context/arvest-api).

First, we need to "connect" to Arvest using the Arvest API package. For this, we need our user email and our password which we will give to an instance of the `arvestapi.Arvest()` class.

In [None]:
EMAIL = "my_email@something.com"
PASSWORD = "myarvestpassword"

ar = arvestapi.Arvest(EMAIL, PASSWORD)
print(f"👍 Succesfully connected to Arvest with \"{ar.profile.name}\"")

Next, we'll get all of our media using the `get_medias()` function, and search until we find the right video.

In [None]:
found_media = []
media_items = ar.get_medias()

for media_item in media_items:
    media_item_metadata = media_item.get_metadata()
    if media_item_metadata["identifier"] == "API-TUTORIAL-CONTENT-SPEECH-REC":
        found_media.append(media_item)

video_item = found_media[0]

print(f"🔍 Found {len(found_media)} media files corresponding to search criteria.")
print(f"Treating first item: \"{video_item.title}\"")

Next, we shall have to download the correspoinding video in order to analyze it. To do this, we shall use our helper function.

In [None]:
local_video_path = jhutils.online_files.download(video_item.get_full_url(), dir = TEMP_FOLDER)

print(f"👍 Video downloaded to {local_video_path}")

Finally, let's convert the video into a wave file with a sample rate of 16000 - this is the type of file that vosk will accept for processing.

In [None]:
audio_file_path = os.path.join(TEMP_FOLDER, os.path.splitext(os.path.basename(local_video_path))[0] + '.wav')
!ffmpeg -i "{local_video_path}" -ar {16000} -ac 1 "{audio_file_path}" > /dev/null 2>&1
print("👍 Converted to audio!")

# 2. Audio pre-processing
There are a few other things to do in order to make sure that the file will work best with vosk. Here, we scale the audio to 16 bit depth, as well as perform some simple noise reduction and audio normalization.

In [None]:
print(f"Processing \"{os.path.basename(audio_file_path)}\"...")
    
# Load the audio file with librosa:
audio_data, sample_rate = librosa.load(audio_file_path, sr = None)

# Perform noise reduction and normalization:
noise_reduction = wiener(audio_data)
normalized = librosa.util.normalize(noise_reduction)

# Scale to 16 bit depth for vosk:
scaled = np.int16(normalized * 32767)

# Output file:
wavfile.write(audio_file_path, sample_rate, scaled)
print("👍 Finished!")

# 3. Perform Analysis
Now that we have our audio source, we can run the speech recognition model. We shall use an instance of vosk's `KaldiRecognizer` class to do this.

In [None]:
print(f"Running speech recognition...")

# Create the vosk recognizer:
recognizer = vosk.KaldiRecognizer(model, 16000)
recognizer.SetWords(True)

# Open the audio file:
with wave.open(os.path.join("media_for_analysis", audio_file_path), 'rb') as wf:
    audio_data = wf.readframes(wf.getnframes())

# Run the model:
recognizer.AcceptWaveform(audio_data)
result = json.loads(recognizer.Result())["result"]

# Print the results:
print(f"👍 Analysis complete! Found {len(result)} words.")

full_string = ""
for item in result:
    full_string = f"{full_string}{item['word']} "
print(f"\"{full_string}\"")

# 4. Export to Arvest
Finally, we shall export the results of our analysis to an interactive IIIF Manifest that can be opened in Arvest. Let's begin by creating the basic Manifest with the [arvesttools](https://github.com/arvest-data-in-context/arvest-api-tools) package's `media_to_manifest()` function.

In [None]:
manifest = arvesttools.manifest_creation.media_to_manifest(video_item)
print("👍 Manifest created!")

Next, let's add a timed annotation to the main Canvas, one for each word.

In [None]:
print("Adding annotations...")
for i, word in enumerate(result):
    print_progress_bar_colab(i + 1, len(result), f"(word {i + 1}/{len(result)})")

    arvesttools.manifest_creation.add_textual_annotation(
        manifest,
        text_content = f"<p><strong>{word['word']}</strong><br>Confidence: {word['conf']}<br>(<em>{word['start']}-{word['end']}</em>)</p>",
        t = {"start" : word['start'], "end" : word['end']}
    )

print("👍 Finished")

Finally, we can upload the Manifest to Arvest. You can either go and find it in your [workspace](https://workspace.arvest.app/) or view it at the url given below.

In [None]:
# Save to disk
local_path = os.path.join(TEMP_FOLDER, f"{slugify(video_item.title)}-shot-decomposition.json")
write_json(local_path, manifest.dict())

# Upload Manifest:
added_manifest = ar.add_manifest(path = local_path, update_id = True)
added_manifest.update_title(f"{video_item.title} (automatic speech recognition)")
added_manifest.update_description("A Manifest annotated using an automatic speech recognition model.")
if video_item.thumbnail_url != None:
    added_manifest.update_thumbnail_url(video_item.thumbnail_url)

# Update metadata:
manifest_metadata = added_manifest.get_metadata()
manifest_metadata["creator"] = "Automatic speech recognition tutorial"
manifest_metadata["identifier"] = "&&API-TUTORIAL-AUTO-SPEECH-REC"
added_manifest.update_metadata(manifest_metadata)

print(f"👍 Manifest created, view it here: {added_manifest.get_preview_url()}")

# 5. Cleanup
To finish, lets clean up our mess! First, we can delete the temporary folder .

In [None]:
shutil.rmtree(TEMP_FOLDER)
print(f"🗑️ {TEMP_FOLDER} removed !")

And finally, we can remove from Arvest all of our created Manifest. We can get all of our Manifests by using the `get_manifests()` function, then check the metadata. If it's one of the files we want to remove, we can then use the `remove()` function.

**⚠️ Warning: there's no going back after using the remove function, so be careful! To avoid accidential removal, we've added a `REMOVE` variable that need to be set to `True` for the code to run.**

In [None]:
REMOVE = False

if REMOVE:
    all_manifests = ar.get_manifests()
    count = 0
    print("Removing manifests...")

    for i, media_file in enumerate(all_manifests):
        print_progress_bar_colab(i + 1, len(all_manifests), f"(Processing file {i + 1}/{len(all_manifests)})")
        media_metadata = media_file.get_metadata()
        if media_metadata["creator"] == "Automatic speech recognition tutorial" and media_metadata["identifier"] == "&&API-TUTORIAL-AUTO-SPEECH-REC":
            media_file.remove()
            count = count + 1

    print(f"🗑️ Removed {count} items!")