# Introduction
## Implementing a Speech Recognition capability in DataRobot

This accelerator demonstrates the use of DataRobot custom models functionality to deploy a speech recognition capability to DataRobot based on the OpenAI Whisper models (currently uses the "base" model). This allows the capability to leverage the DataRobot environment and resources, on cloud or on prem.

Possible Enhancements
- Using alternative models from the OpenAI Whisper set (larger or smaller), or a different model altogether
- Adding custom metrics to track in MLProd (e.g. total audio length, average audio length)
- Adding the ability to batch process files in a given URL location

References:

https://openai.com/research/whisper

https://ffmpeg.org

# Setup
## READ BEFORE STARTING NOTEBOOK
1. Use a Python 3.9 notebook and deployment environment
2. Enable the following **feature flags** on your account:
    - Enable Notebooks Filesystem Management
    - Enable Public Network Access for all Custom Models
3. Enable the notebook filesystem for this notebook in the notebook sidebar
4. Set the notebook session timeout to 180 minutes
5. Upload ffmpeg file into the notebook filesystem (see below)

In [1]:
# Download a static Linux build of ffmpeg, and upload the >>>single<<< "ffmpeg" file into the notebook
# filesystem (directly under storage)
# Source: https://johnvansickle.com/ffmpeg/
# For example: https://johnvansickle.com/ffmpeg/releases/ffmpeg-release-amd64-static.tar.xz
# Tested with ffmpeg-6.1-amd64-static

ffmpeg_file_path = "./storage/ffmpeg"

try:
    import os

    assert os.path.isfile(ffmpeg_file_path)
    print("Found ffmpeg file")
except Exception as e:
    raise RuntimeError(
        "Please follow the setup steps before running the notebook to upload ffmpeg."
    ) from e

Found ffmpeg file


In [2]:
!pip install -U openai-whisper \
                ffmpeg \
                datarobotx

Collecting openai-whisper
  Downloading openai-whisper-20231117.tar.gz (798 kB)
[K     |████████████████████████████████| 798 kB 14.4 MB/s 
[?25h  Installing build dependencies ... [?25l- \

 | / done
[?25h  Getting requirements to build wheel ... [?25l- done
[?25h    Preparing wheel metadata ... [?25l- done
[?25hCollecting ffmpeg
  Downloading ffmpeg-1.4.tar.gz (5.1 kB)
Collecting datarobotx
  Downloading datarobotx-0.1.20-py3-none-any.whl (177 kB)
[K     |████████████████████████████████| 177 kB 46.6 MB/s 
[?25h

Collecting tiktoken
  Downloading tiktoken-0.5.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
[K     |████████████████████████████████| 2.0 MB 47.1 MB/s 
Collecting torch
  Downloading torch-2.1.2-cp39-cp39-manylinux1_x86_64.whl (670.2 MB)
[K     |██████████▊                     | 224.5 MB 73.5 MB/s eta 0:00:07

[K     |███████████████████████         | 482.4 MB 77.9 MB/s eta 0:00:03

[K     |████████████████████████████████| 670.2 MB 73.7 MB/s 
[?25h

Collecting triton<3,>=2.0.0
  Downloading triton-2.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (167.9 MB)
[K     |████████████████████████████████| 167.9 MB 211 kB/s 
[?25h

Collecting ipywidgets
  Downloading ipywidgets-8.1.1-py3-none-any.whl (139 kB)
[K     |████████████████████████████████| 139 kB 62.3 MB/s 
Collecting names-generator
  Downloading names_generator-0.1.0-py3-none-any.whl (26 kB)
Collecting regex>=2022.1.18
  Downloading regex-2023.12.25-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (773 kB)
[K     |████████████████████████████████| 773 kB 58.6 MB/s 
[?25hCollecting nvidia-cudnn-cu12==8.9.2.26; platform_system == "Linux" and platform_machine == "x86_64"
  Downloading nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
[K     |██                              | 45.3 MB 55.0 MB/s eta 0:00:13

[K     |█████████████▎                  | 304.6 MB 70.4 MB/s eta 0:00:07

[K     |████████████████████████▋       | 563.7 MB 83.0 MB/s eta 0:00:03

[K     |████████████████████████████████| 731.7 MB 69.1 MB/s 
Collecting nvidia-nvtx-cu12==12.1.105; platform_system == "Linux" and platform_machine == "x86_64"
  Downloading nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (99 kB)
[K     |████████████████████████████████| 99 kB 53.7 MB/s 
[?25h

Collecting nvidia-cuda-runtime-cu12==12.1.105; platform_system == "Linux" and platform_machine == "x86_64"
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
[K     |████████████████████████████████| 823 kB 55.7 MB/s 
Collecting nvidia-cufft-cu12==11.0.2.54; platform_system == "Linux" and platform_machine == "x86_64"
  Downloading nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)
[K     |████████████████████████████████| 121.6 MB 79.1 MB/s 
[?25hCollecting nvidia-cusolver-cu12==11.4.5.107; platform_system == "Linux" and platform_machine == "x86_64"
  Downloading nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl (124.2 MB)
[K     |██████████████████████          | 85.6 MB 63.0 MB/s eta 0:00:01

[K     |████████████████████████████████| 124.2 MB 63.0 MB/s 
[?25hCollecting nvidia-curand-cu12==10.3.2.106; platform_system == "Linux" and platform_machine == "x86_64"
  Downloading nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)
[K     |████████████████████████████████| 56.5 MB 59.1 MB/s 
[?25hCollecting sympy
  Downloading sympy-1.12-py3-none-any.whl (5.7 MB)
[K     |████████████████████████████████| 5.7 MB 62.2 MB/s 
[?25hCollecting nvidia-cusparse-cu12==12.1.0.106; platform_system == "Linux" and platform_machine == "x86_64"
  Downloading nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl (196.0 MB)
[K     |███████████████▏                | 93.2 MB 78.1 MB/s eta 0:00:02

[K     |████████████████████████████████| 196.0 MB 63.4 MB/s 
[?25hCollecting nvidia-nccl-cu12==2.18.1; platform_system == "Linux" and platform_machine == "x86_64"
  Downloading nvidia_nccl_cu12-2.18.1-py3-none-manylinux1_x86_64.whl (209.8 MB)
[K     |█████████████████▎              | 113.5 MB 70.8 MB/s eta 0:00:02

[K     |████████████████████████████████| 209.8 MB 74.0 MB/s 
[?25hCollecting nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == "Linux" and platform_machine == "x86_64"
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
[K     |████████████████████████████████| 23.7 MB 59.0 MB/s 
Collecting nvidia-cublas-cu12==12.1.3.1; platform_system == "Linux" and platform_machine == "x86_64"
  Downloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)
[K     |███████▎                        | 93.8 MB 75.0 MB/s eta 0:00:05

[K     |████████████████████████████▎   | 362.4 MB 74.9 MB/s eta 0:00:01

[K     |████████████████████████████████| 410.6 MB 74.4 MB/s 
Collecting nvidia-cuda-cupti-cu12==12.1.105; platform_system == "Linux" and platform_machine == "x86_64"
  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
[K     |████████████████████████████████| 14.1 MB 57.2 MB/s 
Collecting widgetsnbextension~=4.0.9
  Downloading widgetsnbextension-4.0.9-py3-none-any.whl (2.3 MB)
[K     |████████████████████████████████| 2.3 MB 60.1 MB/s 
[?25hCollecting jupyterlab-widgets~=3.0.9
  Downloading jupyterlab_widgets-3.0.9-py3-none-any.whl (214 kB)
[K     |████████████████████████████████| 214 kB 60.8 MB/s 
Collecting cmdkit>=2.1.2
  Downloading cmdkit-2.7.3-py3-none-any.whl (26 kB)
Collecting nvidia-nvjitlink-cu12
  Downloading nvidia_nvjitlink_cu12-12.3.101-py3-none-manylinux1_x86_64.whl (20.5 MB)
[K     |████████████████████████████████| 20.5 MB 56.5 MB/s 
[?25hCollecting mpmath>=0.19
  Downloading mpmath-1.3.0-py3-none-any.whl (536 kB)
[K 

Collecting toml>=0.10.1
  Downloading toml-0.10.2-py2.py3-none-any.whl (16 kB)
Building wheels for collected packages: openai-whisper, ffmpeg
  Building wheel for openai-whisper (PEP 517) ... [?25l- \ | done
[?25h  Created wheel for openai-whisper: filename=openai_whisper-20231117-py3-none-any.whl size=801356 sha256=be89ef2448b07e0d06cc6f8ac13ffad42781135942034cdb2f9c4b3a29df8f01
  Stored in directory: /tmp/pip-ephem-wheel-cache-710f3ttd/wheels/f5/77/96/4bb7b94449a47b726127100ad66bd72cba123fb4d0a8948473
  Building wheel for ffmpeg (setup.py) ... [?25l- \ | done
[?25h  Created wheel for ffmpeg: filename=ffmpeg-1.4-py3-none-any.whl size=6080 sha256=a36ef8b651c5ec4449350c8e5fe2e40e86d5f6519af1870a7a0d89fd13b50b1b
  Stored in directory: /tmp/pip-ephem-wheel-cache-710f3ttd/wheels/1d/57/24/4eff6a03a9ea0e647568e8a5a0546cdf957e3cf005372c0245
Successfully built openai-whisper ffmpeg


Installing collected packages: regex, tiktoken, nvidia-cublas-cu12, nvidia-cudnn-cu12, triton, nvidia-nvtx-cu12, nvidia-cuda-runtime-cu12, nvidia-cufft-cu12, nvidia-nvjitlink-cu12, nvidia-cusparse-cu12, nvidia-cusolver-cu12, nvidia-curand-cu12, mpmath, sympy, nvidia-nccl-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, torch, openai-whisper, ffmpeg, widgetsnbextension, jupyterlab-widgets, ipywidgets, toml, cmdkit, names-generator, datarobotx


[31mERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.
We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.
torch 2.1.2 requires triton==2.1.0; platform_system == "Linux" and platform_machine == "x86_64", but you'll have triton 2.2.0 which is incompatible.[0m
Successfully installed cmdkit-2.7.3 datarobotx-0.1.20 ffmpeg-1.4 ipywidgets-8.1.1 jupyterlab-widgets-3.0.9 mpmath-1.3.0 names-generator-0.1.0 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-8.9.2.26 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.18.1 nvidia-nvjitlink-cu12-12.3.101 nvidia-nvtx-cu12-12.1.105 openai-whisper-20231117 regex-2023.12.25 sympy-1.12

In [3]:
import os
import shutil

import whisper

model = whisper.load_model("base")

path = "./storage/whisper"
if not os.path.exists(path):
    os.makedirs(path)

source = "./.cache/whisper/base.pt"
target = "./storage/whisper/base.pt"

shutil.copyfile(source, target)

100%|████████████████████████████████████████| 139M/139M [00:00<00:00, 188MiB/s]
'./storage/whisper/base.pt'

# Model Deployment
## Setup of custom model hooks

Reference: 

https://docs.datarobot.com/en/docs/mlops/deployment/custom-models/custom-model-assembly/unstructured-custom-models.html

The following code is used to set up the hooks that will be executed when the deployment starts up, and when it is called for scoring (transcription) purposes.

In [None]:
import base64
import codecs
import os
from pathlib import Path


def file_to_base_64(filepath: str):
    """
    Convert content of a file path and converts to base64

    Parameters
    ----------
    filepath : str
        Path to a file. In our case, this will be used only for pdfs

    Returns
    -------
    bytes
        Base64 representation of a file
    """
    with open(filepath, "rb") as file:
        encoded_string = base64.b64encode(file.read())
        return encoded_string


def base_64_to_file(b64_string: bytes, filepath: str = "data/temp.pdf") -> str:
    """
    Decode a base64 string and write into a pdf file

    Parameters
    ----------
    b64_string : bytes
        Base64 representation of a file
    filepath : str, default temp.pdf
        Path to write a pdf file

    Returns
    -------
    str
    Path of resulting pdf file
    """
    parent_directory = Path(filepath).parent.absolute()
    if not os.path.exists(parent_directory):
        os.makedirs(parent_directory)

    with open(filepath, "wb") as f:
        f.write(codecs.decode(b64_string, "base64"))
    return filepath

In [None]:
def load_model(input_dir):
    import os
    import stat

    import whisper

    # Ensure we have ffmpeg defined in the PATH
    path = os.environ["PATH"]
    if "ffmpeg" not in path:
        os.environ["PATH"] = path + ":" + "./storage"

    # Change the permissions to allow execution of ffmpeg
    file = "./storage/ffmpeg"
    os.chmod(file, stat.S_IRWXU | stat.S_IRWXG | stat.S_IRWXO)

    # Dynamically load this (note this encounters a permissions error when deployed, writing to the .cache folder)
    # model = whisper.load_model("base")

    model_file = "./storage/whisper/base.pt"
    model = whisper.load_model(model_file)

    return model


def score_unstructured(model, data, query, **kwargs) -> str:
    import requests

    temp_file_name = "temp/tempfile"
    data_dict = json.loads(data)

    if "file" in data_dict.keys():
        # Write encoding to file
        base_64_to_file(data_dict["file"].encode(), filepath=temp_file_name)
    elif "url" in data_dict.keys():
        with open(temp_file_name, "wb") as f:
            resp = requests.get(data_dict["url"])
            f.write(resp.content)
            f.close()
    else:
        result = {"error": "Missing parameter - either a file or url must be specified"}
        return json.dumps(result)

    try:
        result = model.transcribe(temp_file_name, fp16=False)
        os.remove(temp_file_name)
    except Exception as e:
        result = {"error": f"{e.__class__.__name__}: {str(e)}"}
    return json.dumps(result)

# Test by calling the local functions

We can test by downloading a file locally and submitting it directly, and also by providing the URL to a publically available audio file

Sample files source: https://github.com/microsoft/MS-SNSD/tree/master

Note that here we will be returning the full output from the Whisper model, which includes the transcribed text along with additional metadata.

In [6]:
import json

import requests

# ------------- Test using a file -------------

# Download the file first
url = "https://github.com/microsoft/MS-SNSD/raw/master/clean_test/clnsp0.wav"
data_file_name = "./storage/sample_audio"
resp = requests.get(url)
with open(data_file_name, "wb") as f:
    f.write(resp.content)
f.close()

# Open the file and encode it
data_file = open(data_file_name, "rb")
data_file_bytes = data_file.read()
encoding = base64.b64encode(data_file_bytes)

# Test the hooks locally
result = score_unstructured(
    load_model("."),
    json.dumps(
        {
            "file": encoding.decode(),
        }
    ),
    None,
)
result

'{"text": " She seemed irritated. You\'ve got no business up here. You took me by surprise. There would still be plenty of moments of regret and sadness and guilty relief. A warm breeze played across it, moving it like waves.", "segments": [{"id": 0, "seek": 0, "start": 0.0, "end": 1.2, "text": " She seemed irritated.", "tokens": [50364, 1240, 6576, 43650, 13, 50424], "temperature": 0.0, "avg_logprob": -0.20225826331547328, "compression_ratio": 1.3766233766233766, "no_speech_prob": 0.003493815427646041}, {"id": 1, "seek": 0, "start": 1.2, "end": 2.64, "text": " You\'ve got no business up here.", "tokens": [50424, 509, 600, 658, 572, 1606, 493, 510, 13, 50496], "temperature": 0.0, "avg_logprob": -0.20225826331547328, "compression_ratio": 1.3766233766233766, "no_speech_prob": 0.003493815427646041}, {"id": 2, "seek": 0, "start": 2.64, "end": 3.92, "text": " You took me by surprise.", "tokens": [50496, 509, 1890, 385, 538, 6365, 13, 50560], "temperature": 0.0, "avg_logprob": -0.20225826331

In [7]:
# ------------- Test using a URL -------------

# Test the hooks locally
result = score_unstructured(
    load_model("."),
    json.dumps({"url": "https://github.com/microsoft/MS-SNSD/raw/master/clean_test/clnsp1.wav"}),
    None,
)
result

'{"text": " In some measure, they depend upon the structure of individual personality. Many selections are themselves convincing contributions to this appraisal. What shall these effects be?", "segments": [{"id": 0, "seek": 0, "start": 0.0, "end": 4.84, "text": " In some measure, they depend upon the structure of individual personality.", "tokens": [50364, 682, 512, 3481, 11, 436, 5672, 3564, 264, 3877, 295, 2609, 9033, 13, 50606], "temperature": 0.0, "avg_logprob": -0.23194175017507454, "compression_ratio": 1.3185185185185184, "no_speech_prob": 0.006317353807389736}, {"id": 1, "seek": 0, "start": 4.84, "end": 9.76, "text": " Many selections are themselves convincing contributions to this appraisal.", "tokens": [50606, 5126, 47829, 366, 2969, 24823, 15725, 281, 341, 724, 20769, 304, 13, 50852], "temperature": 0.0, "avg_logprob": -0.23194175017507454, "compression_ratio": 1.3185185185185184, "no_speech_prob": 0.006317353807389736}, {"id": 2, "seek": 0, "start": 9.76, "end": 11.28, "text

# Deploy the Whisper model

This convenience method uses the DataRobot package drx
https://drx.datarobot.com/consume/deploy.html

- Builds a new Custom Model Environment (OR reuses an existing one) and loads the contents of storage/deploy/
- Assembles a new Custom Model with the provided hooks
- Deploys an Unstructured Custom Model to your Deployments
- Returns an object which can be used to make predictions

Use `environment_id` to re-use an existing Custom Model Environment that you're happy with for shorter iteration cycles on the custom model hooks.

In [8]:
import datarobotx as drx

deployment = drx.deploy(
    "storage",
    name="Whisper Speech Recognition",
    hooks={"score_unstructured": score_unstructured, "load_model": load_model},
    extra_requirements=["openai-whisper", "ffmpeg"],
    # Re-use existing environment if a suitable is available, this can speed up the process
    environment_id="64c964448dd3f0c07f47d040",  # [DataRobot] Python 3.9 GenAI
)

[1m[34m#[0m[1m Deploying custom model[0m
[1m  - [0mUnable to auto-detect model type; any provided paths and files will be
    exported - dependencies should be explicitly specified using
    `extra_requirements` or `environment_id`
[1m  - [0mPreparing model and environment...
[1m  - [0mUsing environment [[DataRobot] Python 3.9 GenAI
    v7](https://app.datarobot.com/model-registry/custom-environments/64c964448dd3f0c07f47d040)
    for deployment
[1m  - [0mConfiguring and uploading custom model...
    100%|█████████████████████████████████████| 225M/225M [00:01<00:00, 134MB/s]


[1m  - [0mRegistered custom model [Whisper Speech
    Recognition](https://app.datarobot.com/model-registry/custom-models/65a8e64e25149018f7c1d4c3/info)
    with target type: Unstructured
[1m  - [0mInstalling additional dependencies...


[1m  - [0mCreating and deploying model package...


[1m  - [0mCreated deployment [Whisper Speech
    Recognition](https://app.datarobot.com/deployments/65a8e775d3c751468792502c/overview)
[1m[34m#[0m[1m Custom model deployment complete[0m


# Test by calling the deployment

We can test by downloading a file locally and submitting it directly, and also by providing the URL to a publically available audio file

Sample files source: https://github.com/microsoft/MS-SNSD/tree/master

In [None]:
import datarobotx as drx

# if using an existing deployment copy your deployment_id here
# deployment = drx.Deployment("653995d19da96f2431c90516")

In [10]:
import requests

# Download the file first
url = "https://github.com/microsoft/MS-SNSD/raw/master/clean_test/clnsp0.wav"
data_file_name = "./storage/sample_audio"
resp = requests.get(url)
with open(data_file_name, "wb") as f:
    f.write(resp.content)
f.close()

# Open the file and encode it
data_file = open(data_file_name, "rb")
data_file_bytes = data_file.read()
encoding = base64.b64encode(data_file_bytes)

# Test the hooks locally
result = deployment.predict_unstructured(
    {
        "file": encoding.decode(),
    }
)
result["text"]

[1m[34m#[0m[1m Making predictions[0m
[1m  - [0mMaking predictions with deployment [Whisper Speech
    Recognition](https://app.datarobot.com/deployments/65a8e775d3c751468792502c/overview)


[1m[34m#[0m[1m Predictions complete[0m
" She seemed irritated. You've got no business up here. You took me by surprise. There would still be plenty of moments of regret and sadness and guilty relief. A warm breeze played across it, moving it like waves."

In [11]:
import json

import requests

# ------------- Test using a URL -------------

# Test the hooks locally
result = deployment.predict_unstructured(
    {"url": "https://github.com/microsoft/MS-SNSD/raw/master/clean_test/clnsp1.wav"}
)
result["text"]

[1m[34m#[0m[1m Making predictions[0m
[1m  - [0mMaking predictions with deployment [Whisper Speech
    Recognition](https://app.datarobot.com/deployments/65a8e775d3c751468792502c/overview)


[1m[34m#[0m[1m Predictions complete[0m
' In some measure, they depend upon the structure of individual personality. Many selections are themselves convincing contributions to this appraisal. What shall these effects be?'