<a href="https://colab.research.google.com/github/atilatech/atlas-service/blob/master/notebooks/whisper_model_deployment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deploy Open AI Model

This notebook shows how to save and upload a model to s3.

1. Create model on local machine
2. Save model using joblib
3. Verify that saved model works
4. Upload to S3

> Inspired by [Deploy A Locally Trained ML Model In Cloud Using AWS SageMaker](https://medium.com/geekculture/84af8989d065)

Install dependencies

In [3]:
!pip install pytube

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [5]:
import pytube

url = "https://www.youtube.com/watch?v=bGk8qcHc1A0" # Joe Rogan & Lex Fridman: Lionel Messi Is The GOAT Over Cristiano Ronaldo
yt = pytube.YouTube(url)
# yt.streams.filter(only_audio=True).first()\
# .download(output_path='mp3', filename=f"{yt.video_id}.mp3")
itag = None
files = yt.streams.filter(only_audio=True)
for file in files:
    # from audio files we grab the first audio for mp4 (eg mp3)
    if file.mime_type == 'audio/mp4':
        itag = file.itag
        break
    if itag is None:
        # just incase no MP3 audio is found (shouldn't happen)
        print("NO MP3 AUDIO FOUND")
        continue

# get the correct mp3 'stream'
stream = yt.streams.get_by_itag(itag)
# downloading the audio
stream.download(
    output_path='mp3',
    filename=f"{yt.video_id}.mp3"
)

# Add the video info to the list of downloaded videos
video_info = {
    'id': yt.video_id,
    'thumbnail': yt.thumbnail_url,
    'title': yt.title,
    'views': yt.views,
    'length': yt.length,
}
video_info

{'id': 'bGk8qcHc1A0',
 'thumbnail': 'https://i.ytimg.com/vi/bGk8qcHc1A0/sddefault.jpg',
 'title': 'Joe Rogan & Lex Fridman: Lionel Messi Is The GOAT Over Cristiano Ronaldo',
 'views': 185138,
 'length': 218}

# Transcribe Audio
1. Download model


Tip: [Use](https://stackoverflow.com/questions/51058533/passing-secret-variables-to-google-colaboratory-notebook/74892619#74892619) `getpass` for passing secret environment variables. Not needed here bu writing
it so we don't forget.

In [3]:
!pip install git+https://github.com/openai/whisper.git -q
!apt install ffmpeg # https://stackoverflow.com/questions/51856340/how-to-install-package-ffmpeg-in-google-colab

# optional install pytorch so you can use a gpu for faster transcription
# command below is for Linux. See instructions for mac and windows: https://pytorch.org/get-started/locally/
!pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cpu

[K     |████████████████████████████████| 5.8 MB 34.6 MB/s 
[K     |████████████████████████████████| 182 kB 51.2 MB/s 
[K     |████████████████████████████████| 7.6 MB 47.1 MB/s 
[?25h  Building wheel for whisper (setup.py) ... [?25l[?25hdone
Reading package lists... Done
Building dependency tree       
Reading state information... Done
ffmpeg is already the newest version (7:3.4.11-0ubuntu0.1).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 20 not upgraded.
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/, https://download.pytorch.org/whl/cpu


In [4]:
import whisper
import torch  # install steps: pytorch.org

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f'whisper will use: {device}')

large_gpu_model = whisper.load_model("large").to("cuda")
# large_cpu_model = whisper.load_model("large").to("cpu")

whisper will use: cuda


100%|██████████████████████████████████████| 2.87G/2.87G [00:25<00:00, 122MiB/s]


In [None]:
from pathlib import Path

audio_paths = [str(x) for x in Path('./mp3').glob('*.mp3')]
audio_path = audio_paths[0]

# verbose: bool
# Whether to display the text being decoded to the console. 
# If True, displays all the details (live transcription)
# If False, displays minimal details. (progress bar)
# If None, does not display anything
# Only show live transcript if video length is less than 300 seconsd (5 minutes)
# To avoid too much text in console
verbose = True if yt.length <= 300 else False
audio_transcript = large_gpu_model.transcribe(audio_path,verbose=verbose)
text = audio_transcript['text']

# Save Model

Save the model using joblib then

In [6]:
audio_transcript

model_file_name = 'whisper-large_gpu_model'
import joblib
joblib.dump(large_gpu_model, model_file_name)

['whisper-large_gpu_model']

In [7]:
large_gpu_model_dumped = joblib.load(model_file_name)

In [None]:
audio_transcript_2 = large_gpu_model_dumped.transcribe(audio_path,verbose=True)

In [9]:
import os

# Get the file size in bytes
file_size_bytes = os.path.getsize(model_file_name)

# Convert the file size to GB
file_size_gb = file_size_bytes / (1024 ** 3)

# Convert the file size to MB
file_size_mb = file_size_bytes / (1024 ** 2)

print(f"File size: {file_size_gb:.2f} GB ({file_size_mb:.2f} MB)")


File size: 5.75 GB (5888.47 MB)


In [31]:
# Save Model to Google Drive
from google.colab import drive

drive.mount('/content/drive')

model_path_in_google_drive = f'/content/drive/MyDrive/Atlas-models/{model_file_name}'
joblib.dump(large_gpu_model, model_path_in_google_drive)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


['/content/drive/MyDrive/Atlas-models/whisper-large_gpu_model']

# Upload Model to S3

This is meant to be a separate section run if you want to start a new
session with an existing model that has been saved with joblib.

The file will be about 5.75 GB big so we'll want to upload to S3 from Colab.

[S3 Multipart Upload Limits](https://docs.aws.amazon.com/AmazonS3/latest/userguide/qfacts.html)

## Compress to Google Drive

Compress data because ran into Maximum part size - 5GB

- [Zlib compression](https://joblib.readthedocs.io/en/latest/persistence.html#compressed-joblib-pickles:~:text=By%20default%2C%20joblib.dump()%20uses%20the%20zlib%20compression%20method%20as%20it%20gives%20the%20best%20tradeoff%20between%20speed%20and%20disk%20space.)

- [Comparison of different compressors](https://joblib.readthedocs.io/en/latest/auto_examples/compressors_comparison.html#sphx-glr-auto-examples-compressors-comparison-py)

Loading Raw file: 59 seconds
File Size: 5.75 GB

|                    | Raw  | Compressed |
|--------------------|------|------------|
| File Size (GB)     | 5.75 |       3.45 |
| Save Time (s)      | ?    |        489 |
| Load Time (s)      |   59 |         66 |
| Inference Time (s) |    ? |         54 |

? = That value hasn't been recorded.

2. Upload to S3
  1. [Get S3 credentials](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html)



## Use Model in Sagemaker

1. Create a Sagemaker Instance and

In [None]:
import joblib

model_file_name = 'whisper-large_gpu_model'
model_path_in_google_drive = f'/content/drive/MyDrive/Atlas-models/{model_file_name}'
large_gpu_model_dumped = joblib.load(model_path_in_google_drive)

large_gpu_model_dumped

In [3]:
# compress the file because it is 5.75 GB uncompressed
# AWS has a Maximum Part Size of 5GB
# https://joblib.readthedocs.io/en/latest/persistence.html#compressed-joblib-pickles:~:text=By%20default%2C%20joblib.dump()%20uses%20the%20zlib%20compression%20method%20as%20it%20gives%20the%20best%20tradeoff%20between%20speed%20and%20disk%20space.
# https://joblib.readthedocs.io/en/latest/auto_examples/compressors_comparison.html#sphx-glr-auto-examples-compressors-comparison-py
# Use zlib because it has the best tradeoff between size and speed
model_path_in_google_drive = f'/content/drive/MyDrive/Atlas-models/{model_file_name}'

model_path_in_google_drive_compressed = model_path_in_google_drive + '.gz'
joblib.dump(large_gpu_model_dumped,
            model_path_in_google_drive_compressed,
            compress=True)


['/content/drive/MyDrive/Atlas-models/whisper-large_gpu_model.gz']

In [4]:
import os

# Get the file size in bytes
file_size_bytes = os.path.getsize(model_path_in_google_drive_compressed)

# Convert the file size to GB
file_size_gb = file_size_bytes / (1024 ** 3)

# Convert the file size to MB
file_size_mb = file_size_bytes / (1024 ** 2)

print(f"File size: {file_size_gb:.2f} GB ({file_size_mb:.2f} MB)")

File size: 3.45 GB (3536.86 MB)


In [5]:
large_gpu_model_dumped_compressed = joblib.load(model_path_in_google_drive_compressed)

In [None]:
large_gpu_model_dumped_compressed

In [None]:
from pathlib import Path

audio_paths = [str(x) for x in Path('./mp3').glob('*.mp3')]
audio_path = audio_paths[0]


decode_options = {
     # Set language to None to support multilingual, 
     # but it will take longer to process while it detects the language.
     # Realized this by running in verbose mode and seeing how much time
     # was spent on the decoding language step
    "language":"en"
} 
audio_transcript = large_gpu_model_dumped_compressed.transcribe(audio_path,
                                                     verbose=True,
                                                     **decode_options)
audio_transcript

In [None]:
!pip install boto3

In [8]:
import getpass

from getpass import getpass

AWS_ACCESS_KEY = getpass('Enter AWS_ACCESS_KEY')
AWS_SECRET_KEY = getpass('Enter AWS_SECRET_KEY')

Enter AWS_ACCESS_KEY··········
Enter AWS_SECRET_KEY··········


In [9]:
# Verify that the credentials work. 
# This won't verify that you have upload9write access to the bucket

import logging
import boto3
from botocore.exceptions import ClientError
import os

# See progress of the upload

  # Upload the file
s3_client = boto3.client(
  's3',
  aws_access_key_id=AWS_ACCESS_KEY,
  aws_secret_access_key=AWS_SECRET_KEY,
)


In [None]:
# verify your connection
s3_client.list_buckets()['Buckets'][:5]

In [11]:
import os
import sys
import threading
# https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-uploading-files.html#the-callback-parameter
class ProgressPercentage(object):

    def __init__(self, filename):
        self._filename = filename
        self._size = float(os.path.getsize(filename))
        self._seen_so_far = 0
        self._lock = threading.Lock()

    def __call__(self, bytes_amount):
        # To simplify, assume this is hooked up to a single filename
        with self._lock:
            self._seen_so_far += bytes_amount
            percentage = (self._seen_so_far / self._size) * 100
            # You probably don't need both print and sys.stdout.write
            print("\r%s  %s / %s  (%.2f%%)" % (
                    self._filename, self._seen_so_far, self._size,
                    percentage))
            sys.stdout.write(
                "\r%s  %s / %s  (%.2f%%)" % (
                    self._filename, self._seen_so_far, self._size,
                    percentage))
            sys.stdout.flush()

In [None]:
bucket_name='atila-ai-models-2'# your bucket name here
model_path_in_google_drive = f'/content/drive/MyDrive/Atlas-models/{model_file_name}'

model_path_in_google_drive_compressed = model_path_in_google_drive + '.gz'
try:
    response = s3_client.upload_file(model_path_in_google_drive_compressed, 
                                     bucket_name,
                                     os.path.basename(model_path_in_google_drive_compressed),
                                     Callback=ProgressPercentage(model_path_in_google_drive_compressed))
except ClientError as e:
    logging.error(e)

In [15]:
os.path.basename(model_path_in_google_drive_compressed)

'whisper-large_gpu_model.gz'

# Running in Sagemaker

1. [Available EC2 Instances in Sagemaker](https://docs.aws.amazon.com/sagemaker/latest/dg/notebooks-available-instance-types.html)

1. [ml.g4dn.xlarge](https://docs.aws.amazon.com/sagemaker/latest/dg/notebooks-available-instance-types.html#:~:text=ml.p3dn.24xlarge-,ml.g4dn.xlarge,-%3E%3E%20Fast%20launch) because it is the only GPU with fast launch
  1. Uses [Elastic Inference](https://aws.amazon.com/machine-learning/elastic-inference/)

1. [EC2 Instances](https://aws.amazon.com/ec2/instance-types/)

Note: If you try to add Elastic inference you might need to request a service
limit, which may take a few days.

https://stackoverflow.com/questions/71738894/unable-to-create-aws-segamaker-error-the-account-level-service-limit-number-o

https://support.console.aws.amazon.com/support/home?region=us-east-1&skipRegion=true#/case/create?issueType=service-limit-increase

https://discuss.huggingface.co/t/deploying-open-ais-whisper-on-sagemaker/24761/16

https://stackoverflow.com/questions/56255154/how-to-use-a-pretrained-model-from-s3-to-predict-some-data

In [None]:
!pip install sagemaker



In [2]:
import sagemaker
from sagemaker.tensorflow import TensorFlowModel
from sagemaker.utils import S3DataConfig

import shutil
import tarfile
import tensorflow as tf
from tensorflow.python.keras.utils.np_utils import to_categorical

role = sagemaker.get_execution_role()
sm_session = sagemaker.Session()
bucket_name = sm_session.default_bucket()



ValueError: ignored