## Text Extraction with Whisper

In this notebook, we will extract information from video/audio files with [Whipser model](https://github.com/openai/whisper). Be leveraging multilingual support, we can extract tanscripts from videos files mixed different languages, even for one video file with different languanges. We provide the following options for whisper inference:
- Batch inference with SageMaker Processing job, we can process massive data and store them into vector database for RAG solution.
- Real-time inference with SageMaker Endpoint, we can leverage it to do summarizaton or QA with a short video/audio file (less than 6MB).

In [6]:
!pip install -U sagemaker -q

## Set up

In [7]:
from sagemaker.huggingface import HuggingFaceProcessor
from sagemaker import get_execution_role
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.huggingface import HuggingFaceModel
import sagemaker
import boto3
import json

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.session.Session()
bucket = sess.default_bucket()
prefix = "sagemaker/rag_video"
s3_input = f"s3://{bucket}/{prefix}/raw_data" # Directory for video files
s3_output_clips = f"s3://{bucket}/{prefix}/clips" # Directory for video clips
s3_output_transcript = f"s3://{bucket}/{prefix}/transcript" # Directory for transcripts

In [8]:
%store s3_output_transcript
%store s3_output_clips

Stored 's3_output_transcript' (str)
Stored 's3_output_clips' (str)


## Upload test data to S3 bucket

In [13]:
!aws s3 cp test_video.mp4 {s3_input}/
!aws s3 cp test_audio.mp3 {s3_input}/

upload: ./test_video.mp4 to s3://sagemaker-us-east-1-822507008821/sagemaker/rag_video/raw_data/test_video.mp4
upload: ./test_audio.mp3 to s3://sagemaker-us-east-1-822507008821/sagemaker/rag_video/raw_data/test_audio.mp3


## Batch inference with SageMaker Processing

In [10]:
hfp = HuggingFaceProcessor(
    role=get_execution_role(), 
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    transformers_version='4.28.1',
    pytorch_version='2.0.0', 
    base_job_name='frameworkprocessor-hf',
    py_version="py310"
)

In [19]:
hfp.run(
    code='preprocessing.py',
    source_dir="data_preparation",
    inputs=[
        ProcessingInput(source=s3_input, destination="/opt/ml/processing/input")
    ], 
    outputs=[
        ProcessingOutput(source='/opt/ml/processing/output_clips', destination=s3_output_clips),
        ProcessingOutput(source='/opt/ml/processing/transcripts', destination=s3_output_transcript),
    ],
    arguments=[
        "--whisper-model", "whisper-large-v2",
        "--target-language", "en",
        "--sentence-embedding-model", "all-mpnet-base-v2"
    ]
)

INFO:sagemaker.processing:Uploaded data_preparation to s3://sagemaker-us-east-1-822507008821/frameworkprocessor-hf-2023-07-12-16-44-37-718/source/sourcedir.tar.gz
INFO:sagemaker.processing:runproc.sh uploaded to s3://sagemaker-us-east-1-822507008821/frameworkprocessor-hf-2023-07-12-16-44-37-718/source/runproc.sh
INFO:sagemaker:Creating processing-job with name frameworkprocessor-hf-2023-07-12-16-44-37-718


Using provided s3_resource
[34mCollecting git+https://github.com/openai/whisper.git (from -r requirements.txt (line 3))
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-bap8p6al
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-bap8p6al
  Resolved https://github.com/openai/whisper.git to commit b91c907694f96a3fb9da03d4bbdc83fbcd3a40a4
  Installing build dependencies: started[0m
[34m  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'[0m
[34mCollecting tiktoken (from -r requirements.txt (line 1))
  Downloading tiktoken-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 45.2 MB/s eta 0:00:00

In [18]:
import os
if not os.path.exists('mytest'):
    os.makedirs('mytest')

## Deploy Whipser model to SageMaker for real-time inference

In [None]:
endpoint_name="wisper-large-v2"
# Hub Model configuration. https://huggingface.co/models
hub = {
    'HF_MODEL_ID':'openai/whisper-large-v2',
    'HF_TASK':'automatic-speech-recognition',
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    transformers_version='4.26.0',
    pytorch_version='1.13.1',
    py_version='py39',
    
    env=hub,
    role=role
)

In [None]:
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    endpoint_name=endpoint_name,
    initial_instance_count=1, # number of instances
    instance_type='ml.g5.xlarge' # ec2 instance type
)

In [None]:
client = boto3.client('runtime.sagemaker')
file = "test.mp3"
with open(file, "rb") as f:
    data = f.read()

In [None]:
response = client.invoke_endpoint(EndpointName=endpoint_name, ContentType='audio/x-audio', Body=data)
output = json.loads(response['Body'].read())
print(f"Extracted text from the audio file:\n {output['text']}")

In [4]:
from transformers import pipeline
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"

pipe = pipeline(
    "automatic-speech-recognition",
    model=f"openai/whisper-large-v2",
    device=device
)

In [None]:
generate_kwargs = {"task":"transcribe", "language":f"<|en|>"}
prediction = pipe(
    'test_audio.mp3',
    return_timestamps=True,
    chunk_length_s=20,
    stride_length_s=(5),
    generate_kwargs=generate_kwargs
)