# Whisper template

The code cells below can be used to import whisper and transcribe and translate audio files. The code assumes the filename of the audio file is `audio.mp3`. Change the the variable `audio` below if your filename is different.

To upload an audio file, first navigate to the folder where you want to upload the audio in the panel on the left. It is recommended to store the files in a (new) folder in `/data/volume_2`. Press the 'Upload Files' button (upward arrow) in the panel on the left to upload the file. 

### Import whisper

In [None]:
import whisper

### Load model and specify audio file
Run the code cell below to load the Whisper model that you need. To select a different model, change 'medium' to any of the following options:

**Model options:**  
- `'tiny'`
- `'base'`
- `'small'`
- `'medium'`
- `'large'`  
**English only models:**  
- `'tiny.en'`
- `'base.en'`
- `'small.en'`
- `'medium.en'`

In [None]:
model = whisper.load_model('medium')     
audio = "/data/volume_2/audiofiles/audio.mp3"

### Transcribe
Run the code cell below to transcribe the audio file with the model selected above. 

In [None]:
result = model.transcribe(audio, verbose=True)

### Translate
Run the code cell below to get a translated transcript in English from audio in a different language using the model selected above. 

In [None]:
result = model.transcribe(audio, task='translate', verbose=True)

It is possible to specify the input language, this may improve accuracy or efficiency in some cases:

In [None]:
result = model.transcribe(audio, task='translate', language='nl', verbose=True)

### Save output to files
Run this cell to save the transcript in all file formats that are supported by Whisper

In [None]:
output_directory = "./"
options = {"max_line_width":None,
           "max_line_count":None,
           "highlight_words":None}

writer = whisper.utils.get_writer("all", output_directory)
writer(result, audio, options)

To just print the plain text to a text file:

In [None]:
with open("audio_sample.txt", "w+") as f:
    f.write(result["text"])

# WhisperX

[WhisperX](https://github.com/m-bain/whisperX) can be used to improve the accuracy of timestamps, get word level timestamps, speaker diarization, and more.

To enable speaker diarization, include your Hugging Face access token in the cell below. A token can be generated from [Here](https://huggingface.co/settings/tokens). On the huggingface website you need to accept the user agreement for the following models: Segmentation , Voice Activity Detection (VAD), and Speaker Diarization.

In [None]:
import whisperx
import gc 

device = "cuda" 
audio_file = "audio.mp3"
batch_size = 16 # reduce if low on GPU mem
compute_type = "float16" # change to "int8" if low on GPU mem (may reduce accuracy)

### 1. Transcribe with original whisper (batched)

In [None]:
model = whisperx.load_model("small", device, compute_type=compute_type)

audio_whisperx = whisperx.load_audio(audio)
result = model.transcribe(audio_whisperx, batch_size=batch_size)
print(result["segments"]) # before alignment

# delete model if low on GPU resources
# import gc; gc.collect(); torch.cuda.empty_cache(); del model

### 2. Align whisper output

In [None]:
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result = whisperx.align(result["segments"], model_a, metadata, audio_whisperx, device, return_char_alignments=False)

print(result["segments"]) # after alignment

# delete model if low on GPU resources
# import gc; gc.collect(); torch.cuda.empty_cache(); del model_a

### 3. Assign speaker labels

If you don't have a Huggingface token, create one [here](https://huggingface.co/settings/tokens)
make sure you accept the user agreement for the following models: Segmentation , Voice Activity Detection (VAD), and Speaker Diarization.

In [None]:
YOUR_HF_TOKEN = '<insert your huggingface token here>'

diarize_model = whisperx.DiarizationPipeline(use_auth_token=YOUR_HF_TOKEN, device=device)

# add min/max number of speakers if known
diarize_segments = diarize_model(audio)
# diarize_model(audio_file, min_speakers=min_speakers, max_speakers=max_speakers)

result = whisperx.assign_word_speakers(diarize_segments, result)
print(diarize_segments)
print(result["segments"]) # segments are now assigned speaker IDs

### 4. Save output

In [None]:
import json
with open('data.json', 'w') as f:
    json.dump(result["segments"], f)