# Translate with your own voice - Whisper

> PhD. Joel Omar Juarez Gambino

Team:

* Armas Ramirez Daniel
* Prezas Bernal Emiliano
* Escamilla Gachuz Karla Escamilla
* Dorado Alcala Nathaly

## Experiment 2 using Whisper

1. Install tools, mode dependencies

Download Whisper


In [3]:
! pip install git+https://github.com/openai/whisper.git
! pip install sounddevice wavio
! pip install ipywebrtc notebook

Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-bhe1koow
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-bhe1koow
  Resolved https://github.com/openai/whisper.git to commit dd985ac4b90cafeef8712f2998d62c59c3e62d22
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch->openai-whisper==20240930)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch->openai-whisper==20240930)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch->openai-whisper==20240930)
  Downloading nvidia_cuda_c

We also need the following in order to record audio from this notebook and process the resulting files.

In [4]:
!apt install ffmpeg
!apt-get install libportaudio2

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
ffmpeg is already the newest version (7:4.4.2-0ubuntu0.22.04.1).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  libportaudio2
0 upgraded, 1 newly installed, 0 to remove and 35 not upgraded.
Need to get 65.3 kB of archives.
After this operation, 223 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 libportaudio2 amd64 19.6.0-1.1 [65.3 kB]
Fetched 65.3 kB in 1s (54.4 kB/s)
Selecting previously unselected package libportaudio2:amd64.
(Reading database ... 126111 files and directories currently installed.)
Preparing to unpack .../libportaudio2_19.6.0-1.1_amd64.deb ...
Unpacking libportaudio2:amd64 (19.6.0-1.1) ...
Setting up libportaudio2:amd64 (19.6.0-1.1) ...
Processing triggers for

In [5]:
import os
import numpy as np

try:
    import tensorflow
except ImportError:
    pass

import torch
import pandas as pd
import whisper
import torchaudio

from ipywebrtc import AudioRecorder, CameraStream
from IPython.display import Audio, display
import ipywidgets as widgets

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

## Make your recording

1. Load widgets in colab

In [6]:
from google.colab import output
output.enable_custom_widget_manager()

In [7]:
camera = CameraStream(constraints={'audio': True,'video':False})
recorder = AudioRecorder(stream=camera)
recorder

AudioRecorder(audio=Audio(value=b'', format='webm'), stream=CameraStream(constraints={'audio': True, 'video': …

2. Transform the audio to WAV for the model

In [16]:
with open('record.webm', 'wb') as f:
    f.write(recorder.audio.value)
!ffmpeg -i recording.webm -ac 1 -f wav my_recording.wav -y -hide_banner -loglevel panic

## Select Language and mode

1. We choose spanish for our task

In [17]:
language_options = whisper.tokenizer.TO_LANGUAGE_CODE
language_list = list(language_options.keys())

## Load Whisper model

In [18]:
model = whisper.load_model("base")
print(
    f"Model is {'multilingual' if model.is_multilingual else 'English-only'} "
    f"and has {sum(np.prod(p.shape) for p in model.parameters()):,} parameters."
)

Model is multilingual and has 71,825,920 parameters.


Finally, let's set the rest of our task and language options below and see what we've got. Check that your task and language settings are correct, but don't worry about the other defaults.

In [19]:
options = whisper.DecodingOptions(language='spanish', task='transcribe', without_timestamps=True)
options

DecodingOptions(task='transcribe', language='spanish', temperature=0.0, sample_len=None, best_of=None, beam_size=None, patience=None, length_penalty=None, prompt=None, prefix=None, suppress_tokens='-1', suppress_blank=True, without_timestamps=True, max_initial_timestamp=1.0, fp16=True)

## Start transcribing

Time to load our file and see the results

In [20]:
audio = whisper.load_audio("record.webm")
audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio).to(model.device)
result = model.decode(mel, options)

In [21]:
result.text

'Hola, hoy es 15 de julio.'

# Multilingual translation using NLLB (No Language Left Behind) from Meta

### 1. Installing dependencies

In [1]:
!pip install transformers sentencepiece



### 2. Building translation architecture

In [22]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

#pretrained Meta translator
model_name = "facebook/nllb-200-distilled-600M"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(DEVICE)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/564 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/4.85M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.3M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/3.55k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/846 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.46G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.46G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

### 3. Setting languages

In [39]:
#language list
languages = {
    'inglés': 'eng_Latn',
    'coreano': 'kor_Hang',
    'japonés': 'jpn_Jpan',
    'chino': 'zho_Hans',
    'francés': 'fra_Latn',
    'italiano': 'ita_Latn',
    'ruso': 'rus_Cyrl',
    'urdu': 'urd_Arab',
    'alemán': 'deu_Latn'
}

### 4. Defining Transtalion Function

In [26]:
def translate(text, destination_lang):
    if destination_lang not in languages:
        raise ValueError("Language not found.")

    target_lang = languages[destination_lang]

    #tokenize
    inputs = tokenizer(text, return_tensors="pt").to(DEVICE)

    #getting target id
    target_lang_id = tokenizer.convert_tokens_to_ids(target_lang)

    #translate
    translated_tokens = model.generate(
        **inputs,
        forced_bos_token_id=target_lang_id,
        max_length=200
    )

    #decoding results
    translation = tokenizer.decode(translated_tokens[0], skip_special_tokens=True)
    return translation

### 5. Testing

**Coreano**

In [27]:
corpus = result.text
idioma = "coreano"

traduccion = translate(corpus, idioma)
print(f"Texto original: {corpus}")
print(f"Traducción a {idioma}: {traduccion}")

Texto original: Hola, hoy es 15 de julio.
Traducción a coreano: 안녕하세요, 오늘은 7월 15일입니다.


**Inglés**

In [29]:
idioma = "inglés"

traduccion = translate(corpus, idioma)
print(f"Texto original: {corpus}")
print(f"Traducción a {idioma}: {traduccion}")

Texto original: Hola, hoy es 15 de julio.
Traducción a inglés: Hello, today is July 15th.


**Japonés**

In [30]:
idioma = "japonés"

traduccion = translate(corpus, idioma)
print(f"Texto original: {corpus}")
print(f"Traducción a {idioma}: {traduccion}")

Texto original: Hola, hoy es 15 de julio.
Traducción a japonés: こんにちは 今日は7月15日です


**Chino**

In [31]:
idioma = "chino"

traduccion = translate(corpus, idioma)
print(f"Texto original: {corpus}")
print(f"Traducción a {idioma}: {traduccion}")

Texto original: Hola, hoy es 15 de julio.
Traducción a chino: 你好,今天是7月15日.


**Francés**

In [32]:
idioma = "francés"

traduccion = translate(corpus, idioma)
print(f"Texto original: {corpus}")
print(f"Traducción a {idioma}: {traduccion}")

Texto original: Hola, hoy es 15 de julio.
Traducción a francés: Bonjour, aujourd'hui est le 15 juillet.


**Italiano**

In [33]:
idioma = "italiano"

traduccion = translate(corpus, idioma)
print(f"Texto original: {corpus}")
print(f"Traducción a {idioma}: {traduccion}")

Texto original: Hola, hoy es 15 de julio.
Traducción a italiano: Ciao, oggi è il 15 luglio.


**Ruso**

In [34]:
idioma = "ruso"

traduccion = translate(corpus, idioma)
print(f"Texto original: {corpus}")
print(f"Traducción a {idioma}: {traduccion}")

Texto original: Hola, hoy es 15 de julio.
Traducción a ruso: Здравствуйте, сегодня 15 июля.


**Urdu**

In [38]:
idioma = "urdu"

traduccion = translate(corpus, idioma)
print(f"Texto original: {corpus}")
print(f"Traducción a {idioma}: {traduccion}")

Texto original: Hola, hoy es 15 de julio.
Traducción a urdu: ہیلو، آج 15 جولائی ہے.


**Alemán**

In [40]:
idioma = "alemán"

traduccion = translate(corpus, idioma)
print(f"Texto original: {corpus}")
print(f"Traducción a {idioma}: {traduccion}")

Texto original: Hola, hoy es 15 de julio.
Traducción a alemán: Hallo, heute ist der 15. Juli.
