<a href="https://colab.research.google.com/github/ayabdi/SeamlessM4t/blob/main/SeamlessM4T.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SeamlessM4T

| Code Credits | Link |
| ----------- | ---- |
| 🎉 seamless_communication | [![GitHub Repository](https://img.shields.io/github/stars/facebookresearch/seamless_communication?style=social)](https://github.com/facebookresearch/seamless_communication) |
| 🚀 Online inference | [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/facebook/seamless_m4t) |
| 🔥 Discover More Colab Notebooks | [![GitHub Repository](https://img.shields.io/badge/GitHub-Repository-black?style=flat-square&logo=github)](https://github.com/R3gm/InsightSolver-Colab/) |


SeamlessM4T is a groundbreaking Massively Multilingual & Multimodal Machine Translation model, bridging speech and text translation for up to 100 languages.

Generally, the translation from one audio to another is done with several intermediate steps, such as transcription, translation, and later conversion to audio (Cascaded systems), as in [SoniTranslate](https://github.com/R3gm/SoniTranslate). However, the current model allows us to perform all these tasks.

In [None]:
!pip install fairseq2 pydub yt-dlp
!git clone https://github.com/facebookresearch/seamless_communication.git
%cd seamless_communication
!pip install .

Collecting fairseq2
  Downloading fairseq2-0.2.1-py3-none-any.whl (191 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m191.8/191.8 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pydub
  Downloading pydub-0.25.1-py2.py3-none-any.whl (32 kB)
Collecting yt-dlp
  Downloading yt_dlp-2024.7.9-py3-none-any.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m41.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting fairseq2n==0.2.1 (from fairseq2)
  Downloading fairseq2n-0.2.1-cp310-cp310-manylinux2014_x86_64.whl (2.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m63.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting jiwer~=3.0 (from fairseq2)
  Downloading jiwer-3.0.4-py3-none-any.whl (21 kB)
Collecting overrides~=7.3 (from fairseq2)
  Downloading overrides-7.7.0-py3-none-any.whl (17 kB)
Collecting packaging~=23.1 (from fairseq2)
  Downloading packaging-23.2-py3-none-any.whl (53 k

Utility Functions and Libraries

In [None]:
from seamless_communication.inference import Translator
from IPython.display import Audio
from pydub import AudioSegment
from pydub.silence import split_on_silence
from pydub import AudioSegment
import torchaudio
import torch
import os

def save_and_play_audio(path_save, wav):
    torchaudio.save(
        path_save,
        wav.audio_wavs[0][0].cpu().to(torch.float32),
        sample_rate=wav.sample_rate
    )

    audio_play = Audio(path_save, rate=wav.sample_rate, autoplay=True, normalize=True)
    display(audio_play)

def split_audio_with_max_duration(input_file, output_directory, min_silence_len=2500, silence_thresh=-60, max_chunk_duration=15000):

    sound = AudioSegment.from_wav(input_file)

    # Splitting on silence
    audio_chunks = split_on_silence(sound, min_silence_len=min_silence_len, silence_thresh=silence_thresh)

    # split for max_chunk_duration
    final_audio_chunks = []
    for chunk in audio_chunks:
        if len(chunk) > max_chunk_duration:
            num_subchunks = len(chunk) // max_chunk_duration + 1
            subchunk_size = len(chunk) // num_subchunks
            for i in range(num_subchunks):
                start_idx = i * subchunk_size
                end_idx = (i + 1) * subchunk_size
                subchunk = chunk[start_idx:end_idx]
                final_audio_chunks.append(subchunk)
        else:
            final_audio_chunks.append(chunk)

    # Export wav
    for i, chunk in enumerate(final_audio_chunks):
        output_file = f"{output_directory}/chunk{i}.wav"
        print("Exporting file", output_file)
        chunk.export(output_file, format="wav")

Load the model

In [None]:
# Initialize a Translator object with a multitask model, vocoder on the GPU.
translator = Translator(
    "seamlessM4T_v2_large",
    "vocoder_36langs",
    torch.device("cuda:0")
)

Downloading the checkpoint of seamlessM4T_v2_large...
100%|██████████| 8.45G/8.45G [02:26<00:00, 62.1MB/s]
Downloading the tokenizer of seamlessM4T_v2_large...
100%|██████████| 360k/360k [00:00<00:00, 7.41MB/s]
Downloading the tokenizer of seamlessM4T_v2_large...
100%|██████████| 4.93M/4.93M [00:00<00:00, 71.1MB/s]
Using the cached tokenizer of seamlessM4T_v2_large. Set `force` to `True` to download again.
Downloading the checkpoint of vocoder_36langs...
100%|██████████| 160M/160M [00:00<00:00, 191MB/s]


We will process the audio from a YouTube video.

In [None]:
# Download the video
video_url = 'https://www.youtube.com/watch?v=Iwg-jGQVUVE&t'
!yt-dlp -f "mp4" --force-overwrites --no-warnings --no-abort-on-error --ignore-no-formats-error --restrict-filenames -o Video.mp4  $video_url
#!yt-dlp -f "(bestvideo+bestaudio/best)[protocol!*=dash][ext=mp4]" --external-downloader ffmpeg --external-downloader-args "ffmpeg_i:-ss 00:00:58.00 -to 04:54:34.00"  --force-overwrites --no-warnings --no-abort-on-error --ignore-no-formats-error --restrict-filenames -o Video.mp4  $video_url

/bin/bash: line 1: t: command not found
[youtube] Extracting URL: https://www.youtube.com/watch?v=Iwg-jGQVUVE
[youtube] Iwg-jGQVUVE: Downloading webpage
[youtube] Iwg-jGQVUVE: Downloading ios player API JSON
[youtube] Iwg-jGQVUVE: Downloading m3u8 information
[info] Iwg-jGQVUVE: Downloading 1 format(s): 18
Deleting existing file Video.mp4
[download] Destination: Video.mp4
[K[download] 100% of    3.13MiB in [1;37m00:00:00[0m at [0;32m8.37MiB/s[0m


In [None]:
# Convert to wav
!ffmpeg -y -i Video.mp4 -vn -acodec pcm_s16le -ar 44100 -ac 2 audio.wav

ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enab

# Split the audio

To process the audio, we need to split it due to model limitations.

In [None]:
input_audio_file = "/content/seamless_communication/audio.wav"
output_directory = "/content/seamless_communication/split_segments"

!mkdir split_segments
!rm -rf /content/seamless_communication/split_segments/*
split_audio_with_max_duration(input_audio_file, output_directory)

mkdir: cannot create directory ‘split_segments’: File exists
Exporting file /content/seamless_communication/split_segments/chunk0.wav
Exporting file /content/seamless_communication/split_segments/chunk1.wav
Exporting file /content/seamless_communication/split_segments/chunk2.wav
Exporting file /content/seamless_communication/split_segments/chunk3.wav
Exporting file /content/seamless_communication/split_segments/chunk4.wav


In [None]:
# Play a split
audio_path = '/content/seamless_communication/split_segments/chunk0.wav'
audio = Audio(audio_path, rate=44100, autoplay=True, normalize=True)
display(audio)



## Speech to Speech Translate

In [None]:
# Example
translated_text, wav = translator.predict(
    input='/content/seamless_communication/split_segments/chunk1.wav',
    task_str='S2ST',
    tgt_lang='hin', # target language
    src_lang='eng', # source language # If you specify this, it will improve the model's result.
)

# Save the audio and play
save_and_play_audio(
    '/content/seamless_communication/audiot.wav',
    wav, # Use the 'to' method to cast the tensor
)

Now we will translate all the segments and combine them into a new audio file.

In [None]:
segments = []
timestamps = []
max_chunk_duration = 15000

for filename in sorted(os.listdir(output_directory)):
    if filename.startswith("chunk") and filename.endswith(".wav"):
        segment_path = os.path.join(output_directory, filename)

        # Extract timestamp (Assuming filename format is chunk{index}.wav)
        index = int(filename[5:-4])
        timestamp = index * max_chunk_duration  # Assuming max_chunk_duration is in milliseconds
        timestamps.append(timestamp)

        translated_text, wav = translator.predict(
            input=segment_path,
            task_str='s2st',
            tgt_lang='hin',
            src_lang='eng',
        )
        print(translated_text, segment_path)

        torchaudio.save(
            segment_path,
            wav.audio_wavs[0][0].cpu().to(torch.float32),
            sample_rate=wav.sample_rate,
        )

        segment = AudioSegment.from_file(segment_path)
        segments.append(segment)

    # Align and combine segments
    combined_audio = AudioSegment.empty()
    for i, segment in enumerate(segments):
        # Adjust start time based on timestamp
        aligned_segment = AudioSegment.silent(duration=timestamps[i]) + segment
        combined_audio += aligned_segment


    combined_audio = sum(segments)
    combined_audio.export('/content/seamless_communication/audio_eng.mp3', format="mp3")

[CString('जब कोई आप पर मुस्कुराता है, तो यह आराम और आनंद लाता है। कभी-कभी मुस्कुराना आपको घबरा सकता है।')] /content/seamless_communication/split_segments/chunk0.wav
[CString('he was born for smiling the prophet smiled to the white the one who was older the black')] /content/seamless_communication/split_segments/chunk1.wav
[CString('Because smiling always means something good. There's different kinds of smiling. There's smiling which is mockery, there's smiling which is fake, there's smiling when you're angry.')] /content/seamless_communication/split_segments/chunk2.wav
[CString('Smile is not always one interpretation, that's why I want to talk about it. What is smile?')] /content/seamless_communication/split_segments/chunk3.wav
[CString('यह समझने के लिए इस बात का ध्यान रखें कि मुस्कुराहट का आपके जीवन पर कितना सकारात्मक प्रभाव पड़ेगा।')] /content/seamless_communication/split_segments/chunk4.wav


In [None]:
audio_path = '/content/seamless_communication/audio_eng.mp3'
audio = Audio(audio_path, rate=44100, autoplay=True, normalize=True)
display(audio)

## Text to Speech Translate

In [None]:
text = 'En el bosque encantado'

In [None]:
translated_text, wav, sr = translator.predict(
    text,
    "t2st",
    tgt_lang='eng',
    src_lang='spa'
)

save_and_play_audio(
    '/content/seamless_communication/text2speech.wav',
    wav,
    sr,
)

## Text to text translate

In [None]:
text = 'En el bosque encantado, un zorro curioso halló un reloj antiguo. Al tocarlo, quedó atrapado en un bucle temporal. Buscó ayuda de un búho sabio, quien reveló que solo resolviendo acertijos podría romper el hechizo. Juntos descifraron enigmas, liberando al zorro y tejiendo una amistad eterna.'

In [None]:
translated_text, _, _ = translator.predict(text, "t2tt", 'eng', src_lang='spa')
translated_text

CString('In the enchanted forest, a curious fox found an ancient clock. When he touched it, he was trapped in a time loop. He sought help from a wise owl, who revealed that only by solving riddles could he break the spell. Together they solved riddles, freeing the fox and forging an eternal friendship.')

## Speech to text translate

In [None]:
# Resample audio
resample_rate = 44100
waveform, sample_rate = torchaudio.load('/content/seamless_communication/split_segments/chunk1.wav')
resampler = torchaudio.transforms.Resample(sample_rate, resample_rate, dtype=waveform.dtype)
resampled_waveform = resampler(waveform)
torchaudio.save('/content/seamless_communication/split_segments/resample_chunk1.wav', resampled_waveform, resample_rate)

In [None]:
translated_text, _, _ = translator.predict('/content/seamless_communication/split_segments/resample_chunk1.wav', "s2tt", 'eng')
translated_text

CString('And he's going to answer some questions: First, what are the most polluted areas from solid waste or packages that are in the school?')

License Attribution-NonCommercial 4.0 International: https://github.com/facebookresearch/seamless_communication/blob/main/LICENSE