### План работы

1. Выбрать два 7-10 минутных видео на английском языке (TED talk/интервью/...) с мужским и женским голосом
2. Выполнить предобработку данных: выделить аудио, сегментировать, привести к нужному  для ASR модели формату, при необходимости применить деноизинг
3. Запустить любые open-source ASR и MT модели. Объяснить выбор моделей.
4. Разобраться в архитектуре модели XTTS https://arxiv.org/pdf/2406.04904 
5. Zero-shot voice cloning: 
   - скопировать репозиторий https://github.com/coqui-ai/TTS
   - сгенерировать аудио для соответствующих текстов на русском языке аудио моделью XTTSv2. В качестве промпта подавать исходные записи на английском для копирования голоса 
   - наложить аудио на видео
6. Few-shot voice cloning:  
   - выбрать голос персонажа/актера, на котором zero-shot генерация дает не совсем точное копирование/не передает специфические характеристики, и собрать аудиозаписи для него
   - выполнить файнтюнинг модели на этих данных (проверить на разном количестве данных, подобрать гиперпараметры при необходимости)
   - сгенерировать те же предложения до и после файнтюнинга
   - посчитать объективную метрику похожести голоса (speaker smilarity): косинусное расстояние между векторами модели исходной (gt) и сгенерированной записи для zero- и few-shot вариантов

Результат представить:
  - в виде ноутбука с поэтапным описанием каждого шага/ скриптов
  - описания XTTS модели, основных компонент архитектуры и их задач. описание того, какие части обновляются во время few-shot клонирования
  - описания экспериментов по файнтюнингу, текущих проблем пайплайна
  - исходные и полученные видео

### Предобработка

In [148]:
import os
import librosa
import torch
import datetime
import json
from moviepy import VideoFileClip
from pathlib import Path
from tqdm.notebook import tqdm
from transformers import pipeline
from datasets import load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"

In [7]:
def convert_video_to_audio_moviepy(video_file: str, output_ext: str = "mp3") -> None:
    """Converts video to audio using MoviePy library
    that uses `ffmpeg` under the hood"""
    filename, _ = os.path.splitext(video_file)
    clip = VideoFileClip(video_file)
    clip.audio.write_audiofile(f"{filename}.{output_ext}")

In [None]:
for video_file in tqdm(list(Path("data").glob("*.mp4"))):
    convert_video_to_audio_moviepy(video_file)

  0%|          | 0/6 [00:00<?, ?it/s]

MoviePy - Writing audio in data/ch_broken.mp3




MoviePy - Done.
MoviePy - Writing audio in data/Tom Cruise on Performing His Own Dangerous Stunts, ‘Mission Impossible’ Training, and Cliff-Hangers (360p).mp3




MoviePy - Done.
MoviePy - Writing audio in data/Cameron's interview - CANADA - #HUMAN (360p).mp3




MoviePy - Done.
MoviePy - Writing audio in data/1987 on TODAY- Bruce Willis talks balancing fame and privacy (360p).mp3




MoviePy - Done.
MoviePy - Writing audio in data/Bruno's interview - ENGLAND - #HUMAN (360p).mp3




MoviePy - Done.
MoviePy - Writing audio in data/James May Talks- TV, Animals, Clarkson & Hammond (360p).mp3




MoviePy - Done.


In [92]:
pipe = pipeline(
    "automatic-speech-recognition", model="openai/whisper-base", device=device
    # "automatic-speech-recognition", model="openai/whisper-small.en", device=device
)

Device set to use cpu


In [138]:
type(pipe)

transformers.pipelines.automatic_speech_recognition.AutomaticSpeechRecognitionPipeline

In [145]:
def speech_to_text(model, audio_file):
    array, sampling_rate = librosa.load(audio_file)
    trimmed_array, _ = librosa.effects.trim(array)
    return model(
        {
            "path": str(audio_file),
            "array": trimmed_array,
            "sampling_rate": sampling_rate
        },
        chunk_length_s=30,
        return_timestamps=True,
    )

In [144]:
list(Path("data").glob("*.mp3"))[1].stem

'Tom Cruise on Performing His Own Dangerous Stunts, ‘Mission Impossible’ Training, and Cliff-Hangers (360p)'

In [147]:
result

{'ch_broken': {'text': ' Старый, ну чё ты, блядя, что такое случилось с тобой?',
  'chunks': [{'timestamp': (0.0, 9.48),
    'text': ' Старый, ну чё ты, блядя, что такое случилось с тобой?'}]}}

In [149]:
result = {}
for audio_file in tqdm(list(Path("data").glob("*.mp3"))):
    result[audio_file.stem] = speech_to_text(pipe, audio_file)

  0%|          | 0/5 [00:00<?, ?it/s]

Whisper did not predict an ending timestamp, which can happen if audio is cut off in the middle of a word. Also make sure WhisperTimeStampLogitsProcessor was used during generation.


In [153]:
with open("stt_data.json", "w") as fout:
    json.dump(result, fout, sort_keys=True, indent=4)

In [None]:
dataset = load_dataset(
    "hf-internal-testing/librispeech_asr_dummy", "clean", split="validation"
)
dataset

README.md:   0%|          | 0.00/520 [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/9.19M [00:00<?, ?B/s]

Generating validation split:   0%|          | 0/73 [00:00<?, ? examples/s]

Dataset({
    features: ['file', 'audio', 'text', 'speaker_id', 'chapter_id', 'id'],
    num_rows: 73
})

In [93]:
# array, sampling_rate = librosa.load("data/James May Talks- TV, Animals, Clarkson & Hammond (360p).mp3")
array, sampling_rate = librosa.load("data/Cameron's interview - CANADA - #HUMAN (360p).mp3")

In [126]:
trimmed_array, _ = librosa.effects.trim(array)

In [None]:
out = pipe(
    {
        "path": "data/Cameron's interview - CANADA - #HUMAN (360p).mp3",
        "array": trimmed_array,
        "sampling_rate": sampling_rate
    },
    chunk_length_s=30,
    # batch_size=8,
    return_timestamps=True,
    # generate_kwargs={"task": "transcribe", "max_new_tokens": 256}
)



My name is Cameron Diaz. I'm 41 years old. I'm my weight of 42. I am an actor and film. I recently just wrote a book called The Body Book, co-authored with a wonderful writing partner named Sondra Park. And I grew up in Long Beach, California. I live in Los Angeles and New York. I go back and forth, but I like the seasons. And what else do I need to tell you? That's, I think, that's it, right? Okay. If one people... I like the seasons, and what else do I need to tell you? That's, I think that's it, right? Okay. When people say, oh, I want to be like you, I want to be an actor, I want to look like you. The question I always ask them is why? Like really why? And people, and especially America, this idea of fame, that to be famous means that you're successful, that you're happy. It's not about, I don't do what I do because I want to be famous. That's part of being famous is my job. That's, when I'm home, I am not, and I'm with my family and my friends. I am not famous, I am me, and I'm Cameron, and that doesn't, fame does not define me. And so if you are looking for fame to define you, then you will never be happy and you will always be searching for happiness and you will never find it in fame. And so it goes back to I think for me, authenticity, intention. Why do you want to do anything you do? It should not be motivated by something that you think is going to make fulfillment comes from within you by being authentic to yourself, not chasing fame. I was really fortunate, the education that I got from my parents was really an amazing, I was so lucky to have just the best parents who really trusted and empowered and they just gave us everything. There was an moment where my parents weren't teaching us. They were always teaching us, very actively teaching us, work ethic, how to be a good person, how to treat people, how to be a good human being. That was a very important thing to them. And there was always lessons, always lessons. And I think that they just loved with every part of their being. They loved us so deeply and that I think is just to know what it feels like to be loved, to know how to love and how to be loved is such an important thing. They really showed us, my sister and I, how to, you know, how to do that. When my father died, the moment we need to not knowing what that meant, not knowing what it would mean to be without him, I think was probably the hardest moment, and to have to see my mother have to say goodbye to him. and oh, just really hard and all the family but yeah I think that that moment of not knowing really was so painful and then after that knowing that and learning the lessons like being very soon like the next morning when I woke up and I was I had a I had this had this moment where I was like, oh my god, my dad is dead. Like he's dead. He's gone. And I had this moment where I had a visual of my dad standing next to me. And all of a sudden, I looked over and he was gone. And there was this massive hole, this huge, huge hole that was so deep I couldn't see the end of it, it was so dark, it was so overwhelming and on the other side of that hole was this massive pile of dirt. It was so high, I couldn't see the top of it. And I instinctively, I went, I walked around the hole, I saw myself walking around the hole and I started, I rolled up my sleeves and I started climbing through the dirt. And as soon as I dug my hand into the dirt, I pulled my hand out, and there was strands of emeralds and rubies, and in this hand was a gold goblet, and I was so perplexed for a moment, and I just immediately all of this, and I realized, Okay, where he once was is now this, he has left me a task, a responsibility that I must climb up this pile of dirt that seems so overwhelming. And I must dig out all of the treasures that he's left behind, all of the jewels, all of the lessons, all of the things that are going to enrich my life. And I just saw myself climbing through it and I'm still climbing up to the top. And I've gotten so many, he's left me so many beautiful treasures. So.

In [133]:
out["text"]

" My name is Cameron Diaz. I'm 41 years old. I'm my weight of 42. I am an actor and film. I recently just wrote a book called The Body Book, co-authored with a wonderful writing partner named Sondra Park. And I grew up in Long Beach, California. I live in Los Angeles and New York. I go back and forth, but I like the seasons. And what else do I need to tell you? That's, I think, that's it, right? Okay. If one people... I like the seasons, and what else do I need to tell you? That's, I think that's it, right? Okay. When people say, oh, I want to be like you, I want to be an actor, I want to look like you. The question I always ask them is why? Like really why? And people, and especially America, this idea of fame, that to be famous means that you're successful, that you're happy. It's not about, I don't do what I do because I want to be famous. That's part of being famous is my job. That's, when I'm home, I am not, and I'm with my family and my friends. I am not famous, I am me, and I'm 

In [135]:
out

{'text': " My name is Cameron Diaz. I'm 41 years old. I'm my weight of 42. I am an actor and film. I recently just wrote a book called The Body Book, co-authored with a wonderful writing partner named Sondra Park. And I grew up in Long Beach, California. I live in Los Angeles and New York. I go back and forth, but I like the seasons. And what else do I need to tell you? That's, I think, that's it, right? Okay. If one people... If one people... If one people... If one people... If one people... If one people... If one people... If one people... If one people... If one people... If one people... If one people... If one people... If one people... If one people... If one people... If one people... If one people... If one people... If one people... If one people... If one people... If one people... If one people... If one people... If one people... If one people... I like the seasons, and what else do I need to tell you? That's, I think that's it, right? Okay. When people say, oh, I want to

In [112]:
processed_chunks = [
    {
        "start_time_sec": chunk["timestamp"][0],
        "start_time_min": str(datetime.timedelta(seconds=chunk["timestamp"][0])),
        **chunk
    }
    for i, chunk in enumerate(out["chunks"])
    for next_chunk in [out["chunks"][i + 1] if i != len(out["chunks"]) - 1 else {"timestamp": chunk["timestamp"], "text": ""}]
    if chunk["text"] not in next_chunk["text"] and chunk["timestamp"][0] is not None
]
processed_chunks[0]

{'start_time_sec': 0.0,
 'start_time_min': '0:00:00',
 'timestamp': (0.0, 7.0),
 'text': " My name is Cameron Diaz. I'm 41 years old on my way to 42. I am an actor and film."}

In [131]:
out["text"]

" My name is Cameron Diaz. I'm 41 years old. I'm my weight of 42. I am an actor and film. I recently just wrote a book called The Body Book, co-authored with a wonderful writing partner named Sondra Park. And I grew up in Long Beach, California. I live in Los Angeles and New York. I go back and forth, but I like the seasons. And what else do I need to tell you? That's, I think, that's it, right? Okay. If one people... I like the seasons, and what else do I need to tell you? That's, I think that's it, right? Okay. When people say, oh, I want to be like you, I want to be an actor, I want to look like you. The question I always ask them is why? Like really why? And people, and especially America, this idea of fame, that to be famous means that you're successful, that you're happy. It's not about, I don't do what I do because I want to be famous. That's part of being famous is my job. That's, when I'm home, I am not, and I'm with my family and my friends. I am not famous, I am me, and I'm 

In [109]:
processed_chunks[0]

{'start_time_sec': 7.0,
 'start_time_min': '0:00:07',
 'timestamp': (7.0, 14.0),
 'text': ' I recently just wrote a book called The Body Book, co-authored with a wonderful writing partner named Sandra Park.'}

In [117]:
out

{'text': " My name is Cameron Diaz. I'm 41 years old on my way to 42. I am an actor and film. I recently just wrote a book called The Body Book, co-authored with a wonderful writing partner named Sandra Park. And I grew up in Long Beach, California. I live in Los Angeles and New York. I go back and forth, but I like the seasons. And what else do I need to tell you? That's, I think, that's it, right? Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. but I like the seasons, and what else do I need to tell you? That's, I think that's it, right? Okay. When people say, oh, I want to be like, you want to be an actor, I want to look like you. The question I always ask them is why? Like, really why? And people, and especially America, this idea of fame, that to be famous means that you're successful, that you're happy. It's not about, I don't do... I don't do.

In [115]:
[
    chunk
    for chunk in processed_chunks
    if chunk["timestamp"][0] < chunk["timestamp"][1]
]

[{'start_time_sec': 0.0,
  'start_time_min': '0:00:00',
  'timestamp': (0.0, 7.0),
  'text': " My name is Cameron Diaz. I'm 41 years old on my way to 42. I am an actor and film."},
 {'start_time_sec': 7.0,
  'start_time_min': '0:00:07',
  'timestamp': (7.0, 14.0),
  'text': ' I recently just wrote a book called The Body Book, co-authored with a wonderful writing partner named Sandra Park.'},
 {'start_time_sec': 14.0,
  'start_time_min': '0:00:14',
  'timestamp': (14.0, 22.0),
  'text': ' And I grew up in Long Beach, California. I live in Los Angeles and New York. I go back and forth, but I like the seasons.'},
 {'start_time_sec': 24.0,
  'start_time_min': '0:00:24',
  'timestamp': (24.0, 26.0),
  'text': ' Okay. Okay. Okay. Okay. Okay. Okay. Okay. but I like the seasons, and what else do I need to tell you?'},
 {'start_time_sec': 26.0,
  'start_time_min': '0:00:26',
  'timestamp': (26.0, 27.5),
  'text': " That's, I think that's it, right?"},
 {'start_time_sec': 27.5,
  'start_time_min

In [113]:
[
    chunk
    for min_ts in [0.0]
    for chunk in processed_chunks
    if chunk["timestamp"][0] >= min_ts
    for min_ts in [chunk["timestamp"][0]]
]

[{'start_time_sec': 0.0,
  'start_time_min': '0:00:00',
  'timestamp': (0.0, 7.0),
  'text': " My name is Cameron Diaz. I'm 41 years old on my way to 42. I am an actor and film."},
 {'start_time_sec': 7.0,
  'start_time_min': '0:00:07',
  'timestamp': (7.0, 14.0),
  'text': ' I recently just wrote a book called The Body Book, co-authored with a wonderful writing partner named Sandra Park.'},
 {'start_time_sec': 14.0,
  'start_time_min': '0:00:14',
  'timestamp': (14.0, 22.0),
  'text': ' And I grew up in Long Beach, California. I live in Los Angeles and New York. I go back and forth, but I like the seasons.'},
 {'start_time_sec': 22.0,
  'start_time_min': '0:00:22',
  'timestamp': (22.0, 1.0),
  'text': " And what else do I need to tell you? That's, I think, that's it, right? Okay."},
 {'start_time_sec': 24.0,
  'start_time_min': '0:00:24',
  'timestamp': (24.0, 26.0),
  'text': ' Okay. Okay. Okay. Okay. Okay. Okay. Okay. but I like the seasons, and what else do I need to tell you?'},


In [None]:
[
    
    for i, chunk in enumerate(out['chunks'])
    for cur_time in chunk['timestamp'][0]
]

[{'timestamp': (0.0, 9.5),
  'text': " Hello viewers, I'm here James May sitting in this very comfortable armchair in Lucy Brown's social media dungeon to answer the difficult questions you've asked."},
 {'timestamp': (9.5, 15.0),
  'text': " And let's go straight into it whilst I enjoy this California dream gin and tonic."},
 {'timestamp': (20.5, 2.4),
  'text': " I know the idea of the animal sanctuary is something we've talked about as a TV show, which I think is fascinating as an idea. I had the, the neighbours would hate this."},
 {'timestamp': (2.4, 3.4),
  'text': ' I had the, the neighbours would hate this.'},
 {'timestamp': (3.4, 4.4),
  'text': ' I had the, the neighbours would hate this.'},
 {'timestamp': (4.4, 5.4),
  'text': ' I had the, the neighbours would hate this.'},
 {'timestamp': (5.4, 6.4),
  'text': ' I had the, the neighbours would hate this.'},
 {'timestamp': (6.4, 7.4),
  'text': ' I had the, the neighbours would hate this.'},
 {'timestamp': (7.4, 8.4),
  'text

In [78]:
out

{'text': " Hello viewers, I'm here James May sitting in this very comfortable armchair in Lucy Brown's social media dungeon to answer the difficult questions you've asked. And let's go straight into it whilst I enjoy this California dream gin and tonic. I know the idea of the animal sanctuary is something we've talked about as a TV show, which I think is fascinating as an idea. I had the, the neighbours would hate this. I had the, the neighbours would hate this. I had the, the neighbours would hate this. I had the, the neighbours would hate this. I had the, the neighbours would hate this. I had the, the neighbours would hate this. I had the, the neighbours would hate this. I had the, the neighbours would hate this. I had the, the neighbours would hate this. I had the, the neighbours would hate this. I had the, the neighbours would hate this. I had the, the neighbours would hate this. I had the, the neighbours would hate this. I had the, the neighbours would hate this. I had the, the ne

In [36]:
sample = dataset[2]

In [49]:
sample

{'file': '/Users/sanchitgandhi/.cache/huggingface/datasets/downloads/extracted/aad76e6f21870761d7a8b9b34436f6f8db846546c68cb2d9388598d7a164fa4b/dev_clean/1272/128104/1272-128104-0002.flac',
 'audio': {'path': '1272-128104-0002.flac',
  'array': array([-6.71386719e-04,  6.10351562e-05,  5.18798828e-04, ...,
          1.52587891e-04,  2.13623047e-04,  1.83105469e-04], shape=(199760,)),
  'sampling_rate': 16000},
 'text': 'HE TELLS US THAT AT THIS FESTIVE SEASON OF THE YEAR WITH CHRISTMAS AND ROAST BEEF LOOMING BEFORE US SIMILES DRAWN FROM EATING AND ITS RESULTS OCCUR MOST READILY TO THE MIND',
 'speaker_id': 1272,
 'chapter_id': 128104,
 'id': '1272-128104-0002'}

In [52]:
out = pipe(sample["audio"].copy(), generate_kwargs={"language": "spanish", "task": "transcribe"})

In [53]:
out

{'text': ' Él dice a nosotros que en este feste de la season de la noche, con Chris Menes y Rose Beath looming antes de nosotros, similez de él y su resulta ocurre más rato a la mano.'}

In [32]:
sample["text"]

'HE TELLS US THAT AT THIS FESTIVE SEASON OF THE YEAR WITH CHRISTMAS AND ROAST BEEF LOOMING BEFORE US SIMILES DRAWN FROM EATING AND ITS RESULTS OCCUR MOST READILY TO THE MIND'