# Project for the course [Algorithms for speech and natural language processing](https://github.com/edupoux/MVA_2021_SL)

Authors: Hugo Laurençon (hugo.laurencon@gmail.com), Alexandre Perez (alexandre.perez.enpc@gmail.com), Charbel-Raphaël Ségerie (charbel-raphael.segerie@hotmail.fr)

Project proposal: https://fr.overleaf.com/project/601e7f56c6528f6574fe77e8

# Resources

Unsupervised pretraining transfers well across languages: https://arxiv.org/abs/2002.02848

Wav2Vec 2.0 Paper: https://arxiv.org/abs/2006.11477

Wav2Vec2 Documentation: https://huggingface.co/transformers/master/model_doc/wav2vec2.html#wav2vec2forctc

Other resources: https://docs.google.com/document/d/1P8pTAdIAZ14lZJzENwXBSFqUIUIdVa59SCTv1pXdhFs/edit#

# Getting started

## Toy dataset of LibriSpeech

In [None]:
!pip install datasets

In [None]:
from datasets import load_dataset
import soundfile as sf

def map_to_array(batch):
    speech, _ = sf.read(batch["file"])
    batch["speech"] = speech
    return batch

ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
ds = ds.map(map_to_array)

## Creation of a first pre-trained Wav2Vec model

In [None]:
!pip install transformers

In [None]:
from transformers import Wav2Vec2Tokenizer, Wav2Vec2Model

tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base-960h")

## Creation of embeddings for the toy dataset of LibriSpeech with the pre-trained Wav2Vec model

In [None]:
input_values = tokenizer(ds["speech"][0], return_tensors="pt").input_values  # Batch size 1
hidden_states = model(input_values).last_hidden_state

## Analysis

ds est un dataset, comprenant entre autres des attributs 'text' et 'speech'.

ds['text'] est une liste de phrase, qui sont les labels.

ds['speech'] est une liste de liste. Chaque sous-liste a une taille variable selon la longueur de l'audio, mais est généralement d'une longueur au alentour de 100000 (notons qu'on ne considère qu'un signal mono et non stéréo), et comprend des nombres réels généralement compris entre -1 et 1 pour les valeurs les plus extrêmes.

input_values = tokenizer(ds["speech"][0], return_tensors="pt").input_values renvoie essentiellement la même chose que ds["speech"][0], mais en type tensor et avec un arrondi à la 4ème décimale après la virgule pour les valeurs de la liste.

hidden_states = model(input_values).last_hidden_state renvoie un tensor de taille (1, N, 768), où 1 est ici le batch size, N dépend de la longueur de l'audio dans input_values (typiquement N est au alentour de 500), et 768 est la taille d'embedding.

# Processing of the LibriSpeech dataset

Download a dataset of your choice [here](https://www.openslr.org/12). The dev-clean dataset contains approximately 5 hours of read English speech.

If using Google Colab, upload the dataset to your drive.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os

path_to_dataset = '/content/drive/My Drive/dev-clean'

dic_flac_txt = {}

for subdir, dirs, files in os.walk(path_to_dataset):

    if len(files) > 1:
        filepath_txt = None

        for file in files:
            if file.endswith(".flac"):

                filepath_flac = subdir + os.sep + file
                data_flac, _ = sf.read(filepath_flac)
                dic_flac_txt[file[:-5]] = [data_flac, None]

            elif file.endswith(".txt"):
                filepath_txt = subdir + os.sep + file

        txt_file = open(filepath_txt, "r")
        content = txt_file.read()
        content = content.split("\n")
        txt_file.close()

        for line in content:
            if len(line) > 0:
                id = -1
                num = 0
                while num < 2:
                    id += 1
                    if line[id] == '-':
                        num += 1
                key = line[:id+5]
                txt = line[id+6:]
                dic_flac_txt[key][1] = txt

        #print(len(dic_flac_txt))

In [None]:
import pickle

data_audio = []
data_txt = []

for key in dic_flac_txt:
    data_audio.append(dic_flac_txt[key][0])
    data_txt.append(dic_flac_txt[key][1])

save_filepath_data_audio = '/content/drive/My Drive/data_audio.pkl'
save_filepath_data_txt = '/content/drive/My Drive/data_txt.pkl'

with open(save_filepath_data_audio, 'wb') as f:
    pickle.dump(data_audio, f)

with open(save_filepath_data_txt, 'wb') as f:
    pickle.dump(data_txt, f)

In [None]:
import pickle

save_filepath_data_audio = '/content/drive/My Drive/data_audio.pkl'
save_filepath_data_txt = '/content/drive/My Drive/data_txt.pkl'

with open(save_filepath_data_audio, 'rb') as f:
    data_audio = pickle.load(f)

with open(save_filepath_data_txt, 'rb') as f:
    data_txt = pickle.load(f)

In [None]:
import pickle
import torch

data_audio_embedding = []

for i in range(len(data_audio)):
    #print(i)
    with torch.no_grad():
        input_values = tokenizer(data_audio[i], return_tensors="pt").input_values
        hidden_states = model(input_values).last_hidden_state
        data_audio_embedding.append(hidden_states)

save_filepath_data_audio_embedding = '/content/drive/My Drive/data_audio_embedding.pkl'

with open(save_filepath_data_audio_embedding, 'wb') as f:
    pickle.dump(data_audio_embedding, f)

In [None]:
import torch
import pickle

save_filepath_data_audio_embedding = '/content/drive/My Drive/data_audio_embedding.pkl'

with open(save_filepath_data_audio_embedding, 'rb') as f:
    data_audio_embedding = pickle.load(f)

"""
print("data_audio_embedding loaded")

max_len = 0
for e in data_audio_embedding_padding_right:
    length = e.shape[1]
    if length > max_len:
        max_len = length

for i in range(len(data_audio_embedding_padding_right)):
    length = data_audio_embedding_padding_right[i].shape[1]
    if length < max_len:
        data_audio_embedding_padding_right[i] = torch.cat((data_audio_embedding_padding_right[i], torch.zeros((1,max_len-length,768))), dim=1)
data_audio_embedding_padding_right = torch.cat(data_audio_embedding_padding_right, dim=0)

print("data_audio_embedding_padding_right done")

save_filepath_data_audio_embedding_padding_right = '/content/drive/My Drive/data_audio_embedding_padding_right.pkl'

with open(save_filepath_data_audio_embedding_padding_right, 'wb') as f:
    pickle.dump(data_audio_embedding_padding_right, f)

print("data_audio_embedding_padding_right saved")
"""

In [None]:
!mkdir '/content/drive/My Drive/data_libri_en'
!mkdir '/content/drive/My Drive/data_libri_en/audio'
!mkdir '/content/drive/My Drive/data_libri_en/labels'

In [None]:
for i in range(len(data_audio_embedding)):
    torch.save(data_audio_embedding[i], '/content/drive/My Drive/data_libri_en/audio/audio_'+str(i)+'.pt')

In [None]:
import pickle

save_filepath_data_txt = '/content/drive/My Drive/data_txt.pkl'

with open(save_filepath_data_txt, 'rb') as f:
    data_txt = pickle.load(f)

data_txt = data_txt[:2607] # There was a None at position 2607
txt = '\n'.join(data_txt)
text_file = open('/content/drive/My Drive/data_txt.txt', "w")
n = text_file.write(txt)
text_file.close()


In [None]:
!sudo apt-get install festival espeak-ng mbrola

In [None]:
!pip install phonemizer

In [None]:
!phonemize -b espeak -l en-us -p '-' -w ' ' '/content/drive/My Drive/data_txt.txt' -o '/content/drive/My Drive/data_phones.txt'

In [None]:
filepath_txt = '/content/drive/My Drive/data_phones.txt'
txt_file = open(filepath_txt, "r")
content = txt_file.read()
content = content.split("\n")
del content[-1]
txt_file.close()

In [None]:
vocab = {}
for i in range(len(content)):
    cont = content[i][:-2].replace(" ", "")
    cont = cont.replace("--", "-")
    split = cont.split('-')
    for e in split:
        if e != '':
            if e in vocab:
                vocab[e] += 1
            else:
                vocab[e] = 1

print(len(vocab.keys()))
print(vocab.keys())
print(vocab)

60
dict_keys(['ð', 'iː', 'z', 'eɪ', 's', 'p', 'ɹ', 'ɛ', 'd', 'aʊ', 't', 'ɔ', 'n', 'ə', 'ɡ', 'j', 'uː', 'ɪ', 'ŋ', 'b', 'æ', 'k', 'oʊ', 'f', 'ɔːɹ', 'l', 'w', 'v', 'ɑː', 'ɐ', 'h', 'ɑːɹ', 'aɪ', 'ᵻ', 'oːɹ', 'i', 'm', 'ʌ', 'əl', 'ɚ', 'ʊɹ', 'ʊ', 'dʒ', 'ɜː', 'ɛɹ', 'ɾ', 'tʃ', 'ɔɪ', 'ɔː', 'ʃ', 'aɪɚ', 'oː', 'θ', 'ɪɹ', 'iə', 'aɪə', 'ʒ', 'ʔ', 'n̩', 'r'])
{'ð': 6365, 'iː': 4280, 'z': 5349, 'eɪ': 2877, 's': 8852, 'p': 3510, 'ɹ': 6296, 'ɛ': 5148, 'd': 9478, 'aʊ': 1301, 't': 11823, 'ɔ': 761, 'n': 13098, 'ə': 7086, 'ɡ': 1713, 'j': 1161, 'uː': 3327, 'ɪ': 11005, 'ŋ': 2304, 'b': 3341, 'æ': 6497, 'k': 4862, 'oʊ': 2460, 'f': 3510, 'ɔːɹ': 813, 'l': 6423, 'w': 4409, 'v': 3940, 'ɑː': 1616, 'ɐ': 2338, 'h': 4066, 'ɑːɹ': 673, 'aɪ': 3736, 'ᵻ': 1683, 'oːɹ': 462, 'i': 2378, 'm': 5555, 'ʌ': 5539, 'əl': 1167, 'ɚ': 3165, 'ʊɹ': 187, 'ʊ': 1006, 'dʒ': 814, 'ɜː': 1585, 'ɛɹ': 491, 'ɾ': 1261, 'tʃ': 1098, 'ɔɪ': 207, 'ɔː': 1069, 'ʃ': 1422, 'aɪɚ': 127, 'oː': 122, 'θ': 863, 'ɪɹ': 266, 'iə': 230, 'aɪə': 71, 'ʒ': 90, 'ʔ': 59, 'n̩':

In [None]:
map_ipa_idx = {}
ipa_vocab = list(vocab.keys())
for i in range(len(ipa_vocab)):
    map_ipa_idx[ipa_vocab[i]] = i+1 # the index 0 is for the blank
print(map_ipa_idx)

{'ð': 1, 'iː': 2, 'z': 3, 'eɪ': 4, 's': 5, 'p': 6, 'ɹ': 7, 'ɛ': 8, 'd': 9, 'aʊ': 10, 't': 11, 'ɔ': 12, 'n': 13, 'ə': 14, 'ɡ': 15, 'j': 16, 'uː': 17, 'ɪ': 18, 'ŋ': 19, 'b': 20, 'æ': 21, 'k': 22, 'oʊ': 23, 'f': 24, 'ɔːɹ': 25, 'l': 26, 'w': 27, 'v': 28, 'ɑː': 29, 'ɐ': 30, 'h': 31, 'ɑːɹ': 32, 'aɪ': 33, 'ᵻ': 34, 'oːɹ': 35, 'i': 36, 'm': 37, 'ʌ': 38, 'əl': 39, 'ɚ': 40, 'ʊɹ': 41, 'ʊ': 42, 'dʒ': 43, 'ɜː': 44, 'ɛɹ': 45, 'ɾ': 46, 'tʃ': 47, 'ɔɪ': 48, 'ɔː': 49, 'ʃ': 50, 'aɪɚ': 51, 'oː': 52, 'θ': 53, 'ɪɹ': 54, 'iə': 55, 'aɪə': 56, 'ʒ': 57, 'ʔ': 58, 'n̩': 59, 'r': 60}


In [None]:
import torch

for i in range(len(content)):
    cont = content[i][:-2].replace(" ", "")
    cont = cont.replace("--", "-")
    split = cont.split('-')
    val = []
    for e in split:
        if e != '':
            val.append(map_ipa_idx[e])
    tens = torch.tensor(val)
    torch.save(tens, '/content/drive/My Drive/data_libri_en/labels/label_'+str(i)+'.pt')


# Creation of the model for phone recognition

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import optim

class CustomDataset:
    def __init__(self, dataset_path, len_dataset, train_size, batch_size):
        self.dataset_path = dataset_path
        self.len_dataset = len_dataset
        self.train_size = train_size
        self.batch_size = batch_size
        self.it_train = 0
        self.it_eval = 0

    def get_next_train_batch(self):
        list_tens_audio, list_tens_labels = None, None
        if (self.it_train + 1) * self.batch_size <= self.train_size:
            list_tens_audio = [torch.load(self.dataset_path + '/audio/audio_' + str(i) + '.pt') for i in range(self.it_train * self.batch_size, (self.it_train + 1) * self.batch_size)]
            list_tens_labels = [torch.load(self.dataset_path + '/labels/label_' + str(i) + '.pt') for i in range(self.it_train * self.batch_size, (self.it_train + 1) * self.batch_size)]
            self.it_train += 1
        else:
            list_tens_audio = [torch.load(self.dataset_path + '/audio/audio_' + str(i) + '.pt') for i in range(self.it_train * self.batch_size, self.train_size)]
            list_tens_labels = [torch.load(self.dataset_path + '/labels/label_' + str(i) + '.pt') for i in range(self.it_train * self.batch_size, self.train_size)]
            self.it_train = 0
        input_lengths = torch.tensor([e.shape[1] for e in list_tens_audio])
        target_lengths = torch.tensor([e.shape[0] for e in list_tens_labels])
        targets = torch.cat(list_tens_labels)
        max_len = torch.max(input_lengths)
        for i in range(len(list_tens_audio)):
            length = list_tens_audio[i].shape[1]
            if length < max_len:
                list_tens_audio[i] = torch.cat((list_tens_audio[i], torch.zeros((1,max_len-length,768))), dim=1)
        X = torch.cat(list_tens_audio, dim=0)
        return X, input_lengths, targets, target_lengths

    def get_next_eval_batch(self):
        list_tens_audio, list_tens_labels = None, None
        if self.train_size + (self.it_eval + 1) * self.batch_size <= self.len_dataset:
            list_tens_audio = [torch.load(self.dataset_path + '/audio/audio_' + str(i) + '.pt') for i in range(self.train_size + self.it_eval * self.batch_size, self.train_size + (self.it_eval + 1) * self.batch_size)]
            list_tens_labels = [torch.load(self.dataset_path + '/labels/label_' + str(i) + '.pt') for i in range(self.train_size + self.it_eval * self.batch_size, self.train_size + (self.it_eval + 1) * self.batch_size)]
            self.it_eval += 1
        else:
            list_tens_audio = [torch.load(self.dataset_path + '/audio/audio_' + str(i) + '.pt') for i in range(self.train_size + self.it_eval * self.batch_size, self.len_dataset)]
            list_tens_labels = [torch.load(self.dataset_path + '/labels/label_' + str(i) + '.pt') for i in range(self.train_size + self.it_eval * self.batch_size, self.len_dataset)]
            self.it_eval = 0
        input_lengths = torch.tensor([e.shape[1] for e in list_tens_audio])
        target_lengths = torch.tensor([e.shape[0] for e in list_tens_labels])
        targets = torch.cat(list_tens_labels)
        max_len = torch.max(input_lengths)
        for i in range(len(list_tens_audio)):
            length = list_tens_audio[i].shape[1]
            if length < max_len:
                list_tens_audio[i] = torch.cat((list_tens_audio[i], torch.zeros((1,max_len-length,768))), dim=1)
        X = torch.cat(list_tens_audio, dim=0)
        return X, input_lengths, targets, target_lengths



In [None]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.dropout = nn.Dropout(0.3)
        self.fc = nn.Linear(768, 61)

    def forward(self, x):
        x_drop = self.dropout(x)
        fc = self.fc(x_drop)
        output = F.log_softmax(fc, dim=2)
        return output

In [None]:
def train(model, device, dataset, n_epochs, learning_rate):
    ctc_loss = nn.CTCLoss()
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)
    for ep in range(n_epochs):
        print("Epoch:", ep)
        model.train()
        for it in range(dataset.train_size // dataset.batch_size + 1*(dataset.train_size % dataset.batch_size > 0)):
            X, input_lengths, targets, target_lengths = dataset.get_next_train_batch()
            X, input_lengths, targets, target_lengths = X.to(device), input_lengths.to(device), targets.to(device), target_lengths.to(device)
            optimizer.zero_grad()
            X = model(X).permute(1,0,2)
            loss = ctc_loss(X, targets, input_lengths, target_lengths)
            loss.backward()
            optimizer.step()
            print("It:", it, "Train loss:", loss.item())
        model.eval()
        mean_loss_eval = []
        with torch.no_grad():
            for it in range((dataset.len_dataset - dataset.train_size) // dataset.batch_size + 1*((dataset.len_dataset - dataset.train_size) % dataset.batch_size > 0)):
                X, input_lengths, targets, target_lengths = dataset.get_next_eval_batch()
                X, input_lengths, targets, target_lengths = X.to(device), input_lengths.to(device), targets.to(device), target_lengths.to(device)
                X = model(X).permute(1,0,2)
                loss = ctc_loss(X, targets, input_lengths, target_lengths)
                mean_loss_eval.append(loss.item())
        print("Average eval loss:", sum(mean_loss_eval)/len(mean_loss_eval))
        print("")

In [None]:
use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")
model = Net().to(device)
dataset = CustomDataset('/content/drive/My Drive/data_libri_en', 2607, 2200, 64)
n_epochs = 1
learning_rate = 0.001

train(model, device, dataset, n_epochs, learning_rate)

In [1]:
!pip install jiwer

Collecting jiwer
  Downloading https://files.pythonhosted.org/packages/8c/cc/fb9d3132cba1f6d393b7d5a9398d9d4c8fc033bc54668cf87e9b197a6d7a/jiwer-2.2.0-py3-none-any.whl
Collecting python-Levenshtein
[?25l  Downloading https://files.pythonhosted.org/packages/2a/dc/97f2b63ef0fa1fd78dcb7195aca577804f6b2b51e712516cc0e902a9a201/python-Levenshtein-0.12.2.tar.gz (50kB)
[K     |██████▌                         | 10kB 26.4MB/s eta 0:00:01[K     |█████████████                   | 20kB 33.9MB/s eta 0:00:01[K     |███████████████████▌            | 30kB 36.7MB/s eta 0:00:01[K     |██████████████████████████      | 40kB 39.4MB/s eta 0:00:01[K     |████████████████████████████████| 51kB 8.1MB/s 
Building wheels for collected packages: python-Levenshtein
  Building wheel for python-Levenshtein (setup.py) ... [?25l[?25hdone
  Created wheel for python-Levenshtein: filename=python_Levenshtein-0.12.2-cp37-cp37m-linux_x86_64.whl size=149817 sha256=24dbda969ff052221000f90f492092c5748da5f5c48a22241

# Pipeline pour les expériences à réaliser

**En priorité**

-Choisir un modèle wav2vec (essentiellement sa taille et le/les langages sur lesquels il a été entraîné). Les différents modèles sont présents [ici](https://huggingface.co/models?filter=wav2vec2).

-Choisir un target language pour la tâche de phone/phoneme recognition

-Télécharger les données en format wav pour le target language, avec les labels sous forme de phrase

-Préprocesser les données, c'est-à-dire transformer tous les wav en une liste de liste de nombre réels, que l'on va passer au modèle pré-entraîné Wav2vec pour créer les embeddings

-Préprocesser les labels, c'est-à-dire utiliser un phonemizer (par exemple https://github.com/bootphon/phonemizer) pour transformer les phrases en une liste de phones/phonemes

-Créer un data loader, et faire du padding sur les données audio et les labels de façon à pouvoir ensuite faire des batchs, et avoir une entrée et une sortie propre à une application d'un modèle de ML

-Créer un modèle, que ce soit un modèle linéaire (dropout + une couche linear qui est la même pour tous les timesteps), LSTM, transformer ou autre, pour partir des données des embeddings et qui doit prédire le résultat du phonemizer. Entraîner le modèle avec la loss CTC (voir https://huggingface.co/transformers/master/_modules/transformers/models/wav2vec2/modeling_wav2vec2.html#Wav2Vec2ForCTC pour la loss)

-Evaluation et trouver une bonne métrique

**Par la suite**

-Répéter les opérations précédentes mais en changeant le modèle wav2vec (et potentiellement en le remplaçant par différents CPC pré-entraîné), en changeant le target language (ou en gardant le même mais avec plus ou moins de données), et en changeant le modèle final utilisé, de façon à répondre à un maximum de questions du proposal

**Si ça ne marche pas**

-Mettre en plus un phone qui encode le "silence" entre deux mots

-Utiliser un tokenizer non deprecated.

-Corriger le warning du début "Some weights of Wav2Vec2Model were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized", qui apparemment n'est pas si grave.

