### Task 3
3a) 
From Task 2d, you are to use the common-voice mp3 files 
- under cv-valid-train and cv-valid-train.csv for finetuning train dataset. 

Write a python jupyter notebook called cv-train-2a.ipynb for this task, using
either TensorFlow or PyTorch. 

You are to split the dataset into 70-30 ratio where 30% is kept for training validation. 

You are to list down your explanation for your chosen:
- preprocessing, 
- tokenizer, 
- feature extraction and 
- pipeline processes (including hyperparameters selected). 


You are also required to visualise the training and validation metrics and explain your interpretation of these visualisations.

3b) 
Rename your fine-tuned AI model: wav2vec2-large-960h-cv.


3c) 
Within your jupyter notebook, cv-train-2a.ipynb, in task 2d, use your
fine-tuned AI model to transcribe the common-voice mp3 files under cv-valid-test and compare the generated text against cv-valid-test.csv. 
Log your overall performance.


In [1]:
# Imports
import pandas as pd
import os 
from sklearn.model_selection import train_test_split

# Audio Loading
import librosa 

# Modelling
import torch 
from torch.utils.data import Dataset


In [2]:
# Import data
df = pd.read_csv("../data/common_voice/cv-valid-train.csv")

# Create filepath col to audiofiles 
df['file_path'] = df['filename'].apply(lambda x: os.path.join("../data/common_voice/cv-valid-train", x))

# Remove unnecessary columns + assume the 'text' col is the ground truth labels
df_subset = df[['file_path', 'text']]
df_subset

Unnamed: 0,file_path,text
0,../data/common_voice/cv-valid-train/cv-valid-t...,learn to recognize omens and follow them the o...
1,../data/common_voice/cv-valid-train/cv-valid-t...,everything in the universe evolved he said
2,../data/common_voice/cv-valid-train/cv-valid-t...,you came so that you could learn about your dr...
3,../data/common_voice/cv-valid-train/cv-valid-t...,so now i fear nothing because it was those ome...
4,../data/common_voice/cv-valid-train/cv-valid-t...,if you start your emails with greetings let me...
...,...,...
195771,../data/common_voice/cv-valid-train/cv-valid-t...,the englishman said nothing
195772,../data/common_voice/cv-valid-train/cv-valid-t...,the irish man sipped his tea
195773,../data/common_voice/cv-valid-train/cv-valid-t...,what do you know about that
195774,../data/common_voice/cv-valid-train/cv-valid-t...,the phone rang while she was awake


In [3]:
# Split the dataset
train_df, val_df = train_test_split(df_subset, test_size=0.3, random_state=42)
train_df.shape, val_df.shape

((137043, 2), (58733, 2))

In [4]:
sample = df_subset.loc[0]
sample

file_path    ../data/common_voice/cv-valid-train/cv-valid-t...
text         learn to recognize omens and follow them the o...
Name: 0, dtype: object

In [23]:
sample['file_path']

'../data/common_voice/cv-valid-train/cv-valid-train/sample-000000.mp3'

In [5]:
import torch 
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC

processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

librosa_sample_mp3_input, librosa_sample_mp3_sample_rate = librosa.load(sample['file_path'], sr=16000)

# pad input values and return pt tensor
input_values = processor(librosa_sample_mp3_input, sampling_rate=16000, return_tensors="pt", padding='longest').input_values
input_values

  from .autonotebook import tqdm as notebook_tqdm
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tensor([[-0.0005, -0.0005, -0.0005,  ...,  0.0010,  0.0084,  0.0085]])

In [6]:
input_values.shape

torch.Size([1, 65664])

In [9]:
input_values[0]

tensor([-0.0005, -0.0005, -0.0005,  ...,  0.0010,  0.0084,  0.0085])

In [8]:
from transformers import Wav2Vec2CTCTokenizer

tokenizer = Wav2Vec2CTCTokenizer.from_pretrained(unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")

TypeError: PreTrainedTokenizerBase.from_pretrained() missing 1 required positional argument: 'pretrained_model_name_or_path'

In [None]:
from transformers import SeamlessM4TFeatureExtractor

feature_extractor = SeamlessM4TFeatureExtractor(feature_size=80, num_mel_bins=80, sampling_rate=16000, padding_value=0.0)

In [None]:
from transformers import Wav2Vec2BertProcessor

processor = Wav2Vec2BertProcessor(feature_extractor=feature_extractor, tokenizer=tokenizer)

In [None]:
from transformers import SeamlessM4TFeatureExtractor

feature_extractor = SeamlessM4TFeatureExtractor.from_pretrained("facebook/w2v-bert-2.0")

In [None]:
processor = Wav2Vec2Processor(feature_extractor=feature_extractor)

# pad input values and return pt tensor
input_values = processor(librosa_sample_mp3_input, sampling_rate=16000, return_tensors="pt", padding='longest').input_values
input_values

In [11]:
# Data Preprocessing, Tokenizing and Feature Extraction 

# Audio Loading
def load_audio(file_path, target_sr=16000):
    waveform, sr = librosa.load(file_path, sr=target_sr)
    return waveform 

# Tokenizer and Feature Extraction 
from transformers import Wav2Vec2Processor

processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-960h")

In [15]:
# Creating PyTorch Dataset class 
class ASRDataset(Dataset):
    def __init__(self, dataframe, processor):
        self.data = dataframe
        self.processor = processor

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        row = self.data.iloc[idx]
        audio = load_audio(row['file_path'])
        # input_values = self.processor(audio, sampling_rate=16000, return_tensors="pt", padding='longest').input_values[0]
        # tokenize
        input_values = self.processor(audio, sampling_rate=16000, return_tensors="pt", padding='longest').input_values
        
        # retrieve logits
        logits = model(input_values).logits

        # take argmax and decode
        predicted_ids = torch.argmax(logits, dim=-1)
        transcription = processor.batch_decode(predicted_ids)[0]
        
        # with self.processor.as_target_processor():
        #     labels = self.processor(row['transcription'], return_tensors="pt").input_ids[0]
        return {"input_values": input_values, "transcription": transcription}

In [16]:
train_dataset = ASRDataset(train_df, processor)
val_dataset = ASRDataset(val_df, processor)

In [17]:
train_dataset[0]

{'input_values': tensor([[-0.0005, -0.0005, -0.0005,  ..., -0.0001,  0.0007,  0.0010]]),
 'transcription': 'AT THE FIRST GLANCE IT WAS DEALLY NOT VERY EXCITED'}

In [None]:
"""
# Metric to assess model on:
- Word Error Rate (WER)
- Character Error Rate (CER)
"""


In [14]:
train_df

Unnamed: 0,filename,text,up_votes,down_votes,age,gender,accent,duration,file_path
35234,cv-valid-train/sample-035234.mp3,at the first glance it was really not very exc...,3,1,,,,,../data/common_voice/cv-valid-train/cv-valid-t...
168647,cv-valid-train/sample-168647.mp3,dense clouds of smoke or dust can be seen thro...,4,1,thirties,female,us,,../data/common_voice/cv-valid-train/cv-valid-t...
163199,cv-valid-train/sample-163199.mp3,the boy preferred wine,1,0,,,,,../data/common_voice/cv-valid-train/cv-valid-t...
27555,cv-valid-train/sample-027555.mp3,the men climbed the hill and they were tired w...,3,0,,,,,../data/common_voice/cv-valid-train/cv-valid-t...
34896,cv-valid-train/sample-034896.mp3,the night was warm and i was thirsty,5,0,fifties,female,australia,,../data/common_voice/cv-valid-train/cv-valid-t...
...,...,...,...,...,...,...,...,...,...
119879,cv-valid-train/sample-119879.mp3,the boy observed in silence the progress of th...,3,0,fourties,male,newzealand,,../data/common_voice/cv-valid-train/cv-valid-t...
103694,cv-valid-train/sample-103694.mp3,we'd better forget it,1,0,twenties,male,us,,../data/common_voice/cv-valid-train/cv-valid-t...
131932,cv-valid-train/sample-131932.mp3,air was either entering or escaping at the rim...,4,0,teens,male,australia,,../data/common_voice/cv-valid-train/cv-valid-t...
146867,cv-valid-train/sample-146867.mp3,as soon as he saw me among the crowd he called...,2,1,,,,,../data/common_voice/cv-valid-train/cv-valid-t...


In [10]:
df.loc[0]['file_path']


'../data/common_voice/cv-valid-train/cv-valid-train/sample-000000.mp3'

In [9]:
df.loc[0]['file_path']

import librosa 

librosa.load(df.loc[0]['file_path'], sr=16000)

(array([-4.3655746e-11,  9.0949470e-12,  4.0017767e-11, ...,
         1.2503879e-04,  7.3011382e-04,  7.3690247e-04], dtype=float32),
 16000)

In [2]:
eg = pd.read_csv("../data/common_voice/cv-valid-train.csv")
eg

Unnamed: 0,filename,text,up_votes,down_votes,age,gender,accent,duration
0,cv-valid-train/sample-000000.mp3,learn to recognize omens and follow them the o...,1,0,,,,
1,cv-valid-train/sample-000001.mp3,everything in the universe evolved he said,1,0,,,,
2,cv-valid-train/sample-000002.mp3,you came so that you could learn about your dr...,1,0,,,,
3,cv-valid-train/sample-000003.mp3,so now i fear nothing because it was those ome...,1,0,,,,
4,cv-valid-train/sample-000004.mp3,if you start your emails with greetings let me...,3,2,,,,
...,...,...,...,...,...,...,...,...
195771,cv-valid-train/sample-195771.mp3,the englishman said nothing,1,0,thirties,male,england,
195772,cv-valid-train/sample-195772.mp3,the irish man sipped his tea,1,0,,,,
195773,cv-valid-train/sample-195773.mp3,what do you know about that,1,0,,,,
195774,cv-valid-train/sample-195774.mp3,the phone rang while she was awake,2,0,twenties,male,us,
