# <b><span style="color:#ADF88B">10 x 3 sec audio files to 30 sec audio file </span> </b>

This script concatenates 10x3sec audio files into one 30sec audio file. 
The script uses the <span style="color:#ADF88B"><b> pydub library</b> </span> to manipulate the audio files. 


## <span style="color:#58F2B3"> <b>Method: </b> Concatenated Clips  </span>

An ID column containing the assessment information was added to the data frame. 


It was then sorted and grouped by the ID column. The audio clips in each group were then concatenated to equally 10x3 audio clips. 


All the concatenated audio clips were saved to new directories for easy access.



Each audio clip is 3 seconds long, while Whisper expects audio clips of 30 seconds or longer. Therefore, Whisper's responses were tested using 30-second audio clips, achieved by concatenating ten 3-second audio clips. Each clip does not correlate, but it might reveal valuable results for Whisper's performance.

The Verbatim Tiny and Medium models from the Norwegian National Library (NNL) were used to compare the performance differences between the two, which had a high difference in features. 

Multiple tests were conducted to evaluate Whisper's response in this scenario:

1. <span style="color:#00E6DB"> Grouping by score: Each score was grouped before transcription. Sorted by file name.</span>
2. <span style="color:#00E6DB"> Grouping by score: Each score was grouped before transcription. Mixed file names.</span>
3. <span style="color:#00D5F7"> Grouping by person: Each person's independent words were grouped, and sorted by score.</span>
4. <span style="color:#00D5F7"> Random shuffling of data before transcription.</span>


The results were analyzed using <span style="color:#F9F871">**Word Error Rate (WER)**</span> and Character Error Rate (CER) scores to assess the performance of the models.

If looking for outliers this could have proved usefull, considering if the transcription where much longer og contained question marks.

<span style="color:#00C0FF">#00C0FF </span>
<span style="color:#00D5F7">#00D5F7 </span>
<span style="color:#00E6DB">#00E6DB </span>
<span style="color:#58F2B3">#58F2B3 </span>
<span style="color:#ADF88B">#ADF88B </span>
<span style="color:#F9F871">#F9F871 </span>

### <span style="color:#ADF88B"> <b>Library imports </b> </span>


In [23]:
import sys      
script_directory = '../'
sys.path.append(script_directory)
import self_made_functions as smf

from prettytable import PrettyTable
from transformers import pipeline
import matplotlib.pyplot as plt
from pydub import AudioSegment # Good for audio manipulation
import pandas as pd 
import numpy as np
import whisper
import jiwer
import os

### <span style="color:#58F2B3"> <b> Concatination information functions </b> </span>


Save and extract relevant information from the audio files.

The CER and WER was not calculated correctly fro the first batch of results.

In [24]:
# --------- Part 1 - Generate the audio files ----------- #
def concat_audio(df:pd.DataFrame, save_directory:str, s:int=0, e:int=10, wv_path:str = '/talebase/data/speech_raw/teflon_no/speech16khz/', name = ''):
    concat_rows = df.iloc[s:e]  # Rows to concatenate
    input_lst = list(concat_rows.Word.values) # get input string
    
    combined_audio = AudioSegment.empty() # Empty audio file to concatenate the audio clips
    id_list = [] # List of the speaker IDs
    
    target_wer_count = 0
    pron_lst = [] # list of pron scores to calcualte CER
    
    for row in concat_rows.itertuples():
        id_list.append(row.id.split('_')[0])
        
        audio = AudioSegment.from_file(os.path.join(wv_path, row[1]))
        combined_audio += audio
        cer_score = (row.pronScores.count('1')/len(row.pronScores.split(' ')))
        
        pron_lst.append(row.pronScores)
        if cer_score != 0:
            target_wer_count += 1   
    
    target_wer = target_wer_count/len(input_lst) # Correct Word Error Rate (WER)
    
    pron_lst = (' '.join(pron_lst)) # String of all the 0 adn 1s
    target_cer = pron_lst.count('1')/len(pron_lst) # Correct Character Error Rate (CER)
    
    audio_name = name + f'words_{s}_{e}.wav'
    # print('in funttion', audio_name)

    if not os.path.exists(save_directory):
        os.makedirs(save_directory)
        print(f"Directory {save_directory} created")

    # For store filer! Ikke lagre dem direkte i git repo, men på serveren
    audio_directory = os.path.join(save_directory, audio_name)
    # Save the combined audio
    if not os.path.exists(audio_directory):
        combined_audio.export(audio_directory, format="wav")   
    
    return  audio_name, id_list, input_lst, target_cer, target_wer

# Create dir if not exist
def new_dir(dir_name): 
    if not os.path.exists(dir_name):
        os.makedirs(dir_name)
        print(f"Directory {dir_name} created")
    else:
        print(f"Directory {dir_name} already exists")

# -------------- Part 2 - Look at the result -------------- #
def get_transcribed_words(string_words:str):
    # Return a list with lower case words and no dots
    split_string = string_words.split(' ')
    remove_empty = [word for word in split_string if word != '']
    lower = [word.lower() for word in remove_empty] 
    no_dot = [word.replace('.', '') for word in lower]
    return no_dot 

In [25]:
df_fin, wv_path = smf.get_correct_df()
# Speaker ID column added
df_trail =  df_fin
df_trail['id'] = df_fin['File name'].apply(lambda x: x.split('_')[0])

In [26]:
df_trail.head()
# cer = (df_trail.pronScores[0].count('1')/len(df_trail.pronScores[0].split(' ')))

Unnamed: 0,File name,Score,Prosody,Noise/Disruption,Pre-speech noise,Repetition,Word,Pronunciation,pronScores,Assessor,id
0,a01_artist.wav,4,,,,,artist,A t` I s t,1 0 1 1 1,Anne Marte,a01
1,a01_barn.wav,3,1.0,,,,barn,b A: n`,1 0 1,Anne Marte,a01
2,a01_bart.wav,4,,,,,bart,b A t`,1 1 0,Anne Marte,a01
3,a01_bjoern.wav,3,,,,,bjørn,b j 2: n`,1 1 0 0,Anne Marte,a01
4,a01_blaa.wav,4,,,,,blå,b l o:,1 1 0,Anne Marte,a01


### <span style="color:#00E6DB"> <b>1. Grouping by score:</b> Each score was grouped before transcription. Sorted by speaker</span>

In [27]:
def run_for_score_sort(model_name):
        # Group by score, sort by speaker ID
        df_trail_1 = df_trail.sort_values(by=['id'], ascending=False) # Sorted by ID
        df_trail_1 = df_trail_1.reset_index(drop=True)
        df_score_groups = df_trail_1.groupby('Score')# Grouped by score

        concatenated_audio_information = pd.DataFrame(columns=[
                'score', 'input_string', 'translated_string',
                'translated_CER', 'translated_WER', 'target_CER',  'target_WER',
                "unique_id's", "speaker_id's",
                'length_deviation_words',
                'audio_name', 'audio_path'])

        ground_folder = '../../3x10_Concatenations' # Save outside of git repo

        # model_name = 'nb-whisper-tiny-verbatim'
        model_path = smf.get_whisper_path(model_name)
        model = pipeline("automatic-speech-recognition", model_path)

        for idx, score_group in df_score_groups:
                # if idx > 1:
                #         print('out at', idx)
                #         break
                print(idx)
        # -------------- Make or Find directory -------------- #
                score_directory = f'{ground_folder}/3x10_score_{idx}_sorted'
        # --------- Concatenate and save audio files --------- #
                for i in range(0, len(score_group)//10,10):
                        # Concatenate the 10 audio files together
                        s, e = i, i + 10 # get start and end index
                        audio_name, ids, input_words_list, target_cer, target_wer = concat_audio(score_group, score_directory, s, e)
                        unique_ids = list(set(ids))

        # ---------- Transcribe the audio file ---------- #
                        load_audio = whisper.load_audio(os.path.join(score_directory,audio_name))
                        translation = model(load_audio, generate_kwargs={'task': 'transcribe', 'language': 'no'}) 
                        
                        # calcualte deviation in str length
                        translation_list = get_transcribed_words(translation['text'])
                        length_deviation = len(translation_list) - len(input_words_list)
                        
                        # Calculate cer and wer with jiwter
                        input_str = ' '.join(input_words_list)
                        translation_str = ' '.join(translation_list)
                        # print(input_str)
                        # print(translation['text'])
                        # print(translation_str)
                        trans_cer = jiwer.cer(input_str, translation_str)
                        trans_wer = jiwer.wer(input_str, translation_str)
                        
                        if len(input_words_list) == len(translation_list):
                                trans_cer = []
                                for i in range(len(input_words_list)):
                                        cer = jiwer.cer(input_words_list[i], translation_list[i])
                                        trans_cer.append(cer)
                                        # print(input_words_list[i], translation_list[i], cer) 
                                        # wer = jiwer.wer(input_words_list[i], translation_list[i])
                                        # trans_wer.append(wer)
                                # print(input_str)
                                # print(translation['text'])
                                # print('CER:', trans_cer)
                                # print('target CER', target_cer)
                        
        # --------- Store information in data frame --------- #
                        # add to dataframe
                        new_row = {'audio_path': [score_directory],
                                'audio_name': [audio_name],
                                
                                'score': [idx],
                                'unique_id\'s': [unique_ids],
                                'speaker_id\'s': [ids],
                                
                                'target_CER' : [target_cer],
                                'target_WER' : [target_wer],#[score_group.iloc[s,e]['Score'].values[0]],
                                'input_string': [input_str],
                                
                                'translated_CER':[trans_cer], 
                                'translated_WER':[trans_wer], 
                                'translated_string': [translation_str],# [translation],
                                
                                'length_deviation_words' : [length_deviation]
                        }
                        new_row = pd.DataFrame(new_row, index=[0])
                        # print(new_row.columns.values)
                        concatenated_audio_information = pd.concat([concatenated_audio_information, new_row], ignore_index=True)
                        
                
        # ----- Save the concatenated audio information ----- #
        csv_folder = '../3x10_Concatenation_information'
        base_name = f'{model_name}__concatenated_audio_information_scores_id_sorted'
        new_dir(csv_folder)
        new_csv_name, _  = smf.get_new_csv_name(csv_folder, base_name)
        print(new_csv_name)
        concatenated_audio_information.to_csv(new_csv_name, index=False)

### <span style="color:#00E6DB"> <b>2. Grouping by score:</b> Each score was grouped before transcription. Randomized speakers </span>

In [28]:
def run_for_score_mixed(model_name):
        # Group by score, sort by speaker ID
        df_trail_2 = df_trail.sample(frac=1).reset_index(drop=True)
        df_score_groups = df_trail_2.groupby('Score')# Grouped by score
        
        concatenated_audio_information = pd.DataFrame(columns=[
                'score', 'input_string', 'translated_string',
                'translated_CER', 'translated_WER', 'target_CER',  'target_WER',
                "unique_id's", "speaker_id's",
                'length_deviation_words',
                'audio_name', 'audio_path'])

        ground_folder = '../../3x10_Concatenations' # Save outside of git repo

        # model_name = 'nb-whisper-tiny-verbatim'
        model_path = smf.get_whisper_path(model_name)
        model = pipeline("automatic-speech-recognition", model_path)

        for idx, score_group in df_score_groups:
                # if idx > 1:
                #         print('out at', idx)
                #         break
                print(idx)
        # -------------- Make or Find directory -------------- #
                score_directory = f'{ground_folder}/3x10_score_{idx}_mixed'
        # --------- Concatenate and save audio files --------- #
                for i in range(0, len(score_group)//10,10):
                        # Concatenate the 10 audio files together
                        s, e = i, i + 10 # get start and end index
                        audio_name, ids, input_words_list, target_cer, target_wer = concat_audio(score_group, score_directory, s, e)
                        unique_ids = list(set(ids))
                        
        # ---------- Transcribe the audio file ---------- #
                        load_audio = whisper.load_audio(os.path.join(score_directory,audio_name))
                        translation = model(load_audio, generate_kwargs={'task': 'transcribe', 'language': 'no'}) 
                        
                        # calcualte deviation in str length
                        translation_list = get_transcribed_words(translation['text'])
                        length_deviation = len(translation_list) - len(input_words_list)
                        
                        # Calculate cer and wer with jiwter
                        input_str = ' '.join(input_words_list)
                        translation_str = ' '.join(translation_list)
                        # print(input_str)
                        # print(translation['text'])
                        # print(translation_str)
                        trans_cer = jiwer.cer(input_str, translation_str)
                        trans_wer = jiwer.wer(input_str, translation_str)
                        
                        if len(input_words_list) == len(translation_list):
                                trans_cer = []
                                for i in range(len(input_words_list)):
                                        cer = jiwer.cer(input_words_list[i], translation_list[i])
                                        trans_cer.append(cer)
                                        # print(input_words_list[i], translation_list[i], cer) 
                                        # wer = jiwer.wer(input_words_list[i], translation_list[i])
                                        # trans_wer.append(wer)
                                # print(input_str)
                                # print(translation['text'])
                                # print('CER:', trans_cer)
                                # print('target CER', target_cer)
                        
        # --------- Store information in data frame --------- #
                        # add to dataframe
                        new_row = {'audio_path': [score_directory],
                                'audio_name': [audio_name],
                                
                                'score': [idx],
                                'unique_id\'s': [unique_ids],
                                'speaker_id\'s': [ids],
                                'target_CER' : [target_cer],
                                #[score_group.iloc[s,e]['Score'].values[0]],
                                'target_WER' : [target_wer],#[score_group.iloc[s,e]['Score'].values[0]],
                                'input_string': [input_str],
                                
                                'translated_CER' :[trans_cer], 
                                'translated_WER':[trans_wer], 
                                'translated_string': [translation_str],# [translation],
                                
                                'length_deviation_words' : [length_deviation]
                        }
                        new_row = pd.DataFrame(new_row, index=[0])
                        # print(new_row.columns.values)
                        concatenated_audio_information = pd.concat([concatenated_audio_information, new_row], ignore_index=True)
                        
                
        # ----- Save the concatenated audio information ----- #
        csv_folder = '../3x10_Concatenation_information'
        base_name = f'{model_name}__concatenated_audio_information_scores_id_mixed'
        new_dir(csv_folder)
        new_csv_name, _  = smf.get_new_csv_name(csv_folder, base_name)
        print(new_csv_name)
        concatenated_audio_information.to_csv(new_csv_name, index=False)

### <span style="color:#00D5F7"><b> 3. Grouping by person:</b> Each person's independent words were grouped, and sorted by score</span>


In [29]:
def group_by_id(model_name): 
        df_name_groups = df_trail.groupby('id')  # Grouped by file name

        concatenated_audio_information = pd.DataFrame(columns=[
        "speaker_id", 'score', 'input_string', 'translated_string',
        'translated_CER', 'translated_WER', 'target_CER',  'target_WER',

        'length_deviation_words',
        'audio_name', 'audio_path'])

        ground_folder = '../../3x10_Concatenations/3x10_id_sorted' # Save outside of git repo
        
        model_path = smf.get_whisper_path(model_name)
        model = pipeline("automatic-speech-recognition", model_path)
        for idx, person in df_name_groups:
                # print(person.iloc[0]['id'])
                # if person.iloc[0]['id'] == 'a06':
                #         break
        # --------- Concatenate and save audio files --------- #
                for i in range(0, len(person)//10,10):
                        # Concatenate the 10 audio files together
                        s, e = i, i + 10 # get start and end index
                        speaker_id = person.iloc[0]['id']
                        audio_name, _, input_words_list, target_cer, target_wer = concat_audio(person, ground_folder, s, e, name=f'{speaker_id}_')
                        # print(audio_name)
                        
        # ---------- Transcribe the audio file ---------- #
                        load_audio = whisper.load_audio(os.path.join(ground_folder,audio_name))
                        translation = model(load_audio, generate_kwargs={'task': 'transcribe', 'language': 'no'}) 
                        
                        # calcualte deviation in str length
                        translation_list = get_transcribed_words(translation['text'])
                        length_deviation = len(translation_list) - len(input_words_list)
                        
                        # Calculate cer and wer with jiwter
                        input_str = ' '.join(input_words_list)
                        translation_str = ' '.join(translation_list)
                        # print(input_str)
                        # print(translation['text'])
                        # print(translation_str)
                        trans_cer = jiwer.cer(input_str, translation_str)
                        trans_wer = jiwer.wer(input_str, translation_str)
                        
                        if len(input_words_list) == len(translation_list):
                                trans_cer = []
                                for i in range(len(input_words_list)):
                                        cer = jiwer.cer(input_words_list[i], translation_list[i])
                                        trans_cer.append(cer)
                        
        # --------- Store information in data frame --------- #
                        # add to dataframe
                        scores = person.iloc[s:e]['Score'].values
                        
                        new_row = {'audio_path': [ground_folder],
                                'audio_name': [audio_name],
                                
                                'score': [scores],
                                'speaker_id': [speaker_id],
                                'target_CER' : [target_cer],
                                #[score_group.iloc[s,e]['Score'].values[0]],
                                'target_WER' : [target_wer],#[score_group.iloc[s,e]['Score'].values[0]],
                                'input_string': [input_str],
                                
                                'translated_CER' :[trans_cer], 
                                'translated_WER':[trans_wer], 
                                'translated_string': [translation_str],# [translation],
                                
                                'length_deviation_words' : [length_deviation]
                        }
                        new_row = pd.DataFrame(new_row, index=[0])
                        # print(new_row.columns.values)
                        # print(new_row)
                        concatenated_audio_information = pd.concat([concatenated_audio_information, new_row], ignore_index=True)
                        
                
        # ----- Save the concatenated audio information ----- #
        csv_folder = '../3x10_Concatenation_information'
        base_name = f'{model_name}__concatenated_audio_information_group_id_sorted_score'
        new_dir(csv_folder)
        new_csv_name, _  = smf.get_new_csv_name(csv_folder, base_name)
        print(new_csv_name)
        concatenated_audio_information.to_csv(new_csv_name, index=False)

### <span style="color:#00D5F7"><b> 4. Random shuffling of data before transcription.</span></b >

Persons not sorted, and results not grouped by score.

In [30]:
def random_shuffel_concat(model_name):
        # Group by file name, sort by speaker ID
        df_no_group = df_trail.sample(frac=1).reset_index(drop=True)

        concatenated_audio_information = pd.DataFrame(columns=[
        'score', 'input_string', 'translated_string',
        'translated_CER', 'translated_WER', 'target_CER',  'target_WER',
        "unique_id's", "speaker_id's",
        'length_deviation_words',
        'audio_name', 'audio_path'])

        ground_folder = '../../3x10_Concatenations/3x10_no_group_mixed' # Save outside of git repo
        
        # model_name = 'nb-whisper-tiny-verbatim'
        model_path = smf.get_whisper_path(model_name)
        model = pipeline("automatic-speech-recognition", model_path)

# --------- Concatenate and save audio files --------- #
        for i in range(0, len(df_no_group)//10,10):
                # Concatenate the 10 audio files together
                s, e = i, i + 10 # get start and end index
                audio_name, ids, input_words_list, target_cer, target_wer = concat_audio(df_no_group, ground_folder, s, e)
                unique_ids = list(set(ids))
                
# ---------- Transcribe the audio file ---------- #
                load_audio = whisper.load_audio(os.path.join(ground_folder,audio_name))
                translation = model(load_audio, generate_kwargs={'task': 'transcribe', 'language': 'no'}) 
                
                # calcualte deviation in str length
                translation_list = get_transcribed_words(translation['text'])
                length_deviation = len(translation_list) - len(input_words_list)
                
                # Calculate cer and wer with jiwter
                input_str = ' '.join(input_words_list)
                translation_str = ' '.join(translation_list)
                # print(input_str)
                # print(translation['text'])
                # print(translation_str)
                trans_cer = jiwer.cer(input_str, translation_str)
                trans_wer = jiwer.wer(input_str, translation_str)
                
                if len(input_words_list) == len(translation_list):
                        trans_cer = []
                        for i in range(len(input_words_list)):
                                cer = jiwer.cer(input_words_list[i], translation_list[i])
                                trans_cer.append(cer)
                
# --------- Store information in data frame --------- #
                # add to dataframe
                scores = df_no_group.iloc[s:e]['Score'].values
                
                new_row = {'audio_path': [ground_folder],
                        'audio_name': [audio_name],
                        
                        'score': [scores],
                        'unique_id\'s': [unique_ids],
                        'speaker_id\'s': [ids],
                        'target_CER' : [target_cer],
                        #[score_group.iloc[s,e]['Score'].values[0]],
                        'target_WER' : [target_wer],#[score_group.iloc[s,e]['Score'].values[0]],
                        'input_string': [input_str],
                        
                        'translated_CER' :[trans_cer], 
                        'translated_WER':[trans_wer], 
                        'translated_string': [translation_str],# [translation],
                        
                        'length_deviation_words' : [length_deviation]
                }
                new_row = pd.DataFrame(new_row, index=[0])
                # print(new_row.columns.values)
                concatenated_audio_information = pd.concat([concatenated_audio_information, new_row], ignore_index=True)
                
        # ----- Save the concatenated audio information ----- #
        csv_folder = '../3x10_Concatenation_information'
        base_name = f'{model_name}__concatenated_audio_information_no_group_mixed'
        new_dir(csv_folder)
        new_csv_name, _  = smf.get_new_csv_name(csv_folder, base_name)
        concatenated_audio_information.to_csv(new_csv_name, index=False)

## <span style="color:#58F2B3">  **Run** every test for every model

In [31]:
# # Test 1 for all the models
test_for_models = ['tiny', 'nb-whisper-tiny', 'nb-whisper-tiny-verbatim',
                    'base', 'nb-whisper-base', 'nb-whisper-base-verbatim',
                    'medium', 'nb-whisper-medium', 'nb-whisper-medium-verbatim']
for model_name in test_for_models:
    print(model_name)
    
    run_for_score_sort(model_name) 
    run_for_score_mixed(model_name) 
    group_by_id(model_name)
    random_shuffel_concat(model_name)
# # Kjør gjennom alle csv filene laget og sammen like på vhem som kalrer flest like lange transcribsjoner.

tiny


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


1


  concatenated_audio_information = pd.concat([concatenated_audio_information, new_row], ignore_index=True)


2
3
4
5
Directory ../3x10_Concatenation_information already exists
../3x10_Concatenation_information/tiny__concatenated_audio_information_scores_id_sorted_v1.csv


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


1


  concatenated_audio_information = pd.concat([concatenated_audio_information, new_row], ignore_index=True)


2
3
4


## <span style="color:#F9F871"> Old Code</span>

Only tried for global score 5, but for multiple models

The first test show good results for one example for the base,  nb-whisper-medium and  nb-whisper-medium-verbatim.

Considering the  nb-whisper-medium-verbatim showed promising results for the other metrics I will try concatenating the rest of teh 5 results for this model.

Then calculate the success rate, and deciding if the rest of the scores should be found from this.


Her ser vi at den kan i noen tilfeller transkribere rikgit antall ord, og riktig transcribering uten om noen.

Men i fleteparten av tilfellene er det feil antall ord, og feil transcribering.

NB! clip 22 -> 210_220 har en lengre setting 

Clip 20 med 23 ulike transcriberinger har ikke noen ekstra ord.

whisper base vireker helt vilt dårlig, så ser at nb sine modeller er bedre her i alle fall