# <b><span style="color:#FF9671"> Remove Emty Transcriptions from the Transcribed Data Frame </span></b>

For some reason, some of the empty transcriptions sneak into the transcribed data frame. This script fixes this.

Look for rows in the transcription with <b><span style="color:#FF6F91">NaN</span></b> values and rows in the assessment that are not in the transcription.

After <b><span style="color:#FF6F91">removing</span></b> the NaN value rows, we are left with <b><span style="color:#FF6F91">comparing</span></b> the missing data from the assessment to the transcription.

All <b><span style="color:#FF6F91">missing rows</span></b> are assumed to be <b><span style="color:#FF6F91">empty transcriptions</span></b> and are given <b><span style="color:#FF6F91">CER = 1.0</span></b>. 

A new empty transcription file is created and saved.


### <b><span style="color:#FF9671">  Library Imports 

In [8]:
# this scripts task is to go throng all the transcribed files. 
# It checks if there are any missing values in the transcribed files, that is not in the empty transcriptions file.
# It also compar the length to teh assessment files, and assume theses row are all files the model could not transcribe
# They are therefor deemed as empty transcriptions, and given CER = 1.0
import sys      

script_directory = '../'
sys.path.append(script_directory)

import self_made_functions as smf
import pandas as pd
import os

###  <b><span style="color:#FF9671">  Data Initialization</b>

Get the transcribed files, the assessment df, and Initiate the empty list to store the fixed transcriptions.

In [9]:
# Get the transcribed files to look through
path_transcriptions = '../Transcriptions'
lst = os.listdir(path_transcriptions)
lst = [file for file in lst if file.startswith('tran') and file.endswith('.csv')]

# Compar with original assessment
df_assessment, _ = smf.get_correct_df()

###  <b><span style="color:#FF9671">  Remove NaN values </span></b>
Iterate through all the transcriptions and find missing values. Remove these from the transcribed data frame, and save them accordantly.

In [10]:
for file in lst:
    df_csv = pd.read_csv(os.path.join(path_transcriptions, file))
    nan_df = df_csv[df_csv['Transcribed'].isna()]
    
    if not nan_df.empty:
        print(f'NaN in {file}')
        df_csv = df_csv.dropna()
        df_csv.to_csv(os.path.join(path_transcriptions, file), index=False)
    else:
        print(f'No NaN')

No NaN
No NaN
No NaN
No NaN
No NaN
No NaN
No NaN
No NaN
No NaN
No NaN
No NaN
No NaN


###  <b><span style="color:#FF9671">  Creating empty transcription file </span></b>
Now, there is non nan values in the transcribed data frame. Meaning all missing rows in the assessment data frame are empty transcriptions.

In [122]:
# Initialize the empty transcriptions data frame
empty_transcriptions = pd.DataFrame(columns=["file_name",
                                "CER",  # Character Error Rate (CER)
                                "target_word", 
                                "global_score",
                                'model_name', 'version'])

# Iterate throng all the transcriptions and find the missing values
for file in lst:
    df_csv = pd.read_csv(os.path.join(path_transcriptions, file))
    model = file.split('_')[1]
    version = file.split('_')[2].split('.')[0]
    
    # Find the missing rows from the assessment
    missing_rows = df_assessment[~df_assessment['File name'].isin(df_csv['File name'])]
    
    # Add the missing rows to the empty transcriptions
    for i, row in missing_rows.iterrows():
        new_row = {"file_name": row['File name'],
                    "CER": 1.0,  # Character Error Rate (CER)
                    "target_word": row['Word'], 
                    "global_score": row['Score'],
                    'model_name': model,
                    'version': version}
        new_row = pd.DataFrame(new_row, index=[0])
        empty_transcriptions = pd.concat([empty_transcriptions, new_row], ignore_index=True)

  empty_transcriptions = pd.concat([empty_transcriptions, new_row], ignore_index=True)


####  <b><span style="color:#FF6F91">  Check if it was correctly saved in the empty data frame</span></b>

In [123]:
for file in lst:
    df_csv = pd.read_csv(os.path.join(path_transcriptions, file))
    model = file.split('_')[1]
    version = file.split('_')[2].split('.')[0]
    check = empty_transcriptions[(empty_transcriptions['version']==version) & (empty_transcriptions['model_name']==model)]

    print(len(check) + len(df_csv)==len(df_assessment), len(check) + len(df_csv), len(df_assessment), model, version)
    

True 9322 9322 nb-whisper-base-verbatim v1
True 9322 9322 nb-whisper-medium-verbatim v1
True 9322 9322 tiny v1
True 9322 9322 base v1
True 9322 9322 medium v1
True 9322 9322 nb-whisper-base v1
True 9322 9322 nb-whisper-medium v1
True 9322 9322 nb-whisper-base-verbatim v2
True 9322 9322 nb-whisper-medium-verbatim v2
True 9322 9322 nb-whisper-tiny-verbatim v1
True 9322 9322 nb-whisper-tiny-verbatim v2
True 9322 9322 nb-whisper-tiny v1


###  <b><span style="color:#FF9671">  Save the empty data frame </span></b>

In [124]:
for version in empty_transcriptions['version'].unique():
    df_to_save = empty_transcriptions[empty_transcriptions['version']==version]
    df_to_save = df_to_save.drop(columns=['version'])
    df_to_save.to_csv(f'./Transcriptions/empty_transcriptions_{version}.csv', index=False)