# Purpose

The purpose of this code is to tie annotations to their time of occurrance in WAV files and map to points output by VGGish. We do this in 6 major steps.

*If the file human_readable_annotation_timings.csv already exists, skip to Part II of the code*

### Part I - creates human_readable_annotation_timings.csv
1. We pull the most recent annotation file for each image. This was created in a previous script called "unique_images_annotations.ipynb". Run it now if you haven't already to generate the unique_images_annotations.csv which is required for this script.
2. Open the most recent JSON for each image to get the annotations and their coordinates.
3. Each annotated image has an associated file number. These file numbers directly relate to the windows covered e.g., file-0 covers from 0 to 1 minute of time in original audio, file-1 covers from minute 1 to minute 2 of original audio, etc. We will use these file numbers to get the spectrogram start time in seconds.
4. To find where each noise specifically happened we pull the coordinates of the annotations. By finding how far the annotation started and ended in the spectrogram and knowing that each spectrogram accounts for a minute, we can get the start and end time of a sound in seconds. This output is saved as human_readable_annotation_timings.csv

### Part II - creates vggish_X_sec_comb_labels.csv
1. We need to match these annotation start and stop times back to the VGGish points. We know that each VGGish point accounts for X seconds of raw audio time (denoted by the variable exmaple_duration), so dividing the start and stop sound times by X will give us which example each belongs to.
2. Becaue we are using each annotation as the label of the vggish_point in later clustering algorithms we need to ensure each point has a single label. This requires combining labels for vggish_points with multiple annotations. This output is saved as wav_to_annotation_combined_labels.csv

## Part I

Step 1: We start by loading basic Python libraries and files for use (unique_images_annotations.csv, as well as the mapping file which ties wav filenames back to the shortened labels for VGGish's output developed by Saumya). We also set a constant value "example_duration" which will be used in step 5's calculation of examples - this value can changed based on model imputs.

In [74]:
#Library Imports
import numpy as np
import pandas as pd
import json
import os
from os import listdir
from os.path import isfile, join

#File imports
uniq_annotation_df = pd.read_csv('unique_images_annotations.csv')

Step 2: We begin by opening each JSON in the uniq_annotation_df and pulling out the image, sound and annotation points info 

In [75]:
#Getting a list of all the unique JSON files
json_file_list = list(uniq_annotation_df['json_file_path'])

#Creating a df to save data from the JSON in
annotation_df = pd.DataFrame(columns=['sound','points','image_name', 'json_file_path'])

#Iterate through json files on external hard drive to get annotated image info
for file in json_file_list:
    #Loading the JSON data & turning into dict
    annotated_file = open(file)
    annotated_dict = json.load(annotated_file)
    #Pulling out the labels, points, and image path
    image_name = annotated_dict['imagePath']
    if len(annotated_dict['shapes']) > 0:
        for shape in annotated_dict['shapes']:
            sound = shape['label']
            points = shape['points']
            annotation_df.loc[len(annotation_df.index)] = [sound, points, image_name, file]
    else:
        sound = None
        points = None
        annotation_df.loc[len(annotation_df.index)] = [sound, points, image_name, file]
    
#Validating we got all the json files from the original unique annotations dataframe in the new annotation df
if len(uniq_annotation_df) != len(annotation_df['json_file_path'].unique()):
    print("ERROR - not all JSON contained in the annotation_df")
else:
    pass

#Validating that images paired with json from unique annotations file are the same as the images referenced in the JSON files themselves
comb_df = pd.merge(left = uniq_annotation_df, right = annotation_df, how = 'left', left_on='json_file_path', right_on='json_file_path')
comb_df['image_file_name_tester'] = comb_df['image_file_name']+'.png'
comb_df['name_compare'] = np.where((comb_df['image_file_name_tester'] == comb_df['image_name']), 0, 1)
if comb_df['name_compare'].sum() != 0:
    print("ERROR - The following json files contain different image filenames than those they were aligned to in the unique_image_annotations code")
    print(comb_df.loc[comb_df['name_compare']!=0])
else:
    pass

#Dropping the JSON file path column as it will no longer be used
annotation_df = annotation_df.drop(['json_file_path'], axis = 1)

#Removing rows with images with no sounds since they aren't useful for what we want to do
annotation_df = annotation_df[annotation_df['sound'].notnull()]

Step 3: Each annotated image has an associated file number. These file numbers directly relate to the windows covered e.g., file-0 covers from 0 to 1 minute of time in original audio, file-1 covers from minute 1 to minute 2 of original audio, etc. We will use these file numbers to get the spectrogram start time in seconds. We add a column which contains the file number only and multiply it by 60 to get the number of seconds that have passed in the audio before the spectrogram starts. The new column is called "spectrogram_start_sec"

In [76]:
#Getting just the image file number and multiplying by 60 seconds
annotation_df['spectrogram_start_min'] = annotation_df['image_name'].str[21:-4]
annotation_df['spectrogram_start_sec'] = pd.to_numeric(annotation_df['spectrogram_start_min'])*60
annotation_df = annotation_df.drop('spectrogram_start_min', axis = 1)

Step 4: To find where each noise specifically happened we pull the coordinates of the annotations. By finding how far the annotation started and ended in the spectrogram and knowing that each spectrogram accounts for a minute, we can get the start and end time of a sound in seconds.

We start by getting x1 and x2 from the coordinates of the bounding boxes. The min is the start time in pixels, and the max is the stop time in pixels. We then subtract the distance from the edge of the image to the spectrogram to ensure we aren't counting white space in the image which isn't relevent to the audio. Then, by finding the percent of the way through the spectrogram the annotation starts we can multiply by 60 (60 secs per spectrogram) to understand where the sound started. Adding the spectrogram_start_sec will tell you the time in the audio file the sound started.

In [77]:
#Getting individual points from points column
annotation_df['point1'] = annotation_df['points'].str[0]
annotation_df['point2'] = annotation_df['points'].str[1]

#Getting x1 and x2 from point1 and point2
annotation_df['x1'] = annotation_df['point1'].str[0]
annotation_df['x2'] = annotation_df['point2'].str[0]

#Finding the start vs. the stop time
annotation_df['annotation_start'] = round(annotation_df[["x1", "x2"]].min(axis=1), 3)
annotation_df['annotation_stop'] = round(annotation_df[["x1", "x2"]].max(axis=1), 3)

#Subtract the avg. left edge of the spectrogram in pixels (309.532) from the annotation start and stops
#to ensure we don't count whitespace in the image which has no relevance to the audio in the wav files
annotation_df['annotation_start_shifted'] = annotation_df['annotation_start'] - 309.532
annotation_df['annotation_stop_shifted'] = annotation_df['annotation_stop'] - 309.532

#Divide the start & stop times by the total avg pixels in the spectrogram (1863.734)
#& multiply by 60 to get time in seconds
annotation_df['annotation_start_sec'] = annotation_df['annotation_start_shifted'] * 60.000 / 1863.734
annotation_df['annotation_stop_sec'] = annotation_df['annotation_stop_shifted'] * 60.000 / 1863.734

#Adding the spectrogram_start_time_secs to annotation_start_sec to get the final annotation start/stop time
annotation_df['time_in_wav_start_sec'] = annotation_df['annotation_start_sec'] + annotation_df['spectrogram_start_sec']
annotation_df['time_in_wav_stop_sec'] = annotation_df['annotation_stop_sec'] + annotation_df['spectrogram_start_sec']

#Removing all the calculation columns
annotation_df = annotation_df.drop(['points','spectrogram_start_sec','point1','point2','x1','x2','annotation_start','annotation_stop','annotation_start_sec','annotation_stop_sec', 'annotation_start_shifted', 'annotation_stop_shifted'], axis=1)

We save the current annotation data frame as "human_readable_annotation_timings.csv". The format is useful for humans to double check the location of annotations in wav files. In Part II of the code below we will change the format to be more machine-friendly and serve as an input in future clustering models.

In [78]:
#Saving the annotation timings in a human-readable format
annotation_df.to_csv('human_readable_annotation_timings.csv', index = False)

## Part II

In Part II of the code we alter "human_readable_annotation_timings.csv" to be machine-friendly and serve as data label inputs in future clustering models. We start by importing files and libraries necessary for Part II of the code below.

In [79]:
#Library Imports
import numpy as np
import pandas as pd
import json
import os
from os import listdir
from os.path import isfile, join

#File imports
mapping_file_df = pd.read_csv('mapping_filenames.csv')
annotation_df = pd.read_csv('human_readable_annotation_timings.csv')

#Setting duration of VGGish examples
example_duration = 0.96

Step 1: We need to match annotation start and stop times in human_readable_annotations back to the VGGish points. We know that each VGGish point accounts for X seconds of raw audio time (denoted by the variable example_duration above), so dividing the start and stop sound times by X will give us which example each belongs to. We create an intermediate column with the range of examples that cover the sound.

In [80]:
#Creating a col with the starting example and ending example by dividing the time by example_duration
#for the seconds in the spectrogram
annotation_df['start_example_float'] = annotation_df['time_in_wav_start_sec'] / example_duration
annotation_df['stop_example_float'] = annotation_df['time_in_wav_stop_sec'] / example_duration

#Taking the floor of the start example and stop example to ensure we get all the sound
#in our examples (rounding the start and stop times down to the nearest example)
annotation_df['start_example'] = annotation_df['start_example_float'].apply(np.floor)
annotation_df['stop_example'] = annotation_df['stop_example_float'].apply(np.floor)

#Creating column which is the list of the examples between start_example and stop_example
start_examples = list(annotation_df['start_example'])
stop_examples = list(annotation_df['stop_example'])
example_range = []
for i in range(len(start_examples)):
    range_list = list(range(int(start_examples[i]), int(stop_examples[i])+1))
    example_range.append(range_list)

#Adding a col which is the range of example numbers per sound
annotation_df['example_numbers'] = example_range

#Verifying that no start/stop examples occurr outside the expected timeframe based on the image name
test_annotation_df = annotation_df.copy()
test_annotation_df['image_number'] = test_annotation_df['image_name'].str[21:-4]
test_annotation_df['min_example'] = (pd.to_numeric(test_annotation_df['image_number'])*60/0.96)
test_annotation_df['max_example'] = ((pd.to_numeric(test_annotation_df['image_number'])+1)*60/0.96)
test_annotation_df['min_example_floor'] = test_annotation_df['min_example'].apply(np.floor)
test_annotation_df['max_example_floor'] = test_annotation_df['max_example'].apply(np.floor)
test_annotation_df['min_bounds_exceeded'] = np.where((test_annotation_df['start_example'] < test_annotation_df['min_example_floor']), 1, 0)
test_annotation_df['max_bounds_exceeded'] = np.where((test_annotation_df['stop_example'] > test_annotation_df['max_example_floor']), 1, 0)
if len(test_annotation_df.loc[test_annotation_df['min_bounds_exceeded']==1]) > 0:
    print("ERROR - example below min example present in data, verify close enough")
    print(test_annotation_df.loc[test_annotation_df['min_bounds_exceeded']==1])
if len(test_annotation_df.loc[test_annotation_df['max_bounds_exceeded']==1]) > 0:
    print("ERROR - example above max example present in data, verify close enough")
    print(test_annotation_df.loc[test_annotation_df['max_bounds_exceeded']==1])

#Dropping the start and stop example cols
annotation_df = annotation_df.drop(['start_example','stop_example', 'start_example_float','stop_example_float'], axis = 1)

ERROR - example below min example present in data, verify close enough
     sound                   image_name  time_in_wav_start_sec  \
1833  fish  20181227T100004-File-20.png            1199.990439   

      time_in_wav_stop_sec  start_example_float  stop_example_float  \
1833           1209.175032           1249.99004         1259.557326   

      start_example  stop_example  \
1833         1249.0        1259.0   

                                        example_numbers image_number  \
1833  [1249, 1250, 1251, 1252, 1253, 1254, 1255, 125...           20   

      min_example  max_example  min_example_floor  max_example_floor  \
1833       1250.0       1312.5             1250.0             1312.0   

      min_bounds_exceeded  max_bounds_exceeded  
1833                    1                    0  


Step 1 (continued): Now we need to map the wav files in the mapping_filenames.csv back to the images. We do this using the YYMMDDTHHMMS in the image_name to match back to the wav filename's first 12 digits which are also YYMMDD-HHMMS, and joining to get the file lookup number. 

We don't match on the final 's' value in either filename because there was a batch processing issue with the original files which misrecorded '2' as '4' and will create non-matches between files.

In [81]:
#Create a column with the image name in a format that is matchable back to the wav file name in the
#mapping_filenames.csv file
annotation_df['image_name_wav_format'] = (annotation_df['image_name'].str[2:8] + '-' + annotation_df['image_name'].str[9:14])

#Getting the matching characters from the mapping_filenames wav names
mapping_file_df['wav_name_match'] = mapping_file_df['wav_filename'].str[:12]

#Merging the annotation_df and mapping_filenames df together
wav_annotation_df = pd.merge(left = annotation_df, right = mapping_file_df, how = 'left',
                             left_on='image_name_wav_format', right_on='wav_name_match')

#Check that all files have a wav file matching
if len(wav_annotation_df.loc[wav_annotation_df['wav_filename'].isnull()]) > 0:
    print("ERROR - bad merge, some images don't have associated wav files")

#Removing extra columns used to merge datasets
wav_annotation_df = wav_annotation_df.drop(['image_name_wav_format','wav_name_match'], axis = 1)

Step 1 (continued): Now we pull apart the 'example_numbers' column to create one row per example number. We create a final column which shows the mapped_filename-example_number which is the unique label on each vggish point.

In [82]:
#Getting list of sounds and the list of example_numbers
sounds_list = list(wav_annotation_df['sound'])
image_name_list = list(wav_annotation_df['image_name'])
mapped_filename_list = list(wav_annotation_df['mapped_filename'])
example_numbers_list = list(wav_annotation_df['example_numbers'])

#Creating final table w/ sound, image_name, vggish_point
wav_to_annotation_df = pd.DataFrame(columns= ['sound','image_name','vggish_point'])

#Iterate through each example numbers array, pull it apart, save each element + the mapped filename and sound ]
#to wav_to_annotation_df
for i in range(len(example_numbers_list)):
    sound = sounds_list[i]
    image_name = image_name_list[i]
    mapped_filename = mapped_filename_list[i]
    examples = example_numbers_list[i]
    for example in examples:
        vggish_point = str(mapped_filename) + '-' + str(example)
        wav_to_annotation_df.loc[len(wav_to_annotation_df)] = [sound, image_name, vggish_point]

#Checking that all examples are in the final wav_to_annotation_df
example_count = 0
for example in example_numbers_list:
    example_count = example_count + len(example)
if example_count != len(wav_to_annotation_df):
    print("ERROR - file wrong length. Check code to ensure all examples are included.")
else:
    pass

#Saving output for later checks
#wav_to_annotation_df.to_csv('wav_to_annotation_new_ceil.csv', index= False)

Step 2: Becaue we are using each annotation as the label of the vggish_point in later clustering algorithms we need to ensure each point has a single label. This requires combining labels for vggish_points with multiple annotations.

In [83]:
#Grouping each point by its annotations to count how many have mult. annotations
point_counts = pd.DataFrame(wav_to_annotation_df.groupby(['vggish_point']).count().reset_index())
print("{0}% of points have >1 annotation ({1} out of {2}).".format(
    round(len(point_counts.loc[point_counts['sound']>1])*100/len(point_counts),2),
    len(point_counts.loc[point_counts['sound']>1]), len(point_counts)))

#Making vggish_point the primary key and combining sounds into a list per primary key
grouped_annotation_df = wav_to_annotation_df.groupby('vggish_point')['sound'].agg(list).reset_index()

#Validating we didn't lose any annotations
if len(grouped_annotation_df) != len(wav_to_annotation_df['vggish_point'].unique()):
    print("ERROR - missing points after grouping")
else:
    pass

#Checking that we captured all vggish_points with > 1 annotation
sound_lists = list(grouped_annotation_df['sound'])
multi_sound_count = 0
for sound in sound_lists:
    if len(sound)>1:
        multi_sound_count = multi_sound_count + 1
    else:
        pass
if multi_sound_count != len(point_counts.loc[point_counts['sound']>1]):
    print("ERROR - not all duplicate sounds captured in grouped sound lists")
else:
    pass

#Iterating through individual points' sounds to remove dupes, alphabetize, and make strings
sounds_list = list(grouped_annotation_df['sound'])
vggish_point_list = list(grouped_annotation_df['vggish_point'])
deduped_labels_df = pd.DataFrame(columns=['vggish_point','label'])
for i in range(len(vggish_point_list)):
    vggish_point = vggish_point_list[i]
    unique_sound_list = []
    [unique_sound_list.append(sound) for sound in sounds_list[i] if sound not in unique_sound_list]
    unique_sound_list.sort()
    label = '-'.join(unique_sound_list)
    deduped_labels_df.loc[len(deduped_labels_df)] = [vggish_point, label]
print("Labels have been aggrgated for points with multiple annotations")

#Validating that there are as many deduplicated points as there were unique points previously
if len(grouped_annotation_df) == len(wav_to_annotation_df['vggish_point'].unique()) == len(deduped_labels_df):
    pass
else:
    print("ERROR - missing points after deduping annotation labels")

#Validate that the vggish_points are unique
if len(deduped_labels_df) != len(list(deduped_labels_df['vggish_point'].unique())):
    print("ERROR - points are not unique")
else:
    pass

15.21% of points have >1 annotation (1586 out of 10430).
Labels have been aggrgated for points with multiple annotations


Step 2 (continued): The vggish_points and their deduplicated annotation labels are saved as vggish_point_Xsec_comb_labels.csv

In [84]:
#Saving the file
filename = 'vggish_'+str(example_duration)+'_sec_comb_labels.csv'
deduped_labels_df.to_csv(filename, index=False)