# Purpose

The purpose of this code is to tie vggish points back to their annotations. We do this in 5 major steps. 

1. We pull the most recent annotation file for each image. This was created in a previous script called "unique_images_annotations.ipynb". Run it now to generate the csv by the same name which is required for this script.
2. Opening the most recent JSON for each image we get the annotations and their coordinates.
3. Each annotated image has an associated file number. These file numbers directly relate to the windows covered e.g., file-0 covers from 0 to 1 minute of time in original audio, file-1 covers from minute 1 to minute 2 of original audio, etc. We will use these file numbers to get the spectrogram start time in seconds.
4. To find where each noise specifically happened we pull the coordinates of the annotations. By finding how far the annotation started and ended in the spectrogram and knowing that each spectrogram accounts for a minute, we can get the start and end time of a sound in seconds.
5. Finally, we need to match these annotation start and stop times back to the VGGish points. We know that each VGGish point accounts for 0.96 seconds of raw audio time, so dividing the start and stop sound times by 0.96 will give us which slice each belongs to.

Step 1: We start by loading basic Python libraries and files for use (the unique image annotations file noted above, as well as the mapping file which ties wav filenames back to the shortened labels for VGGish's output).

In [159]:
#Library Imports
import numpy as np
import pandas as pd
import json
import os
import matplotlib.pyplot as plt
from os import listdir
from os.path import isfile, join

#File imports
uniq_annotation_df = pd.read_csv('unique_images_annotations.csv')
mapping_file_df = pd.read_csv('mapping_filenames.csv')

Step 2: We begin by opening each JSON in the uniq_annotation_df and pulling out the image, sound and annotation points info 

In [160]:
#Getting a list of all the unique JSON files
json_file_list = list(uniq_annotation_df['json_file_path'])

#Creating a df to save data from the JSON in
annotation_df = pd.DataFrame(columns=['sound','points','image_name', 'json_file_path'])

#Iterate through json files on external hard drive to get annotated image info
for file in json_file_list:
    #Loading the JSON data & turning into dict
    annotated_file = open(file)
    annotated_dict = json.load(annotated_file)
    #Pulling out the labels, points, and image path
    image_name = annotated_dict['imagePath']
    if len(annotated_dict['shapes']) > 0:
        for shape in annotated_dict['shapes']:
            sound = shape['label']
            points = shape['points']
            annotation_df.loc[len(annotation_df.index)] = [sound, points, image_name, file]
    else:
        sound = None
        points = None
        annotation_df.loc[len(annotation_df.index)] = [sound, points, image_name, file]
    
#Validating we got all the json files from the original unique annotations dataframe in the new annotation df
if len(uniq_annotation_df) != len(annotation_df['json_file_path'].unique()):
    print("ERROR - not all JSON contained in the annotation_df")
else:
    pass

#Validating that images paired with json from unique annotations file are the same as the images referenced in the JSON files themselves
comb_df = pd.merge(left = uniq_annotation_df, right = annotation_df, how = 'left', left_on='json_file_path', right_on='json_file_path')
comb_df['image_file_name_tester'] = comb_df['image_file_name']+'.png'
comb_df['name_compare'] = np.where((comb_df['image_file_name_tester'] == comb_df['image_name']), 0, 1)
if comb_df['name_compare'].sum() != 0:
    print("ERROR - The following json files contain different image filenames than those they were aligned to in the unique_image_annotations code")
    print(comb_df.loc[comb_df['name_compare']!=0])
else:
    pass

#Dropping the JSON file path column as it will no longer be used
annotation_df = annotation_df.drop(['json_file_path'], axis = 1)

#Removing rows which images which no sounds since they aren't useful for what we want to do
annotation_df = annotation_df[annotation_df['sound'].notnull()]

Step 3: Each annotated image has an associated file number. These file numbers directly relate to the windows covered e.g., file-0 covers from 0 to 1 minute of time in original audio, file-1 covers from minute 1 to minute 2 of original audio, etc. We will use these file numbers to get the spectrogram start time in seconds. We add a column which contains the file number only and multiply it by 60 to get the number of seconds that have passed in the audio before the spectrogram starts. The new column is called "spectrogram_start_sec"

In [162]:
#Getting just the image file number and multiplying by 60 seconds
annotation_df['spectrogram_start_min'] = annotation_df['image_name'].str[21:-4]
annotation_df['spectrogram_start_sec'] = pd.to_numeric(annotation_df['spectrogram_start_min'])*60
annotation_df = annotation_df.drop('spectrogram_start_min', axis = 1)

Step 4: To find where each noise specifically happened we pull the coordinates of the annotations. By finding how far the annotation started and ended in the spectrogram and knowing that each spectrogram accounts for a minute, we can get the start and end time of a sound in seconds.

We start by getting x1 and x2 from the coordinates of the bounding boxes. The min is the start time in pixels, and the max is the stop time in pixels. We then subtract the distance from the edge of the image to the spectrogram to ensure we aren't counting white space in the image which isn't relevent to the audio. Then, by finding the percent of the way through the spectrogram the annotation starts we can multiply by 60 (60 secs per spectrogram) to understand where the sound started. Adding the spectrogram_start_sec will tell you the time in the audio file the sound started.

In [163]:
#Getting individual points from points column
annotation_df['point1'] = annotation_df['points'].str[0]
annotation_df['point2'] = annotation_df['points'].str[1]

#Getting x1 and x2 from point1 and point2
annotation_df['x1'] = annotation_df['point1'].str[0]
annotation_df['x2'] = annotation_df['point2'].str[0]

#Finding the start vs. the stop time
annotation_df['annotation_start'] = annotation_df[["x1", "x2"]].min(axis=1)
annotation_df['annotation_stop'] = annotation_df[["x1", "x2"]].max(axis=1)

#Subtract the left edge of the spectrogram in pixels (309.532) from the annotation start and stops
#to ensure we don't count whitespace in the image which has no relevance to the audio in the wav files
annotation_df['annotation_start'] = annotation_df['annotation_start'] - 309.532
annotation_df['annotation_stop'] = annotation_df['annotation_stop'] - 309.532

#Divide the start & stop times by the total pixels in the spectrogram (2173.266 - 309.532)
#& multiply by 60 to get time in seconds
annotation_df['annotation_start_sec'] = annotation_df['annotation_start'] * 60 / (2173.266- 309.532)
annotation_df['annotation_stop_sec'] = annotation_df['annotation_stop'] * 60 / (2173.266- 309.532)

#Adding the spectrogram_start_time_secs to annotation_start_sec to get the final annotation start/stop time
annotation_df['time_in_wav_start_sec'] = annotation_df['annotation_start_sec'] + annotation_df['spectrogram_start_sec']
annotation_df['time_in_wav_stop_sec'] = annotation_df['annotation_stop_sec'] + annotation_df['spectrogram_start_sec']

#Removing all the calculation columns
annotation_df = annotation_df.drop(['points','spectrogram_start_sec','point1','point2','x1','x2','annotation_start','annotation_stop','annotation_start_sec','annotation_stop_sec'], axis=1)
#annotation_df

Step 5: Finally, we need to match these annotation start and stop times back to the VGGish points. We know that each VGGish point accounts for 0.96 seconds of raw audio time, so dividing the start and stop sound times by 0.96 will give us which slice each belongs to.

We create an intermediate column with the range of slices that cover the sound.

In [183]:
#Creating a col with the start image slice and the end image slice by dividing the time by 0.96 for the seconds in the spectrogram
annotation_df['start_slice'] = annotation_df['time_in_wav_start_sec'] / 0.96
annotation_df['stop_slice'] = annotation_df['time_in_wav_stop_sec'] / 0.96

#Taking the floor of the start slice and the ceiling of the stop slice to ensure we get all the sound
#in our slices
annotation_df['start_slice'] = annotation_df['start_slice'].apply(np.floor)
annotation_df['stop_slice'] = annotation_df['stop_slice'].apply(np.ceil)

#Creating column which is the list of the slices between start_slice and stop_slice
start_slices = list(annotation_df['start_slice'])
stop_slices = list(annotation_df['stop_slice'])
slice_range = []
for i in range(len(start_slices)):
    range_list = list(range(int(start_slices[i]), int(stop_slices[i])+1))
    slice_range.append(range_list)

#Adding a col which is the range of slice numbers per sound
annotation_df['slice_numbers'] = slice_range

#Dropping the start and stop slice cols
annotation_df = annotation_df.drop(['start_slice','stop_slice'], axis = 1)

Step 5 (continued): Now we need to map the wav files in the mapping_filenames.csv back to the images. We do this using the YYMMDDTHHMMS in the image_name to match back to the wav filename's first 12 digits which are also YYMMDD-HHMMS, and joining to get the file lookup number. 

We don't match on the final 's' value in either filename because there was a batch processing issue with the original files which misrecorded '2' as '4' and will create non-matches between files.

In [184]:
#Create a column with the image name in a format that is matchable back to the wav file name in th e
#mapping_filenames.csv file
annotation_df['image_name_wav_format'] = (annotation_df['image_name'].str[2:8] + '-' + annotation_df['image_name'].str[9:14])

#Getting the matching characters from the mapping_filenames wav names
mapping_file_df['wav_name_match'] = mapping_file_df['wav_filename'].str[:12]

#Matching the annotation_df and mapping_filenames df together
wav_annotation_df = pd.merge(left = annotation_df, right = mapping_file_df, how = 'left',
                             left_on='image_name_wav_format', right_on='wav_name_match')

#Checking match
if len(wav_annotation_df) == len(annotation_df):
    pass
else:
    print("ERROR - unsuccessful merge, rows differ between annotation_df and wav_annotation_df. Check code.")

#Removing extra columns used to merge datasets
wav_annotation_df = wav_annotation_df.drop(['image_name_wav_format','wav_name_match'], axis = 1)


Unnamed: 0,sound,image_name,time_in_wav_start_sec,time_in_wav_stop_sec,slice_numbers,wav_filename,mapped_filename
0,mooring,20181204T100004-File-8.png,482.351224,484.794822,"[502, 503, 504, 505]",181204-100002-437599-806141979_resampled,551
1,helicopter,20181204T100004-File-8.png,531.877158,539.772215,"[554, 555, 556, 557, 558, 559, 560, 561, 562, ...",181204-100002-437599-806141979_resampled,551
2,mooring,20181204T113004-File-16.png,963.708778,965.842078,"[1003, 1004, 1005, 1006, 1007]",181204-113002-437599-806141979_resampled,506
3,mooring,20181204T113004-File-16.png,988.959292,991.092592,"[1030, 1031, 1032, 1033]",181204-113002-437599-806141979_resampled,506
4,mooring,20181204T113004-File-16.png,1003.000284,1005.909330,"[1044, 1045, 1046, 1047, 1048]",181204-113002-437599-806141979_resampled,506
...,...,...,...,...,...,...,...
2947,humpback,20190217T023004-File-27.png,1659.874020,1662.092757,"[1729, 1730, 1731, 1732]",190217-023002-437599-806141979_resampled,407
2948,humpback,20190217T023004-File-27.png,1662.745326,1663.658924,"[1732, 1733]",190217-023002-437599-806141979_resampled,407
2949,humpback,20190217T023004-File-27.png,1670.967704,1672.272843,"[1740, 1741, 1742]",190217-023002-437599-806141979_resampled,407
2950,humpback,20190217T023004-File-27.png,1675.840224,1677.188867,"[1745, 1746, 1747, 1748]",190217-023002-437599-806141979_resampled,407


Step 5 (continued): Now we pull apart the 'slice_numbers' column to create a row per slice number. We create a final column which shows the mapped_filename-slice_number which is the unique label on each vggish point. We save this focused data frame for easy reference in model output clustering scripts.

In [200]:
#Getting list of sounds and the list of slice_numbers
sounds_list = list(wav_annotation_df['sound'])
image_name_list = list(wav_annotation_df['image_name'])
mapped_filename_list = list(wav_annotation_df['mapped_filename'])
slice_numbers_list = list(wav_annotation_df['slice_numbers'])

#Creating final table w/ sound, image_name, vggish_point
wav_to_annotation_df = pd.DataFrame(columns= ['sound','image_name','vggish_point'])

#Iterate through each slice numbers array, pull it apart, save each element + the mapped filename and sound to a new df
for i in range(len(slice_numbers_list)):
    sound = sounds_list[i]
    image_name = image_name_list[i]
    mapped_filename = mapped_filename_list[i]
    slices = slice_numbers_list[i]
    for slice in slices:
        vggish_point = str(mapped_filename) + '-' + str(slice)
        wav_to_annotation_df.loc[len(wav_to_annotation_df)] = [sound, image_name, vggish_point]

#Saving the file
wav_to_annotation_df.to_csv('wav_to_annotation.csv', index=False)

In [197]:
slice_numbers_list

[[502, 503, 504, 505],
 [554, 555, 556, 557, 558, 559, 560, 561, 562, 563],
 [1003, 1004, 1005, 1006, 1007],
 [1030, 1031, 1032, 1033],
 [1044, 1045, 1046, 1047, 1048],
 [1659, 1660, 1661, 1662],
 [640, 641, 642, 643],
 [647, 648, 649, 650],
 [667, 668, 669, 670],
 [1271, 1272, 1273, 1274, 1275],
 [1277, 1278, 1279, 1280],
 [1291, 1292, 1293, 1294, 1295],
 [1304, 1305, 1306, 1307, 1308],
 [1788, 1789, 1790, 1791],
 [1792, 1793, 1794, 1795],
 [1808, 1809, 1810],
 [37, 38, 39, 40],
 [1602, 1603, 1604, 1605, 1606],
 [715, 716, 717, 718],
 [722, 723, 724, 725, 726],
 [687, 688, 689, 690, 691, 692, 693, 694],
 [250, 251, 252, 253],
 [255, 256, 257, 258, 259, 260],
 [215, 216, 217, 218],
 [1305, 1306, 1307],
 [1273, 1274, 1275, 1276, 1277, 1278, 1279, 1280, 1281],
 [1036, 1037, 1038, 1039],
 [1053, 1054, 1055, 1056, 1057],
 [1547, 1548, 1549, 1550],
 [1550, 1551, 1552, 1553, 1554],
 [25, 26, 27, 28, 29],
 [897, 898, 899, 900, 901],
 [912, 913, 914, 915, 916],
 [912, 913, 914, 915],
 [912, 91

We save this final large dataframe to manually listen to audio. We also save a smaller df with just the sound and vggish_point_id for easy future reference.