The purpose of this file is to explore if we actually need the metadata files to map annotated images back to their wav files and get their window information. We will do this by:

1. Matching annotated images back to their metadata files. Do the windows in the metadata files correpond to the image file numbers? If so, we don't need the windows_plotted from the metadata files and can map back to wav based on YYMMDDTHHMMs
2. If image file numbers and metadata windows_plotted aren't correlated, we need to keep thinking on how to work around the missing metadata files.


In [1]:
#Imports
import numpy as np
import pandas as pd
import json
import os
import matplotlib.pyplot as plt
from os import listdir
from os.path import isfile, join

Getting the most recent json information from the unique_images_annotations.csv

In [2]:
#Creating a df to save the data in
annotation_df = pd.DataFrame(columns=['sound','points','image_name', 'json_file_path'])
annotation_df.head()

#Getting the path to the JSON files & listing files
uniq_annotation_df = pd.read_csv('unique_images_annotations.csv')
json_file_list = list(uniq_annotation_df['json_file_path'])

#Iterate through json files on external hard drive to get annotated image info
for file in json_file_list:
    #Loading the JSON data & turning into dict
    annotated_file = open(file)
    annotated_dict = json.load(annotated_file)
    #annotated_dict.keys()

    #Pulling out the labels, points, and image path
    image_name = annotated_dict['imagePath']
    if len(annotated_dict['shapes']) > 0:
        for shape in annotated_dict['shapes']:
            sound = shape['label']
            points = shape['points']
            annotation_df.loc[len(annotation_df.index)] = [sound, points, image_name, file]
    else:
        sound = None
        points = None
        annotation_df.loc[len(annotation_df.index)] = [sound, points, image_name, file]
    
#Validating that images paired with json from unique annotations file are the same as the images referenced in the JSON files themselves
comb_df = pd.merge(left = uniq_annotation_df, right = annotation_df, how = 'left', left_on='json_file_path', right_on='json_file_path')
comb_df['image_file_name_tester'] = comb_df['image_file_name']+'.png'
comb_df['name_compare'] = np.where((comb_df['image_file_name_tester'] == comb_df['image_name']), 0, 1)
if comb_df['name_compare'].sum() != 0:
    print("The following json files contain different image filenames than those they were aligned to in the unique_image_annotations code")
    print(comb_df.loc[comb_df['name_compare']!=0])
else:
    pass

#Validating we got all the json files from the original unique annotations dataframe in the new annotation df
if len(uniq_annotation_df) != len(annotation_df['json_file_path'].unique()):
    print("ERROR - not all JSON contained in the annotation_df")
else:
    pass

#Examining head of file
print(annotation_df.head())

        sound                                             points  \
0     mooring  [[382.56626506024094, 614.4819277108433], [458...   
1  helicopter  [[1920.952380952381, 1048.1904761904761], [216...   
2     mooring  [[424.7349397590361, 748.2168674698794], [490....   
3     mooring  [[1209.0722891566263, 938.5783132530119], [127...   
4     mooring  [[1645.2168674698794, 744.6024096385541], [173...   

                    image_name  \
0   20181204T100004-File-8.png   
1   20181204T100004-File-8.png   
2  20181204T113004-File-16.png   
3  20181204T113004-File-16.png   
4  20181204T113004-File-16.png   

                                      json_file_path  
0  D:/Annotation Stuff/MLFigs_Labeled_Oct_26_Chri...  
1  D:/Annotation Stuff/MLFigs_Labeled_Oct_26_Chri...  
2  D:/Annotation Stuff/MLFigs_Labeled_Oct_26_Chri...  
3  D:/Annotation Stuff/MLFigs_Labeled_Oct_26_Chri...  
4  D:/Annotation Stuff/MLFigs_Labeled_Oct_26_Chri...  


Now we need to match the annotated spectrograms with the wav files that generated them. We have the "image_name" info, everything before the ".png" (call it X) will reference a file called XMetadata. Inside that XMetadata file there is a "FileName" which contains the raw title information which matches to the hydrophone recording title and the windows plotted in the sepctrogram. As an example:

    image_name: 20181204T100004-File-0.png
    X: 20181204T100004-File-0
    Metadata file: 20181204T100004-File-0Metadata
    FileName: 181204-100002-437599-806141979_Spectrograms_20Hz.mat
    WindowsPlotted: "1-1201"

In [3]:
#Getting the image prefixes
image_names = list(annotation_df['image_name'])
image_start = [name.replace('.png', '') for name in image_names]

#Creating all the metadata file names
metadata_names = []
for image_name in image_start:
    metadata_names.append('D:/1Dec2018_28Feb2019/MLFigsMeta/'+image_name+'Metadata.txt')
    
#Adding the metadata names to the table with annotation info
annotation_df['metadata_file_path'] = metadata_names
annotation_df.head()

Unnamed: 0,sound,points,image_name,json_file_path,metadata_file_path
0,mooring,"[[382.56626506024094, 614.4819277108433], [458...",20181204T100004-File-8.png,D:/Annotation Stuff/MLFigs_Labeled_Oct_26_Chri...,D:/1Dec2018_28Feb2019/MLFigsMeta/20181204T1000...
1,helicopter,"[[1920.952380952381, 1048.1904761904761], [216...",20181204T100004-File-8.png,D:/Annotation Stuff/MLFigs_Labeled_Oct_26_Chri...,D:/1Dec2018_28Feb2019/MLFigsMeta/20181204T1000...
2,mooring,"[[424.7349397590361, 748.2168674698794], [490....",20181204T113004-File-16.png,D:/Annotation Stuff/MLFigs_Labeled_Oct_26_Chri...,D:/1Dec2018_28Feb2019/MLFigsMeta/20181204T1130...
3,mooring,"[[1209.0722891566263, 938.5783132530119], [127...",20181204T113004-File-16.png,D:/Annotation Stuff/MLFigs_Labeled_Oct_26_Chri...,D:/1Dec2018_28Feb2019/MLFigsMeta/20181204T1130...
4,mooring,"[[1645.2168674698794, 744.6024096385541], [173...",20181204T113004-File-16.png,D:/Annotation Stuff/MLFigs_Labeled_Oct_26_Chri...,D:/1Dec2018_28Feb2019/MLFigsMeta/20181204T1130...


For the images that match with a metadata file (we don't have metadata for some files because annotated images are from some time in 2018 to 2/23/2019, but metadata files only run from 12/04/2018 to 01/14/2019), let's pull the windows plotted from that metadata file.

In [4]:
#List of metadata file names
metadata_file_names = annotation_df['metadata_file_path'].unique()

#Creating a metadata df to store info taken from each file
metadata_df = pd.DataFrame(columns=['metadata_file_path','filename','wav_filename', 'windows_plotted'])

for file in metadata_file_names:
    #Read all the lines in the filepath & save in array if it exists
    #Otherwise add it to a list for printing later
    try:
        meta_file = open(file, 'r')
        text = meta_file.readlines()
        filename = text[1][10:-2]
        #starttime = text[2][11:-2]
        windows_plotted = text[3][16:-2]
        wav_filename = 'D:/1Dec2018_28Feb2019/Hydrophone/'+filename[:-22]+'.wav'
        metadata_df.loc[len(metadata_df.index)] = [file, filename, wav_filename, windows_plotted]
    except FileNotFoundError:
        metadata_df.loc[len(metadata_df.index)] = [file, None, None, None]

#Matching the metadata_df information back to annotation_df
meta_annotation_df = pd.merge(left = annotation_df, right = metadata_df, how = 'left', on='metadata_file_path')
meta_annotation_df.head()

Unnamed: 0,sound,points,image_name,json_file_path,metadata_file_path,filename,wav_filename,windows_plotted
0,mooring,"[[382.56626506024094, 614.4819277108433], [458...",20181204T100004-File-8.png,D:/Annotation Stuff/MLFigs_Labeled_Oct_26_Chri...,D:/1Dec2018_28Feb2019/MLFigsMeta/20181204T1000...,181204-100002-437599-806141979_Spectrograms_20...,D:/1Dec2018_28Feb2019/Hydrophone/181204-100002...,9602-10801
1,helicopter,"[[1920.952380952381, 1048.1904761904761], [216...",20181204T100004-File-8.png,D:/Annotation Stuff/MLFigs_Labeled_Oct_26_Chri...,D:/1Dec2018_28Feb2019/MLFigsMeta/20181204T1000...,181204-100002-437599-806141979_Spectrograms_20...,D:/1Dec2018_28Feb2019/Hydrophone/181204-100002...,9602-10801
2,mooring,"[[424.7349397590361, 748.2168674698794], [490....",20181204T113004-File-16.png,D:/Annotation Stuff/MLFigs_Labeled_Oct_26_Chri...,D:/1Dec2018_28Feb2019/MLFigsMeta/20181204T1130...,181204-113002-437599-806141979_Spectrograms_20...,D:/1Dec2018_28Feb2019/Hydrophone/181204-113002...,19202-20401
3,mooring,"[[1209.0722891566263, 938.5783132530119], [127...",20181204T113004-File-16.png,D:/Annotation Stuff/MLFigs_Labeled_Oct_26_Chri...,D:/1Dec2018_28Feb2019/MLFigsMeta/20181204T1130...,181204-113002-437599-806141979_Spectrograms_20...,D:/1Dec2018_28Feb2019/Hydrophone/181204-113002...,19202-20401
4,mooring,"[[1645.2168674698794, 744.6024096385541], [173...",20181204T113004-File-16.png,D:/Annotation Stuff/MLFigs_Labeled_Oct_26_Chri...,D:/1Dec2018_28Feb2019/MLFigsMeta/20181204T1130...,181204-113002-437599-806141979_Spectrograms_20...,D:/1Dec2018_28Feb2019/Hydrophone/181204-113002...,19202-20401


Now that we have the windows_plotted information, let's make a column for the "File-#" in image_name to see if they always match. We would expect File-0 files to have windows 1-1201, File-2 to have windows 1202-2403, etc.

In [14]:
#Getting the file number from each image_name
meta_annotation_df['file_number'] = meta_annotation_df['image_name'].str[16:-4]

#Grouping the df by file_number and windows plotted, hoping for a 1:1 ratio
file_window_df = meta_annotation_df.groupby(['file_number','windows_plotted']).count().sort_values('file_number', ascending=True).reset_index()
print(list(file_window_df['file_number']))
print(list(file_window_df['windows_plotted']))

['File-0', 'File-1', 'File-10', 'File-11', 'File-12', 'File-13', 'File-14', 'File-15', 'File-16', 'File-17', 'File-18', 'File-19', 'File-2', 'File-20', 'File-21', 'File-22', 'File-23', 'File-24', 'File-25', 'File-26', 'File-27', 'File-28', 'File-29', 'File-3', 'File-4', 'File-5', 'File-6', 'File-7', 'File-8', 'File-9']
['1-1201', '1202-2401', '12002-13201', '13202-14401', '14402-15601', '15602-16801', '16802-18001', '18002-19201', '19202-20401', '20402-21601', '21602-22801', '22802-24001', '2402-3601', '24002-25201', '25202-26401', '26402-27601', '27602-28801', '28802-30001', '30002-31201', '31202-32401', '32402-33601', '33602-34801', '34802-35904', '3602-4801', '4802-6001', '6002-7201', '7202-8401', '8402-9601', '9602-10801', '10802-12001']
