## Purpose

The purpose of this code is to return a csv containg rows with each image, its filepath, and its most recent JSON annotation file path. This code is necessary because images have been annotated multiple times and will be used to move the approrpriate images and json files to the labeling software/blob storage/etc.

Input(s):
1. String of file path to annotation JSON files. Variable initialized to path to "Annotation Stuff" on Chris-generated external hard drive. Code written to accept user input but commented out due to current ipynb format. In the event that this file becomes a .py file it can be uncommented.
2. String of file path to output csv save location. Variable initialized to current directory. Code written to accept user input but commented out due to current ipynb format. In the event that this file becomes a .py file it can be uncommented.

Output(s):
1. CSV file with each row corresponding to a unique image file with the following columns: image file name, image file path, JSON file path
    
We will being by importing standard python libraries.

In [2]:
import os
import pandas as pd

First we will define some constants for use in this code - namely the directory containing the image annotation files. The block of code commented out accepts user input in the event this becomes a .py file rather than .ipynb file.

In [3]:
#Constants of file paths & final data frame name
annotation_path = 'D:/Annotation Stuff/'
save_path = './'
image_annotation_df = pd.DataFrame()

Next we will iterate through all of the directories and files stored in the annotation_path to save the directory, annotation file name, and image file name.

In [148]:
#Get list of subdirectories
subdir_list = [f.name for f in os.scandir(annotation_path) if f.is_dir()]

#Iterate through subdirectories, saving the name of the subdirectory,
#annotation files and image files
for subdir in subdir_list:
    subdir_path = annotation_path + subdir + '/'
    json_names = [f[:-5] for f in os.listdir(subdir_path) if f.endswith('.json')]
    image_names = [f[:-4] for f in os.listdir(subdir_path) if f.endswith('.png')]
    if len(json_names) != len(image_names):
        print("Different number of JSON and images ({0} vs. {1}) in {2}".format(
        len(json_names), len(image_names), subdir))
        unmatched_image = list(set(image_names)-set(json_names))
        unmatched_json = list(set(json_names)-set(image_names))
        unmatched_files = unmatched_image + unmatched_json
        print("The unmatched file has the name {0}".format(unmatched_files))
        print("CODER PLEASE FIX AND RE-RUN THIS CELL")
        image_annotation_df = pd.DataFrame()
        break
    elif len(json_names) == len(image_names):
        subdirs = [subdir] * len(json_names)
        json_names.sort()
        image_names.sort()
        subdir_df = pd.DataFrame()
        subdir_df['subdirectory'] = subdirs
        subdir_df['image_file_name'] = json_names
        subdir_df['json_file_name'] = image_names
        image_annotation_df = pd.concat([image_annotation_df,subdir_df],
                                        ignore_index=True) 

Because images may have been annotated multiple times, we want to keep only the most recent annotations. We do this by numbering the subdirectories by recency (e.g., 1 is the most recent annotation on Nov 9, 2 is the second most recent annotation on Nov 3rd, etc.) and keeping the min recency ranking.

In [149]:
#Define function with recency
def label_recency(row):
    if row['subdirectory'] == 'MLFigsLabeled_Nov_09_Chris':
        return 1
    if row['subdirectory'] == 'MLFigsLabeled_Nov_03':
        return 2
    if row['subdirectory'] == 'MLFigs_Labeled_Oct_26_Chris':
        return 3
    if row['subdirectory'] == 'MlFigsLabeled_Oct_25_Chris':
        return 4
    
#Apply that function to the df
image_annotation_df['recency_rank'] = image_annotation_df.apply(label_recency, axis = 1)
#image_annotation_df

#Test that each image appears no more than 4 times (the # of subdirectories)
image_counts = image_annotation_df.groupby('image_file_name').count().reset_index()
more_than_four_copies_of_image = (image_counts['subdirectory']>4).any()
if more_than_four_copies_of_image:
    print("There are more than 4 copies of an image")
    print("The images in question are: ")
    print(image_counts.loc[image_counts['subdirectory']>4]['image_file_name'])
else:
    pass

Now we will filter the table to keep only the most recent annotation of an image.

In [150]:
#Grouping df by image_file_name, then filtering on min recency_rank
recent_images_df = image_annotation_df.groupby('image_file_name').min('recency_rank').reset_index()

#Verifying that each image is unique
#print(recent_images_df.groupby('image_file_name').count().max())

#Keeping only the images which are the most recent
final_images = pd.merge(left=recent_images_df, right=image_annotation_df,
                       how='left', on=['image_file_name', 'recency_rank'])
if len(final_images) != len(recent_images_df):
    print("Bad join - # of images in unique rankings != final # of images")
else:
    pass

#Verifying that each image is paired with its matching JSON
if len(final_images.loc[final_images['image_file_name'] != final_images['json_file_name']]) >0:
    print("Images are not paired with correct JSON - check code")
else:
    pass

#Verifying each final image is unique
#print(final_images.groupby('image_file_name').count().max())

Finally, we will format the dataframe as agreed-upon internally by the MSDS Capstone team and save it as a csv.

Output: CSV file with each row corresponding to a unique image file with the following columns: image file name, image file path, JSON file path

In [156]:
#Dropping the recency_rank col
clean_images = final_images.drop('recency_rank', axis = 1)

#Creating image file path
clean_images['image_file_path'] = (annotation_path + 
                                   clean_images['subdirectory'] + 
                                  '/' + clean_images['image_file_name'] +
                                  '.png')

#Creating image file path
clean_images['json_file_path'] = (annotation_path + 
                                   clean_images['subdirectory'] + 
                                  '/' + clean_images['json_file_name']+
                                  '.json')


#Dropping subdirectory & json filename col
clean_images = clean_images.drop(['subdirectory', 'json_file_name'], axis = 1)

#Save output to csv
clean_images.to_csv('unique_images_annotations.csv', index = False)
print("Unique image filenames with most recent images and most recent "+
     "annotation json saved to a csv in current directory.")
clean_images.head()

Unique image filenames with most recent images and most recent annotation json saved to a csv in current directory.


Unnamed: 0,image_file_name,image_file_path,json_file_path
0,20181204T100004-File-0,D:/Annotation Stuff/MLFigsLabeled_Nov_09_Chris...,D:/Annotation Stuff/MLFigsLabeled_Nov_09_Chris...
1,20181204T100004-File-1,D:/Annotation Stuff/MLFigsLabeled_Nov_03/20181...,D:/Annotation Stuff/MLFigsLabeled_Nov_03/20181...
2,20181204T100004-File-10,D:/Annotation Stuff/MLFigsLabeled_Nov_09_Chris...,D:/Annotation Stuff/MLFigsLabeled_Nov_09_Chris...
3,20181204T100004-File-17,D:/Annotation Stuff/MLFigsLabeled_Nov_09_Chris...,D:/Annotation Stuff/MLFigsLabeled_Nov_09_Chris...
4,20181204T100004-File-21,D:/Annotation Stuff/MLFigsLabeled_Nov_09_Chris...,D:/Annotation Stuff/MLFigsLabeled_Nov_09_Chris...


The cell below tests to verify that there should be no files from October 25th in the final data because they are contained in another directory.

In [None]:
'''#Testing to see if everything in october 25th dir is in the oct 26th dir
#Get list of subdirectories
subdir_list = ['D:/Annotation Stuff/MlFigsLabeled_Oct_25_Chris', 'D:/Annotation Stuff/MLFigs_Labeled_Oct_26_Chris']

#Pulling the Oct 25 and 26th files
json_names_25 = [f[:-5] for f in os.listdir('D:/Annotation Stuff/MlFigsLabeled_Oct_25_Chris') if f.endswith('.json')]
json_names_26 = [f[:-5] for f in os.listdir('D:/Annotation Stuff/MLFigs_Labeled_Oct_26_Chris') if f.endswith('.json')]
image_names_25 = [f[:-4] for f in os.listdir('D:/Annotation Stuff/MlFigsLabeled_Oct_25_Chris') if f.endswith('.png')]
image_names_26 = [f[:-4] for f in os.listdir('D:/Annotation Stuff/MLFigs_Labeled_Oct_26_Chris') if f.endswith('.png')]

#Seeing what's in 10/25 which isn't in 10/26 - ideally nothing
json_25_26 = list(set(json_names_25)-set(json_names_26))
print(json_25_26)
image_25_26 = list(set(image_names_25)-set(image_names_26))
print(image_25_26)

#Seeing what's in 10/26 which isn't in 10/25 - should be many files based on folder item counts
json_26_25 = list(set(json_names_26)-set(json_names_25))
print(len(json_26_25))
image_26_25 = list(set(image_names_26)-set(image_names_25))
print(len(image_26_25))

#Checking that the 2 printed numbers above = the delta in file contents
print((692/2)-(234/2))'''