# Image Curation
This notebook is used to curate the images for the training and testing dataset. The images are curated by removing the images that are not relevant to the project. The images are also renamed to a standard format. 


## Methodology
I have a folder with images that need annotations. In order to keep them organized I would like to name them 01 -> (n). The file types can be (.png, .jpg, .jpeg, and .webp). I would also like to maintain the order of already sorted images in the event that I add additional images to the collection.

I want the collection of images to be called: image_00, image_01, image_02

In [14]:
import os

image_directory = '..\ml_train\\base_images'

# list all of the images in the directory
image_list = os.listdir(image_directory)

# print the number of images in the directory
print('Number of images in the directory: ', len(image_list))

# get count of images by file extension
image_extensions = [os.path.splitext(x)[1] for x in image_list]
print('Count of images by file extension: ', dict([(x,image_extensions.count(x)) for x in set(image_extensions)]))

# get a list of images that are already named by index
images_already_indexed = [x for x in image_list if x.startswith('image_')]
print(images_already_indexed)

# get a list of images with only two numeric characters after the underscore
images_already_indexed = [x for x in images_already_indexed if x[6:8].isdigit()]
print(images_already_indexed)

# get a list of images that are not already indexed
images_to_index = [x for x in image_list if x not in images_already_indexed]
print(images_to_index)

# create a list of index values for images that need to be named that skip the already indexed values
index_values = [x for x in range(1, len(images_to_index) + 1 + len(images_already_indexed)) if x not in [int(x[6:8]) for x in images_already_indexed]]

# create a list of new image names without extensions
new_image_names = ['image_' + str(x).zfill(2) for x in index_values]
print(new_image_names)

# create a list of new image names with extensions
new_image_names_with_extensions = [x + os.path.splitext(y)[1] for x,y in zip(new_image_names, images_to_index)]
print(new_image_names_with_extensions)

# rename files in the directory
for x,y in zip(images_to_index, new_image_names_with_extensions):
    os.rename(os.path.join(image_directory, x), os.path.join(image_directory, y))

# Print out how many images were renamed, and how many images were not renamed
print('Number of images renamed: ', len(images_to_index))
print('Number of images not renamed: ', len(images_already_indexed))









Number of images in the directory:  43
Count of images by file extension:  {'.jpeg': 21, '.jpg': 13, '.png': 1, '.webp': 8}
['image_01.jpg', 'image_02.jpeg', 'image_03.jpeg', 'image_04.webp', 'image_05.jpeg', 'image_06.jpeg', 'image_07.jpeg', 'image_08.jpeg', 'image_09.jpeg', 'image_10.jpeg', 'image_11.jpeg', 'image_12.jpeg', 'image_13.jpeg', 'image_14.png', 'image_15.jpeg', 'image_16.webp', 'image_17.jpg', 'image_18.jpeg', 'image_19.jpg', 'image_20.jpeg', 'image_21.jpeg', 'image_22.jpeg', 'image_23.jpeg', 'image_24.webp', 'image_25.jpeg', 'image_26.jpeg', 'image_27.jpeg', 'image_28.webp', 'image_29.jpg', 'image_30.jpg', 'image_31.jpg', 'image_32.jpg', 'image_33.jpg', 'image_34.jpg', 'image_35.jpg', 'image_36.webp', 'image_37.jpg', 'image_38.jpg', 'image_39.jpg', 'image_40.jpeg', 'image_41.webp', 'image_42.webp', 'image_43.webp']
['image_01.jpg', 'image_02.jpeg', 'image_03.jpeg', 'image_04.webp', 'image_05.jpeg', 'image_06.jpeg', 'image_07.jpeg', 'image_08.jpeg', 'image_09.jpeg', 'imag