## 1. Manual Cleaning

Some images that were scraped were not acceptable to be used to train a model (some weren't even of guitars). So the first step is to manually go through the images and dispose of any unusable ones, there is nothing to document here other than results after the clean; it wasn't fun.



Note: Of raw, uncleaned data, Les Paul has 878 entries, and Stratocaster has 901.

Unusable entries included blurry images, heavy watermarking, logos, and any image containing more than one guitar body or non-distinguishing partial views. 

In the manual cleaning process, 137 Stratocaster entries were removed. For Les Paul, only 76 entries weren't satisfactory. Leaving 764 Stratocaster entries and 802 Les Paul.


## 2. Systematic Cleaning

Here I will outline the steps I am going to take in the programmatic cleaning:

General stuff:
1. Ensure all file types are the same (in case colors change)
2. Convert all images to RGB colour space

Transformations:
1. Need all images to be the same size to fit the model, aim for 224*224 pixels as the input standard.


In [None]:
from PIL import Image
import os

BASE_DIR = r'C:\Users\archi\Data stuff\Guitar classifier\data\test'
CLASSES = ['stratocaster', 'les paul']
# ---------------------

def convert_to_rgb_and_standardize():
    for class_name in CLASSES:
        folder_path = os.path.join(BASE_DIR, class_name)
        print(f"Processing images in: {folder_path}")
        
        for filename in os.listdir(folder_path):
            file_path = os.path.join(folder_path, filename)
            
            # Skip directories
            if not os.path.isfile(file_path):
                continue
                
            # Keep track of the original file path in case we need to delete it
            original_file_path = file_path

            try:
                # Open the image
                with Image.open(file_path) as img:
                    
                    # Core Step: Convert to RGB color space
                    img_rgb = img.convert('RGB')
                    
                    # Define new filename with the standard .jpg extension
                    base_name = os.path.splitext(filename)[0]
                    new_file_path = os.path.join(folder_path, base_name + '.jpg')
                    
                    # Save as JPEG (overwrites existing JPGs, converts other formats)
                    img_rgb.save(new_file_path, 'JPEG')
                    
                    # If the original file was a different format (e.g., PNG), delete it
                    if not filename.lower().endswith(('.jpg', '.jpeg')):
                        os.remove(original_file_path)
                        
            except Exception as e:
                print(f"Failed to process {filename}: {e}")
                
if __name__ == "__main__":
    convert_to_rgb_and_standardize()

Processing images in: C:\Users\archi\Data stuff\Guitar classifier\data\test\stratocaster


