<a href="https://colab.research.google.com/github/ciupava/Yolo_experiments/blob/main/YoloPreprocessing_test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Preprocessing** ###

This notebook adapts the one provided by Omdena to HOT as an outcome of the challenge who took place in the summer 2024.
(The original notebook downloads and preprocesses the RAMP data package (RAMP-DATA-V0), which comprises over 100k image-label in TIF-GEOJSON formats and 22 regional folders. The resulting output of that notebook was YOLO-DATA-V1 dataset. YOLO-DATA-V2 and YOLO-DATA-V3 datasets were created using the Pruning notebook)

Summary of the tasks accomplished in the notebook:

1. Data Exploration
    
 - Obtain data from the dedicated folder (currently in HOT's Google Drive).
 - Complete basic data integrity checks.
 - Count the image and label files and confirm that each image has a label with the matching name.
 - Check the shape of the images.

~2. Enhance the dataset~
 - ~Create a dataframe for analysis and data control.~
 - ~Set Shanghai and Paris subfolders from the dataset by flagging 'use' to False for image shape is not (256, 256, 3).~
 - ~Set 'use' flag to False randomly for 13% of background images to reduce from 18% to 5%.~
 - ~Identify images found with incorrect image-label in the exclusions_list.txt~

2. Data wrangling

 - Assess all the folders, one per city
 - Get the list of the split train/valid for the set of images as it was used on the first experiment run on RAMP metric
 -

3. Convert data (TIF/GEOJSON) to YOLO (JPG/TXT)

 - ~Image-label pairs were randomly assigned to train-val-test folders using a 70-15-15 split.~
 - Generate text files based on the YOLO's segmentation format from the GEOJSON and TIF files. Polygons were extracted from each GEOJSON file and aligned with the location information embedded in the TIF. Note: 0,0 is the top, left, and coordinates normalized over 0, 1.
 - Convert the TIF images to JPG at quality level 100.
 - Save to separate folders (one per city)


Note: A GPU is not required for preprocessing. Due to the numerous disk i/o operations, this notebook is preferred over Google Colab for local running.

### Initial import

In [1]:
!pip3 install -q rasterio
!pip3 install -q pyproj

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m22.2/22.2 MB[0m [31m16.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:
import os
import glob
import json
import cv2
import json
import pandas as pd
import matplotlib.pyplot as plt
import random
from pyproj import Transformer
import rasterio
import matplotlib.image as mpimg
from PIL import Image
import numpy as np
import yaml
import gc
from tqdm import tqdm

In [4]:
# Connecting to 'My Drive', to be able to access local data

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [5]:
os.chdir('/content/drive/My Drive/YOLO_test/data/metric_test_data/')
os.listdir()


['model51_td364',
 'model95_td370',
 'model97_td372',
 'model98_td373',
 'model102_td391',
 'model108_td529',
 'model110_td394',
 'model112_td397',
 'model113_td398',
 'model114_td399',
 'model134_td456',
 'model135_td459',
 'model136_td462',
 'model137_td463',
 'model147_td485',
 'model148_td488',
 'model149_td489',
 'model158_td508',
 'model162_td530',
 'model163_td523',
 'model164_td524',
 'model165_td525',
 'model166_td526',
 'model167_td528',
 'model168_td539']

### Defining path variables

In [9]:
base_path = f"{os.getcwd()}"
print(f"\n---\nCurrent working directory {base_path}")

list_filename_tent = os.path.join(os.getcwd(), "../../cities_list.txt")
print(f'Tentative: {os.path.normpath(list_filename_tent)}')
list_filename = os.path.normpath(os.path.join(os.getcwd(), "../../cities_list.txt"))
print(f'File with list of city names to be used: {list_filename}')


---
Current working directory /content/drive/MyDrive/YOLO_test/data/metric_test_data
Tentative: /content/drive/MyDrive/YOLO_test/cities_list.txt
File with list of city names to be used: /content/drive/MyDrive/YOLO_test/cities_list.txt


In [10]:
# check that training-datasets list file exists and is readable
if not os.path.exists(f'{list_filename}'):
    raise ValueError(f"Can't find file {list_filename}")

### generating list of regions (cities) from the input txt file
# input text file from command line:
# list_filename = "cities_list.txt"

#TO_DO: add condition of stopping if filename empty or file doesn't exist
print(f"---\nI am going to get the names from {list_filename} (name of the list file you provided)")
# with open("cities_list.txt", "r") as file:
#     cities_list = "".join(file.read().split("\n"))
# the following is to obtain the list of cities, removing commented lines (starting with "#"):
with open(list_filename, 'r') as f:
    full_file = f.read()
    # print(full_file)
    full_list = full_file.split('\n') # separating per each new line
    cities_list = []
    for counter in range(len(full_list)):
        line = full_list[counter]
        # print(f"line is {line}")
        if not line.startswith('#'): # this is to avoid commented lines in the input file
            cities_list.append(full_list[counter])
print(f"---\nList of cities: {cities_list}")

---
I am going to get the names from /content/drive/MyDrive/YOLO_test/cities_list.txt (name of the list file you provided)
---
List of cities: ['model51_td364']


### Defining functions


Function to create a list of the labels and rgs files with path

In [None]:
# Changed from Omdena's original version, to keep into account different folder structure
def find_files():
    """
    Find chip (.tif) and label (.geojson) files in the specified folders.

    Returns:
    cwps (list): List of chip filenames with path.
    lwps (list): List of label filenames with path.
    base_folders (list): List of base folder names.
    """

    # Find the folders
    data_folders = glob.glob(base_path)

    # Create a list to store chip (.tif), mask (.mask.tif), and label (.geojson) filenames with path
    cwps = []
    lwps = []

    # Create a list to store the base folder names
    base_folders = []

    for folder in data_folders:
        print(f'folder is {folder}')
        # Pattern to match all .tif files in the current folder, including subdirectories
        tif_pattern = f"{folder}/**/**/**/*.tif"
        print(f'tif pattern is {tif_pattern}')
        print(len(tif_pattern))
        # Find all .tif files in the current 'training*' folder and its subdirectories
        found_tif_files = glob.glob(tif_pattern, recursive=True)
        print(f'found tif files {found_tif_files}')
        print(len(found_tif_files))
        # Filter out .mask.tif files and add the rest to the tif_files list
        for file in found_tif_files:
            if not file.endswith('mask.tif'):
                cwps.append(file)

        # Pattern to match all .geojson files in the current folder, including subdirectories
        geojson_pattern = f"{folder}/**/**/**/*.geojson"
        print(f'geojson pattern is {geojson_pattern}')
        print(len(geojson_pattern))
        # Find all .geojson files
        found_geojson_files = glob.glob(geojson_pattern, recursive=True)
        print(f'found gjson files {found_geojson_files}')
        print(len(found_geojson_files))
        # Add found .geojson files to the geojson_files list
        lwps.extend(found_geojson_files)

    # Sort the lists
    cwps.sort()
    lwps.sort()

    # Assert that the the number files for each type are the same
    assert len(cwps) == len(lwps), "Number of tif files and label files do not match"

    # Function to check that the filenames match
    for n, cwp in enumerate(cwps):
        c = os.path.basename(cwp).replace('.tif', '')
        l = os.path.basename(lwps[n]).replace('.geojson', '')

        assert c == l, f"Chip and label filenames do not match: {c} != {l}"

        base_folders.append(cwp.split('/')[1])

    return cwps, lwps, base_folders

# # Call the function and print the number of found files
# cwps, lwps, base_folders = find_files()  ##### MOVED TO AFTER !!!! ######
# print('Found {} chip files'.format(len(cwps)))
# N = 6
# test_list_cwps = cwps[:N]
# print(f"The first {N} elements of list are : {str(test_list_cwps)}")
# print('Found {} label files\n'.format(len(lwps)))
# test_list_lwps = lwps[:N]
# print(f"The first {N} elements of list are : {str(test_list_lwps)}")
# # Print message if all filenames match
# print('All filenames match; each tif has a label!')

Function to check the data shapes and store them in a dictionary

In [None]:
def check_shapes(iwps):
    """
    Check the shapes of image files and store them in a dictionary.

    Parameters:
    iwps (list): A list of image files with paths.

    Returns:
    tuple: A tuple containing two elements:
        - shapes_dict (dict): A dictionary where the keys are the image shapes and the values are the counts.
        - shapes (list): A list of the shapes of the chip files in the same order as the input list.
    """
    # Create a dictionary to store the shape of the chip files
    shapes_dict = {}
    shapes = []

    for iwp in tqdm(iwps):
        # Read the chip file
        shape = cv2.imread(iwp, -1).shape

        # Store the shape in the dictionary
        if str(shape) in shapes_dict:
            shapes_dict[str(shape)] += 1
        else:
            shapes_dict[str(shape)] = 1

        shapes.append(shape)

    # Return the dictionary
    return shapes_dict, shapes

# # Get shapes data     ##### MOVED TO AFTER !!!! ######
# shapes_data = check_shapes(cwps)

# # Print the shape of the first chip file
# print(f'Chip shapes with counts are: {shapes_data[0]}')

A few functions to work on the geospatial data to make it available for the Yolo format

i.e. the tif become a jpg and the geojson become a txt

In [None]:
def get_geo_data(iwp):
    """
    Extracts geo data from a geotif.

    Parameters:
    iwp (str): The image file with path.

    Returns:
    dict: A dictionary containing the extracted geo data. The dictionary includes the following keys:
        - 'left': The left coordinate of the bounding box.
        - 'right': The right coordinate of the bounding box.
        - 'top': The top coordinate of the bounding box.
        - 'bottom': The bottom coordinate of the bounding box.
        - 'width': The width of the bounding box.
        - 'height': The height of the bounding box.
        - 'crs': The coordinate reference system (CRS) of the geotif.
    """
    # Open the image file in binary mode ('rb') for reading Exif data
    with rasterio.open(iwp) as src:

        if src.crs is None:
            raise ValueError("No CRS found in the image file. Please check the file and try again.")
        elif src.bounds is None:
            raise ValueError("No bounds found in the image file. Please check the file and try again.")

        # Convert the bounds to the expected format
        transformer = Transformer.from_crs(src.crs, 'EPSG:4326')
        left, bottom = transformer.transform(src.bounds.left, src.bounds.bottom)
        right, top = transformer.transform(src.bounds.right, src.bounds.top)
        width = right - left
        height = top - bottom

        # Collect and return the extracted geo data
        results = {'left': left, 'right': right, 'top': top, 'bottom': bottom, 'width': width, 'height': height, 'crs': src.crs.to_string()}

        return results

In [None]:
def check_and_clamp(values):
    """
    Check and clamp the values in a nested list.

    Parameters:
    values (list): A nested list of values to be checked and clamped.

    Returns:
    list: A nested list of clamped values.

    """
    # Initialize an empty list to store the clamped values
    clamped_values = []

    # Iterate over each sublist in the list
    for sublist in values:
        # Use a list comprehension to check and clamp each value in the sublist
        clamped_sublist = [[max(0, min(1, value)) for value in pair] for pair in sublist]

        # Add the processed sublist to the clamped_values list
        clamped_values.append(clamped_sublist)

    return clamped_values

# Example usage
# This function clamps polygons coordinates that go outside of the coordinate bounds of the chip.
# YOLO requires all coordinates be in the range of (0, 1)
# test = [[[0.998301, 0.642099], [0.997246, 1.417954], [-0.99515, 0.404696], [0.997182, 0.40438], [0.996855, 0.334872], [0.926036, 0.346056], [0.973166, 0.646062], [0.998301, 0.642099]]]
# print(check_and_clamp(test))

In [None]:
test = [[[0.998301, 0.642099], [0.997246, 1.417954], [-0.99515, 0.404696], [0.997182, 0.40438], [0.996855, 0.334872], [0.926036, 0.346056], [0.973166, 0.646062], [0.998301, 0.642099]]]
print(check_and_clamp(test))

In [None]:
def flatten_list(nested_list):
    """
    Flattens a nested list into a single flat list.

    Parameters:
    nested_list (list): The nested list to be flattened.

    Returns:
    list: The flattened list.
    """
    flat_list = []

    # Iterate over all the elements in the given list
    for item in nested_list:
        # Check if the item is a list itself
        if isinstance(item, list):
            # If the item is a list, extend the flat list by adding elements of this item
            flat_list.extend(flatten_list(item))
        else:
            # If the item is not a list, append the item itself
            flat_list.append(item)
    return flat_list

In [None]:
def convert_coordinates(coordinates, geo_dict):
    """
    Convert coordinates from one coordinate system to another based on the provided geo_dict.

    Args:
        coordinates (list): A list of coordinate sets.
        geo_dict (dict): A dictionary containing information about the coordinate system.

    Returns:
        list: The converted coordinates.

    Raises:
        AssertionError: If the maximum coordinate value is greater than 1 or the minimum coordinate value is less than 0.
    """
    # Iterate over the outer list
    for i in range(len(coordinates)):
        # Iterate over each coordinate set in the inner list
        for j in range(len(coordinates[i])):
            if geo_dict['crs'] == 'EPSG:4326':
                # Convert the coordinates for the EPSG:4326
                coordinates[i][j] = [round((coordinates[i][j][0] - geo_dict['left'])/geo_dict['width'], 6), \
                                    round((geo_dict['top'] - coordinates[i][j][1])/geo_dict['height'], 6)]
            else:
                # Convert the coordinates for not EPSG:4326
                coordinates[i][j] = [round((coordinates[i][j][0] - geo_dict['bottom'])/geo_dict['height'], 6), \
                                    round((geo_dict['right'] - coordinates[i][j][1])/geo_dict['width'], 6)]

    coordinates = check_and_clamp(coordinates)

    # Make sure that the coordinates are within the expected range
    assert max(flatten_list(coordinates)) <= 1, "The maximum coordinate value is greater than 1"
    assert min(flatten_list(coordinates)) >= 0, "The minimum coordinate value is less than 0"

    return coordinates

Function to write LABELS to Yolo format

In [None]:
def write_yolo_file(iwp, folder, class_index=0):
    """
    Writes YOLO label file based on the given image with path and class index.

    Args:
        iwp (str): The image with path.
        class_index (int, optional): The class index for the YOLO label. Defaults to 0.

    Returns:
        None
    """

    # Get the GeoJSON filename with path from the chip filename with path
    # lwp = iwp.replace(".tif", ".geojson").replace("source", "labels")
    lwp = iwp.replace(".tif", ".geojson").replace("chips", "labels") # CHANGED HERE !!!!!! ##########

    # Create the YOLO label filename with path from the chip filename with path
    ywp = os.path.join(f'ramp_data_yolo/folder/labels/', iwp.split('/')[-1].replace('.tif', '.txt')).replace('folder', folder)

    # Create the YOLO label folder if it does not exist
    os.makedirs(os.path.dirname(ywp), exist_ok=True)

    # Remove the YOLO label file if it already exists
    if os.path.exists(ywp):
        os.remove(ywp)

    # Fetch the chip's Exif data
    geo_dict = get_geo_data(iwp)

    # Open the GeoJSON file
    with open(lwp, 'r') as file:
        data = json.load(file)

    # Initialize the polygon count
    polygon_count = 0

    # Navigate through the GeoJSON structure
    for feature in data['features']:
        if feature['geometry']['type'] == 'Polygon':
            # Increment the polygon count
            polygon_count += 1

            # Get the coordinates of the polygon
            coordinates = feature['geometry']['coordinates']

            # Convert the coordinates
            new_coordinates = flatten_list(convert_coordinates(coordinates, geo_dict))
            new_coordinate_str = ' '.join(map(str, flatten_list(new_coordinates)))

            # Write the converted coordinates to a file
            with open(ywp, 'a+') as file:
                # Move the file pointer to the start of the file to check its contents.
                file.seek(0)  # Go to the beginning of the file
                first_character = file.read(1)  # Read the first character to determine if the file is empty

                # If the first character does not exist, the file is empty
                if not first_character:
                    # Write the first string without a new line before it
                    file.write(f'{class_index} ' + new_coordinate_str)

                else:
                    # The file is not empty, write the new string on a new line
                    file.write(f'\n{class_index} ' + new_coordinate_str)

    if polygon_count == 0:
        # Open the file in write mode, which creates a new file if it doesn't exist
        with open(ywp, 'w') as file:
            pass  # No need to write anything, just creating the file

Function to save TIFs to Yolo format

In [None]:
def convert_tif_to_jpg(cwp, folder, ql=100):
    """
    Converts a TIFF image file to JPEG format.

    Parameters:
    cwp (str): The path to the TIFF image file.
    ql (int): The quality level of the JPEG image (default is 100).

    Returns:
    None
    """
    # Open the tif image file
    with Image.open(cwp) as img:
        # Convert the image to RGB and save it as a JPEG
        rgb_img = img.convert('RGB')

        # Define the output path with .jpg extension
        jwp = os.path.join('ramp_data_yolo/folder/images/', cwp.split('/')[-1].replace('.tif', '.jpg')).replace('folder', folder)

        # Create the output folder if it does not exist
        os.makedirs(os.path.dirname(jwp), exist_ok=True)

        # Save the image at quality level ql
        rgb_img.save(jwp, "JPEG", quality=ql)

        # Print the output path
        return (f'Writing: {jwp}')

### Not used functions

In [None]:
# WE MIGHT NOT NEED THIS ONE, AS IT'S ONLY USED ONCE IN THE FUNCTION (NOT USED) BELOW

def get_polygon_count(lwp):
    """
    Count the number of polygons in a GeoJSON file.

    Parameters:
    lwp (str): The path to the GeoJSON file.

    Returns:
    int: The number of polygons in the GeoJSON file.
    """
    # Open the GeoJSON file
    with open(lwp, 'r') as file:
        data = json.load(file)

    # Initialize the polygon count
    polygon_count = 0

    # Navigate through the GeoJSON structure
    for feature in data['features']:
        if feature['geometry']['type'] == 'Polygon':
            # Increment the polygon count
            polygon_count += 1

    return polygon_count

In [None]:
#  WE DO NOT NEED THIS ONE

# 1. Create a DataFrame with the filenames, basefolders, polygon counts, and image shapes.
# 2. Add a 'use' field that will be set to False for image-label pairs that will not be used.
#     a. Randomly set 'use' to False to reach the background images to the Z_TARGET of 5%.
#     b. Set 'use' to False when shape is not 256, 256, 3.
# 3. Review counts to verify the changes.

Z_TARGET_PERCENTAGE = 0.05

# Create a list to store the polygon counts
polygon_counts = []

# Loop through the label files and store the polygon count of each file
for lwp in tqdm(lwps):
    polygon_counts.append(get_polygon_count(lwp))

# Create a list of chip and label filenames
fwps = [os.path.basename(lwp).replace('.geojson', '.tif') for lwp in lwps]

# Create a DataFrame to store the chip, mask, and label filenames and the polygon counts
df = pd.DataFrame({
    'base_folder': base_folders,
    'cwp': cwps,
    'fwp': fwps,
    'polygon_count': polygon_counts,
    'shape': shapes_data[1],
    'use': True
})

# Figure out how many zero count images to use
z_count = len(df[df['polygon_count'] == 0])
z_target = int(round(len(df)*Z_TARGET_PERCENTAGE, 0))

# Filter the DataFrame where polygon_count is 0
condition = df['polygon_count'] == 0
subset_df = df[condition]

# # Randomly select 5001 indices from this subset and set use to True
# indices_to_set_true = np.random.choice(subset_df.index, size=5001, replace=False)
# ValueError: Cannot take a larger sample than population when 'replace=False'
# Randomly select 200 indices from this subset and set use to True (setting to 200, as the images are 4200)
indices_to_set_true = np.random.choice(subset_df.index, size=200, replace=False)
df.loc[indices_to_set_true, 'use'] = True

# Set 'use' to False for the remaining indices in the subset
remaining_indices = subset_df.index.difference(indices_to_set_true)
df.loc[remaining_indices, 'use'] = False

# Set 'use' to False where shape is not (256, 256, 3)
df.loc[df['shape'] != (256, 256, 3), 'use'] = False

# Optional: Verify the changes
print("Count of 'True' in 'use' where polygon_count is 0:", df[(df['polygon_count'] == 0) & (df['use'])].shape[0])
print("Count of 'False' in 'use' where polygon_count is 0:", df[(df['polygon_count'] == 0) & (~df['use'])].shape[0])
print('Percentage of rows where polygon count is 0:', round(100*df[df['polygon_count'] == 0].shape[0]/df.shape[0], 1))
print('Percentage of rows where poloygon count is 0 and use is True:', round(100*df[(df['polygon_count'] == 0) & (df['use'])].shape[0]/df.shape[0], 1))
print('\nTotal number of rows:', df.shape[0])

# Check for duplicate filenames
duplicates = df.duplicated(subset='fwp')
duplicated_files = df[duplicates]
print(f'\nNumber of duplicated files: {duplicated_files.shape[0]}')

# Print the head of the df
print(f'\n{df.head()}')

In [None]:
#  WE DO NOT NEED THIS ONE

def aggregate_data(df):
    """
    Aggregates the data in the given DataFrame by grouping it based on the 'base_folder' column.
    Calculates the count of images, sum of 'use' column, and count of 'polygon_count' where the value is 0.

    Args:
        df (pandas.DataFrame): The DataFrame containing the data to be aggregated.

    Returns:
        pandas.DataFrame: The aggregated DataFrame with columns 'img_count', 'use', and 'z_count'.
    """
    # Group by 'base_folder' and aggregate
    df_bf = df.groupby('base_folder').agg(
        img_count=('cwp', 'count'),
        use=('use', 'sum'),
        z_count=('polygon_count', lambda x: (x == 0).sum())
    )

    return df_bf

# Aggregate data for analysis
aggregated_df = aggregate_data(df)
print(aggregated_df[['img_count', 'use', 'z_count']])

# Print the total number of files
print(f'\n\tTotal: \t\t{df.shape[0]}')

# Print the number of files to use
print(f'\tUsed: \t\t{df[df["use"]].shape[0]}')
print(f'\tNot Used: \t{df.shape[0] - df[df["use"]].shape[0]}')

### Actual preprocessing

In [None]:
# Add part to loop into all the cities list
# Add part to obtain dataset split from the existing files structure

# Call the find_files function and print the number of found files
cwps, lwps, base_folders = find_files()
print('Found {} chip files'.format(len(cwps)))
N = 2
test_list_cwps = cwps[:N]
print(f"The first {N} elements of list are : {str(test_list_cwps)}")
print('Found {} label files\n'.format(len(lwps)))
test_list_lwps = lwps[:N]
print(f"The first {N} elements of list are : {str(test_list_lwps)}")
# Print message if all filenames match
print('All filenames match; each tif has a label!')

# Call the shapes_data and print the shape of the first chip file
shapes_data = check_shapes(cwps)
print(f'Chip shapes with counts are: {shapes_data[0]}')

In [None]:
# Split data it into train-val-test and convert label and image formats for YOLO model training.
# 1. Define SEED and train-val-test split percentages
# 2. Get the chip filenames where 'use'=True
# 3. Shuffle and divide the chips to prevent biased data splitting.
# 4. Confirm the training, validation, and testing arrays to verify the correct split.
# 8. Write YOLO label files.
# 9. Convert image files from TIFF to JPEG.

# SEED = 42
# TRAIN_SPLIT = 0.7 ... not needed!
# VAL_SPLIT = 0.15
# TEST_SPLIT = 0.15

# assert(TRAIN_SPLIT + VAL_SPLIT + TEST_SPLIT == 1), "The sum of the splits must be equal to 1"

# print(f'Train-val-test split: {TRAIN_SPLIT}-{VAL_SPLIT}-{TEST_SPLIT}')

# # Set the random seed
# np.random.seed(SEED)

# # Get the cwps
# cwps = df[df['use']]['cwp'].values

# # Shuffle indices
# shuffled_indices = np.random.permutation(len(cwps))

# # Calculate split indices for the specified split percentage
# train_end = int(len(cwps) * TRAIN_SPLIT)
# val_end = train_end + int(len(cwps) * VAL_SPLIT)

# # Split the indices into training, validation, and testing
# train_indices = shuffled_indices[:train_end]
# val_indices = shuffled_indices[train_end:val_end]
# test_indices = shuffled_indices[val_end:]

# Create train, val, and test arrays
train_cwps = cwps[train_indices]
val_cwps = cwps[val_indices]
test_cwps = cwps[test_indices]

# Output the results to verify
print(f'\nTrain array size: {len(train_cwps)}')
print(f'Validation array size: {len(val_cwps)}')
print(f'Test array size: {len(test_cwps)}\n')

# Check if the YOLO folder exists, if not create labels, images, and folders
if not os.path.exists('ramp_data_yolo'):
    # Create the folder
    os.makedirs('ramp_data_yolo')

    # Write the YOLO label files for the training set
    print('Generating training labels')
    for train_cwp in tqdm(train_cwps):
        write_yolo_file(train_cwp, 'train')

    # Write the YOLO label files for the validation set
    print('Generating validation labels')
    for val_cwp in tqdm(val_cwps):
        write_yolo_file(val_cwp, 'val')

    # Write the YOLO label files for the test set
    print('Generating test labels')
    for test_cwp in tqdm(test_cwps):
        write_yolo_file(test_cwp, 'test')

    # Convert the chip files to JPEG format
    print('Generating training images')
    for train_cwp in tqdm(train_cwps):
        convert_tif_to_jpg(train_cwp, 'train')

    print('Generating validation images')
    for val_cwp in tqdm(val_cwps):
        convert_tif_to_jpg(val_cwp, 'val')

    print('Generating test images')
    for test_cwp in tqdm(test_cwps):
        convert_tif_to_jpg(test_cwp, 'test')

else:
    print('Data already converted')