<a href="https://colab.research.google.com/github/ciupava/Yolo_experiments/blob/main/YoloPreprocessing_test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Preprocessing** ###

TO BE UPDATED!!

This notebook downloads and preprocesses the RAMP data package (RAMP-DATA-V0), which comprises over 100k image-label in TIF-GEOJSON formats and 22 regional folders. The resulting output of this notebook is YOLO-DATA-V1 dataset. YOLO-DATA-V2 and YOLO-DATA-V3 datasets were created using the Pruning notebook.

Below are the key highlights accomplished in this notebook.

1. Data Exploration
    
 - Download from Gdrive and unzip the RAMP data package.
 - Completed basic data integrity checks.
 - Counted the image and label files and confirmed that each image has a label with the matching name.
 - Checked the shape of the images; this step revealed issues with the Shanghai and Paris subfolders.
 - Identify background images by filename the image-label and compute the total percentage.

2. Enhance the dataset

 - Create a dataframe for analysis and data control.
 - Set Shanghai and Paris subfolders from the dataset by flagging 'use' to False for image shape is not (256, 256, 3).
 - Set 'use' flag to False randomly for 13% of background images to reduce from 18% to 5%.
 - Indentify images found with incorrect image-label in the exclusions_list.txt.

3. Convert RAMP (TIF/GEOJSON) to YOLO (JPG/TXT)

 - Image-label pairs were randomly assigned to train-val-test folders using a 70-15-15 split.
 - Generate text files based on the YOLO's segmentation format from the GEOJSON and TIF files. Polygons were extracted from each GEOJSON file and aligned with the location information embedded in the TIF. Note: 0,0 is the top, left, and coordinates normalized over 0, 1.
 - Convert the TIF images to JPG at quality level 100.


Note: A GPU is not required for preprocessing. Due to the numerous disk i/o operations, this notebook is preferred over Google Colab for local running.

In [None]:
!pip3 install -q rasterio
!pip3 install -q pyproj

In [None]:
import os
import glob
import json
import cv2
import json
import pandas as pd
import matplotlib.pyplot as plt
import random
from pyproj import Transformer
import rasterio
import matplotlib.image as mpimg
from PIL import Image
import numpy as np
import yaml
import gc
from tqdm import tqdm

In [None]:
# Connecting to 'My Drive', to be able to access local data

from google.colab import drive
drive.mount('/content/drive')

In [None]:
os.chdir('/content/drive/My Drive/YOLO_test/data/')
os.listdir()


In [None]:
home_folder = os.getcwd()
DATA_FOLDERS = 'fair_*'
data_folders = glob.glob(DATA_FOLDERS)

In [None]:
def find_files():
    """
    Find chip (.tif) and label (.geojson) files in the specified folders.

    Returns:
    cwps (list): List of chip filenames with path.
    lwps (list): List of label filenames with path.
    base_folders (list): List of base folder names.
    """

    # Find the folders
    data_folders = glob.glob(DATA_FOLDERS)

    # Create a list to store chip (.tif), mask (.mask.tif), and label (.geojson) filenames with path
    cwps = []
    lwps = []

    # Create a list to store the base folder names
    base_folders = []

    for folder in data_folders:
        print(f'folder is {folder}')
        # Pattern to match all .tif files in the current folder, including subdirectories
        tif_pattern = f"{folder}/**/**/**/*.tif".        print(f'found tif files {found_tif_files}')
        print(f'tif pattern is {tif_pattern}')
        print(len(tif_pattern))
        # Find all .tif files in the current 'training*' folder and its subdirectories
        found_tif_files = glob.glob(tif_pattern, recursive=True)
        print(f'found tif files {found_tif_files}')
        print(len(found_tif_files))
        # Filter out .mask.tif files and add the rest to the tif_files list
        for file in found_tif_files:
            if not file.endswith('mask.tif'):
                cwps.append(file)

        # Pattern to match all .geojson files in the current folder, including subdirectories
        geojson_pattern = f"{folder}/**/**/**/*.geojson".        print(f'found tif files {found_tif_files}')
        print(f'geojson pattern is {geojson_pattern}')
        print(len(geojson_pattern))
        # Find all .geojson files
        found_geojson_files = glob.glob(geojson_pattern, recursive=True)
        print(f'found gjson files {found_geojson_files}')
        print(len(found_geojson_files))
        # Add found .geojson files to the geojson_files list
        lwps.extend(found_geojson_files)

    # Sort the lists
    cwps.sort()
    lwps.sort()

    # Assert that the the number files for each type are the same
    assert len(cwps) == len(lwps), "Number of tif files and label files do not match"

    # Function to check that the filenames match
    for n, cwp in enumerate(cwps):
        c = os.path.basename(cwp).replace('.tif', '')
        l = os.path.basename(lwps[n]).replace('.geojson', '')

        assert c == l, f"Chip and label filenames do not match: {c} != {l}"

        base_folders.append(cwp.split('/')[1])

    return cwps, lwps, base_folders

# Call the function and print the number of found files
cwps, lwps, base_folders = find_files()
print('Found {} chip files'.format(len(cwps)))
print('Found {} label files\n'.format(len(lwps)))

# Print message if all filenames match
print('All filenames match; each tif has a label!')