# 0. Directories

We are going to place our datasets in a folder named "images":

In [None]:
import os

dir_base = os.getcwd()                            # Base directory
dir_images = os.path.join(dir_base, 'images')     # ./images

# 1. Making the general dataset (TCIA)

To make this dataset, we used a total of 35 histopathological High-Resolution (HR) images from The Cancer Imaging Archive (TCIA). These images are shown in the following table:

| **Collection** | **Location** | **Data Format** | **Magnification** | **Number of Images** | **Size (GB)** | **Images used**                                                                                                            |
|:--------------:|:------------:|:---------------:|:-----------------:|:--------------------:|:-------------:|:--------------------------------------------------------------------------------------------------------------------------:|
| CMB-MML        | Blood, Bone  | SVS             | 20 or 40          | 2                    | 0.29          | MSB-04030-12-02 <br> MSB-04030-12-03                                                                                           |
| CPTAC-GBM      | Brain        | SVS             | 20 or 40          | 508                  | 87            | C3L-00016-21 <br> C3N-00661-21 <br> C3N-04693-21                                                                                   |
| CPTAC-BRCA     | Breast       | SVS             | 20 or 40          | 642                  | 113.29        | 01BR001-0684a407-f446-486d-9160-b483cb <br>  11BR003-2dc74c29-2e89-4600-80bf-1e3637 <br>  22BR001-02797e67-5651-4cc8-b815-904900 |
| CPTAC-COAD     | Colon        | SVS             | 20 or 40          | 373                  | 66.69         | 01CO001-760f15d2-444c-4deb-b133-62f3ca <br>  11CO003-2e1d9ae2-f8f8-4d9f-aa2d-ca5577 <br>  22CO006-e1cd3d70-132b-452f-ba10-026721 |
| CMB-GEC        | Esophagus    | SVS             | 20 or 40          | 3                    | 0.103         | MSB-06857-01-02 <br> MSB-06857-01-07 <br> MSB-06857-01-12                                                                          |
| CPTAC-CCRCC    | Kidney       | SVS             | 20 or 40          | 783                  | 190           | C3L-00004-21 <br> C3N-00148-21 <br> C3N-03021-24                                                                                   |
| CPTAC-LUAD     | Lung         | SVS             | 20 or 40          | 1139                 | 435           | C3L-00001-21 <br> C3N-00167-21 <br> C3N-02923-24                                                                                   |
| CPTAC-OV       | Ovary        | SVS             | 20 or 40          | 222                  | 54.64         | 01OV002-bd8cdc70-3d46-40ae-99c4-90ef77 <br>  13OV003-407c600a-c0ac-41a6-8d87-fd8a3c <br>  20OV005-4113892a-c489-499e-9255-7328ea |
| CPTAC-PDA      | Pancreas     | SVS             | 20 or 40          | 557                  | 88            | C3L-00017-21 <br> C3N-00198-21 <br> C3N-04284-23                                                                                   |
| CMB-PCA        | Prostate     | SVS             | 20 or 40          | 3                    | 0.582         | MSB-02917-01-02 <br> MSB-03973-01-02 <br> MSB-07483-01-02                                                                          |
| CPTAC-CM       | Skin         | SVS             | 20 or 40          | 404                  | 107           | C3L-00275-21 <br> C3N-00179-21 <br> C3N-05624-25                                                                                   |
| CPTAC-UCEC     | Uterus       | SVS             | 20 or 40          | 888                  | 154           | C3L-00006-21 <br> C3N-00151-21 <br> C3N-03767-21                                                                                   |


The images can be downloaded from: https://www.cancerimagingarchive.net/histopathology-imaging-on-tcia/

We downloaded each image (in SVS format) by following the instructions in TCIA.

We placed each image in a corresponding folder for its type of tissue (e.g. Brain or Lung) and we placed these folders inside a folder "images_svs" inside of the "images" folder.

For example, the image "C3L-00016-21" would be in ./images/images_svs/Brain/C3L-00016-21.svs

In [None]:
images_folder_svs = os.path.join(dir_images, 'images_svs')

We also create a directory to place the images in PNG format:

In [None]:
images_folder_png = os.path.join(dir_images, 'images_png')

# If it doesn't exist already, make it
if not os.path.exists(images_folder_png):
    os.makedirs(images_folder_png, exist_ok=True)
    print(f"{images_folder_png} directory successfully created.")

**Note:** We will need to use the **OpenSlide library** to read the images in SVS format. We will have to install it in our environment beforehand, and we can do it in Google Colab by doing:

In [None]:
try:
  import openslide
except:
  !apt update && apt install -y openslide-tools
  !pip install openslide-python
  import openslide

## 1.1 Functions definitions

We defined some functions to handle the images:

In [None]:
## Define some useful functions
import math

# Return closest slice size and number of slices to divide a certain value
def closest_slices_to(value, slice_size=2000, tolerance=0.01):
  """
  Given an input "value" we want to divide into an integer number of "slices"
  of a size close (or above) to a "slice size", returns the slice size "slice_size"
  and number of slices "n_slices", so the remaining quantity after the slicing
  divided by the input value is below the given tolerance, if possible.

  This is:
  value = n_slices * slice_size + remaining,   where  remaining/value < tolerance

  If tolerance can't be reached, it will return the values so remaining/value
  would be the smallest possible.
  """
  n_slices = math.floor(value/slice_size)

  while True :
    remaining = value - n_slices * slice_size
    # If tolerance is reached or won't be reached, breaks
    if remaining/value < tolerance or remaining < n_slices :
      break
    # If not, continues
    slice_size += math.floor(remaining/n_slices)

  return slice_size, n_slices

# Return closest (from below) number to multiple of n
def closest_multiple_of(value, n):
  """
  For given "value" number, returns the multiple of n closest (from below) to "value".
  """
  return value // n * n

# Given some area dimensions, return slices dimensions in which it could be divided
def slicing_function(height, width, slice_size=2000, multiple_of=1, tolerance=0.01,
                     keep_proportions=True):
  """
  Given some area of dimensions (height, width), finds the dimensions of slices in
  which this area could be divided, for which the bigger dimension of each slice
  will be close (from above) to a "slice size" while also being a multiple of
  some desired number.

  The smaller of the dimensions of each slice would try to keep the ratio of the
  dimensions of the original area.

  If keep_proportions = False, the slice would be a square and both of its dimensions
  would be equal.

  The initial value for one of the dimensions of the slice (either the width or
  the height, whichever is higher) will be equal to slice_size and it will be
  gradually adjusted with higher values until the whole corresponding dimension
  of the bigger area can be covered with a whole number of slices, leaving outside
  the less amount of pixels as possible. This will try to reach at least a desired
  value of "tolerance", corresponding to the fraction of area that would be left over.

  The other dimension of the slice will be determined using the final value of
  this dimension.

  Will return:
      [slice_height,slice_width] - Height and width of the slices
      [new_height, new_width]    - Height and width of the area successfully
                                   covered by the slices
  """
  # 1. Find the bigger dimension between height and width
  big_dim = max(height, width)

  # 2. Find closest slice size for bigger dimension
  big_slice_size, big_n_slices = closest_slices_to(big_dim,
                                                   slice_size=slice_size,
                                                   tolerance=tolerance)

  # 3. Find corresponding slice size for smaller dimension
  if keep_proportions :
    small_slice_size = math.floor(min(height, width)/big_dim * big_slice_size)
  else :
    small_slice_size = big_slice_size

  # 4. Round down to closest multiple of multiple_of
  big_slice_size = closest_multiple_of(big_slice_size, n=multiple_of)
  small_slice_size = closest_multiple_of(small_slice_size, n=multiple_of)

  # 5. Get new number of slices
  big_n_slices = math.floor(max(height, width)/big_slice_size)
  small_n_slices = math.floor(min(height, width)/small_slice_size)

  # 6. Get new height and widths to be multiple of slices sizes
  if height > width :
    new_height = big_slice_size * big_n_slices
    new_width = small_slice_size * small_n_slices

    slice_height = big_slice_size
    slice_width = small_slice_size
  else :
    new_width = big_slice_size * big_n_slices
    new_height = small_slice_size * small_n_slices

    slice_width = big_slice_size
    slice_height = small_slice_size

  # 7. Return slices dimensions and new image dimensions [height, width]
  return [slice_height,slice_width], [new_height, new_width]

# Function to crop image (numpy array) to given new dimensions
def get_image_boundaries_to_crop(height, width, new_height, new_width,
                                 alignment_ver=None, alignment_hor=None):
  """
  Given some image dimensions (height, width) and the desired new dimensions
  (new_height, new_width), will return the top, left, right and bottom coordinates
  of the image to be cropped.

  If alignment = "center", will crop from the center of the image. Otherwise,
  from the top-left of the image for ver-hor respectively.
  """
  # Defining the cropping coordinates
  if alignment_hor == "center":
      left = int((width - new_width) / 2)
      right = int((width + new_width) / 2)
  else :
      left = 0
      right = new_width
  if alignment_ver == "center":
      top = int((height - new_height) / 2)
      bottom = int((height + new_height) / 2)
  else :
      top = 0
      bottom = new_height

  return top, left, right, bottom


# Function to crop image (numpy array) with given specifications
def cropping_image_with_slices(image, slice_size=2000, tolerance=0.01, multiple_of=1):
  """
  Images are expected to be numpy arrays of shape: (height,width,channels).
  """
  # Get dimensions from the image
  height = image.shape[0]
  width = image.shape[1]

  # Get slices dimensions and new dimensions
  [slice_height,slice_width], [new_height, new_width] = \
            slicing_function(height, width, slice_size=slice_size,
                             multiple_of=multiple_of, tolerance=tolerance)

  # Get image boundaries
  top, left, right, bottom = \
      get_image_boundaries_to_crop(height=height, width=width,
                                   new_height=new_height, new_width=new_width,
                                   alignment="center")

  # 7. Return cropped image and slices dimensions [height, width]
  return image[top:bottom,left:right,:], [slice_height,slice_width]


In [None]:
import cv2
import numpy as np
import os
import openslide

# Return if the input image is mostly background
def check_if_mostly_background(image, background_color=(255, 255, 255),
                               tolerance=20, threshold=0.7):
  """
  Takes an image as input and check if pixels are close to certain background color
  for a given tolerance (difference in value). Returns True if the fraction of the
  image that is close to the background color is greater than or equal to "threshold".

  Pixels values are expected to be from 0 to 255. Threshold is expected to be
  between 0 and 1.

  Images are expected to be numpy arrays of shape: (height,width,channels).
  """
  # Create a mask of the background color
  lower_bound = np.array([int(max(value - tolerance,0)) for value in background_color],
                         dtype=np.uint8)
  upper_bound = np.array([int(min(value + tolerance,255)) for value in background_color],
                         dtype=np.uint8)
  background_mask = cv2.inRange(image, lower_bound, upper_bound)

  # Calculate the proportion of background pixels
  height, width, _ = image.shape
  background_proportion = cv2.countNonZero(background_mask) / (height * width)

  # Returns True if image is mostly background
  return background_proportion >= threshold

# Slice SVS into smaller PNGs tiles or patches, excluding background
def slice_SVS_to_PNGs(image_path, export_folder, image_name="image",
                      level=0, magnification=20.0,
                      tile_size=2000, scale=1, tolerance=0.01,
                      background_color1=(255, 255, 255),
                      bk_tolerance1=10, bk_threshold1=0.7,
                      background_color2=(0, 0, 0),
                      bk_tolerance2=0, bk_threshold2=1.0):
  """
  Load a SVS image from image_path and slice it according to the specifications,
  exporting the slices in PNG format to export_folder under image_name.

  Slices would be of at least one of its dimensions equal or higher than
  tile_size (for a given tolerance), and the other dimension following the
  original image proportions, if possible. Both dimensions would be an integer
  value multiple of "scale".

  Slices of the image that are mostly background will be discarded, for a given
  background_color, a bk_tolerance (value from 0 to 255) and a bk_threshold
  (fraction from 0 to 1 of the slice that is background that is acceptable).
  Two colors of background can be provided.
  """
  # Load slide
  slide = openslide.OpenSlide(image_path)

  # Get information from the image slide
  (width, height) = slide.level_dimensions[level]
  ApparentMagnification = int(slide.properties["aperio.AppMag"])

  # Define some quantities
  MagnificationRatio = max(ApparentMagnification/magnification, 1.0)  # In case "magnification" would be bigger than the AppMag

  # Divide region into slices
  slice_size = int(tile_size * MagnificationRatio)   # If this MagRatio is bigger than 1, we will reduce the image tiles size later

  [slice_height,slice_width], [new_height, new_width] = \
            slicing_function(height, width, slice_size=slice_size,
                             multiple_of=int(scale * MagnificationRatio),
                             tolerance=tolerance,keep_proportions=False)

  # Get starting positions to crop the whole image
  top, left, right, bottom = \
      get_image_boundaries_to_crop(height=height, width=width,
                                   new_height=new_height, new_width=new_width,
                                   alignment_hor="center",alignment_ver="center")

  # Read the whole image by regions and export them
  rows = int(new_height/slice_height)
  columns = int(new_width/slice_width)
  image_counter=1
  for row in range(rows):
    for column in range(columns):
      region = slide.read_region((left + column * slice_width, top + row * slice_height),
                                 level,
                                 (slice_width, slice_height))

      # Transform to numpy array
      image = np.array(region)
      image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)    # Color correction to RGB

      # Resize image if necessary
      if MagnificationRatio > 1.0 :
        scale_down_factor = 1 / MagnificationRatio
        image = cv2.resize(image, None, fx=scale_down_factor, fy=scale_down_factor,
                          interpolation= cv2.INTER_AREA)   # INTER_AREA is better for downscaling than INTER_CUBIC

      # Discard background tiles and save the others
      # Check if mostly white
      if not check_if_mostly_background(image, background_color=background_color1,
                                        tolerance=bk_tolerance1,
                                        threshold=bk_threshold1):
        # Check if mostly black or dark gray
        if not check_if_mostly_background(image, background_color=background_color2,
                                        tolerance=bk_tolerance2,
                                        threshold=bk_threshold2):
          # If not background, export image
          image_number = f"{image_counter:0{len(str(rows * columns))}d}"    # This will format the numbers like 0016 when applicable
          image_counter += 1
          image_filename = str(image_name + "_x" + str(int(magnification)) + "_" + image_number + ".png")
          export_path = os.path.join(export_folder,image_filename)
          cv2.imwrite(export_path, image)

  return



## 1.2 Preparing the HR images (slice the SVS images into PNG patches)

We are going to load the SVS images in their 20x magnification. Given these images are quite big in size (dimensions of 20,000 px to 50,000 px per side, on average), we will slice them in patches of around 2,000 px per side.

We will also use our function check_if_mostly_background() to discard patches that are mostly background (i.e. either mostly white or mostly black), because the background can be a huge portion of these images, with the parameters:

* background_color1=(225, 225, 225)       (near white)
* background_color2=(30, 30, 30)          (near black)
* tolerance = 30
* threshold = 0.75

So if a patch is at least 75% of a color within a 30 value tolerance from the colors above, it was discarded. All other patches were saved in the PNG folder we previously created.



In [None]:
# Count total number of svs images
number_of_svs_images = 0
for folder, _, images_svs_list in os.walk(images_folder_svs):
  number_of_svs_images += len(images_svs_list)

# Export all PNG images to the same folder (ignore subfolders)
export_folder = images_folder_png
if not os.path.exists(export_folder):
  os.makedirs(export_folder, exist_ok=True)

# Iterate over the SVS images and slice them
current_img_number = 1
for folder, _, images_svs_list in os.walk(images_folder_svs):
  # Go over each SVS image on each folder
  for image in images_svs_list:
    image_path = os.path.join(images_folder_svs, folder, image)

    # Prepare image name and folder
    image_name = image.split(".svs")[0]
    folder_name = folder.split(os.path.sep)[-1]

    # Slice images in images_svs folder
    print(f"Image {folder_name}/{image} is being sliced... ({current_img_number}/{number_of_svs_images})")
    slice_SVS_to_PNGs(image_path=image_path, export_folder=export_folder,
                      image_name=image_name,
                      level=0, magnification=20.0,
                      tile_size=2000, scale=4, tolerance=0.005,
                      background_color1=(225, 225, 225),
                      bk_tolerance1=30, bk_threshold1=0.75,
                      background_color2=(30, 30, 30),
                      bk_tolerance2=30, bk_threshold2=0.75)
    print(f"{image_name} from {image_path} to {export_folder}")
    print(f"All PNG slices of {folder_name}/{image} has been exported to {export_folder}.")
    current_img_number += 1

We can make a .zip file with the PNG folder by running:

In [None]:
!zip -r images_png.zip "images/images_png/"

If the .zip file would be too big to handle, we can split it in smaller pieces (e.g. 6 GB) by running:

In [None]:
!zip images_png.zip --out images_png_part.zip -s 6g

## 1.3 Preparing the LR images (downscaling the HR images)

Given we are interest in EDSR x4, we will need to downscale by a factor of 4 the HR images to get our LR images. We do it with the cv2 library from OpenCV, by using the resize() function with INTER_AREA interpolation.

The LR images would go to a "images_png_x4" folder (i.e. ./images/images_png_x4 )

By following the requirements of our training code, each LR images downscaled by a factor of 4 will have the same name as the HR images with "_x4" at the end.

For example, a HR image named "image06.png" will have a corresponding LR image named "image06_x4.png".

In [None]:
import cv2

def downscale_and_save_image(image_path, export_path, scale=1):
  # Read the image
  image = cv2.imread(image_path)

  # Downscale image with bicubic algorithm
  image = cv2.resize(image, None, fx=1/scale, fy=1/scale,
                     interpolation= cv2.INTER_AREA)        #cv2.INTER_CUBIC for bicubic was giving poor results

  # Save image
  cv2.imwrite(export_path, image)
  return


# Export all PNG images to the same folder (ignore subfolders)
export_folder = os.path.join(dir_images, "images_png_x4")
if not os.path.exists(export_folder):
  os.makedirs(export_folder, exist_ok=True)

# Iterate over the PNG images and downscale them
scale = 4
number_of_folders = 0
current_folder_number = 1
for folder, folder_list, images_png_list in os.walk(images_folder_png):
  # Count number of folders
  if number_of_folders == 0 :
    number_of_folders = len(folder_list)

  # Go over each PNG image on each folder
  print(f"Images on {folder} are being downscaled... ({current_folder_number}/{number_of_folders})")
  for image in images_png_list:
    image_path = os.path.join(images_folder_png, folder, image)

    # Prepare image name and export path
    image_name = image.split(".png")[0]
    image_filename = str(image_name + "_x4.png")
    export_path = os.path.join(export_folder,image_filename)

    # Downscale and save the images
    downscale_and_save_image(image_path=image_path, export_path=export_path,
                             scale=scale)

  print(f"All PNG images on {folder} has been downscaled and saved to {export_folder}.")
  current_folder_number += 1

We can make a .zip file with the folder with the LR images by running:

In [None]:
!zip -r images_png_x4.zip "images/images_png_x4/"

In [None]:
And again, we can split the .zip file into smaller pieces (e.g. 6 GB) by running:

In [None]:
!zip images_png_x4.zip --out images_png_x4_part.zip -s 6g

## 1.4 Final folder structure

Given we want to use these images to train the EDSR x4 model, we need to prepare them in a folder with a particular structure.

If we are going to use our custom.py module, we need to make a folder named "Custom".

Inside such folder, we will create 2 folders: one named "HR" and one named "LR_bicubic".

Inside the HR folder, we will place the images from "images_png".

Inside the LR_bicubic folder, we will create a folder named "X4". Inside the X4 folder, we will place the images from "images_png_x4".

Given this is our general dataset, we can place the "Custom" folder inside a dedicated folder "image-data-general".

We should end up having a dataset that would look like:

* image-data-general
  * Custom
    * HR
      * tcia_image0001.png
      * ...
      * tcia_image2500.png
    * LR_bicubic
      * X4
        * tcia_image0001_x4.png
        * ...
        * tcia_image2500_x4.png

# 2. Making the dedicated dataset (Humanitas)

Given we make our previously dataset in its own folder, we will prepare a dedicated folder for this dataset as well. We will then make a folder "image-data-dedicated" inside our "images" folder, and create the "Custom" folder inside.

In [None]:
import os

dir_custom = os.path.join(dir_images,"image-data-dedicated", "Custom")    # ./images/image-data-dedicated/Custom
if not os.path.exists(dir_custom):
  os.makedirs(dir_custom, exist_ok=True)

## 2.1 Preparing the HR images

The patches provided to us from the Humanitas Research Institute were already in PNG format, with square dimensions of 256 px per side.

Given that, we didn't need to perform any other action on them, rather than placing them inside of the corresponding HR folder of the dataset.

**Note:** Even if images are in PNG format, they could be in either RGB or RGBA mode. Our images were in RGBA mode, and we need to convert them in RGB, in order to avoid conflicts during our training. A way to do it is the following:

In [None]:
from PIL import Image

img_path = "/content/img.png"
img = Image.Open(img_path)
img_rgb = img.convert('RGB')

new_img_rgb_path = "/content/img_rgb.png"
img_rgb.save(new_img_rgb_path)

Alternatively, given that we got access to the Humanitas patches on a server (located in a path stored in the variable "dir_patches" that we have below), we decided to copy and convert the HR images directly with the code below, by using cv2:

In [None]:
import os
import cv2

# Create the HR folder
dir_hr = os.path.join(dir_custom, "HR")    # ./images/image-data-dedicated/Custom/HR
if not os.path.exists(dir_hr):
  os.makedirs(dir_hr, exist_ok=True)

# Function to copy images from one folder to another using cv2
def copy_images_with_cv2(source_dir, destination_dir, verbose=False):
    # Create destination directory if it doesn't exist
    if not os.path.exists(destination_dir):
        os.makedirs(destination_dir)

    # Iterate over files in the source directory
    for file_name in os.listdir(source_dir):
        source_file = os.path.join(source_dir, file_name)
        destination_file = os.path.join(destination_dir, file_name)
        # Load image with cv2
        image = cv2.imread(source_file)
        # Save image with cv2
        cv2.imwrite(destination_file, image)
        if verbose :
            print(f"Copied {file_name} to {destination_dir}")


# Copy HR images from "PatchesHumanitas_256x256_10x" to HR
source_dir = dir_patches
destination_dir = dir_hr

#copy_files(source_dir, destination_dir, print=False)
copy_images_with_cv2(source_dir, destination_dir, verbose=False)

print(f"All images from {source_dir} were copied to {destination_dir}")

## 2.2 Preparing the LR images

We created the LR_bicubic/X4 folder and placed the LR images inside (downscaled from the HR images by a factor 4), with the appropiate name, in the same way we did with general dataset.

We can do this with the code below:

In [None]:
import os
import cv2

# Create the LR/X4 folder
dir_lr = os.path.join(dir_custom, "LR_bicubic")    # ./images/image-data-dedicated/Custom/LR_bicubic
dir_lr_x4 = os.path.join(dir_lr, "X4")             # ./images/image-data-dedicated/Custom/LR_bicubic/X4
if not os.path.exists(dir_lr_x4):
  os.makedirs(dir_lr_x4, exist_ok=True)


# Function to downscale one HR image and store it as its LR counterpart
def downscale_and_save_image(image_path, export_path, scale=1):
  # Read the image
  image = cv2.imread(image_path)

  # Downscale image
  image = cv2.resize(image, None, fx=1/scale, fy=1/scale,
                     interpolation= cv2.INTER_AREA)        #cv2.INTER_CUBIC for bicubic was giving poor results

  # Save image
  cv2.imwrite(export_path, image)
  return


# Iterate over the HR images and downscale them
scale = 4
counter = 0
print_saves = True
print_each = 2000

for file_name in os.listdir(dir_hr):
    if file_name.lower().endswith('.png'):
        #Prepare image name
        image_name = file_name.split(".png")[0]
        image_filename = str(image_name + "_x4.png")

        #Get path
        image_path = os.path.join(dir_hr, file_name)
        export_path = os.path.join(dir_lr_x4, image_filename)

        # Downscale and save the images
        downscale_and_save_image(image_path=image_path, export_path=export_path, scale=scale)

        counter+=1

        # Print
        if print_saves and counter % print_each == 0 :
            print(f"A total of: {counter} images have been downscaled and saved.")

print(f"All PNG images on {dir_hr} has been 4x downscaled and saved in {dir_lr_x4}.")

## 2.3 Checking the images

Given the dataset is conformed of a big number of images (32,261), we defined the functions below in order to check pertinent information about the HR and LR folders with images.

By running the appropiate code, we got that:

* For the HR folder:
  * Number of files: 32261
  * Folder size: 4.07 GB
  * Different file extensions within the directory: ['.png']
  * Dimensions count len: 1
  * (width,height): (256, 256), Count: 32261

* For the LR_bicubic/X4 folder:
  * Number of files: 32261
  * Folder size: 294.92 MB
  * Different file extensions within the directory: ['.png']
  * Dimensions count len: 1
  * (width,height): (64, 64), Count: 32261

  Meaning we had all corresponding 32,261 images on each folder, all images are in .png format, and they are all of the same dimension (dimensions count length = 1), where for the HR the dimensions are 256 x 256 px, and for the LR they are 64 x 64 px, with 4.07 GB and 294.92 MB of folder size, respectively.

In [None]:
import os
from PIL import Image

def get_dir_stats(dir_path):
    total_files = 0
    total_size = 0

    # Walk through the directory tree
    for root, dirs, files in os.walk(dir_path):
        # Count files
        total_files += len(files)
        # Calculate total size
        for file in files:
            file_path = os.path.join(root, file)
            total_size += os.path.getsize(file_path)

    # Convert size to human-readable format
    total_size_str = _format_size(total_size)

    return total_files, total_size_str

def _format_size(size_bytes):
    # Convert bytes to appropriate unit (KB, MB, GB, etc.)
    for unit in ['B', 'KB', 'MB', 'GB', 'TB']:
        if size_bytes < 1024.0:
            return f"{size_bytes:.2f} {unit}"
        size_bytes /= 1024.0

def get_unique_file_extensions(dir_path):
    extensions = []

    # Walk through the directory tree
    for root, dirs, files in os.walk(dir_path):
        # Extract file extensions
        for file in files:
            _, extension = os.path.splitext(file)
            if extension.lower() not in extensions:
                extensions.append(extension.lower())

    return extensions

def get_unique_dimensions_with_count(dir_path):
    dimensions_count = {}

    # Walk through the directory tree
    for root, dirs, files in os.walk(dir_path):
        # Iterate over PNG files
        for file in files:
            if file.lower().endswith('.png'):
                file_path = os.path.join(root, file)
                try:
                    # Open the image and get its dimensions
                    with Image.open(file_path) as img:
                        width, height = img.size
                        dimensions = (width, height)
                        # Update dimensions count
                        if dimensions in dimensions_count:
                            dimensions_count[dimensions] += 1
                        else:
                            dimensions_count[dimensions] = 1
                except Exception as e:
                    print(f"Error processing {file}: {e}")

    return dimensions_count

In [None]:
## Checking the contents of the HR folder
folder_to_check = dir_hr

# Number of files and folder size
num_files, dir_size = get_dir_stats(folder_to_check)
print(f"Number of files: {num_files}")
print(f"Folder size: {dir_size}")

# File extensions present in the folder
extensions = get_unique_file_extensions(folder_to_check)
print(f"Different file extensions within the directory: {extensions}")

# Dimensions of the images presents in the folder
dimensions_count = get_unique_dimensions_with_count(folder_to_check)
print(f"Dimensions count len: {len(dimensions_count)}")
for dimensions, count in dimensions_count.items():
    print(f"(width,height): {dimensions}, Count: {count}")

In [None]:
## Checking the contents of the LR/X4 folder
folder_to_check = dir_lr_x4

# Number of files and folder size
num_files, dir_size = get_dir_stats(folder_to_check)
print(f"Number of files: {num_files}")
print(f"Folder size: {dir_size}")

# File extensions present in the folder
extensions = get_unique_file_extensions(folder_to_check)
print(f"Different file extensions within the directory: {extensions}")

# Dimensions of the images presents in the folder
dimensions_count = get_unique_dimensions_with_count(folder_to_check)
print(f"Dimensions count len: {len(dimensions_count)}")
for dimensions, count in dimensions_count.items():
    print(f"(width,height): {dimensions}, Count: {count}")

## 1.4 Final folder structure

If everything went alright, and all images are properly placed, we should have a dataset with the expected structure:

* image-data-dedicated
  * Custom
    * HR
      * humanitas_image00001.png
      * ...
      * humanitas_image32261.png
    * LR_bicubic
      * X4
        * humanitas_image00001_x4.png
        * ...
        * humanitas_image32261_x4.png