# Data Exploration and Cleaning for SRGD data

This notebook aims to do a quick exploration on the files that are present on our dataset.

## Mounting the Google Drive

Configuration of the Google Drive access and correct path setup for the training folder, both for low-res and high-res files.

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import os
from pathlib import Path
import shutil

base_path = Path("/content/drive/MyDrive/Deep Learning/super-resolution-in-video-games/train")
lr_path = base_path / "lr"
hr_path = base_path / "hr"

print("Path for files:")
print(f"Low resolution: {lr_path}")
print(f"High resolution: {hr_path}")

Path for files:
Low resolution: /content/drive/MyDrive/Deep Learning/super-resolution-in-video-games/train/lr
High resolution: /content/drive/MyDrive/Deep Learning/super-resolution-in-video-games/train/hr


## Checking unique files

We're checking the lenght of each folder, comparing the uniques. We should not have duplicates nor different lenghts between low-res and high-res folders.

In [None]:
lr_files = os.listdir(lr_path)
hr_files = os.listdir(hr_path)

print("Number of files in folders:")
print(f"Low resolution: {len(lr_files)}. Unique: {len(set(lr_files))}")
print(f"High resolution: {len(hr_files)}. Unique: {len(set(hr_files))}")

Number of files in folders:
Low resolution: 14441. Unique: 14441
High resolution: 14432. Unique: 14432


In [None]:
different_files_between_folders = set(lr_files) - set(hr_files)

print(f"Number of different files between folders: {len(different_files_between_folders)}")
print("Different files:")
print(different_files_between_folders)

Number of different files between folders: 10
Different files:
{'04151 (1).png', '04154 (1).png', '04156 (1).png', '04160 (1).png', '04155 (1).png', '04152 (1).png', '04153 (1).png', '12366 (1).png', '04157 (1).png', '12499 (1).png'}


It seems that we have different files that we need to track. They seem to be "marked" with this `* (1).png` on the file name. I'll remove them from the training folders and placing them into a "removed" folder, for rastreability.

In [None]:
# 1. Listar nomes dos arquivos (sem extensão)
lr_files = {f.stem for f in lr_path.glob("*.png") if "(1)" not in f.name}
hr_files = {f.stem for f in hr_path.glob("*.png") if "(1)" not in f.name}

# 2. Interseção: arquivos que existem em ambas
common_files = sorted(lr_files & hr_files)
print(f"Arquivos válidos: {len(common_files)}")
print(f"Primeiros 10 arquivos: {common_files[:10]}")

Arquivos válidos: 14431
Primeiros 10 arquivos: ['00000', '00001', '00002', '00003', '00004', '00005', '00006', '00007', '00008', '00009']


In [None]:
# Criar pasta para guardar removidos
removed_lr = base_path / "removed" / "lr"
removed_hr = base_path / "removed" / "hr"

# Criar pastas de destino se não existirem
removed_lr.mkdir(parents=True, exist_ok=True)
removed_hr.mkdir(parents=True, exist_ok=True)

# Função para mover arquivos com (1) para pasta "removed"
def move_duplicates(src_path: Path, dest_path: Path):
    moved = 0
    for file in src_path.glob("* (1).png"):
        print(f"Movendo: {file.name}")
        shutil.move(str(file), dest_path / file.name)
        moved += 1
    print(f"Total movidos de {src_path.name}: {moved}")

# Mover arquivos duplicados
move_duplicates(lr_path, removed_lr)
move_duplicates(hr_path, removed_hr)

Movendo: 12499 (1).png
Movendo: 12366 (1).png
Movendo: 04156 (1).png
Movendo: 04160 (1).png
Movendo: 04155 (1).png
Movendo: 04151 (1).png
Movendo: 04152 (1).png
Movendo: 04157 (1).png
Movendo: 04153 (1).png
Movendo: 04154 (1).png
Total movidos de lr: 10
Movendo: 02623 (1).png
Total movidos de hr: 1


Then, we can check again:

In [None]:
# Checando novamente:
lr_files = os.listdir(lr_path)
hr_files = os.listdir(hr_path)

print("Number of files in folders:")
print(f"Low resolution: {len(lr_files)}. Unique: {len(set(lr_files))}")
print(f"High resolution: {len(hr_files)}. Unique: {len(set(hr_files))}")

Number of files in folders:
Low resolution: 14431. Unique: 14431
High resolution: 14431. Unique: 14431


Seems that we're good to go!