# Skin Disease Detection using Mobile Application
## Final Year Project 2
Ahmad Daniel Ikhwan Bin Rosli <br>
1201103071

### Dataset Preparation

This notebook prepares the downloaded skin disease datasets for training. We will prepare each dataset seperately and will combine all into one clean data.

In [25]:
import os
import shutil
from pathlib import Path
from PIL import Image

In [26]:
image_extensions = ["*.jpg", "*.jpeg", "*.png", "*.JPG", "*.JPEG", "*.PNG"]

### Phase 1: Prepare `pacificrm` Dataset

**Note:** 
Before running the cleanup code, the dataset folders were manually reorganized:
- The `train` and `test` folders were moved directly under the `pacificrm` directory.
- An extra nested `SkinDisease/SkinDisease/` directory was removed.
- The `README.md` file and other unnecessary files were deleted to simplify the structure.

In [27]:
pacificrm_path = Path("datasets/pacificrm")


In [28]:
# Exclude non target diseases
target_classes = [
    "Acne", "Eczema", "Tinea", "Vitiligo", "Psoriasis", "Warts", "Seborrh_Keratoses", "Moles","Unknown_Normal"
]

for split in ['train', 'test']:
    folder = pacificrm_path / split
    for sub in folder.iterdir():
        if sub.is_dir() and sub.name not in target_classes:
            print(f"Deleting {sub}")
            shutil.rmtree(sub)

Deleting datasets\pacificrm\train\Actinic_Keratosis
Deleting datasets\pacificrm\train\Benign_tumors
Deleting datasets\pacificrm\train\Bullous
Deleting datasets\pacificrm\train\Candidiasis
Deleting datasets\pacificrm\train\DrugEruption
Deleting datasets\pacificrm\train\Infestations_Bites
Deleting datasets\pacificrm\train\Lichen
Deleting datasets\pacificrm\train\Lupus
Deleting datasets\pacificrm\train\Rosacea
Deleting datasets\pacificrm\train\SkinCancer
Deleting datasets\pacificrm\train\Sun_Sunlight_Damage
Deleting datasets\pacificrm\train\Vascular_Tumors
Deleting datasets\pacificrm\train\Vasculitis
Deleting datasets\pacificrm\test\Actinic_Keratosis
Deleting datasets\pacificrm\test\Benign_tumors
Deleting datasets\pacificrm\test\Bullous
Deleting datasets\pacificrm\test\Candidiasis
Deleting datasets\pacificrm\test\DrugEruption
Deleting datasets\pacificrm\test\Infestations_Bites
Deleting datasets\pacificrm\test\Lichen
Deleting datasets\pacificrm\test\Lupus
Deleting datasets\pacificrm\test\R

In [29]:
# Check and delete non-image
for path in pacificrm_path.rglob("*"):
    if path.is_file() and path.suffix.lower() not in [".jpg", ".jpeg", ".png"]:
        print(f"Deleting {path}")
        path.unlink()


In [30]:
# Check and delete corrupted images
corrupted = []

for ext in image_extensions:
    for img_path in pacificrm_path.rglob(ext):
        try:
            with Image.open(img_path) as img:
                img.verify()
        except Exception as e:
            print(f"Corrupted: {img_path} ({e})")
            corrupted.append(img_path)

for c in corrupted:
    c.unlink()

print(f"{len(corrupted)} corrupted images removed.")


0 corrupted images removed.


**Note:** 
Additional Cleanup
- Renamed the folder **`Unknown_Normal`** to **`Normal`** in both `train` and `test` directories for consistency.
- Manually reviewed and **deleted non-skin images**, including:
  - Irrelevant content (e.g., clothing, backgrounds, non-human skin).
  - **Blurry or low-quality images** that may affect model performance.

In [32]:
# Combined both the train and test to prepare for resplitting later on
combined_path = Path("datasets/pacificrm_combined")
combined_path.mkdir(exist_ok=True)

splits = ["train", "test"]

for split in splits:
    for class_folder in (pacificrm_path / split).iterdir():
        if class_folder.is_dir():
            dest_class_folder = combined_path / class_folder.name
            dest_class_folder.mkdir(parents=True, exist_ok=True)

            for ext in image_extensions:
                for img_file in class_folder.glob(ext):
                    shutil.copy(img_file, dest_class_folder / img_file.name)

print("Combined pacificrm dataset is ready.")


Combined pacificrm dataset is ready.


In [33]:
# Count how many images in each skin disease
pacificrm_combined_path = Path("datasets/pacificrm_combined")

for class_folder in pacificrm_combined_path.iterdir():
    if class_folder.is_dir():
        count = 0
        for ext in image_extensions:
            count += len(list(class_folder.glob(ext)))
        print(f"{class_folder.name}: {count} images")

Acne: 1316 images
Eczema: 2244 images
Moles: 802 images
Normal: 900 images
Psoriasis: 1816 images
Seborrh_Keratoses: 1012 images
Tinea: 2050 images
Vitiligo: 1590 images
Warts: 1288 images


### Phase 2: Prepare `ham10000` Dataset

In [None]:
# combine ham10000 image files
part1_path = Path("datasets/ham10000/HAM10000_images_part_1")
part2_path = Path("datasets/ham10000/HAM10000_images_part_2")
combined_path = Path("datasets/ham10000_combined")
combined_path.mkdir(parents=True, exist_ok=True)

for part in [part1_path, part2_path]:
    for img_file in part.glob("*.jpg"):
        shutil.copy(img_file, combined_path / img_file.name)

print("ham10000 images combined into:", combined_path)