### Check 1: Validity

To check whether data is valid, we will check if the landmark coordinates are within the image. For example, a coordinate can not be 3000, 3000 if the image resolution is 1800x2400.

In [3]:
import json
from PIL import Image
from pathlib import Path

In [4]:
# Set paths
annotations_dir = Path(r"C:\Users\victo\Downloads\Aariz\Aariz\train\Annotations\Cephalometric Landmarks\Senior Orthodontists")
images_dir = Path(r"C:\Users\victo\Downloads\Aariz\Aariz\train\Cephalograms")

# %%
# Get first annotation and image
annotation_file = list(annotations_dir.glob("*.json"))[0]
image_file = list(images_dir.glob("*.*"))[0]  # Adjust extension if needed

print(f"Checking: {annotation_file.name}")
print(f"With image: {image_file.name}")

# %%
# Get image resolution
img = Image.open(image_file)
width, height = img.size
print(f"\nImage resolution: {width} x {height}")

# %%
# Load annotation and check each landmark
with open(annotation_file) as f:
    data = json.load(f)

print(f"\nChecking {len(data['landmarks'])} landmarks:")
print("-" * 40)

invalid_count = 0
for landmark in data['landmarks']:
    name = landmark['title']
    x = landmark['value']['x']
    y = landmark['value']['y']
    
    # Check if within bounds (0 <= x < width, 0 <= y < height)
    if not (0 <= x < width and 0 <= y < height):
        print(f"{name}: ({x}, {y}) - OUT OF BOUNDS")
        invalid_count += 1
    else:
        print(f"✓ {name}: ({x}, {y})")

print("-" * 40)
print(f"\nResult: {invalid_count} invalid landmarks out of {len(data['landmarks'])}")

Checking: cks2ip8fq29yq0yufc4scftj8.json
With image: cks2ip8fq29yq0yufc4scftj8.png

Image resolution: 1968 x 2225

Checking 29 landmarks:
----------------------------------------
✓ A-point: (1315, 1086)
✓ Anterior Nasal Spine: (1338, 1048)
✓ B-point: (1333, 1564)
✓ Menton: (1297, 1733)
✓ Nasion: (1183, 508)
✓ Orbitale: (1112, 790)
✓ Pogonion: (1348, 1663)
✓ Posterior Nasal Spine: (793, 1138)
✓ Pronasale: (1585, 946)
✓ Ramus: (523, 1265)
✓ Sella: (499, 758)
✓ Articulare: (445, 1061)
✓ Condylion: (449, 980)
✓ Gnathion: (1336, 1707)
✓ Gonion: (593, 1496)
✓ Porion: (291, 958)
✓ Lower 2nd PM Cusp Tip: (1189, 1329)
✓ Lower Incisor Tip: (1371, 1308)
✓ Lower Molar Cusp Tip: (1147, 1334)
✓ Upper 2nd PM Cusp Tip: (1177, 1339)
✓ Upper Incisor Apex: (1288, 1106)
✓ Upper Incisor Tip: (1400, 1334)
✓ Upper Molar Cusp Tip: (1119, 1339)
✓ Lower Incisor Apex: (1288, 1556)
✓ Labrale inferius: (1501, 1369)
✓ Labrale superius: (1508, 1212)
✓ Soft Tissue Nasion: (1240, 557)
✓ Soft Tissue Pogonion: (1488, 16

In [6]:
import hashlib
from pathlib import Path
from collections import defaultdict

images_dir = Path(r"C:\Users\victo\Downloads\Aariz\Aariz\train\Cephalograms")

# %%
def get_file_hash(filepath):
    """Generate MD5 hash of a file"""
    hash_md5 = hashlib.md5()
    with open(filepath, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

# %%
# Find all images and compute their hashes
print("Scanning for duplicate images...")
print("-" * 40)

hash_to_files = defaultdict(list)

# Get all image files (using set to avoid duplicates)
image_files = set()
for pattern in ['*.jpg', '*.jpeg', '*.png', '*.bmp', '*.tif', '*.tiff']:
    image_files.update(images_dir.glob(pattern))

# Convert back to list
image_files = list(image_files)

print(f"Found {len(image_files)} image files")
print("Computing hashes...")

# Hash each file
for img_path in image_files:
    file_hash = get_file_hash(img_path)
    hash_to_files[file_hash].append(img_path.name)

# %%
# Find and report duplicates
duplicates_found = False
duplicate_groups = []

for file_hash, files in hash_to_files.items():
    if len(files) > 1:
        duplicates_found = True
        duplicate_groups.append(files)

print("\n" + "=" * 40)
print("RESULTS")
print("=" * 40)

if duplicates_found:
    print(f"⚠️  Found {len(duplicate_groups)} groups of duplicate images:\n")
    for i, group in enumerate(duplicate_groups, 1):
        print(f"Duplicate Group {i} ({len(group)} identical files):")
        for filename in group:
            print(f"  - {filename}")
        print()
else:
    print("✅ No duplicate images found!")

# %%
# Summary statistics
total_unique = len(hash_to_files)
total_files = len(image_files)
total_duplicates = total_files - total_unique

print("Summary:")
print(f"  Total files: {total_files}")
print(f"  Unique images: {total_unique}")
print(f"  Duplicate files: {total_duplicates}")

# %%
# Debug: Show what's happening with a specific case
if duplicate_groups:
    print("\nDebug - Full paths for first duplicate group:")
    first_group = duplicate_groups[0]
    for img_path in image_files:
        if img_path.name in first_group:
            print(f"  {img_path}")

Scanning for duplicate images...
----------------------------------------
Found 700 image files
Computing hashes...

RESULTS
⚠️  Found 11 groups of duplicate images:

Duplicate Group 1 (2 identical files):
  - cl5lg05ug01au074k511u78le.jpg
  - cl5lg05uj01d6074k93foas54.jpg

Duplicate Group 2 (2 identical files):
  - cl5lg05un01ia074kf2a4ayes.jpg
  - cl5lg05uk01fu074kfuqths55.jpg

Duplicate Group 3 (2 identical files):
  - cl5lg05uf01a6074k9wdlgqlf.jpg
  - cl5lg05ue019i074kfgtu5ot7.jpg

Duplicate Group 4 (2 identical files):
  - cl5lg05uh01c2074kaiage5hg.jpg
  - cl5lg05uh01bu074k41u2cc4k.jpg

Duplicate Group 5 (2 identical files):
  - cl5lg05uf019q074k77b49iib.jpg
  - cl5lg05um01h2074k1fmo9z46.jpg

Duplicate Group 6 (2 identical files):
  - cl5lg05uk01ey074k4s9dftjk.jpg
  - cl5lg05uf01aa074k30lz1oq2.jpg

Duplicate Group 7 (2 identical files):
  - cl5lg05uk01f2074k0wkaaelh.jpg
  - cl5lg05uj01dq074kg2o42anm.jpg

Duplicate Group 8 (2 identical files):
  - cl5lg05uk01fi074k69ht356y.jpg
  - 

### Check 2: Uniqueness

In order to check uniqueness, we will check whether there are multiple of the same image. To achieve this, we will create a hash of each file. This is similar to taking a fingerprint. Once we have the fingerprints, we can check whether we see 2 or more of the same.

In [None]:
import hashlib
from pathlib import Path
from collections import defaultdict
from datetime import datetime

# %%
# Define base path and splits
base_path = r"C:\Users\victo\Downloads\Aariz\Aariz"
splits = ['train', 'test', 'valid']

# %%
def get_file_hash(filepath):
    """Generate MD5 hash of a file"""
    hash_md5 = hashlib.md5()
    with open(filepath, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

def check_duplicates_in_split(split_name, images_dir):
    """Check for duplicates in a single split"""
    print(f"\n{'='*50}")
    print(f"Checking {split_name.upper()} set")
    print('='*50)
    
    # Get all image files (using set to avoid duplicates)
    image_files = set()
    for pattern in ['*.jpg', '*.jpeg', '*.png', '*.bmp', '*.tif', '*.tiff']:
        image_files.update(images_dir.glob(pattern))
    
    image_files = list(image_files)
    
    if not image_files:
        print(f"No images found in {images_dir}")
        return None
    
    print(f"Found {len(image_files)} image files")
    print("Computing hashes...")
    
    # Hash each file
    hash_to_files = defaultdict(list)
    for img_path in image_files:
        file_hash = get_file_hash(img_path)
        hash_to_files[file_hash].append(img_path.name)
    
    # Find duplicates
    duplicate_groups = []
    for file_hash, files in hash_to_files.items():
        if len(files) > 1:
            duplicate_groups.append(files)
    
    # Report results
    total_unique = len(hash_to_files)
    total_files = len(image_files)
    total_duplicates = total_files - total_unique
    
    result = {
        'split': split_name,
        'total_files': total_files,
        'unique_images': total_unique,
        'duplicate_files': total_duplicates,
        'duplicate_groups': duplicate_groups
    }
    
    # Print results
    if duplicate_groups:
        print(f"\n Found {len(duplicate_groups)} groups of duplicate images:")
        for i, group in enumerate(duplicate_groups, 1):
            print(f"\nDuplicate Group {i} ({len(group)} identical files):")
            for filename in group:
                print(f"  - {filename}")
    else:
        print("\nNo duplicate images found!")
    
    print(f"\nSummary for {split_name}:")
    print(f"  Total files: {total_files}")
    print(f"  Unique images: {total_unique}")
    print(f"  Duplicate files: {total_duplicates}")
    
    return result

# %%
# Check all splits and save results
results = []
output_lines = []
output_lines.append(f"Duplicate Image Detection Report")
output_lines.append(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
output_lines.append("="*60)

for split in splits:
    images_dir = Path(base_path) / split / "Cephalograms"
    
    if images_dir.exists():
        result = check_duplicates_in_split(split, images_dir)
        if result:
            results.append(result)
            
            # Add to output file
            output_lines.append(f"\n{split.upper()} SET")
            output_lines.append("-"*40)
            output_lines.append(f"Total files: {result['total_files']}")
            output_lines.append(f"Unique images: {result['unique_images']}")
            output_lines.append(f"Duplicate files: {result['duplicate_files']}")
            
            if result['duplicate_groups']:
                output_lines.append(f"\nFound {len(result['duplicate_groups'])} duplicate groups:")
                for i, group in enumerate(result['duplicate_groups'], 1):
                    output_lines.append(f"\n  Group {i} ({len(group)} identical files):")
                    for filename in group:
                        output_lines.append(f"    - {filename}")
            else:
                output_lines.append("\nNo duplicates found.")
    else:
        print(f"\nDirectory not found: {images_dir}")
        output_lines.append(f"\n{split.upper()} SET: Directory not found")

# %%
# Save results to file
output_file = "duplicate_check_results.txt"
with open(output_file, 'w') as f:
    f.write('\n'.join(output_lines))

print(f"\n{'='*60}")
print(f"Results saved to: {output_file}")

# %%
# Overall summary
print(f"\n{'='*60}")
print("OVERALL SUMMARY")
print('='*60)

total_all = sum(r['total_files'] for r in results)
unique_all = sum(r['unique_images'] for r in results)
duplicates_all = sum(r['duplicate_files'] for r in results)

print(f"Across all {len(results)} splits:")
print(f"  Total files: {total_all}")
print(f"  Total unique images: {unique_all}")
print(f"  Total duplicate files: {duplicates_all}")

# Add overall summary to file
output_lines.append(f"\n{'='*60}")
output_lines.append("OVERALL SUMMARY")
output_lines.append(f"Total files across all splits: {total_all}")
output_lines.append(f"Total unique images: {unique_all}")
output_lines.append(f"Total duplicate files: {duplicates_all}")

# Re-save with overall summary
with open(output_file, 'w') as f:
    f.write('\n'.join(output_lines))

print(f"\nFinal results saved to: {output_file}")


Checking TRAIN set
Found 700 image files
Computing hashes...

⚠️  Found 11 groups of duplicate images:

Duplicate Group 1 (2 identical files):
  - cl5lg05ug01au074k511u78le.jpg
  - cl5lg05uj01d6074k93foas54.jpg

Duplicate Group 2 (2 identical files):
  - cl5lg05un01ia074kf2a4ayes.jpg
  - cl5lg05uk01fu074kfuqths55.jpg

Duplicate Group 3 (2 identical files):
  - cl5lg05uf01a6074k9wdlgqlf.jpg
  - cl5lg05ue019i074kfgtu5ot7.jpg

Duplicate Group 4 (2 identical files):
  - cl5lg05uh01c2074kaiage5hg.jpg
  - cl5lg05uh01bu074k41u2cc4k.jpg

Duplicate Group 5 (2 identical files):
  - cl5lg05uf019q074k77b49iib.jpg
  - cl5lg05um01h2074k1fmo9z46.jpg

Duplicate Group 6 (2 identical files):
  - cl5lg05uk01ey074k4s9dftjk.jpg
  - cl5lg05uf01aa074k30lz1oq2.jpg

Duplicate Group 7 (2 identical files):
  - cl5lg05uk01f2074k0wkaaelh.jpg
  - cl5lg05uj01dq074kg2o42anm.jpg

Duplicate Group 8 (2 identical files):
  - cl5lg05uk01fi074k69ht356y.jpg
  - cl5lg05ue018y074kdq6yc9cj.jpg

Duplicate Group 9 (2 identical 

In [1]:
import json
from PIL import Image
from pathlib import Path
from collections import defaultdict
import hashlib
from datetime import datetime

# %%
# Set paths for dataset 2
annotations_dir = Path(r"C:\Users\victo\Downloads\dental-cepha-dataset\dental-cepha-dataset_json\doctor1")
images_dir = Path(r"C:\Users\victo\Downloads\dental-cepha-dataset\dental-cepha-dataset_json\image")

# %%
# Get first annotation and corresponding image for validation check
annotation_files = list(annotations_dir.glob("*.json"))
image_files = list(images_dir.glob("*.*"))

print(f"Found {len(annotation_files)} annotation files")
print(f"Found {len(image_files)} image files")

# %%
# Validate landmarks for first file as example
if annotation_files and image_files:
    annotation_file = annotation_files[0]
    
    # Load annotation to get ceph_id
    with open(annotation_file) as f:
        data = json.load(f)
        ceph_id = data.get("ceph_id", "unknown")
    
    print(f"\nChecking: {annotation_file.name}")
    print(f"Ceph ID: {ceph_id}")
    
    # Find matching image (try to match by ID in filename)
    image_file = None
    for img in image_files:
        if ceph_id in img.stem or img.stem in ceph_id:
            image_file = img
            break
    
    if not image_file:
        image_file = image_files[0]  # Use first image as fallback
        print(f"Using image: {image_file.name} (no ID match found)")
    else:
        print(f"Matched image: {image_file.name}")
    
    # Get image resolution
    img = Image.open(image_file)
    width, height = img.size
    print(f"\nImage resolution: {width} x {height}")
    
    # Check each landmark
    print(f"\nChecking {len(data['landmarks'])} landmarks:")
    print("-" * 40)
    
    invalid_count = 0
    for landmark in data['landmarks']:
        name = landmark['title']
        symbol = landmark['symbol']
        x = landmark['value']['x']
        y = landmark['value']['y']
        
        # Check if within bounds
        if not (0 <= x < width and 0 <= y < height):
            print(f"❌ {name} ({symbol}): ({x}, {y}) - OUT OF BOUNDS")
            invalid_count += 1
        else:
            print(f"✓ {name} ({symbol}): ({x}, {y})")
    
    print("-" * 40)
    print(f"\nResult: {invalid_count} invalid landmarks out of {len(data['landmarks'])}")

# %% [markdown]
# ## 2. Check All Files for Invalid Landmarks

# %%
def validate_all_landmarks():
    """Check all annotation files for out-of-bounds landmarks"""
    
    results = []
    files_with_invalid = 0
    total_invalid_landmarks = 0
    
    print("\nValidating all annotation files...")
    print("=" * 50)
    
    for ann_file in annotation_files:
        with open(ann_file) as f:
            data = json.load(f)
            ceph_id = data.get("ceph_id", "unknown")
        
        # Find matching image
        img_file = None
        for img in image_files:
            if ceph_id in img.stem or img.stem in ceph_id:
                img_file = img
                break
        
        if not img_file:
            img_file = image_files[0] if image_files else None
        
        if img_file:
            img = Image.open(img_file)
            width, height = img.size
            
            invalid_in_file = 0
            for landmark in data['landmarks']:
                x = landmark['value']['x']
                y = landmark['value']['y']
                if not (0 <= x < width and 0 <= y < height):
                    invalid_in_file += 1
            
            if invalid_in_file > 0:
                files_with_invalid += 1
                total_invalid_landmarks += invalid_in_file
                print(f"⚠️  {ann_file.name}: {invalid_in_file} invalid landmarks")
                results.append(f"{ann_file.name}: {invalid_in_file} invalid landmarks")
    
    print("\n" + "=" * 50)
    print("LANDMARK VALIDATION SUMMARY")
    print(f"Files with invalid landmarks: {files_with_invalid}/{len(annotation_files)}")
    print(f"Total invalid landmarks: {total_invalid_landmarks}")
    
    return results

validation_results = validate_all_landmarks()

# %% [markdown]
# ## 3. Duplicate Image Detection

# %%
def get_file_hash(filepath):
    """Generate MD5 hash of a file"""
    hash_md5 = hashlib.md5()
    with open(filepath, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

# %%
# Check for duplicate images
print("\n" + "=" * 50)
print("DUPLICATE IMAGE CHECK")
print("=" * 50)

hash_to_files = defaultdict(list)

# Get unique image files
image_files_set = set(images_dir.glob("*.*"))
image_files_list = list(image_files_set)

print(f"Found {len(image_files_list)} image files")
print("Computing hashes...")

# Hash each file
for img_path in image_files_list:
    file_hash = get_file_hash(img_path)
    hash_to_files[file_hash].append(img_path.name)

# Find duplicates
duplicate_groups = []
for file_hash, files in hash_to_files.items():
    if len(files) > 1:
        duplicate_groups.append(files)

# Report results
if duplicate_groups:
    print(f"\n⚠️  Found {len(duplicate_groups)} groups of duplicate images:\n")
    for i, group in enumerate(duplicate_groups, 1):
        print(f"Duplicate Group {i} ({len(group)} identical files):")
        for filename in group:
            print(f"  - {filename}")
        print()
else:
    print("\n✅ No duplicate images found!")

# Summary
total_unique = len(hash_to_files)
total_files = len(image_files_list)
total_duplicates = total_files - total_unique

print("Summary:")
print(f"  Total files: {total_files}")
print(f"  Unique images: {total_unique}")
print(f"  Duplicate files: {total_duplicates}")

# %% [markdown]
# ## 4. Save Results to File

# %%
# Save all results to a text file
output_file = "dental_cepha_validation_results.txt"

with open(output_file, 'w') as f:
    f.write("Dental Cepha Dataset Validation Report\n")
    f.write(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
    f.write("=" * 60 + "\n\n")
    
    # Dataset info
    f.write("DATASET INFORMATION\n")
    f.write("-" * 40 + "\n")
    f.write(f"Annotations directory: {annotations_dir}\n")
    f.write(f"Images directory: {images_dir}\n")
    f.write(f"Total annotation files: {len(annotation_files)}\n")
    f.write(f"Total image files: {len(image_files)}\n\n")
    
    # Landmark validation results
    f.write("LANDMARK VALIDATION RESULTS\n")
    f.write("-" * 40 + "\n")
    if validation_results:
        for result in validation_results:
            f.write(f"{result}\n")
    else:
        f.write("All landmarks are within image bounds.\n")
    f.write(f"\nFiles with invalid landmarks: {files_with_invalid}/{len(annotation_files)}\n")
    f.write(f"Total invalid landmarks: {total_invalid_landmarks}\n\n")
    
    # Duplicate check results
    f.write("DUPLICATE IMAGE CHECK RESULTS\n")
    f.write("-" * 40 + "\n")
    f.write(f"Total files: {total_files}\n")
    f.write(f"Unique images: {total_unique}\n")
    f.write(f"Duplicate files: {total_duplicates}\n\n")
    
    if duplicate_groups:
        f.write(f"Found {len(duplicate_groups)} duplicate groups:\n\n")
        for i, group in enumerate(duplicate_groups, 1):
            f.write(f"Group {i} ({len(group)} identical files):\n")
            for filename in group:
                f.write(f"  - {filename}\n")
            f.write("\n")
    else:
        f.write("No duplicate images found.\n")

print(f"\n{'=' * 50}")
print(f"Results saved to: {output_file}")

Found 102 annotation files
Found 102 image files

Checking: 1.json
Ceph ID: ceph_doc1_img1_20250924115919
Matched image: 1.bmp

Image resolution: 2089 x 1937

Checking 19 landmarks:
----------------------------------------
✓ Sella (S): (1175, 774)
✓ Nasion (N): (1632, 639)
✓ Orbitale (Or): (1597, 862)
✓ Porion (Po): (1022, 932)
✓ Anterior Nasal Spine (ANS): (1728, 1129)
✓ Posterior Nasal Spine (PNS): (1697, 1474)
✓ A-point (A): (1686, 1582)
✓ B-point (B): (1638, 1630)
✓ Pogonion (Pog): (1671, 1621)
✓ Menton (Me): (1177, 1367)
✓ Gnathion (Gn): (1791, 1345)
✓ Gonion (Go): (1800, 1330)
✓ Articulare (Ar): (1894, 1221)
✓ Lower Incisor Tip (LIT): (1872, 1401)
✓ Upper Incisor Tip (UIT): (1834, 1131)
✓ Soft Tissue Pogonion (Pos): (1773, 1598)
✓ Subnasale (Sn): (1351, 1111)
✓ Labrale superius (Ls): (1750, 1094)
✓ Labrale inferius (Li): (1103, 1054)
----------------------------------------

Result: 0 invalid landmarks out of 19

Validating all annotation files...

LANDMARK VALIDATION SUMMARY
Fil

NameError: name 'files_with_invalid' is not defined

In [2]:
# %% [markdown]
# # Kaggle Dataset Validation
# Dataset 3/3: Checking landmark coordinates and duplicate images

# %% [markdown]
# ## 1. Landmark Coordinate Validation

# %%
import json
from PIL import Image
from pathlib import Path
from collections import defaultdict
import hashlib
from datetime import datetime

# %%
# Set paths for dataset 3 (Kaggle)
annotations_dir = Path(r"C:\Users\victo\Downloads\OneDrive_2025-10-02\Dataset Kaggle\Annotations")
images_dir = Path(r"C:\Users\victo\Downloads\OneDrive_2025-10-02\Dataset Kaggle\Cephalograms")

# %%
# Get annotation and image files
annotation_files = list(annotations_dir.glob("*.json"))
image_files = list(images_dir.glob("*.*"))

print(f"Found {len(annotation_files)} annotation files")
print(f"Found {len(image_files)} image files")

# %%
# Validate landmarks for first file as example
if annotation_files and image_files:
    annotation_file = annotation_files[0]
    
    # Load annotation to get ceph_id
    with open(annotation_file) as f:
        data = json.load(f)
        ceph_id = data.get("ceph_id", "unknown")
    
    print(f"\nChecking: {annotation_file.name}")
    print(f"Ceph ID: {ceph_id}")
    
    # Find matching image (try to match by ID in filename)
    image_file = None
    for img in image_files:
        if ceph_id in img.stem or img.stem in ceph_id:
            image_file = img
            break
    
    if not image_file:
        image_file = image_files[0]  # Use first image as fallback
        print(f"Using image: {image_file.name} (no ID match found)")
    else:
        print(f"Matched image: {image_file.name}")
    
    # Get image resolution
    img = Image.open(image_file)
    width, height = img.size
    print(f"\nImage resolution: {width} x {height}")
    
    # Check each landmark
    print(f"\nChecking {len(data['landmarks'])} landmarks:")
    print("-" * 40)
    
    invalid_count = 0
    for landmark in data['landmarks']:
        name = landmark['title']
        symbol = landmark['symbol']
        x = landmark['value']['x']
        y = landmark['value']['y']
        
        # Check if within bounds
        if not (0 <= x < width and 0 <= y < height):
            print(f"❌ {name} ({symbol}): ({x}, {y}) - OUT OF BOUNDS")
            invalid_count += 1
        else:
            print(f"✓ {name} ({symbol}): ({x}, {y})")
    
    print("-" * 40)
    print(f"\nResult: {invalid_count} invalid landmarks out of {len(data['landmarks'])}")

# %% [markdown]
# ## 2. Check All Files for Invalid Landmarks

# %%
def validate_all_landmarks():
    """Check all annotation files for out-of-bounds landmarks"""
    
    results = []
    files_with_invalid = 0
    total_invalid_landmarks = 0
    
    print("\nValidating all annotation files...")
    print("=" * 50)
    
    for ann_file in annotation_files:
        with open(ann_file) as f:
            data = json.load(f)
            ceph_id = data.get("ceph_id", "unknown")
        
        # Find matching image
        img_file = None
        for img in image_files:
            if ceph_id in img.stem or img.stem in ceph_id:
                img_file = img
                break
        
        if not img_file:
            img_file = image_files[0] if image_files else None
        
        if img_file:
            img = Image.open(img_file)
            width, height = img.size
            
            invalid_in_file = 0
            for landmark in data['landmarks']:
                x = landmark['value']['x']
                y = landmark['value']['y']
                if not (0 <= x < width and 0 <= y < height):
                    invalid_in_file += 1
            
            if invalid_in_file > 0:
                files_with_invalid += 1
                total_invalid_landmarks += invalid_in_file
                print(f"⚠️  {ann_file.name}: {invalid_in_file} invalid landmarks")
                results.append(f"{ann_file.name}: {invalid_in_file} invalid landmarks")
    
    print("\n" + "=" * 50)
    print("LANDMARK VALIDATION SUMMARY")
    print(f"Files with invalid landmarks: {files_with_invalid}/{len(annotation_files)}")
    print(f"Total invalid landmarks: {total_invalid_landmarks}")
    
    return results, files_with_invalid, total_invalid_landmarks

validation_results, files_with_invalid, total_invalid_landmarks = validate_all_landmarks()

# %% [markdown]
# ## 3. Duplicate Image Detection

# %%
def get_file_hash(filepath):
    """Generate MD5 hash of a file"""
    hash_md5 = hashlib.md5()
    with open(filepath, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

# %%
# Check for duplicate images
print("\n" + "=" * 50)
print("DUPLICATE IMAGE CHECK")
print("=" * 50)

hash_to_files = defaultdict(list)

# Get unique image files
image_files_set = set(images_dir.glob("*.*"))
image_files_list = list(image_files_set)

print(f"Found {len(image_files_list)} image files")
print("Computing hashes...")

# Hash each file
for img_path in image_files_list:
    file_hash = get_file_hash(img_path)
    hash_to_files[file_hash].append(img_path.name)

# Find duplicates
duplicate_groups = []
for file_hash, files in hash_to_files.items():
    if len(files) > 1:
        duplicate_groups.append(files)

# Report results
if duplicate_groups:
    print(f"\n⚠️  Found {len(duplicate_groups)} groups of duplicate images:\n")
    for i, group in enumerate(duplicate_groups, 1):
        print(f"Duplicate Group {i} ({len(group)} identical files):")
        for filename in group:
            print(f"  - {filename}")
        print()
else:
    print("\n✅ No duplicate images found!")

# Summary
total_unique = len(hash_to_files)
total_files = len(image_files_list)
total_duplicates = total_files - total_unique

print("Summary:")
print(f"  Total files: {total_files}")
print(f"  Unique images: {total_unique}")
print(f"  Duplicate files: {total_duplicates}")

# %% [markdown]
# ## 4. Save Results to File

# %%
# Save all results to a text file
output_file = "kaggle_dataset_validation_results.txt"

with open(output_file, 'w') as f:
    f.write("Kaggle Dataset Validation Report\n")
    f.write(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
    f.write("=" * 60 + "\n\n")
    
    # Dataset info
    f.write("DATASET INFORMATION\n")
    f.write("-" * 40 + "\n")
    f.write(f"Annotations directory: {annotations_dir}\n")
    f.write(f"Images directory: {images_dir}\n")
    f.write(f"Total annotation files: {len(annotation_files)}\n")
    f.write(f"Total image files: {len(image_files)}\n\n")
    
    # Landmark validation results
    f.write("LANDMARK VALIDATION RESULTS\n")
    f.write("-" * 40 + "\n")
    if validation_results:
        for result in validation_results:
            f.write(f"{result}\n")
    else:
        f.write("All landmarks are within image bounds.\n")
    f.write(f"\nFiles with invalid landmarks: {files_with_invalid}/{len(annotation_files)}\n")
    f.write(f"Total invalid landmarks: {total_invalid_landmarks}\n\n")
    
    # Duplicate check results
    f.write("DUPLICATE IMAGE CHECK RESULTS\n")
    f.write("-" * 40 + "\n")
    f.write(f"Total files: {total_files}\n")
    f.write(f"Unique images: {total_unique}\n")
    f.write(f"Duplicate files: {total_duplicates}\n\n")
    
    if duplicate_groups:
        f.write(f"Found {len(duplicate_groups)} duplicate groups:\n\n")
        for i, group in enumerate(duplicate_groups, 1):
            f.write(f"Group {i} ({len(group)} identical files):\n")
            for filename in group:
                f.write(f"  - {filename}\n")
            f.write("\n")
    else:
        f.write("No duplicate images found.\n")

print(f"\n{'=' * 50}")
print(f"Results saved to: {output_file}")

# %% [markdown]
# ## 5. Summary Across All Three Datasets

# %%
print("\n" + "=" * 60)
print("ALL DATASETS VALIDATION COMPLETE")
print("=" * 60)
print("\nDataset 1: Aariz Dataset - Check 'duplicate_check_results.txt'")
print("Dataset 2: Dental Cepha Dataset - Check 'dental_cepha_validation_results.txt'")
print("Dataset 3: Kaggle Dataset - Check 'kaggle_dataset_validation_results.txt'")
print("\nAll validation reports have been saved to text files.")

Found 400 annotation files
Found 400 image files

Checking: 001.json
Ceph ID: 001
Matched image: 001.jpg

Image resolution: 1935 x 2400

Checking 19 landmarks:
----------------------------------------
✓ Sella (S): (835, 996)
✓ Nasion (N): (1473, 1029)
✓ Orbitale (Or): (1289, 1279)
✓ Porion (Po): (604, 1228)
✓ A-point (A): (1375, 1654)
✓ B-point (B): (1386, 2019)
✓ Pogonion (Pog): (1333, 2200)
✓ Menton (Me): (1263, 2272)
✓ Gnathion (Gn): (1305, 2252)
✓ Gonion (Go): (694, 1805)
✓ Lower Incisor Tip (LIT): (1460, 1870)
✓ Upper Incisor Tip (UIT): (1450, 1864)
✓ Labrale superius (Ls): (1588, 1753)
✓ Labrale inferius (Li): (1569, 2013)
✓ Subnasale (Sn): (1514, 1620)
✓ Soft Tissue Pogonion (Pos): (1382, 2310)
✓ Posterior Nasal Spine (PNS): (944, 1506)
✓ Anterior Nasal Spine (ANS): (1436, 1569)
✓ Articulare (Ar): (664, 1340)
----------------------------------------

Result: 0 invalid landmarks out of 19

Validating all annotation files...

LANDMARK VALIDATION SUMMARY
Files with invalid landmark