
# HAM10000 – Concept Reading & Categorization Notebook

Goal of this notebook:
- Read and categorize all *usable concepts* for future CBM work
- Clearly separate:
  - Concepts (C)
  - Targets / labels (Y)
  - Technical / excluded fields

This notebook is ONLY about concepts (no modeling).



## Dataset structure

Expected directory:

```
data/HAM10000/
 ├── ham10000_metadata_2026-02-09.csv
 └── ISIC-images/
```


# HAM10000 – Concept Bottleneck Model (CBM) view (X–C–Y)

This markdown cell explains how the HAM10000 / ISIC skin lesion dataset can be structured for a Concept Bottleneck Model (CBM) using the X–C–Y formulation.  
The objective is to clearly define the input, the available concepts, their categories, and the possible target labels, in order to prepare the dataset for future CBM-based research.

Dataset columns:

['isic_id',
 'attribution',
 'copyright_license',
 'age_approx',
 'anatom_site_general',
 'anatom_site_special',
 'concomitant_biopsy',
 'diagnosis_1',
 'diagnosis_2',
 'diagnosis_3',
 'diagnosis_confirm_type',
 'image_manipulation',
 'image_type',
 'lesion_id',
 'melanocytic',
 'sex']


---------------------------------------------------------------------

## 1. Dataset overview

This dataset is derived from the ISIC / HAM10000 dermoscopy collection.  
Each sample corresponds to one dermoscopic image of a skin lesion and is associated with clinical and acquisition metadata as well as a hierarchical diagnostic annotation.

The dataset contains:

- a dermoscopic image identified by `isic_id`,
- patient and lesion context information,
- a three-level pathological annotation hierarchy.

The purpose of this dataset is skin lesion diagnosis and pathological classification.

---------------------------------------------------------------------

## 2. Diagnostic hierarchy and pathological classes

The dataset provides three diagnostic variables:

### diagnosis_1 – clinical malignancy level

This variable represents a coarse clinical outcome:

- Benign
- Malignant
- Indeterminate

This level answers the clinical question of malignancy.

---

### diagnosis_2 – pathological family

This variable represents a higher-level pathological grouping, such as:

- Benign melanocytic proliferations  
- Benign epidermal proliferations  
- Malignant melanocytic proliferations (melanoma)  
- Malignant epithelial proliferations  
- Benign soft tissue proliferations  
- Indeterminate epidermal proliferations  

This level corresponds to broad pathological families.

---

### diagnosis_3 – fine-grained pathological class

This variable represents the true lesion type, such as:

- Nevus  
- Pigmented benign keratosis  
- Melanoma, NOS  
- Basal cell carcinoma  
- Squamous cell carcinoma, NOS  
- Dermatofibroma  
- Solar or actinic keratosis  

This is the most clinically meaningful and fine-grained pathology label.

Important note for CBM:
diagnosis_1, diagnosis_2 and diagnosis_3 are targets (Y), not concepts (C).

---------------------------------------------------------------------

## 3. CBM structure

A Concept Bottleneck Model follows the structure:

X → C → Y

For this dataset:

- X is the dermoscopic image,
- C is a set of interpretable clinical or semantic concepts,
- Y is the pathological diagnosis.

---------------------------------------------------------------------

## 4. Definition of X

X is the dermoscopic skin lesion image associated with `isic_id`, loaded from the image directory.

The image is the only raw visual input to the model.

---------------------------------------------------------------------

## 5. Candidate concepts available in this dataset

From the available columns, the interpretable concept candidates are:

- age_approx
- sex
- anatom_site_general
- anatom_site_special
- melanocytic
- concomitant_biopsy

These variables describe the patient or the lesion context and are available before diagnosis.

All other columns are identifiers, legal information, acquisition metadata or diagnostic annotations.

---------------------------------------------------------------------

## 6. Concept categorization for CBM

For CBM, concepts should be separated according to their semantic role.

---------------------------------------------------------------------

### 6.1 Patient context concepts

These describe the patient rather than the lesion appearance:

- Age (derived from age_approx, typically discretized into age groups)
- Sex

These concepts are non-visual but clinically meaningful and useful as contextual concepts in CBM.

---------------------------------------------------------------------

### 6.2 Anatomical context concepts

These describe where the lesion is located on the body:

- anatom_site_general  
  (recommended main anatomical concept)
- anatom_site_special  
  (fine-grained and very sparse; optional and usually grouped)

These concepts provide anatomical context and are well suited for CBM.

---------------------------------------------------------------------

### 6.3 Biological prior concept

- melanocytic

This indicates whether the lesion is of melanocytic origin.

This is a high-level medical concept and extremely informative.  
However, it is also strongly correlated with the final diagnosis and must be handled carefully in CBM experiments, as it can dominate the prediction and reduce the role of visual concepts.

---------------------------------------------------------------------

### 6.4 Clinical procedure related concept

- concomitant_biopsy

This represents a clinical context (whether another lesion was biopsied at the same time).  
It does not describe lesion appearance but can be used as a weak auxiliary concept.

---------------------------------------------------------------------

## 7. Concepts that should NOT be used

The following columns must not be used as concepts:

- isic_id
- lesion_id
- attribution
- copyright_license
- image_type
- image_manipulation
- diagnosis_confirm_type
- diagnosis_1
- diagnosis_2
- diagnosis_3

They correspond to identifiers, legal or acquisition metadata, or directly encode diagnostic information and would introduce information leakage.

---------------------------------------------------------------------

## 8. Visual dermatological concepts are not provided

This dataset does not contain visual dermatological concepts such as:

- asymmetry,
- border irregularity,
- color variation,
- pigment network,
- dots and globules,
- streaks,
- blue-white veil,
- vascular structures.

In a CBM research setting, these would form a set of visual concepts (C_visual) that must be learned from images or annotated separately.

---------------------------------------------------------------------

## 9. What fits best with CBM for this dataset

This dataset naturally supports a hybrid CBM formulation:

- metadata concepts are directly available,
- visual dermatological concepts can be learned later.

Conceptually, the model can be written as:

X_image → C_metadata → Y  
X_image → C_visual   → Y

At present, only C_metadata is directly available.

---------------------------------------------------------------------

## 10. Definition of Y for CBM experiments

Several valid targets are possible:

- Coarse classification:
  Y = diagnosis_1

- Intermediate classification:
  Y = diagnosis_2

- Fine-grained pathological classification (recommended):
  Y = diagnosis_3

For medical and interpretability-oriented CBM research, diagnosis_3 is the most appropriate target.

---------------------------------------------------------------------

## 11. Final X–C–Y formulation

X = dermoscopic image

C = {
     Age_group,
     Sex,
     Anatomical_site_general,
     Melanocytic_origin,
     Concomitant_biopsy
    }

Y = diagnosis_3

---------------------------------------------------------------------

## 12. CBM-style metadata object for this dataset

```python
metadata = {
    "task_id": "HAM10000_CBM_Pathology_Task",
    "task_name": "Skin_Lesion_Pathology_Classification_with_CBM",
    "X": "Dermoscopy image loaded from data/HAM10000/ISIC-images/{isic_id}.jpg",
    "base_concepts": [
        "Age_group",
        "Sex",
        "Anatomical_site_general",
        "Melanocytic_origin",
        "Concomitant_biopsy"
    ],
    "concept_groups": {
        "Patient_context": [
            "Age_group",
            "Sex"
        ],
        "Anatomical_context": [
            "Anatomical_site_general"
        ],
        "Biological_prior": [
            "Melanocytic_origin"
        ],
        "Clinical_context": [
            "Concomitant_biopsy"
        ]
    },
    "target": "diagnosis_3",
    "expected_pattern": "Dermoscopy image → clinical and anatomical concepts → pathological class",
    "description": "Dermoscopy image combined with interpretable clinical and anatomical metadata concepts to predict fine-grained skin lesion pathology (diagnosis_3).",
    "concept_indices": {
        "Age_group": 0,
        "Sex": 1,
        "Anatomical_site_general": 2,
        "Melanocytic_origin": 3,
        "Concomitant_biopsy": 4
    }
}


In [1]:

import pandas as pd
from pathlib import Path

DATA_DIR = Path("data/HAM10000")
META_FILE = DATA_DIR / "ham10000_metadata_2026-02-09.csv"

df = pd.read_csv(META_FILE)
df.head()


Unnamed: 0,isic_id,attribution,copyright_license,age_approx,anatom_site_general,anatom_site_special,concomitant_biopsy,diagnosis_1,diagnosis_2,diagnosis_3,diagnosis_confirm_type,image_manipulation,image_type,lesion_id,melanocytic,sex
0,ISIC_0024306,"ViDIR Group, Department of Dermatology, Medica...",CC-BY-NC,45.0,,,False,Benign,Benign melanocytic proliferations,Nevus,serial imaging showing no change,,dermoscopic,IL_7252831,True,male
1,ISIC_0024307,"ViDIR Group, Department of Dermatology, Medica...",CC-BY-NC,50.0,lower extremity,,False,Benign,Benign melanocytic proliferations,Nevus,serial imaging showing no change,,dermoscopic,IL_6125741,True,male
2,ISIC_0024308,"ViDIR Group, Department of Dermatology, Medica...",CC-BY-NC,55.0,,,False,Benign,Benign melanocytic proliferations,Nevus,serial imaging showing no change,,dermoscopic,IL_3692653,True,female
3,ISIC_0024309,"ViDIR Group, Department of Dermatology, Medica...",CC-BY-NC,40.0,,,False,Benign,Benign melanocytic proliferations,Nevus,serial imaging showing no change,,dermoscopic,IL_0959663,True,male
4,ISIC_0024310,"ViDIR Group, Department of Dermatology, Medica...",CC-BY-NC,60.0,anterior torso,,True,Malignant,Malignant melanocytic proliferations (Melanoma),"Melanoma, NOS",histopathology,,dermoscopic,IL_8194852,True,male



## 1. Raw columns


In [2]:

df.columns.tolist()

['isic_id',
 'attribution',
 'copyright_license',
 'age_approx',
 'anatom_site_general',
 'anatom_site_special',
 'concomitant_biopsy',
 'diagnosis_1',
 'diagnosis_2',
 'diagnosis_3',
 'diagnosis_confirm_type',
 'image_manipulation',
 'image_type',
 'lesion_id',
 'melanocytic',
 'sex']


## 2. Diagnosis hierarchy (labels, not concepts)

This dataset contains a three-level diagnostic hierarchy:

- diagnosis_1 : clinical malignancy status
- diagnosis_2 : pathological family
- diagnosis_3 : specific pathological entity

These are the targets (Y), not concepts (C).


In [3]:

df["diagnosis_1"].value_counts()

diagnosis_1
Benign           9415
Malignant        2156
Indeterminate     149
Name: count, dtype: int64

In [4]:

df["diagnosis_2"].value_counts()

diagnosis_2
Benign melanocytic proliferations                           7737
Benign epidermal proliferations                             1338
Malignant melanocytic proliferations (Melanoma)             1305
Malignant adnexal epithelial proliferations - Follicular     622
Malignant epidermal proliferations                           229
Benign soft tissue proliferations - Vascular                 180
Benign soft tissue proliferations - Fibro-histiocytic        160
Indeterminate epidermal proliferations                       149
Name: count, dtype: int64

In [5]:

df["diagnosis_3"].value_counts()

diagnosis_3
Nevus                           7737
Pigmented benign keratosis      1338
Melanoma, NOS                   1305
Basal cell carcinoma             622
Squamous cell carcinoma, NOS     229
Dermatofibroma                   160
Solar or actinic keratosis       149
Name: count, dtype: int64


### Diagnostic meaning

diagnosis_1  → coarse clinical outcome  
diagnosis_2  → pathological group  
diagnosis_3  → fine-grained pathology  

We will later choose which level to use as Y.



## 3. Definition of CBM concepts

We define concepts as *interpretable variables available before diagnosis*.

In this dataset, only metadata-based concepts are available.



### 3.1 Candidate concept columns


In [6]:

concept_columns = [
    "age_approx",
    "sex",
    "anatom_site_general",
    "anatom_site_special",
    "melanocytic",
    "concomitant_biopsy"
]

df[concept_columns].head()


Unnamed: 0,age_approx,sex,anatom_site_general,anatom_site_special,melanocytic,concomitant_biopsy
0,45.0,male,,,True,False
1,50.0,male,lower extremity,,True,False
2,55.0,female,,,True,False
3,40.0,male,,,True,False
4,60.0,male,anterior torso,,True,True



## 4. Concept categorization

We divide concepts into clinically meaningful groups.



### 4.1 Patient context concepts
- age_approx
- sex


In [7]:

df[["age_approx","sex"]].describe(include="all")


Unnamed: 0,age_approx,sex
count,11337.0,11377
unique,,2
top,,male
freq,,6179
mean,52.037135,
std,16.704833,
min,5.0,
25%,40.0,
50%,50.0,
75%,65.0,



### 4.2 Lesion anatomical context concepts
- anatom_site_general
- anatom_site_special


In [8]:

df["anatom_site_general"].value_counts()


anatom_site_general
lower extremity    2731
posterior torso    2504
anterior torso     1624
upper extremity    1382
head/neck          1255
oral/genital         55
palms/soles           7
Name: count, dtype: int64

In [9]:

df["anatom_site_special"].value_counts().head(15)


anatom_site_special
acral NOS               475
oral or genital          55
acral palms or soles      7
Name: count, dtype: int64


### 4.3 Biological / pathological prior concepts
- melanocytic


In [10]:

df["melanocytic"].value_counts()


melanocytic
True     9042
False    2678
Name: count, dtype: int64


### 4.4 Clinical procedure related concept
- concomitant_biopsy


In [11]:

df["concomitant_biopsy"].value_counts()


concomitant_biopsy
True     6227
False    5493
Name: count, dtype: int64


## 5. Strong relationship warning (concept leakage analysis)

We check how strong the melanocytic concept is with respect to diagnosis.


In [12]:

pd.crosstab(df["melanocytic"], df["diagnosis_3"])


diagnosis_3,Basal cell carcinoma,Dermatofibroma,"Melanoma, NOS",Nevus,Pigmented benign keratosis,Solar or actinic keratosis,"Squamous cell carcinoma, NOS"
melanocytic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
False,622,160,0,0,1338,149,229
True,0,0,1305,7737,0,0,0



## 6. Final concept taxonomy for CBM

C_meta (usable now):

- C_age (from age_approx, later discretized)
- C_sex
- C_anatom_site_general
- C_melanocytic
- C_concomitant_biopsy

Optional / advanced:

- C_anatom_site_special (only if grouped)

Excluded from concepts:

- isic_id
- lesion_id
- attribution
- copyright_license
- image_type
- image_manipulation
- diagnosis_confirm_type



## 7. Label (Y) hierarchy summary

Y can be defined at three levels:

Level 1 (coarse):
- diagnosis_1  → {Benign, Malignant, Indeterminate}

Level 2 (intermediate):
- diagnosis_2  → pathological families

Level 3 (fine-grained):
- diagnosis_3  → concrete pathology classes

This hierarchy enables multi-level CBM experiments.



## 8. CBM formulation for this dataset

X : dermoscopic image

C : metadata concepts
    {age, sex, anatomical site, melanocytic, concomitant biopsy}

Y : diagnosis (chosen level among diagnosis_1, diagnosis_2 or diagnosis_3)

X → C → Y



## 9. Other datasets HAM10000 diagnostic

The kaggle dataset HAM10000 dataset contains 7 diagnostic categories:

| Code | Meaning |
|------|-------|
| nv | Melanocytic nevi |
| mel | Melanoma |
| bkl | Benign keratosis-like lesions |
| bcc | Basal cell carcinoma |
| akiec | Actinic keratoses / intraepithelial carcinoma |
| df | Dermatofibroma |
| vasc | Vascular lesions |

Here is the Link : https://www.kaggle.com/datasets/surajghuwalewala/ham1000-segmentation-and-classification/data

These correspond to the *Target label* Y in CBM terminology.



## 10. Summary

This notebook isolates and categorizes all interpretable concepts available
in the HAM10000 metadata and clearly separates them from diagnostic labels.

This is the conceptual preparation step before building CBM datasets.
