
# HAM10000 – Concept Reading & Categorization Notebook
Author: Meher

Goal of this notebook:
- Read and categorize all *usable concepts* for future CBM work
- Clearly separate:
  - Concepts (C)
  - Targets / labels (Y)
  - Technical / excluded fields

This notebook is ONLY about concepts (no modeling).



## Dataset structure

Expected directory:

```
data/HAM10000/
 ├── ham10000_metadata_2026-02-09.csv
 └── ISIC-images/
```


In [None]:

import pandas as pd
from pathlib import Path

DATA_DIR = Path("data/HAM10000")
META_FILE = DATA_DIR / "ham10000_metadata_2026-02-09.csv"

df = pd.read_csv(META_FILE)
df.head()



## 1. Raw columns


In [None]:

df.columns.tolist()



## 2. Diagnosis hierarchy (labels, not concepts)

This dataset contains a three-level diagnostic hierarchy:

- diagnosis_1 : clinical malignancy status
- diagnosis_2 : pathological family
- diagnosis_3 : specific pathological entity

These are the targets (Y), not concepts (C).


In [None]:

df["diagnosis_1"].value_counts()


In [None]:

df["diagnosis_2"].value_counts()


In [None]:

df["diagnosis_3"].value_counts()



### Diagnostic meaning

diagnosis_1  → coarse clinical outcome  
diagnosis_2  → pathological group  
diagnosis_3  → fine-grained pathology  

We will later choose which level to use as Y.



## 3. Definition of CBM concepts

We define concepts as *interpretable variables available before diagnosis*.

In this dataset, only metadata-based concepts are available.



### 3.1 Candidate concept columns


In [None]:

concept_columns = [
    "age_approx",
    "sex",
    "anatom_site_general",
    "anatom_site_special",
    "melanocytic",
    "concomitant_biopsy"
]

df[concept_columns].head()



## 4. Concept categorization

We divide concepts into clinically meaningful groups.



### 4.1 Patient context concepts
- age_approx
- sex


In [None]:

df[["age_approx","sex"]].describe(include="all")



### 4.2 Lesion anatomical context concepts
- anatom_site_general
- anatom_site_special


In [None]:

df["anatom_site_general"].value_counts()


In [None]:

df["anatom_site_special"].value_counts().head(15)



### 4.3 Biological / pathological prior concepts
- melanocytic


In [None]:

df["melanocytic"].value_counts()



### 4.4 Clinical procedure related concept
- concomitant_biopsy


In [None]:

df["concomitant_biopsy"].value_counts()



## 5. Strong relationship warning (concept leakage analysis)

We check how strong the melanocytic concept is with respect to diagnosis.


In [None]:

pd.crosstab(df["melanocytic"], df["diagnosis_3"])



## 6. Final concept taxonomy for CBM

C_meta (usable now):

- C_age (from age_approx, later discretized)
- C_sex
- C_anatom_site_general
- C_melanocytic
- C_concomitant_biopsy

Optional / advanced:

- C_anatom_site_special (only if grouped)

Excluded from concepts:

- isic_id
- lesion_id
- attribution
- copyright_license
- image_type
- image_manipulation
- diagnosis_confirm_type



## 7. Label (Y) hierarchy summary

Y can be defined at three levels:

Level 1 (coarse):
- diagnosis_1  → {Benign, Malignant, Indeterminate}

Level 2 (intermediate):
- diagnosis_2  → pathological families

Level 3 (fine-grained):
- diagnosis_3  → concrete pathology classes

This hierarchy enables multi-level CBM experiments.



## 8. CBM formulation for this dataset

X : dermoscopic image

C : metadata concepts
    {age, sex, anatomical site, melanocytic, concomitant biopsy}

Y : diagnosis (chosen level among diagnosis_1, diagnosis_2 or diagnosis_3)

X → C → Y



## 9. Summary

This notebook isolates and categorizes all interpretable concepts available
in the HAM10000 metadata and clearly separates them from diagnostic labels.

This is the conceptual preparation step before building CBM datasets.
