# Derm7pt Dataset – Exploration and CBM Preparation

This notebook introduces the Derm7pt dataset, explores its structure and metadata, and explains how it can be used in a Concept Bottleneck Model (CBM) pipeline.

Dataset location assumed: `data/derm7pt/`

## 1. Dataset overview

Derm7pt is a dermatology dataset designed around the 7‑point checklist used by dermatologists.

Each sample contains:
- One dermoscopic image (and optionally a clinical image)
- Seven interpretable visual concepts
- A final diagnosis label

We aim to use this dataset for Concept Bottleneck Models (X → C → Y).

In [1]:
import os
import pandas as pd

root = 'data/derm7pt'
meta_path = os.path.join(root, 'meta', 'meta.csv')

print(os.listdir(root))
print(os.listdir(os.path.join(root, 'meta')))

['derm.html', 'meta', 'README.txt', 'images', 'clinic.html']
['train_indexes.csv', 'test_indexes.csv', 'valid_indexes.csv', 'meta.csv']


## 2. Load metadata

In [2]:
df = pd.read_csv(meta_path)
df.shape, df.columns.tolist()

((1011, 19),
 ['case_num',
  'diagnosis',
  'seven_point_score',
  'pigment_network',
  'streaks',
  'pigmentation',
  'regression_structures',
  'dots_and_globules',
  'blue_whitish_veil',
  'vascular_structures',
  'level_of_diagnostic_difficulty',
  'elevation',
  'location',
  'sex',
  'management',
  'clinic',
  'derm',
  'case_id',
  'notes'])

In [3]:
df.head()

Unnamed: 0,case_num,diagnosis,seven_point_score,pigment_network,streaks,pigmentation,regression_structures,dots_and_globules,blue_whitish_veil,vascular_structures,level_of_diagnostic_difficulty,elevation,location,sex,management,clinic,derm,case_id,notes
0,1,basal cell carcinoma,0,absent,absent,absent,absent,absent,absent,arborizing,medium,nodular,abdomen,female,excision,NEL/NEL025.JPG,NEL/Nel026.jpg,,
1,2,basal cell carcinoma,1,absent,absent,absent,absent,irregular,absent,absent,low,palpable,head neck,female,excision,NEL/NEL027.JPG,NEL/Nel028.jpg,,
2,3,basal cell carcinoma,1,absent,absent,absent,absent,irregular,absent,arborizing,medium,palpable,head neck,female,excision,NEL/Nel032.jpg,NEL/Nel033.jpg,,
3,4,basal cell carcinoma,4,absent,absent,absent,blue areas,irregular,present,within regression,low,palpable,lower limbs,male,excision,NEL/NEL034.JPG,NEL/Nel035.jpg,,
4,5,basal cell carcinoma,1,absent,absent,diffuse irregular,absent,absent,absent,absent,high,palpable,upper limbs,female,excision,NEL/NEL036.JPG,NEL/Nel037.jpg,,


## 3. Concept and label definition

Final label (Y): `diagnosis`

Concepts (C):

In [4]:
concept_cols = [
    'pigment_network',
    'streaks',
    'pigmentation',
    'regression_structures',
    'dots_and_globules',
    'blue_whitish_veil',
    'vascular_structures'
]

y_col = 'diagnosis'

concept_cols, y_col

(['pigment_network',
  'streaks',
  'pigmentation',
  'regression_structures',
  'dots_and_globules',
  'blue_whitish_veil',
  'vascular_structures'],
 'diagnosis')

## 4. Distribution of diagnoses

In [5]:
df[y_col].value_counts()

diagnosis
clark nevus                     399
melanoma (less than 0.76 mm)    102
reed or spitz nevus              79
melanoma (in situ)               64
melanoma (0.76 to 1.5 mm)        53
seborrheic keratosis             45
basal cell carcinoma             42
dermal nevus                     33
vascular lesion                  29
blue nevus                       28
melanoma (more than 1.5 mm)      28
lentigo                          24
dermatofibroma                   20
congenital nevus                 17
melanosis                        16
combined nevus                   13
miscellaneous                     8
recurrent nevus                   6
melanoma metastasis               4
melanoma                          1
Name: count, dtype: int64

## 5. Distribution of each concept

In [6]:
for c in concept_cols:
    print('\n===', c, '===')
    print(df[c].value_counts(dropna=False))


=== pigment_network ===
pigment_network
absent      400
typical     381
atypical    230
Name: count, dtype: int64

=== streaks ===
streaks
absent       653
irregular    251
regular      107
Name: count, dtype: int64

=== pigmentation ===
pigmentation
absent                 588
diffuse irregular      265
diffuse regular        115
localized irregular     40
localized regular        3
Name: count, dtype: int64

=== regression_structures ===
regression_structures
absent          758
blue areas      116
combinations     99
white areas      38
Name: count, dtype: int64

=== dots_and_globules ===
dots_and_globules
irregular    448
regular      334
absent       229
Name: count, dtype: int64

=== blue_whitish_veil ===
blue_whitish_veil
absent     816
present    195
Name: count, dtype: int64

=== vascular_structures ===
vascular_structures
absent               823
dotted                53
within regression     46
arborizing            31
comma                 23
linear irregular      18
hairpi

## 6. Image paths

Two image modalities are provided:
- dermoscopic image: column `derm`
- clinical image: column `clinic`

Images are stored in `data/derm7pt/images/`.

In [7]:
img_root = os.path.join(root, 'images')
df[['clinic','derm']].head()

Unnamed: 0,clinic,derm
0,NEL/NEL025.JPG,NEL/Nel026.jpg
1,NEL/NEL027.JPG,NEL/Nel028.jpg
2,NEL/Nel032.jpg,NEL/Nel033.jpg
3,NEL/NEL034.JPG,NEL/Nel035.jpg
4,NEL/NEL036.JPG,NEL/Nel037.jpg


## 7. Train / validation / test split

The dataset uses row indices of meta.csv to define splits.

In [8]:
train_idx = pd.read_csv(os.path.join(root,'meta','train_indexes.csv'))['indexes'].dropna().astype(int)
val_idx   = pd.read_csv(os.path.join(root,'meta','valid_indexes.csv'))['indexes'].dropna().astype(int)
test_idx  = pd.read_csv(os.path.join(root,'meta','test_indexes.csv'))['indexes'].dropna().astype(int)

len(train_idx), len(val_idx), len(test_idx)

(413, 203, 395)

## 8. CBM formulation for Derm7pt

We define:

**X**: dermoscopic image

**C**: seven categorical visual concepts

**Y**: diagnosis

Unlike binary concept datasets, each concept is multi‑class.
Therefore, the concept predictor must be implemented as a multi‑head multi‑class classifier.

Pipeline:
1. Image encoder f(x)
2. Concept heads g_k(f(x)) → p(c_k)
3. Diagnosis head h(c_1,...,c_7)

The column `seven_point_score` must NOT be used as a concept because it is derived from the concepts.

## 9. Example skeleton for future CBM usage

In [9]:
# Example only – not a full training script

sample = df.iloc[0]

x_path = os.path.join(img_root, sample['derm'])
C = sample[concept_cols].values
y = sample[y_col]

x_path, C, y

('data/derm7pt/images/NEL/Nel026.jpg',
 array(['absent', 'absent', 'absent', 'absent', 'absent', 'absent',
        'arborizing'], dtype=object),
 'basal cell carcinoma')

## 10. Next steps

- Encode categorical concepts to integer labels per concept
- Build a Dataset class that returns (image, concept_vector, diagnosis)
- Train a CBM using:
  - multi-class loss for each concept head
  - classification loss for diagnosis

This dataset is particularly well suited for studying interpretability and uncertainty in medical concept learning.