# Data Exploring Notebook
The purpose of this notebook is the exploration of the CMMD Dataset (Chinese Mammography Database). With this exploration, we can then preprocess and split the data for training and experimentation.

The main objective is to classify the molecular subtypes of breast cancer using only mammographic images, and for that reason, the main focus will be in the "malignant" labeled cases.

# 0. Imports

In [1]:
import pandas as pd
import plotly.express as px


# 1. Load data

In [17]:
clinical_data_df = pd.read_excel('../data/raw/CMMD/CMMD_clinicaldata_revision.xlsx')
metadata_df = pd.read_csv('../data/raw/CMMD/metadata.csv')

In [18]:
clinical_data_df.head(5)

Unnamed: 0,ID1,LeftRight,Age,number,abnormality,classification,subtype
0,D1-0001,R,44,2,calcification,Benign,
1,D1-0002,L,40,2,calcification,Benign,
2,D1-0003,L,39,2,calcification,Benign,
3,D1-0004,L,41,2,calcification,Benign,
4,D1-0005,R,42,2,calcification,Benign,


In [19]:
metadata_df.head(5)

Unnamed: 0,Series UID,Collection,3rd Party Analysis,Data Description URI,Subject ID,Study UID,Study Description,Study Date,Series Description,Manufacturer,Modality,SOP Class Name,SOP Class UID,Number of Images,File Size,File Location,Download Timestamp
1.3.6.1.4.1.14519.5.2.1.1239.1759.623006463861567934606116970244,CMMD,NO,https://doi.org/10.7937/tcia.eqde4b16,D1-0001,1.3.6.1.4.1.14519.5.2.1.1239.1759.335790956129...,,07-18-2010,,,MG,Digital Mammography X-Ray Image Storage - For ...,1.2.840.10008.5.1.4.1.1.1.2,2,8,79 MB,./CMMD/D1-0001/07-18-2010-NA-NA-79377/1.000000...,2025-02-27T01:22:03.4
1.3.6.1.4.1.14519.5.2.1.1239.1759.610823649257711756778765445313,CMMD,NO,https://doi.org/10.7937/tcia.eqde4b16,D1-0002,1.3.6.1.4.1.14519.5.2.1.1239.1759.241519791051...,,07-18-2010,,,MG,Digital Mammography X-Ray Image Storage - For ...,1.2.840.10008.5.1.4.1.1.1.2,2,8,79 MB,./CMMD/D1-0002/07-18-2010-NA-NA-49231/1.000000...,2025-02-27T01:22:03.4
1.3.6.1.4.1.14519.5.2.1.1239.1759.292560899611058154484740024283,CMMD,NO,https://doi.org/10.7937/tcia.eqde4b16,D1-0003,1.3.6.1.4.1.14519.5.2.1.1239.1759.113089024322...,,07-18-2011,,,MG,Digital Mammography X-Ray Image Storage - For ...,1.2.840.10008.5.1.4.1.1.1.2,2,8,79 MB,./CMMD/D1-0003/07-18-2011-NA-NA-25491/1.000000...,2025-02-27T01:22:03.447
1.3.6.1.4.1.14519.5.2.1.1239.1759.202447815325989213972136564676,CMMD,NO,https://doi.org/10.7937/tcia.eqde4b16,D1-0006,1.3.6.1.4.1.14519.5.2.1.1239.1759.241563313733...,,07-18-2010,,,MG,Digital Mammography X-Ray Image Storage - For ...,1.2.840.10008.5.1.4.1.1.1.2,2,8,79 MB,./CMMD/D1-0006/07-18-2010-NA-NA-16802/1.000000...,2025-02-27T01:22:07.503
1.3.6.1.4.1.14519.5.2.1.1239.1759.328825651506038448195218436301,CMMD,NO,https://doi.org/10.7937/tcia.eqde4b16,D1-0004,1.3.6.1.4.1.14519.5.2.1.1239.1759.132173027545...,,07-18-2011,,,MG,Digital Mammography X-Ray Image Storage - For ...,1.2.840.10008.5.1.4.1.1.1.2,2,8,79 MB,./CMMD/D1-0004/07-18-2011-NA-NA-14914/1.000000...,2025-02-27T01:22:07.681


In [20]:
clinical_data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1872 entries, 0 to 1871
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   ID1             1872 non-null   object
 1   LeftRight       1872 non-null   object
 2   Age             1872 non-null   int64 
 3   number          1872 non-null   int64 
 4   abnormality     1872 non-null   object
 5   classification  1872 non-null   object
 6   subtype         749 non-null    object
dtypes: int64(2), object(5)
memory usage: 102.5+ KB


In [21]:
metadata_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1775 entries, 1.3.6.1.4.1.14519.5.2.1.1239.1759.623006463861567934606116970244 to 1.3.6.1.4.1.14519.5.2.1.1239.1759.311460119741546999437566015082
Data columns (total 17 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Series UID            1775 non-null   object 
 1   Collection            1775 non-null   object 
 2   3rd Party Analysis    1775 non-null   object 
 3   Data Description URI  1775 non-null   object 
 4   Subject ID            1775 non-null   object 
 5   Study UID             0 non-null      float64
 6   Study Description     1775 non-null   object 
 7   Study Date            0 non-null      float64
 8   Series Description    0 non-null      float64
 9   Manufacturer          1775 non-null   object 
 10  Modality              1775 non-null   object 
 11  SOP Class Name        1775 non-null   object 
 12  SOP Class UID         1775 non-null   int64  
 13  Numbe

# 2. Exploratory Data Analysis

## 2.1 Clinical Data Analysis
For the main objective we are going to filter only the rows with molecular subtype annotation.

In [23]:
clinical_data_df_filtered = clinical_data_df[clinical_data_df['subtype'].notnull()]
print(f'Original clinical data shape: {clinical_data_df.shape}')
print(f'Filtered clinical data shape: {clinical_data_df_filtered.shape}')

Original clinical data shape: (1872, 7)
Filtered clinical data shape: (749, 7)


There are 749 rows with molecular subtype annotation

In [24]:
clinical_data_df_filtered.info()

<class 'pandas.core.frame.DataFrame'>
Index: 749 entries, 1107 to 1871
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   ID1             749 non-null    object
 1   LeftRight       749 non-null    object
 2   Age             749 non-null    int64 
 3   number          749 non-null    int64 
 4   abnormality     749 non-null    object
 5   classification  749 non-null    object
 6   subtype         749 non-null    object
dtypes: int64(2), object(5)
memory usage: 46.8+ KB


In [25]:
clinical_data_df_filtered.head(5)
clinical_data_df_filtered.sort_values(by='Age')

Unnamed: 0,ID1,LeftRight,Age,number,abnormality,classification,subtype
1355,D2-0241,L,21,2,both,Malignant,Luminal B
1167,D2-0059,R,25,2,calcification,Malignant,Luminal A
1526,D2-0409,R,27,2,mass,Malignant,Luminal B
1543,D2-0426,R,27,2,mass,Malignant,Luminal B
1381,D2-0265,R,27,2,both,Malignant,Luminal B
...,...,...,...,...,...,...,...
1619,D2-0502,L,83,2,mass,Malignant,Luminal A
1535,D2-0418,L,83,2,mass,Malignant,Luminal B
1449,D2-0333,R,85,2,mass,Malignant,Luminal B
1773,D2-0654,L,85,2,mass,Malignant,triple negative


In [26]:
clinical_data_df_filtered['classification'].value_counts()

classification
Malignant    749
Name: count, dtype: int64

### 2.1.1 ID1 (Patient ID)

In [27]:
clinical_data_df_filtered['ID1'].nunique()

749

In [28]:
clinical_data_df_filtered['ID1'].duplicated().sum()

np.int64(0)

### 2.1.2 LeftRight

In [29]:
clinical_data_df_filtered['LeftRight'].value_counts()

LeftRight
L    385
R    364
Name: count, dtype: int64

In [30]:
px.pie(clinical_data_df_filtered, names='LeftRight', title='LeftRight distribution')

### 2.1.3 Age

In [None]:
clinical_data_df_filtered['Age'].describe()

count    749.000000
mean      49.818425
std       10.798179
min       21.000000
25%       43.000000
50%       49.000000
75%       57.000000
max       87.000000
Name: Age, dtype: float64

In [None]:
px.histogram(clinical_data_df_filtered, x='Age', title='Age distribution', color='subtype')

In [None]:
px.histogram(clinical_data_df_filtered, x='Age', title='Age distribution', color='LeftRight')

### 2.1.4 Abnormality

In [None]:
px.histogram(clinical_data_df_filtered, x='abnormality',
             title='Abnormality distribution', color='subtype')

### 2.1.5 Subtype

In [None]:
px.histogram(clinical_data_df_filtered, x='subtype',
             title='Subtype distribution')

In [None]:
px.pie(clinical_data_df_filtered, names='subtype',
       title='Subtype distribution' )

## 2.2 Metadata analysis

In [32]:
metadata_df

Unnamed: 0,Series UID,Collection,3rd Party Analysis,Data Description URI,Subject ID,Study UID,Study Description,Study Date,Series Description,Manufacturer,Modality,SOP Class Name,SOP Class UID,Number of Images,File Size,File Location,Download Timestamp
1.3.6.1.4.1.14519.5.2.1.1239.1759.623006463861567934606116970244,CMMD,NO,https://doi.org/10.7937/tcia.eqde4b16,D1-0001,1.3.6.1.4.1.14519.5.2.1.1239.1759.335790956129...,,07-18-2010,,,MG,Digital Mammography X-Ray Image Storage - For ...,1.2.840.10008.5.1.4.1.1.1.2,2,8,79 MB,./CMMD/D1-0001/07-18-2010-NA-NA-79377/1.000000...,2025-02-27T01:22:03.4
1.3.6.1.4.1.14519.5.2.1.1239.1759.610823649257711756778765445313,CMMD,NO,https://doi.org/10.7937/tcia.eqde4b16,D1-0002,1.3.6.1.4.1.14519.5.2.1.1239.1759.241519791051...,,07-18-2010,,,MG,Digital Mammography X-Ray Image Storage - For ...,1.2.840.10008.5.1.4.1.1.1.2,2,8,79 MB,./CMMD/D1-0002/07-18-2010-NA-NA-49231/1.000000...,2025-02-27T01:22:03.4
1.3.6.1.4.1.14519.5.2.1.1239.1759.292560899611058154484740024283,CMMD,NO,https://doi.org/10.7937/tcia.eqde4b16,D1-0003,1.3.6.1.4.1.14519.5.2.1.1239.1759.113089024322...,,07-18-2011,,,MG,Digital Mammography X-Ray Image Storage - For ...,1.2.840.10008.5.1.4.1.1.1.2,2,8,79 MB,./CMMD/D1-0003/07-18-2011-NA-NA-25491/1.000000...,2025-02-27T01:22:03.447
1.3.6.1.4.1.14519.5.2.1.1239.1759.202447815325989213972136564676,CMMD,NO,https://doi.org/10.7937/tcia.eqde4b16,D1-0006,1.3.6.1.4.1.14519.5.2.1.1239.1759.241563313733...,,07-18-2010,,,MG,Digital Mammography X-Ray Image Storage - For ...,1.2.840.10008.5.1.4.1.1.1.2,2,8,79 MB,./CMMD/D1-0006/07-18-2010-NA-NA-16802/1.000000...,2025-02-27T01:22:07.503
1.3.6.1.4.1.14519.5.2.1.1239.1759.328825651506038448195218436301,CMMD,NO,https://doi.org/10.7937/tcia.eqde4b16,D1-0004,1.3.6.1.4.1.14519.5.2.1.1239.1759.132173027545...,,07-18-2011,,,MG,Digital Mammography X-Ray Image Storage - For ...,1.2.840.10008.5.1.4.1.1.1.2,2,8,79 MB,./CMMD/D1-0004/07-18-2011-NA-NA-14914/1.000000...,2025-02-27T01:22:07.681
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1.3.6.1.4.1.14519.5.2.1.1239.1759.231264718340548671615654377404,CMMD,NO,https://doi.org/10.7937/tcia.eqde4b16,D2-0744,1.3.6.1.4.1.14519.5.2.1.1239.1759.169383765396...,,07-18-2011,,,MG,Digital Mammography X-Ray Image Storage - For ...,1.2.840.10008.5.1.4.1.1.1.2,4,17,57 MB,./CMMD/D2-0744/07-18-2011-NA-NA-36434/1.000000...,2025-02-27T02:13:20.482
1.3.6.1.4.1.14519.5.2.1.1239.1759.972296336858504752743199236363,CMMD,NO,https://doi.org/10.7937/tcia.eqde4b16,D2-0746,1.3.6.1.4.1.14519.5.2.1.1239.1759.291758813596...,,07-18-2011,,,MG,Digital Mammography X-Ray Image Storage - For ...,1.2.840.10008.5.1.4.1.1.1.2,4,17,57 MB,./CMMD/D2-0746/07-18-2011-NA-NA-61307/1.000000...,2025-02-27T02:13:28.112
1.3.6.1.4.1.14519.5.2.1.1239.1759.242283723072008178075216409459,CMMD,NO,https://doi.org/10.7937/tcia.eqde4b16,D2-0748,1.3.6.1.4.1.14519.5.2.1.1239.1759.194867387180...,,07-18-2011,,,MG,Digital Mammography X-Ray Image Storage - For ...,1.2.840.10008.5.1.4.1.1.1.2,4,17,57 MB,./CMMD/D2-0748/07-18-2011-NA-NA-50416/1.000000...,2025-02-27T02:13:30.731
1.3.6.1.4.1.14519.5.2.1.1239.1759.240741173543977273447334314949,CMMD,NO,https://doi.org/10.7937/tcia.eqde4b16,D2-0747,1.3.6.1.4.1.14519.5.2.1.1239.1759.240317944400...,,07-18-2011,,,MG,Digital Mammography X-Ray Image Storage - For ...,1.2.840.10008.5.1.4.1.1.1.2,4,17,57 MB,./CMMD/D2-0747/07-18-2011-NA-NA-07955/1.000000...,2025-02-27T02:13:34.652


In [31]:
metadata_df['Manufacturer'].value_counts()

Manufacturer
MG    1775
Name: count, dtype: int64