# DUKE TCIA metadata, numerize for DL/ML


### ðŸ§¬ Pathologic Complete Response (pCR) Definition

To maintain consistency with other datasets, we established a clear definition of pathologic complete response (pCR) in the Duke cohort of 922 patients. Participants were categorized into three groups:

- **Responders (pCR = 1, n = 64):** Patients who achieved complete pathologic response (Pathologic_Response_to_Neoadjuvant_Therapy = 1.0) after neoadjuvant chemotherapy (NAC) without concurrent endocrine therapy.  
- **Non-responders (pCR = 0, n = 214):** Patients with residual invasive disease (n = 202), residual DCIS (n = 11), or residual LCIS (n = 1) following NAC.  
- **Irrelevant cases (pCR = None, n = 644):** Patients who either received neoadjuvant endocrine therapy, lacked pathology data, or did not receive NAC.

This stratification resulted in a final analysis cohort of 278 NAC-treated patients, with a pCR rate of 23.0%. Our definition ensures comparability with other datasets, where pCR specifically reflects the response to chemotherapy rather than alternative treatment modalities.


In [435]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import struct
import os

In [437]:
my_computer=True
my_colab=False

In [438]:
if my_colab:
    from google.colab import drive
    drive.mount('/content/drive')

    base_path  ="drive/MyDrive/new_breast_mri/DUKE/"
else:
    base_path  = 'G:\\My Drive\\breast_mri\\DUKE\\'
    base_jpg_path="D:\\duke"

In [439]:
df = pd.read_excel(os.path.join('TCIA_metadata',
                              "Clinical_and_Other_Features_proc.xlsx"))
df.head()

Unnamed: 0,Patient ID,Slice Thickness,Rows,Columns,FOV Computed (Field of View) in cm,Date of Birth (Days),Menopause (at diagnosis),Race and Ethnicity,ER,PR,...,Tumor Location,Position,Bilateral Information,"Clinical Response, Evaluated Through Imaging",Pathologic Response to Neoadjuvant Therapy,Neoadjuvant Chemotherapy,Adjuvant Chemotherapy,Neoadjuvant Endocrine Therapy Medications,Adjuvant Endocrine Therapy Medications,Near Complete ResponseOverall Near-complete Response: Stricter Definition
0,Breast_MRI_001,5,1,1,15,-15209,0,2,0,0,...,L,L 9 medial,0,2.0,2.0,1.0,0.0,0.0,0.0,0.0
1,Breast_MRI_002,12,2,2,5,-14061,0,2,0,0,...,L,L 1,0,1.0,3.0,1.0,0.0,0.0,0.0,2.0
2,Breast_MRI_003,18,2,2,14,-22685,1,1,1,1,...,,L 2,NC,2.0,2.0,0.0,0.0,1.0,1.0,0.0
3,Breast_MRI_004,18,2,2,15,-21479,1,1,1,1,...,L,L,0,,,0.0,0.0,0.0,1.0,
4,Breast_MRI_005,18,1,1,14,-13932,0,5,1,0,...,R,R 3,0,1.0,1.0,1.0,0.0,0.0,1.0,1.0


In [440]:
import pandas as pd
import numpy as np

def map_category_to_thickness(category):
    # Dictionary mapping categories to thickness values
    category_map = {
        0: 0.90,
        1: 0.95,
        2: 1.00,
        3: 1.04,
        4: 1.06,
        5: 1.10,
        6: 1.12,
        7: 1.15,
        8: 1.20,
        9: 1.23,
        10: 1.24,
        11: 1.25,
        12: 1.30,
        13: 1.40,
        14: 1.45,
        15: 1.50,
        16: 1.60,
        17: 1.80,
        18: 2.00,
        19: 2.20,
        20: 2.50
    }

    return category_map.get(int(category), None)


In [441]:
df.columns = [col.strip().replace(' ', '_') for col in df.columns]

df.columns = df.columns.str.replace(' ', '_')

In [442]:
df['slice_thick'] = df['Slice_Thickness'].apply(map_category_to_thickness)

In [443]:
dicct_rows={0:320,1:448,2:512}

In [444]:
df['n_xy'] = df['Rows'].map(dicct_rows)

In the Duke Breast MRI dataset, "FOV_Computed_(Field_of_View)_in_cm" or "FOVcm" refers to the Field of View measurement in centimeters used during the MRI acquisition.

The Field of View (FOV) in MRI is a critical technical parameter that defines the spatial area or volume that is captured in the imaging process. Specifically:

1. It represents the physical dimensions of the region being imaged (in this case, in centimeters)
2. It's an important acquisition parameter that affects spatial resolution and image quality
3. In breast MRI, the FOV typically needs to be large enough to encompass the entire breast volume

The FOV value impacts several aspects of the imaging:
- Larger FOV values capture more anatomy but may reduce spatial resolution
- Smaller FOV values provide higher spatial resolution but cover less area
- The FOV directly affects the pixel/voxel size in combination with the matrix size

This parameter might be relevant in analyses as it can potentially influence the apparent size of lesions and the quality of extracted radiomics features. It's one of the technical MRI parameters that researchers might need to account for when comparing images acquired with different protocols.

In [445]:
df=df.rename(columns={'FOV_Computed_(Field_of_View)_in_cm':'FOVcm',
                     })

In [446]:
df['age']=df['Date_of_Birth_(Days)'].apply(lambda x: int(-1.*x/365.))

In [447]:
df.head()

Unnamed: 0,Patient_ID,Slice_Thickness,Rows,Columns,FOVcm,Date_of_Birth_(Days),Menopause_(at_diagnosis),Race_and_Ethnicity,ER,PR,...,"Clinical_Response,_Evaluated_Through_Imaging",Pathologic_Response_to_Neoadjuvant_Therapy,Neoadjuvant_Chemotherapy,Adjuvant_Chemotherapy,Neoadjuvant_Endocrine_Therapy_Medications,Adjuvant_Endocrine_Therapy_Medications,Near_Complete_ResponseOverall_Near-complete_Response:__Stricter_Definition,slice_thick,n_xy,age
0,Breast_MRI_001,5,1,1,15,-15209,0,2,0,0,...,2.0,2.0,1.0,0.0,0.0,0.0,0.0,1.1,448,41
1,Breast_MRI_002,12,2,2,5,-14061,0,2,0,0,...,1.0,3.0,1.0,0.0,0.0,0.0,2.0,1.3,512,38
2,Breast_MRI_003,18,2,2,14,-22685,1,1,1,1,...,2.0,2.0,0.0,0.0,1.0,1.0,0.0,2.0,512,62
3,Breast_MRI_004,18,2,2,15,-21479,1,1,1,1,...,,,0.0,0.0,0.0,1.0,,2.0,512,58
4,Breast_MRI_005,18,1,1,14,-13932,0,5,1,0,...,1.0,1.0,1.0,0.0,0.0,1.0,1.0,2.0,448,38


In the Duke breast MRI dataset, the "Pathologic_Response_to_Neoadjuvant_Therapy" field represents the degree of tumor response to neoadjuvant therapy (treatment given before surgery), with the following code meanings:

1.0 (64 cases): Complete response - No evidence of invasive tumor remaining after therapy
2.0 (224 cases): Not complete response - Residual invasive disease present after therapy
3.0 (11 cases): DCIS only remaining - All invasive disease eliminated, but ductal carcinoma in situ still present
4.0 (1 case): LCIS only remaining - All invasive disease eliminated, but lobular carcinoma in situ still present
5.0 (12 cases): Treatment response assessment unavailable - Unable to determine response

The values show that most patients (224) did not achieve a complete response to neoadjuvant therapy, while 64 patients had a complete response (pCR - pathologic Complete Response). The relatively small number of patients with only DCIS or LCIS remaining represents an intermediate response category where the invasive component was eliminated but pre-invasive disease remained.

This variable is often used as a primary endpoint in breast cancer research, as achieving pCR is associated with improved long-term outcomes in many breast cancer subtypes.

In [448]:
df.Pathologic_Response_to_Neoadjuvant_Therapy.value_counts()

Pathologic_Response_to_Neoadjuvant_Therapy
2.0    224
1.0     64
5.0     12
3.0     11
4.0      1
Name: count, dtype: int64

* 5->none,
* 1->1,
* 2,
* 3,
* 4->0

In [449]:
pcr_dict={2:0,3:0,4:0,5:None,1:1}
df['pCR']=df['Pathologic_Response_to_Neoadjuvant_Therapy'].map(pcr_dict)

In [450]:
df['pCR'].value_counts(dropna=False)

pCR
NaN    622
0.0    236
1.0     64
Name: count, dtype: int64

In [451]:
pd.crosstab(df.Neoadjuvant_Chemotherapy, df['pCR'], dropna=False)

pCR,0.0,1.0,NaN
Neoadjuvant_Chemotherapy,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,19,0,582
1.0,217,64,11
,0,0,29


In [452]:
pd.crosstab(df.Neoadjuvant_Chemotherapy, df.Neoadjuvant_Endocrine_Therapy_Medications, dropna=False)

Neoadjuvant_Endocrine_Therapy_Medications,0.0,1.0,NaN
Neoadjuvant_Chemotherapy,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,582,19,0
1.0,288,3,1
,0,0,29


Paitents who did not revice Neoadjuvant_Chemotherapy == 0, should have pCR = None

In [453]:
df['pCR']=np.where(df.Neoadjuvant_Chemotherapy==0, np.nan, df['pCR'])

In [454]:
pd.crosstab(df.Neoadjuvant_Chemotherapy, df['pCR'], dropna=False)

pCR,0.0,1.0,NaN
Neoadjuvant_Chemotherapy,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,0,0,601
1.0,217,64,11
,0,0,29


In [455]:
pd.crosstab(df.Pathologic_Response_to_Neoadjuvant_Therapy, df['pCR'], dropna=False)

pCR,0.0,1.0,NaN
Pathologic_Response_to_Neoadjuvant_Therapy,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1.0,0,64,0
2.0,205,0,19
3.0,11,0,0
4.0,1,0,0
5.0,0,0,12
,0,0,610


In [456]:
df.to_csv('duke_tcia_metadata.csv', index = False)

In [457]:
pd.crosstab(df.Neoadjuvant_Chemotherapy,
            df.Neoadjuvant_Endocrine_Therapy_Medications, dropna=False)

Neoadjuvant_Endocrine_Therapy_Medications,0.0,1.0,NaN
Neoadjuvant_Chemotherapy,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,582,19,0
1.0,288,3,1
,0,0,29


In [458]:
pd.crosstab(df.Neoadjuvant_Endocrine_Therapy_Medications,
            df.pCR, dropna=False)

pCR,0.0,1.0,NaN
Neoadjuvant_Endocrine_Therapy_Medications,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,214,64,592
1.0,3,0,19
,0,0,30


In [459]:
pd.crosstab(df.Adjuvant_Chemotherapy,
            df.pCR)

pCR,0.0,1.0
Adjuvant_Chemotherapy,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,173,62
1.0,41,2


In [460]:
pd.crosstab(df.Adjuvant_Chemotherapy,
            df.Neoadjuvant_Chemotherapy)

Neoadjuvant_Chemotherapy,0.0,1.0
Adjuvant_Chemotherapy,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,294,243
1.0,291,44


In [461]:
pd.crosstab(df['Near_Complete_ResponseOverall_Near-complete_Response:__Stricter_Definition'],
            df.pCR)

pCR,0.0,1.0
Near_Complete_ResponseOverall_Near-complete_Response:__Stricter_Definition,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,188,0
1.0,0,64
2.0,29,0


* 0 = not complete or near-complete
* 1 = complete
* 2 = near-complete
* 3 = pathologic assessment not available
* NA = not applicable

In [462]:
pd.crosstab(df['Near_Complete_ResponseOverall_Near-complete_Response:__Stricter_Definition'],
            df.Adjuvant_Chemotherapy)

Adjuvant_Chemotherapy,0.0,1.0
Near_Complete_ResponseOverall_Near-complete_Response:__Stricter_Definition,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,157,48
1.0,62,2
2.0,28,0
3.0,9,1


In [463]:
pd.crosstab(df['Near_Complete_ResponseOverall_Near-complete_Response:__Stricter_Definition'],
            df.Neoadjuvant_Chemotherapy)

Neoadjuvant_Chemotherapy,0.0,1.0
Near_Complete_ResponseOverall_Near-complete_Response:__Stricter_Definition,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,19,188
1.0,0,64
2.0,0,29
3.0,1,11


In [464]:
df.pCR.value_counts()

pCR
0.0    217
1.0     64
Name: count, dtype: int64

In [465]:
df.pCR.value_counts(dropna = False)

pCR
NaN    641
0.0    217
1.0     64
Name: count, dtype: int64

In [466]:
df.pCR.value_counts(dropna = False)

pCR
NaN    641
0.0    217
1.0     64
Name: count, dtype: int64

## pcr summary:
we have
* 19 - near complete, but went only Adjuvant_Chemotherapy
* 214 - not complete or near complete - went Neoadjuvant_Chemotherapy
* 64 - complete response - went Neoadjuvant_Chemotherapy


In [467]:
df.columns

Index(['Patient_ID', 'Slice_Thickness', 'Rows', 'Columns', 'FOVcm',
       'Date_of_Birth_(Days)', 'Menopause_(at_diagnosis)',
       'Race_and_Ethnicity', 'ER', 'PR', 'HER2', 'Mol_Subtype',
       'Staging(Tumor_Size)#_[T]', 'Tumor_Grade', 'Unnamed:_14', 'Unnamed:_15',
       'Nottingham_grade', 'Histologic_type', 'Tumor_Location', 'Position',
       'Bilateral_Information', 'Clinical_Response,_Evaluated_Through_Imaging',
       'Pathologic_Response_to_Neoadjuvant_Therapy',
       'Neoadjuvant_Chemotherapy', 'Adjuvant_Chemotherapy',
       'Neoadjuvant_Endocrine_Therapy_Medications',
       'Adjuvant_Endocrine_Therapy_Medications',
       'Near_Complete_ResponseOverall_Near-complete_Response:__Stricter_Definition',
       'slice_thick', 'n_xy', 'age', 'pCR'],
      dtype='object')

In [468]:
df = df[['Patient_ID', 'Slice_Thickness', 'Rows', 'Columns', 'FOVcm',
       'Date_of_Birth_(Days)', 'Menopause_(at_diagnosis)',
       'Race_and_Ethnicity', 'ER', 'PR', 'HER2', 'Mol_Subtype',
       'Staging(Tumor_Size)#_[T]', 'Tumor_Grade', 
       'Nottingham_grade', 'Histologic_type',
       'Bilateral_Information', 
       'slice_thick', 'n_xy', 'age', 'pCR']]

In [469]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 922 entries, 0 to 921
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Patient_ID                922 non-null    object 
 1   Slice_Thickness           922 non-null    int64  
 2   Rows                      922 non-null    int64  
 3   Columns                   922 non-null    int64  
 4   FOVcm                     922 non-null    int64  
 5   Date_of_Birth_(Days)      922 non-null    int64  
 6   Menopause_(at_diagnosis)  922 non-null    int64  
 7   Race_and_Ethnicity        922 non-null    int64  
 8   ER                        922 non-null    int64  
 9   PR                        922 non-null    int64  
 10  HER2                      922 non-null    int64  
 11  Mol_Subtype               922 non-null    int64  
 12  Staging(Tumor_Size)#_[T]  916 non-null    float64
 13  Tumor_Grade               907 non-null    float64
 14  Nottingham

In [470]:
df['ER'].value_counts()

ER
1    686
0    236
Name: count, dtype: int64

In [471]:
df['PR'].value_counts()

PR
1    598
0    324
Name: count, dtype: int64

In [472]:
df['HER2'].value_counts()

HER2
0    759
1    163
Name: count, dtype: int64

In [473]:
df['HR']=np.where(df['ER']+df['PR']>0,1,0)

In [474]:
df['HR'].value_counts()

HR
1    699
0    223
Name: count, dtype: int64

# Mol_Subtype

In [475]:
df['Mol_Subtype'].value_counts()

Mol_Subtype
0    595
3    164
1    104
2     59
Name: count, dtype: int64

**Mol_Subtype** refers to molecular subtypes of breast cancer, which are important classifications based on receptor status that help determine treatment strategies and predict prognosis.

- **0 = luminal-like**: These are estrogen receptor (ER) and/or progesterone receptor (PR) positive, but HER2 negative cancers. Often divided into Luminal A and Luminal B subtypes. They generally have better prognosis and respond to hormone therapy.

- **1 = ER/PR pos, HER2 pos**: These are "triple positive" cancers that express both hormone receptors (ER/PR) and HER2. They can be treated with a combination of hormone therapy and HER2-targeted therapies.

- **2 = her2**: These are HER2-positive but hormone receptor-negative (ER/PR negative) cancers. They tend to be more aggressive but respond well to HER2-targeted therapies.

- **3 = trip neg**: Triple-negative breast cancers (TNBC) lack expression of all three receptors (ER-negative, PR-negative, and HER2-negative). These are generally more aggressive with fewer targeted treatment options, often requiring chemotherapy.

In breast MRI research and clinical trials like ISPY2 or ACRIN-6698, these molecular subtypes are crucial variables as they influence:
- Contrast enhancement patterns in DCE-MRI
- Response to neoadjuvant therapy
- Correlation between imaging features and tumor biology
- Prediction of pathological complete response (pCR)

This classification system helps researchers stratify patients in analysis and may reveal different imaging characteristics across molecular subtypes.

### Mol_Subtype
tncs_dict={
* 0 = luminal-like, # (ER) and/or progesterone receptor (PR) positive, but HER2 negative
* 1 = ER/PR pos, HER2 pos,
* 2 = her2, # her2: These are HER2-positive but hormone receptor-negative (ER/PR negative) cancers.
* 3 = trip neg
}

In [476]:
df['Mol_Subtype'].value_counts()

Mol_Subtype
0    595
3    164
1    104
2     59
Name: count, dtype: int64

In [477]:
df['hormon_status']=df['Mol_Subtype'].map({0:'luminal_hrpos_her2neg',
                                          1:'hrpos_her2pos',
                                          2:'hrneg_her2pos',
                                          3:'triple_neg' })

In [478]:
pd.crosstab(df['Mol_Subtype'],df['HR'])

HR,0,1
Mol_Subtype,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0,595
1,0,104
2,59,0
3,164,0


In [479]:
pd.crosstab(df['Mol_Subtype'],df['HER2'])

HER2,0,1
Mol_Subtype,Unnamed: 1_level_1,Unnamed: 2_level_1
0,595,0
1,0,104
2,0,59
3,164,0


In [480]:
import pandas as pd
import numpy as np

def create_hr_her2_status(df):
    """
    Creates HR_HER2 status column matching the categories:
    HER2pos, HRposHER2neg, TripleNeg

    Parameters:
    -----------
    df : pandas DataFrame
        Must contain columns 'HR' and 'HER2'

    Returns:
    --------
    pandas Series
        HR_HER2 status with categories matching the target format
    """
    conditions = [
        (df['HER2'] == 1),                          # HER2 positive
        (df['HR'] == 1) & (df['HER2'] == 0),       # HR positive and HER2 negative
        (df['HR'] == 0) & (df['HER2'] == 0)        # Triple negative
    ]

    choices = ['HER2pos', 'HRposHER2neg', 'TripleNeg']

    return pd.Series(np.select(conditions, choices, default='Unknown'))

In [481]:
df['HR_HER2'] = create_hr_her2_status(df)
df['HR_HER2'].value_counts(dropna=False)

HR_HER2
HRposHER2neg    595
TripleNeg       164
HER2pos         163
Name: count, dtype: int64

In [482]:
df['TripleNeg'] = np.where(df.HR_HER2=='TripleNeg',1,0)

df['HER2pos'] = np.where(df.HR_HER2=='HER2pos',1,0)

df['HRposHER2neg'] = np.where(df.HR_HER2=='HRposHER2neg',1,0)

In [483]:
pd.crosstab(df['Mol_Subtype'], df['HR_HER2'])

HR_HER2,HER2pos,HRposHER2neg,TripleNeg
Mol_Subtype,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,595,0
1,104,0,0
2,59,0,0
3,0,0,164


In [484]:
pd.crosstab(df['Mol_Subtype'], df['HER2'])

HER2,0,1
Mol_Subtype,Unnamed: 1_level_1,Unnamed: 2_level_1
0,595,0
1,0,104
2,0,59
3,164,0


In [485]:
pd.crosstab(df['Mol_Subtype'], df['HR'])

HR,0,1
Mol_Subtype,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0,595
1,0,104
2,59,0
3,164,0


In [486]:
pd.crosstab(df['Mol_Subtype'], df['ER'])

ER,0,1
Mol_Subtype,Unnamed: 1_level_1,Unnamed: 2_level_1
0,10,585
1,3,101
2,59,0
3,164,0


In [487]:
pd.crosstab(df['Mol_Subtype'], df['PR'])

PR,0,1
Mol_Subtype,Unnamed: 1_level_1,Unnamed: 2_level_1
0,77,518
1,24,80
2,59,0
3,164,0


In [488]:
df['Menopause_(at_diagnosis)'].value_counts()

Menopause_(at_diagnosis)
1    499
0    407
2     16
Name: count, dtype: int64

In [489]:
df['Menopause_(at_diagnosis)'].value_counts()

Menopause_(at_diagnosis)
1    499
0    407
2     16
Name: count, dtype: int64

In [490]:
df['menopause'] = df['Menopause_(at_diagnosis)'].map({1:1,0:0,2:None})

In [491]:
df.to_csv('duke_tcia_metadata.csv', index = False)

In [492]:
df[df.menopause.isna()].age

6      44
28     64
87     42
119    46
121    47
308    62
381    32
471    47
539    48
583    54
598    51
668    80
812    40
866    55
875    56
905    45
Name: age, dtype: int64

In [493]:
# Assuming your columns are 'Menopause_(at_diagnosis)' and 'age'
# Fill missing menopause values based on age threshold of 50
df['menopause'] = df['menopause'].fillna(
    (df['age'] >= 50).astype(int)
)

# Verify the changes
print(f"menopause: {df['menopause'].value_counts()}")
print(f"Missing values remaining: {df['menopause'].isna().sum()}")

menopause: menopause
1.0    506
0.0    416
Name: count, dtype: int64
Missing values remaining: 0


In [494]:
import warnings
warnings.filterwarnings('ignore')

In [495]:
df.columns

Index(['Patient_ID', 'Slice_Thickness', 'Rows', 'Columns', 'FOVcm',
       'Date_of_Birth_(Days)', 'Menopause_(at_diagnosis)',
       'Race_and_Ethnicity', 'ER', 'PR', 'HER2', 'Mol_Subtype',
       'Staging(Tumor_Size)#_[T]', 'Tumor_Grade', 'Nottingham_grade',
       'Histologic_type', 'Bilateral_Information', 'slice_thick', 'n_xy',
       'age', 'pCR', 'HR', 'hormon_status', 'HR_HER2', 'TripleNeg', 'HER2pos',
       'HRposHER2neg', 'menopause'],
      dtype='object')

In [496]:
df.to_csv('duke_tcia_metadata.csv', index = False)

In [497]:
df.columns

Index(['Patient_ID', 'Slice_Thickness', 'Rows', 'Columns', 'FOVcm',
       'Date_of_Birth_(Days)', 'Menopause_(at_diagnosis)',
       'Race_and_Ethnicity', 'ER', 'PR', 'HER2', 'Mol_Subtype',
       'Staging(Tumor_Size)#_[T]', 'Tumor_Grade', 'Nottingham_grade',
       'Histologic_type', 'Bilateral_Information', 'slice_thick', 'n_xy',
       'age', 'pCR', 'HR', 'hormon_status', 'HR_HER2', 'TripleNeg', 'HER2pos',
       'HRposHER2neg', 'menopause'],
      dtype='object')

In [498]:
df.Race_and_Ethnicity.value_counts()

Race_and_Ethnicity
1    651
2    203
0     19
5     18
3     14
6      9
8      4
4      3
7      1
Name: count, dtype: int64

In [499]:
df['r_white']=np.where(df.Race_and_Ethnicity==1,1,0)
df['r_black']=np.where(df.Race_and_Ethnicity==2,1,0)


In [500]:
df.r_black.value_counts(dropna=False)

r_black
0    719
1    203
Name: count, dtype: int64

# Tumor chareceriostiocs
## Tumor stage

Let me break down the meaning of these values in the context of tumor staging:

The values represent the different stages of breast cancer in the Duke Breast MRI dataset:

1. Stage 1.0 (409 cases): Early-stage cancer, small tumor, best prognosis
2. Stage 2.0 (395 cases): Intermediate-stage cancer, moderate tumor size
3. Stage 3.0 (90 cases): Advanced local disease, larger tumor
4. Stage 4.0 (22 cases): Metastatic cancer, most advanced stage

This distribution shows that most cases in the dataset are in the early to intermediate stages (stages 1.0 and 2.0), with fewer cases in the advanced stages (3.0 and 4.0). This is typical in cancer datasets, as early detection is crucial in breast cancer management.

Great question! I'll explain the nuanced relationship between tumor size and severity.

Tumor staging is far more complex than just size. There can absolutely be cases where a smaller tumor is more severe than a larger one. The staging takes into account multiple factors:

1. **Tumor Size**: Only one component of staging
2. **Cellular Aggressiveness**: How quickly the tumor grows
3. **Genetic Characteristics**: Molecular markers
4. **Spread Potential**: Lymph node involvement
5. **Invasion Capability**: How the tumor interacts with surrounding tissues

Real-world example: A small 1 cm tumor with aggressive, rapidly dividing cells and high metastatic potential could be more severe than a 5 cm tumor with slow-growing, less invasive cells.

Think of it like this:
- A tiny but "smart" and aggressive cancer cell can be more dangerous than a larger, slower-growing tumor
- Staging tries to capture this complexity, not just measure size

Oncologists use multiple tests to determine:
- Tumor size
- Cellular characteristics
- Genetic markers
- Potential for spread

So while size is important, it's not the only factor determining a cancer's severity or stage. Each tumor is unique and requires a comprehensive assessment.

Yes, precisely! The staging is defined in the pre-treatment stage, before any treatment begins. This initial staging is crucial as it determines the entire treatment approach.

The staging is definitively determined in the pre-treatment stage, which means:
- Before any surgery
- Before chemotherapy
- Before radiation
- Before any cancer treatments begin

This initial staging is crucial because:
1. It provides the baseline understanding of the cancer
2. It guides the entire treatment plan
3. It helps predict potential outcomes
4. It allows for personalized treatment approach

The process involves:
- Comprehensive imaging (MRI, CT scans)
- Detailed biopsy
- Molecular and genetic testing
- Physical examinations
- Assessing potential spread

Think of it like a complete medical detective work before deciding on the treatment strategy. The goal is to understand the cancer as comprehensively as possible before taking any action.

After initial treatment, there might be a pathological restaging, which can refine the initial assessment, but the original pre-treatment staging remains the primary reference point.

Would you like me to elaborate on how this pre-treatment staging is conducted in detail?

## Tumor grade
Tumor grading is both standardized and subjective. Here's a nuanced explanation:

Standardization Efforts:
1. Worldwide organizations like WHO have created guidelines
2. Most common system is the Nottingham Grading System
3. Specific scoring criteria for:
   - Cell differentiation
   - Mitotic activity
   - Nuclear characteristics

Sources of Subjectivity:
- Different pathologists may interpret cellular characteristics slightly differently
- Studies show up to 20-30% variation in grade assignment
- More subjective in complex or borderline cases

Reduction of Subjectivity:
- Multiple pathologist reviews
- Quantitative scoring systems
- Emerging AI-assisted digital pathology tools
- Molecular testing to complement visual grading

Real-World Analogy:
Think of it like art criticism. While there are established criteria for evaluating art, there's still room for individual interpretation. Similarly, tumor grading has standard guidelines, but some expert judgment is always involved.

Key Takeaway: While there are strong standardization efforts, some level of subjectivity is inherent in the process. This is why:
- Multiple expert opinions are valuable
- Molecular testing is increasingly used
- Continuous research refines grading techniques

Would you like me to elaborate on how pathologists work to minimize this subjectivity?

In [501]:
df['Tumor_Grade'].value_counts(dropna=False)

Tumor_Grade
3.0    695
2.0    144
1.0     68
NaN     15
Name: count, dtype: int64

In [502]:
df['Nottingham_grade'].value_counts(dropna=False)

Nottingham_grade
2.0    318
NaN    284
3.0    207
1.0    113
Name: count, dtype: int64

In [503]:
cc=[ 'ER', 'PR', 'HER2', 'Mol_Subtype',
       'Staging(Tumor_Size)#_[T]', 'Tumor_Grade', 'Nottingham_grade',
       'Histologic_type']

**Key Point: These NaN values are not random errors, but likely represent real limitations in the diagnostic process.**

In [504]:
df[df['Nottingham_grade'].isna()][cc]

Unnamed: 0,ER,PR,HER2,Mol_Subtype,Staging(Tumor_Size)#_[T],Tumor_Grade,Nottingham_grade,Histologic_type
2,1,1,0,0,2.0,3.0,,
5,1,1,0,0,1.0,1.0,,
10,0,0,0,3,1.0,3.0,,
15,1,1,0,0,3.0,3.0,,
21,1,1,1,1,2.0,3.0,,
...,...,...,...,...,...,...,...,...
902,0,0,1,2,1.0,2.0,,
903,1,1,0,0,1.0,3.0,,1.0
904,1,1,0,0,2.0,2.0,,
910,1,0,0,0,2.0,3.0,,


In [505]:
pd.crosstab(df['Nottingham_grade'], df.pCR, dropna=False)

pCR,0.0,1.0,NaN
Nottingham_grade,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1.0,9,0,104
2.0,71,9,238
3.0,63,37,107
,74,18,192


In [506]:
pd.crosstab(df['Nottingham_grade'], df.Histologic_type, dropna=False)

Histologic_type,0.0,1.0,2.0,3.0,5.0,9.0,NaN
Nottingham_grade,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1.0,0,93,18,0,2,0,0
2.0,1,273,39,1,0,4,0
3.0,0,201,6,0,0,0,0
,0,8,0,0,0,0,276


In [507]:
pd.crosstab(df.pCR, df.Histologic_type, dropna=False)

Histologic_type,0.0,1.0,2.0,3.0,5.0,9.0,NaN
pCR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0.0,0,131,11,0,0,1,74
1.0,0,46,0,0,0,0,18
,1,398,52,1,2,3,184


## Histologic Type distribution in the Duke MRI dataset:

Based on the document you've shared, I'll provide a comprehensive explanation of Histologic Type in the context of the Duke MRI dataset.

Let me break down the Histologic Type values in more detail:

1. **1.0 (Invasive Ductal Carcinoma - IDC)**
   - Most common type
   - Starts in milk ducts
   - Breaks through duct wall
   - Represents majority of cases (575/922 cases)

2. **2.0 (Invasive Lobular Carcinoma - ILC)**
   - Begins in milk-producing lobules
   - Spreads to surrounding tissue
   - Less common (63/922 cases)

3. **Other Values (3.0, 5.0, 9.0)**
   - Rare or specialized subtypes
   - Could include:
     * Medullary Carcinoma
     * Metaplastic Carcinoma
     * Inflammatory Breast Cancer
     * Mixed cell types

4. **NaN (276 cases)**
   - Missing or undetermined histologic type
   - Could be due to:
     * Incomplete medical records
     * Challenging sample analysis
     * Preliminary diagnosis

The key takeaway is that the dataset is predominantly composed of Invasive Ductal Carcinoma (1.0), with a small number of other types and a significant portion of unclassified (NaN) cases.

Would you like me to elaborate on any specific aspect of histologic typing or the distribution in this dataset?

import pandas as pd

# Read specific sheet
ddd = pd.read_excel('E:\\duke\\Clinical_and_Other_Features.xlsx', sheet_name='Column Explanations')

# Save as CSV
ddd.to_csv('duke_clinical_explenation.csv', index=False)

In [508]:
df['Histologic_type'].value_counts(dropna=False)

Histologic_type
1.0    575
NaN    276
2.0     63
9.0      4
5.0      2
3.0      1
0.0      1
Name: count, dtype: int64

In [509]:
hist_dict={1:'IDC', 2:'ILC',9:'rare',3:'rare',0:'rare',5:'rare'}

In [510]:
df['hist_type']=df['Histologic_type'].map(hist_dict)

In [511]:
df['hist_type'].value_counts(dropna=False)

hist_type
IDC     575
NaN     276
ILC      63
rare      8
Name: count, dtype: int64

### Bilateral_Information
Practical Implications

0 (Unilateral): Standard treatment approach
1 (Bilateral):

More comprehensive surgery
Potential bilateral mastectomy
Enhanced genetic testing


NC: Requires further investigation

In [512]:
df['Bilateral_Information'].value_counts(dropna=False)

Bilateral_Information
0     623
NC    271
1      28
Name: count, dtype: int64

In [513]:
df['Bilateral_Information'].values[3]==0

True

In [514]:
df['Bilateral_Information'].values[2]=='NC'

True

In [515]:
df['Bilateral']=df['Bilateral_Information'].map({0:0,1:1,'NC':np.nan})

In [516]:
df['Bilateral'].value_counts(dropna=False)

Bilateral
0.0    623
NaN    271
1.0     28
Name: count, dtype: int64

### Rows	Columns
320=0,448=1,512=2



In [517]:
df['n_xy'].value_counts()

n_xy
512    628
448    261
320     33
Name: count, dtype: int64

In [518]:
df['pid']=df['Patient_ID']

In [519]:
df.columns

Index(['Patient_ID', 'Slice_Thickness', 'Rows', 'Columns', 'FOVcm',
       'Date_of_Birth_(Days)', 'Menopause_(at_diagnosis)',
       'Race_and_Ethnicity', 'ER', 'PR', 'HER2', 'Mol_Subtype',
       'Staging(Tumor_Size)#_[T]', 'Tumor_Grade', 'Nottingham_grade',
       'Histologic_type', 'Bilateral_Information', 'slice_thick', 'n_xy',
       'age', 'pCR', 'HR', 'hormon_status', 'HR_HER2', 'TripleNeg', 'HER2pos',
       'HRposHER2neg', 'menopause', 'r_white', 'r_black', 'hist_type',
       'Bilateral', 'pid'],
      dtype='object')

In [520]:
cols=['pid', 'pCR', 'slice_thick', 'n_xy',
      'menopause','age','r_white', 'r_black',
       'hormon_status', 'HR_HER2',
      'HR', 'ER', 'PR', 'HER2',
      'TripleNeg','HER2pos', 'HRposHER2neg',
       'hist_type','Tumor_Grade','Nottingham_grade','Bilateral'
     ]

In [521]:
df=df[cols]
df.head()

Unnamed: 0,pid,pCR,slice_thick,n_xy,menopause,age,r_white,r_black,hormon_status,HR_HER2,...,ER,PR,HER2,TripleNeg,HER2pos,HRposHER2neg,hist_type,Tumor_Grade,Nottingham_grade,Bilateral
0,Breast_MRI_001,0.0,1.1,448,0.0,41,0,1,hrneg_her2pos,HER2pos,...,0,0,1,0,1,0,IDC,3.0,2.0,0.0
1,Breast_MRI_002,0.0,1.3,512,0.0,38,0,1,triple_neg,TripleNeg,...,0,0,0,1,0,0,IDC,3.0,3.0,0.0
2,Breast_MRI_003,,2.0,512,1.0,62,1,0,luminal_hrpos_her2neg,HRposHER2neg,...,1,1,0,0,0,1,,3.0,,
3,Breast_MRI_004,,2.0,512,1.0,58,1,0,luminal_hrpos_her2neg,HRposHER2neg,...,1,1,0,0,0,1,IDC,3.0,2.0,0.0
4,Breast_MRI_005,1.0,2.0,448,0.0,38,0,0,hrpos_her2pos,HER2pos,...,1,0,1,0,1,0,IDC,3.0,3.0,0.0


In [522]:
df.r_black.value_counts()

r_black
0    719
1    203
Name: count, dtype: int64

In [523]:
df.to_csv('duke_tcia_metadata.csv', index = False)