## **Introduction**

In this notebook, we perform an initial exploration of five clinical datasets from the MSK-CHORD 2024 study, available through cBioPortal. The ydata_profiling library is used to generate comprehensive reports that summarize each dataset’s structure, distributions, missing values, and inter-variable relationships.

This exploratory data analysis (EDA) is intended to provide a solid understanding of the clinical and molecular variables associated with patients and samples, serving as a foundation for downstream task such as survival analysis, and specially to predict response to cancer therapy.

## **Description of datasets**

📊 Dataset Overview
1. **clinical_patient_raw** (26 variables)
This dataset contains patient-level clinical information.

Key variables include:

- PATIENT_ID: Unique patient identifier.

- GENDER, RACE, ETHNICITY: Demographic variables.

- CURRENT_AGE_DEID: Patient’s age (de-identified; capped at 89).

- STAGE_HIGHEST_RECORDED: Highest recorded cancer stage from the tumor registry (stage 1-3, stage 4, Unknown).

- NUM_ICDO_DX: Number of tumor diagnoses (ICD-O codes).

- Tumor site history: Indicates whether metastasis or tumor involvement was observed in these anatomical sites, derived via NLP from radiology reports. (10)[Adrenal glands, bone, Cns brain, Intra abdominal, Liver, Lung, Lymph nodes, others, Pleura, Reproductive organs]

- SMOKING_PREDICTIONS_3_CLASSES: Inferred smoking history (Current/Former, Never, or Unknown).

- GLEASON_FIRST_REPORTED / GLEASON_HIGHEST_REPORTED: First and highest Gleason scores reported in pathology (relevant for prostate cancer).

- HISTORY_OF_PDL1: Whether the patient had a PD-L1 positive sample (related to immunotherapy).

- PRIOR_MED_TO_MSK: Indicates whether the patient received anti-cancer treatment prior to admission at MSK.

- OS_MONTHS / OS_STATUS: Overall survival in months and survival status (Alive or Deceased).

- HR / HER2: Hormone receptor and HER2 status, relevant in cancers like breast cancer.

2. **clinical_sample_raw** (24 variables)
This dataset contains sample-level clinical and molecular data.

Key variables include:

- SAMPLE_ID / PATIENT_ID: Sample and associated patient identifiers.

- GLEASON_SAMPLE_LEVEL: Gleason score specific to the sample.

- PDL1_POSITIVE: Indicates whether the sample tested positive for PD-L1.

- CANCER_TYPE / CANCER_TYPE_DETAILED / PRIMARY_SITE: Cancer classification and primary tumor site.

- SAMPLE_TYPE / SAMPLE_CLASS: Describes the sample source (e.g., biopsy, surgical) and classification (tumor, normal, etc.).

- METASTATIC_SITE: Metastatic site, if applicable.

- GENE_PANEL: Gene panel used for sequencing.

- SAMPLE_COVERAGE: Sequencing depth or coverage of the sample.

- TUMOR_PURITY: Estimated percentage of tumor cells in the sample.

- MSI_SCORE / MSI_TYPE / MSI_COMMENT: Microsatellite instability (MSI) metrics and annotations.

- TMB_NONSYNONYMOUS: Tumor mutational burden (number of nonsynonymous mutations).

- CLINICAL_GROUP / PATHOLOGICAL_GROUP: Clinical and pathological groupings used by the institution.

- ICD_O_HISTOLOGY_DESCRIPTION / DIAGNOSIS_DESCRIPTION: Diagnostic histopathology descriptions.

- CLINICAL_SUMMARY: NLP-derived summary of the sample’s clinical context.

3. **data_timeline_treatment** (data_treatment): (8 variables)
The treatment dataset from MSK-CHORD 2024 contains detailed information about therapeutic interventions administered to patients. Each record includes a unique anonymized patient identifier (PATIENT_ID) and the time frame during which the treatment occurred, specified by START_DATE and STOP_DATE as the number of days relative to the patient's cancer diagnosis date (with Day 0 representing the date of diagnosis). The EVENT_TYPE field describes the general nature of the recorded event, which in this dataset is consistently labeled as "Treatment." The SUBTYPE variable further categorizes the type of treatment, such as chemotherapy ("Chemo"). The specific drug or therapeutic agent used is listed under the AGENT column. The RX_INVESTIGATIVE field indicates whether the treatment was part of a clinical trial or investigational protocol, with "Y" denoting investigational treatments and "N" for standard therapies. Finally, the FLAG_OROTOPICAL variable is a binary indicator showing whether the treatment was administered via oral or topical routes (1 = yes, 0 = no). This dataset enables detailed temporal and categorical analysis of treatment patterns across patients.

4. **data_mutations**: (123 variables): The data_mutations.txt file captures the genomic alterations identified in tumor samples analyzed using the MSK‑IMPACT targeted sequencing platform. It encompasses 123 variables, each recording specific details about somatic mutations, including their biological and functional context.
Key categories of variables include:

- Sample identifiers and patient metadata
Variables like Tumor_Sample_Barcode and PATIENT_ID link each mutation to both the biological sample and the patient.

- Genomic coordinates and variant description
Fields such as Hugo_Symbol, Chromosome, Start_Position, End_Position, Reference_Allele, Tumor_Seq_Allele2, and Variant_Classification/Type describe the gene affected, chromosomal location, nucleotide change, and mutation type (e.g., missense, nonsense, frameshift).

- Protein-level annotation
Variables like HGVSp_Short and HGVSp_Long provide protein change notations (e.g., p.V600E).

- Functional impact and consequence
Includes Variant_Classification, Variant_Type, Transcript_ID, and Exon_Number to assess mutation effect.

- Allelic and sequencing metrics
Metrics such as Tumor_Seq_Allele1/2, t_depth, t_ref_count, t_alt_count, and n_depth/ref/alt_count record the number of reads supporting the variant, both in tumor and normal samples.

- Allele frequency and coverage
Variables like t_alt_freq and n_alt_freq indicate the proportion of reads carrying the mutation.

- Database annotations and filtering flags
Includes dbSNP_RS, dbSNP_Val_Status, is_silent, FILTER, and Sequencing_Phase.

- Clinical and therapeutic implications
Fields such as IMPACT_Classification, AMRFAM, OncoKB, COSMIC_ID, PharmGKB, and Germline provide information on clinical actionability, known pathogenicity, drug associations, and germline origin.

- Bioinformatic and QC annotations
Includes algorithmic predictions, QC scores, and annotation flags indicating mutation reliability and filtering status.

5. **clinical_data**: This dataset enables integration of molecular and clinical profiles. The dataset comprises 53 variables, capturing a broad range of patient-level and sample-level characteristics relevant to cancer type, treatment history, survival outcomes, and tumor profiling.

Key fields include:

- Identifiers (Study ID, Patient ID, Sample ID) for linking with genomic and treatment data.

- Cancer diagnosis and classification (Cancer Type, Cancer Type Detailed, ICD-O Histology Description, Oncotree Code, Clinical Group, Pathological Group).

- Tumor site annotations derived via NLP (e.g., Tumor Site: Liver (NLP), Tumor Site: CNS/Brain (NLP), Metastatic Site).

- Demographics (Sex, Race, Ethnicity, Current Age, Smoking History (NLP)).

- Molecular biomarkers and pathology (e.g., HER2, PD-L1 status, HR, MSI Type, TMB, Tumor Purity, Gleason Score fields).

- Sample-specific details (Sample Type, Sample Class, Gene Panel, Somatic Status, Sample Coverage).

- Clinical outcomes (Overall Survival (Months), Overall Survival Status).

- Other variables track prior treatments (Prior Treatment to MSK (NLP)), mutation burden (Mutation Count), and tumor staging (Stage (Highest Recorded)).


### **Install dependencies**

In [1]:
!pip install -U ydata-profiling

Collecting ydata-profiling
  Downloading ydata_profiling-4.16.1-py2.py3-none-any.whl.metadata (22 kB)
Collecting visions<0.8.2,>=0.7.5 (from visions[type_image_path]<0.8.2,>=0.7.5->ydata-profiling)
  Downloading visions-0.8.1-py3-none-any.whl.metadata (11 kB)
Collecting htmlmin==0.1.12 (from ydata-profiling)
  Downloading htmlmin-0.1.12.tar.gz (19 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting phik<0.13,>=0.11.1 (from ydata-profiling)
  Downloading phik-0.12.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Collecting multimethod<2,>=1.4 (from ydata-profiling)
  Downloading multimethod-1.12-py3-none-any.whl.metadata (9.6 kB)
Collecting imagehash==4.3.1 (from ydata-profiling)
  Downloading ImageHash-4.3.1-py2.py3-none-any.whl.metadata (8.0 kB)
Collecting dacite>=1.8 (from ydata-profiling)
  Downloading dacite-1.9.2-py3-none-any.whl.metadata (17 kB)
Collecting puremagic (from visions<0.8.2,>=0.7.5->visions[type_image_path]<0.8.2,>=0.7.5->

### **Import Libraries**

In [2]:
import torch
from datasets import load_dataset
import numpy as np
from tqdm import tqdm
import transformers

import pandas as pd
from ydata_profiling import ProfileReport


In [3]:
from google.colab import userdata
from huggingface_hub import login

token = userdata.get("HF_TOKEN")
login(token=token)
!huggingface-cli whoami

amlopeza


In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
%cd /content/drive/MyDrive/Colab Notebooks/Intern_summer2025/code/

/content/drive/MyDrive/Colab Notebooks/Intern_summer2025/code


In [14]:
file_path =  '../data/msk_chord_2024/data_clinical_patient.txt'
clinical_patient_raw = pd.read_csv(file_path, sep="\t", comment= "#")

# 1. Exploration of clinical patient dataset

In [5]:
# Data Exploration
ProfileReport(clinical_patient_raw)

Output hidden; open in https://colab.research.google.com to view.

In [7]:
#checking column names of clinical patient dataset
clinical_patient_raw.columns

Index(['PATIENT_ID', 'GENDER', 'RACE', 'ETHNICITY', 'CURRENT_AGE_DEID',
       'STAGE_HIGHEST_RECORDED', 'NUM_ICDO_DX', 'ADRENAL_GLANDS', 'BONE',
       'CNS_BRAIN', 'INTRA_ABDOMINAL', 'LIVER', 'LUNG', 'LYMPH_NODES', 'OTHER',
       'PLEURA', 'REPRODUCTIVE_ORGANS', 'SMOKING_PREDICTIONS_3_CLASSES',
       'GLEASON_FIRST_REPORTED', 'GLEASON_HIGHEST_REPORTED', 'HISTORY_OF_PDL1',
       'PRIOR_MED_TO_MSK', 'OS_MONTHS', 'OS_STATUS', 'HR', 'HER2'],
      dtype='object')

In [18]:
#Checking metastatic sites for patient P-0000012
clinical_patient_raw[['PATIENT_ID','ADRENAL_GLANDS', 'BONE', 'CNS_BRAIN', 'INTRA_ABDOMINAL',
                      'LIVER', 'LUNG', 'LYMPH_NODES', 'OTHER', 'PLEURA', 'REPRODUCTIVE_ORGANS']].query('`PATIENT_ID`=="P-0000012"')

Unnamed: 0,PATIENT_ID,ADRENAL_GLANDS,BONE,CNS_BRAIN,INTRA_ABDOMINAL,LIVER,LUNG,LYMPH_NODES,OTHER,PLEURA,REPRODUCTIVE_ORGANS
0,P-0000012,No,No,No,Yes,No,Yes,Yes,Yes,No,No


# 2. Exploration of clinical sample dataset

In [12]:
file_path =  '/content/drive/MyDrive/Colab Notebooks/Intern_summer2025/data/msk_chord_2024/data_clinical_sample.txt'
clinical_sample_raw = pd.read_csv(file_path, sep="\t", comment= "#")

In [11]:
ProfileReport(clinical_sample_raw)

Output hidden; open in https://colab.research.google.com to view.

In [12]:
#checking column names of clinical sample dataset
clinical_sample_raw.columns

Index(['SAMPLE_ID', 'PATIENT_ID', 'GLEASON_SAMPLE_LEVEL', 'PDL1_POSITIVE',
       'CANCER_TYPE', 'SAMPLE_TYPE', 'SAMPLE_CLASS', 'METASTATIC_SITE',
       'PRIMARY_SITE', 'CANCER_TYPE_DETAILED', 'GENE_PANEL', 'SAMPLE_COVERAGE',
       'TUMOR_PURITY', 'ONCOTREE_CODE', 'MSI_COMMENT', 'MSI_SCORE', 'MSI_TYPE',
       'SOMATIC_STATUS', 'CLINICAL_GROUP', 'PATHOLOGICAL_GROUP',
       'CLINICAL_SUMMARY', 'ICD_O_HISTOLOGY_DESCRIPTION',
       'DIAGNOSIS_DESCRIPTION', 'TMB_NONSYNONYMOUS'],
      dtype='object')

In [13]:
clinical_sample_raw['PATIENT_ID'].nunique()

24950

In [14]:
clinical_sample_raw.head()

Unnamed: 0,SAMPLE_ID,PATIENT_ID,GLEASON_SAMPLE_LEVEL,PDL1_POSITIVE,CANCER_TYPE,SAMPLE_TYPE,SAMPLE_CLASS,METASTATIC_SITE,PRIMARY_SITE,CANCER_TYPE_DETAILED,...,MSI_COMMENT,MSI_SCORE,MSI_TYPE,SOMATIC_STATUS,CLINICAL_GROUP,PATHOLOGICAL_GROUP,CLINICAL_SUMMARY,ICD_O_HISTOLOGY_DESCRIPTION,DIAGNOSIS_DESCRIPTION,TMB_NONSYNONYMOUS
0,P-0000012-T03-IM3,P-0000012,,,Non-Small Cell Lung Cancer,Metastasis,Tumor,Neck,Lung,Lung Adenocarcinoma,...,MICROSATELLITE STABLE (MSS). See MSI note below.,0.47,Stable,Matched,3B,,Distant,"Adenocarcinoma, Nos",Lung and Bronchus,32.165504
1,P-0000012-T02-IM3,P-0000012,,,Breast Cancer,Primary,Tumor,Not Applicable,Breast,Breast Invasive Ductal Carcinoma,...,MICROSATELLITE INSTABILITY-INDETERMINATE. See ...,4.1,Indeterminate,Matched,,,,Infiltrating Duct Carcinoma,Breast,1.109155
2,P-0000015-T01-IM3,P-0000015,,,Breast Cancer,Metastasis,Tumor,Liver,Breast,Breast Invasive Ductal Carcinoma,...,Not Available,2.55,Stable,Matched,1,1.0,Localized,Infiltrating Duct Carcinoma,Breast,7.764087
3,P-0000036-T01-IM3,P-0000036,,,Non-Small Cell Lung Cancer,Primary,Tumor,Not Applicable,Lung,Lung Adenocarcinoma,...,,-1.0,Do not report,Unmatched,4,,Distant,"Adenocarcinoma, Nos",Lung and Bronchus,7.764087
4,P-0000041-T01-IM3,P-0000041,,,Breast Cancer,Primary,Tumor,Not Applicable,Breast,Breast Invasive Ductal Carcinoma,...,MICROSATELLITE INSTABILITY-INDETERMINATE. See ...,3.55,Indeterminate,Matched,2A,1.0,Localized,Infiltrating Duct Carcinoma,Breast,11.091553


In [None]:
data_sample_raw

## 3. Exploration of treatment dataset

In [6]:
data_treatment = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Intern_summer2025/data/msk_chord_2024/data_timeline_treatment.txt', sep="\t", comment= "#")

In [9]:
ProfileReport(data_treatment)

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]


  0%|          | 0/8 [00:00<?, ?it/s][A
 12%|█▎        | 1/8 [00:01<00:10,  1.50s/it][A
 50%|█████     | 4/8 [00:01<00:01,  3.13it/s][A
 75%|███████▌  | 6/8 [00:02<00:00,  2.48it/s][A
100%|██████████| 8/8 [00:03<00:00,  2.31it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



In [28]:
data_treatment['PATIENT_ID'].nunique()

21473

In [8]:
data_treatment['SUBTYPE'].unique()

array(['Chemo', 'Investigational', 'Immuno', 'Hormone', 'Bone Treatment',
       'Biologic', 'Targeted', 'Other'], dtype=object)

In [22]:
treats = pd.read_csv('../data/msk_chord_2024/treatment_history.csv', sep=",", comment= "#")

In [26]:
data_treatment.head()

Unnamed: 0,PATIENT_ID,START_DATE,STOP_DATE,EVENT_TYPE,SUBTYPE,AGENT,RX_INVESTIGATIVE,FLAG_OROTOPICAL
0,P-0000012,-5437,-5369,Treatment,Chemo,CYCLOPHOSPHAMIDE,N,0
1,P-0000012,-5437,-5326,Treatment,Chemo,FLUOROURACIL,N,0
2,P-0000012,-5437,-5327,Treatment,Chemo,METHOTREXATE,N,0
3,P-0000012,33,40,Treatment,Chemo,CISPLATIN,N,0
4,P-0000012,33,65,Treatment,Chemo,ETOPOSIDE,N,0


## 4. Exploration of data mutations dataset

In [9]:
data_mutations = pd.read_csv('../data/msk_chord_2024/data_mutations.txt', sep="\t", comment= "#")

  data_mutations = pd.read_csv('../data/msk_chord_2024/data_mutations.txt', sep="\t", comment= "#")


In [11]:
data_mutations['Tumor_Sample_Barcode']

Unnamed: 0,Tumor_Sample_Barcode
0,P-0081657-T01-IM7
1,P-0081657-T01-IM7
2,P-0081657-T01-IM7
3,P-0083825-T01-IM7
4,P-0083825-T01-IM7
...,...
208948,P-0014611-T01-IM6
208949,P-0014611-T01-IM6
208950,P-0014611-T01-IM6
208951,P-0014611-T01-IM6


In [11]:
ProfileReport(data_mutations)

Output hidden; open in https://colab.research.google.com to view.

In [21]:
data_mutations.head()

Unnamed: 0,Hugo_Symbol,Entrez_Gene_Id,Center,NCBI_Build,Chromosome,Start_Position,End_Position,Strand,Consequence,Variant_Classification,...,VARIANT_CLASS,all_effects,amino_acid_change,cDNA_Change,cDNA_position,cdna_change,comments,n_depth,t_depth,transcript
0,EGFR,1956,MSKCC,GRCh37,7,55242470,55242487,+,inframe_deletion,In_Frame_Del,...,,,,,,,,,,
1,PDGFRB,5159,MSKCC,GRCh37,5,149513271,149513271,+,missense_variant,Missense_Mutation,...,,,,,,,,,,
2,RBM10,8241,MSKCC,GRCh37,X,47041565,47041598,+,frameshift_variant,Frame_Shift_Del,...,,,,,,,,,,
3,TP53,7157,MSKCC,GRCh37,17,7578235,7578235,+,missense_variant,Missense_Mutation,...,,,,,,,,,,
4,TP53,7157,MSKCC,GRCh37,17,7577058,7577058,+,stop_gained,Nonsense_Mutation,...,,,,,,,,,,


## 5. Exploration of clinical_data data set

In [16]:
clinical_data = pd.read_csv('../data/msk_chord_2024/msk_chord_2024_clinical_data.tsv', sep="\t", comment= "#")

In [14]:
ProfileReport(clinical_data)

Output hidden; open in https://colab.research.google.com to view.

In [8]:
clinical_data.head()

Unnamed: 0,Study ID,Patient ID,Sample ID,Tumor Site: Adrenal Glands (NLP),Tumor Site: Bone (NLP),Cancer Type,Cancer Type Detailed,Clinical Group,Clinical Summary,Tumor Site: CNS/Brain (NLP),...,Tumor Site: Reproductive Organs (NLP),Sample Class,Number of Samples Per Patient,Sample coverage,Sample Type,Smoking History (NLP),Somatic Status,Stage (Highest Recorded),TMB (nonsynonymous),Tumor Purity
0,msk_chord_2024,P-0000012,P-0000012-T02-IM3,No,No,Breast Cancer,Breast Invasive Ductal Carcinoma,,,No,...,No,Tumor,2,344,Primary,Former/Current Smoker,Matched,Stage 1-3,1.109155,
1,msk_chord_2024,P-0000012,P-0000012-T03-IM3,No,No,Non-Small Cell Lung Cancer,Lung Adenocarcinoma,3B,Distant,No,...,No,Tumor,2,428,Metastasis,Former/Current Smoker,Matched,Stage 1-3,32.165504,
2,msk_chord_2024,P-0000015,P-0000015-T01-IM3,No,Yes,Breast Cancer,Breast Invasive Ductal Carcinoma,1,Localized,Yes,...,No,Tumor,1,281,Metastasis,Unknown,Matched,Stage 1-3,7.764087,40.0
3,msk_chord_2024,P-0000036,P-0000036-T01-IM3,No,Yes,Non-Small Cell Lung Cancer,Lung Adenocarcinoma,4,Distant,No,...,No,Tumor,1,380,Primary,Never,Unmatched,Stage 4,7.764087,30.0
4,msk_chord_2024,P-0000041,P-0000041-T01-IM3,No,Yes,Breast Cancer,Breast Invasive Ductal Carcinoma,2A,Localized,Yes,...,No,Tumor,1,401,Primary,Unknown,Matched,Stage 1-3,11.091553,30.0


In [16]:
clinical_data.columns

Index(['Study ID', 'Patient ID', 'Sample ID',
       'Tumor Site: Adrenal Glands (NLP)', 'Tumor Site: Bone (NLP)',
       'Cancer Type', 'Cancer Type Detailed', 'Clinical Group',
       'Clinical Summary', 'Tumor Site: CNS/Brain (NLP)', 'Current Age',
       'Diagnosis Description', 'Ethnicity', 'Fraction Genome Altered', 'Sex',
       'Gene Panel', 'Gleason Score, 1st Reported (NLP)',
       'Gleason Score, Highest Reported (NLP)',
       'Gleason Score Reported on Sample (NLP)', 'HER2',
       'History for Positive PD-L1 (NLP)', 'HR', 'ICD-O Histology Description',
       'Tumor Site: Intra Abdominal', 'Tumor Site: Liver (NLP)',
       'Tumor Site: Lung (NLP)', 'Tumor Site: Lymph Node (NLP)',
       'Metastatic Site', 'MSI Comment', 'MSI Score', 'MSI Type',
       'Mutation Count', 'Number of Tumor Registry Entries', 'Oncotree Code',
       'Overall Survival (Months)', 'Overall Survival Status',
       'Tumor Site: Other (NLP)', 'Pathological Group',
       'Sample PD-L1 Positive (NL

In [19]:
#Total number of patients
clinical_data['Patient ID'].nunique()

24950

In [12]:
clinical_data.shape

(25040, 53)

In [None]:

clinical_data_variables= ['Cancer Type Detailed', 'Fraction Genome Altered',
       'MSI Type','Mutation Count', 'Primary Tumor Site', 'TMB (nonsynonymous)', 'Tumor Purity']

In [17]:
clinical_data['Patient ID'].duplicated().sum()

np.int64(90)

# Review of Data Exploration – MSK-CHORD 2024  
## **Findings**

After exploring the MSK-CHORD 2024 dataset, specifically 5 different datasets with pathological, clinical and molecular information for 24,950 patients.

We began with the patient_clinical dataset, which contains core clinical information at the patient level. From there, we integrated data from the treatment (data_timeline_treatment) and mutation (data_mutations) datasets, followed by the addition of seven variables from the comprehensive clinical_data file:
- **Cancer Type Detailed**
- **Fraction Genome Altered**
- **MSI Type**
- **Mutation Count**
- **Primary Tumor Site**
- **TMB (nonsynonymous)**
- **Tumor Purity**

Since clinical_data aggregates information from several auxiliary datasets, we used it to evaluate sample availability across the cohort. We found that only 0.36% of the patients (90 out of 24,950) had both pre- and post-treatment samples, each with two corresponding entries.

This data exploration step helped us identify clinically informative variables, evaluate data completeness, and define the subset of patients to be used for generating prompts in downstream tasks focused on cancer therapy response.

In [18]:
especimen = pd.read_csv('../data/msk_chord_2024/data_timeline_specimen.txt', sep='\t', comment='#')

In [19]:
especimen.head()

Unnamed: 0,PATIENT_ID,START_DATE,STOP_DATE,EVENT_TYPE,SUBTYPE,SAMPLE_ID
0,P-0000012,0,,Sequencing,,P-0000012-T03-IM3
1,P-0000012,48,,Sequencing,,P-0000012-T02-IM3
2,P-0000015,0,,Sequencing,,P-0000015-T01-IM3
3,P-0000036,0,,Sequencing,,P-0000036-T01-IM3
4,P-0000041,0,,Sequencing,,P-0000041-T01-IM3


In [20]:
especimen['PATIENT_ID'].duplicated().sum()

np.int64(90)

Only 90 patients have information about 2 samples pre and post treatment