# Pancreatic Cancer Survival Prediction

**Introduction and Background**

Pancreatic cancer is one of the most lethal malignancies worldwide, with a five-year survival rate of less than 10% due to its late diagnosis and aggressive progression. Accurate prediction of patient survival at the time of diagnosis remains a critical challenge in oncology, limiting clinicians’ ability to personalize treatment and allocate resources effectively. With the increasing availability of clinical and genomic data, there is a growing opportunity to leverage machine learning to estimate patient outcomes and support evidence-based decision-making. 


**Problem Statement**

Pancreatic cancer is often diagnosed late and is difficult to treat, making it one of the deadliest cancers. Doctors lack accurate tools to predict how long a patient might survive, which limits personalized care. Although there is a lot of clinical and genomic data available, survival is still mainly predicted using general staging systems. This project aims to build a machine learning model that can better predict survival in pancreatic cancer patients, helping doctors make more informed decisions.

By identifying high-risk individuals early, this tool can support treatment planning, improve patient counseling, and guide the selection of candidates for advanced therapies or clinical trials.

**Project Objectives**

1.	Filter and isolate pancreatic cancer patient data from a larger clinical-genomic dataset to create a focused, high-quality subset for analysis.
2.	Preprocess and engineer features from clinical, pathological, and genomic variables such as tumor mutational burden (TMB), tumor purity, disease stage, and demographic data.
3.	Build a predictive machine learning model to estimate the overall survival status (alive vs. deceased) of pancreatic cancer patients at the time of sample collection or diagnosis.
4.	Evaluate model performance using classification metrics such as accuracy, precision, recall, F1-score, and ROC-AUC to ensure clinical relevance and reliability.
5.	Interpret model outputs to identify the most influential features contributing to survival predictions, thereby providing insights into potential prognostic biomarkers.
6.	Demonstrate potential clinical utility by outlining how the model could support risk stratification, personalized treatment planning, and early patient counseling in real-world oncology settings.


## 1.	Data Understanding

In [299]:
# importing relevant libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, classification_report, roc_curve


MSK-CHORD (MSK, Nature 2024)
The dataset was sourced from the MSK CHORD 2024 clinical-genomic database, which includes over 25,000 cancer cases. A subset of 3,109 records corresponding to patients with pancreatic cancer was extracted based on the Cancer Type column.
Targeted sequencing of 25040 tumors from 24950 patients and their matched normals via MSK-IMPACT, along with clinical annotations, some of which are derived from natural language processing (denoted NLP). This data is available under the Creative Commons BY-NC-ND 4.0 license. 

Data Url: https://www.cbioportal.org/study/summary?id=msk_chord_2024

In [300]:
# Imported the dataset
df = pd.read_csv('msk_chord_2024_clinical_data.csv')

# The first 5 rows
df.head()

Unnamed: 0,Study ID,Patient ID,Sample ID,Tumor Site: Adrenal Glands (NLP),Tumor Site: Bone (NLP),Cancer Type,Cancer Type Detailed,Clinical Group,Clinical Summary,Tumor Site: CNS/Brain (NLP),...,Tumor Site: Reproductive Organs (NLP),Sample Class,Number of Samples Per Patient,Sample coverage,Sample Type,Smoking History (NLP),Somatic Status,Stage (Highest Recorded),TMB (nonsynonymous),Tumor Purity
0,msk_chord_2024,P-0000012,P-0000012-T02-IM3,No,No,Breast Cancer,Breast Invasive Ductal Carcinoma,,,No,...,No,Tumor,2,344,Primary,Former/Current Smoker,Matched,Stage 1-3,1.109155,
1,msk_chord_2024,P-0000012,P-0000012-T03-IM3,No,No,Non-Small Cell Lung Cancer,Lung Adenocarcinoma,3B,Distant,No,...,No,Tumor,2,428,Metastasis,Former/Current Smoker,Matched,Stage 1-3,32.165504,
2,msk_chord_2024,P-0000015,P-0000015-T01-IM3,No,Yes,Breast Cancer,Breast Invasive Ductal Carcinoma,1,Localized,Yes,...,No,Tumor,1,281,Metastasis,Unknown,Matched,Stage 1-3,7.764087,40.0
3,msk_chord_2024,P-0000036,P-0000036-T01-IM3,No,Yes,Non-Small Cell Lung Cancer,Lung Adenocarcinoma,4,Distant,No,...,No,Tumor,1,380,Primary,Never,Unmatched,Stage 4,7.764087,30.0
4,msk_chord_2024,P-0000041,P-0000041-T01-IM3,No,Yes,Breast Cancer,Breast Invasive Ductal Carcinoma,2A,Localized,Yes,...,No,Tumor,1,401,Primary,Unknown,Matched,Stage 1-3,11.091553,30.0


In [301]:
#Dataset overview
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25040 entries, 0 to 25039
Data columns (total 53 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   Study ID                                25040 non-null  object 
 1   Patient ID                              25040 non-null  object 
 2   Sample ID                               25040 non-null  object 
 3   Tumor Site: Adrenal Glands (NLP)        25040 non-null  object 
 4   Tumor Site: Bone (NLP)                  25040 non-null  object 
 5   Cancer Type                             25040 non-null  object 
 6   Cancer Type Detailed                    25040 non-null  object 
 7   Clinical Group                          20376 non-null  object 
 8   Clinical Summary                        24552 non-null  object 
 9   Tumor Site: CNS/Brain (NLP)             25040 non-null  object 
 10  Current Age                             25037 non-null  fl

In [302]:
# columns vrs Rows
df.shape

(25040, 53)

This shows the dataset has mostly float and object data. They are 25040 rows and 52 columns. The rows and columns will need to be selected and filtered to reduce the dataset and focus on Pancreatic cancer.

In [303]:
#Overview of all Numeric columns
df.describe()

Unnamed: 0,Current Age,Fraction Genome Altered,"Gleason Score, 1st Reported (NLP)","Gleason Score, Highest Reported (NLP)",Gleason Score Reported on Sample (NLP),MSI Score,Mutation Count,Number of Tumor Registry Entries,Overall Survival (Months),Number of Samples Per Patient,Sample coverage,TMB (nonsynonymous),Tumor Purity
count,25037.0,24870.0,3272.0,3272.0,2110.0,24487.0,23862.0,25040.0,25040.0,25040.0,25040.0,25040.0,24454.0
mean,65.476655,0.180309,7.845355,7.981663,8.012322,1.43865,8.725756,1.315735,32.128503,1.007188,609.113099,7.258216,36.414452
std,12.683747,0.185514,1.022186,0.976473,0.942561,5.004455,19.416691,0.687179,25.712181,0.084481,213.755797,16.375452,19.201781
min,10.0,0.0,6.0,6.0,6.0,-1.0,1.0,0.0,0.032877,1.0,23.0,0.0,0.0
25%,57.0,0.0211,7.0,7.0,7.0,0.02,3.0,1.0,11.145193,1.0,470.0,2.461042,20.0
50%,67.0,0.1258,8.0,8.0,8.0,0.23,5.0,1.0,24.657507,1.0,598.0,4.101736,30.0
75%,75.0,0.2839,9.0,9.0,9.0,0.9,8.0,1.0,49.183508,1.0,737.0,6.917585,50.0
max,89.0,1.0,10.0,10.0,10.0,53.53,696.0,10.0,118.454665,2.0,2610.0,570.961677,95.0


Understanding the columns:

- The Current Age: Patient age ranges from 10 to 89, with an average of 65.5 years.

-  Fraction Genome Altered: Represents how much of the genome is altered, ranging from 0 to 1.

-  Gleason Scores  Indicates prostate cancer severity generally ranges from 6 to 10.

-  MSI Score: Microsatellite Instability score.

- Mutation Count: Mutation events, ranging widely from 1 to 696.

- Number of Tumor Registry Entries: Most patients have 1, with a max of 10.

- Overall Survival : Ranges from near 0 to over 118 months.

- Number of Samples Per Patient: Usually 1, some with 2.

- Sample Coverage: Mean of 609, max up to 2610.

- (TMB) Tumor Mutational Burden mean of 7.26

- Tumor Purity: Ranges from 0 to 95%, with a mean of 36.4%.

## 2. Data Cleaning and preprocessing

We focus specifically on pancreatic cancer by filtering the dataset to include only patients diagnosed with this cancer type.

In [304]:
# Cancer types found in the dataset.
print(df['Cancer Type'].unique())

['Breast Cancer' 'Non-Small Cell Lung Cancer' 'Colorectal Cancer'
 'Prostate Cancer' 'Pancreatic Cancer']


In [305]:
# Filter for Pancreatic cancer 

df_filtered = df[df['Cancer Type'] == 'Pancreatic Cancer'].copy()

#Confirm the filter
df_filtered.reset_index(drop=True, inplace=True)
print(df_filtered['Cancer Type'].unique())

['Pancreatic Cancer']


In [306]:
# New columns and rows
df_filtered.shape

(3109, 53)

The data set is large with 53 columns, we decided to drop certain columns to focus on what will be used for prediction and EDA. The target variable being the 'Overall Survival Status' and the rest are features.

In [307]:
# Dropping columns
columns_chosen = [
    'Overall Survival Status','Overall Survival (Months)','Current Age','Sex','Race','Stage (Highest Recorded)',
    'Smoking History (NLP)',
    'TMB (nonsynonymous)',
    'Tumor Purity',
    'Sample coverage'
]

# The new dataset
df_filtered = df[columns_chosen].copy()
df_filtered.head()

Unnamed: 0,Overall Survival Status,Overall Survival (Months),Current Age,Sex,Race,Stage (Highest Recorded),Smoking History (NLP),TMB (nonsynonymous),Tumor Purity,Sample coverage
0,0:LIVING,118.454665,68.0,Female,White,Stage 1-3,Former/Current Smoker,1.109155,,344
1,0:LIVING,118.454665,68.0,Female,White,Stage 1-3,Former/Current Smoker,32.165504,,428
2,1:DECEASED,13.906834,45.0,Female,White,Stage 1-3,Unknown,7.764087,40.0,281
3,0:LIVING,115.462887,68.0,Female,Other,Stage 4,Never,7.764087,30.0,380
4,1:DECEASED,13.610944,53.0,Female,White,Stage 1-3,Unknown,11.091553,30.0,401


In [308]:
# Finding Nulls and Missing entries
df_filtered.isnull().sum()

Overall Survival Status        0
Overall Survival (Months)      0
Current Age                    3
Sex                            0
Race                           0
Stage (Highest Recorded)       0
Smoking History (NLP)          0
TMB (nonsynonymous)            0
Tumor Purity                 586
Sample coverage                0
dtype: int64

The column with missing entries is the Tumor Purity and current Age. The Age null rows will be dropped and the Tumor with N/A as not to distort the dataset.

In [309]:
# Dropped rows
df_filtered = df_filtered.dropna(subset=['Current Age'])
df_filtered.isnull().sum()

Overall Survival Status        0
Overall Survival (Months)      0
Current Age                    0
Sex                            0
Race                           0
Stage (Highest Recorded)       0
Smoking History (NLP)          0
TMB (nonsynonymous)            0
Tumor Purity                 586
Sample coverage                0
dtype: int64

In [310]:
# Filled rows
df_filtered['Tumor Purity'] = df_filtered['Tumor Purity'].fillna('N/A')
df_filtered.isnull().sum()

Overall Survival Status      0
Overall Survival (Months)    0
Current Age                  0
Sex                          0
Race                         0
Stage (Highest Recorded)     0
Smoking History (NLP)        0
TMB (nonsynonymous)          0
Tumor Purity                 0
Sample coverage              0
dtype: int64

## 3. Exploratory Data Analysis (EDA)

In this section , we will look into key patient characteristics, treatment factors, and clinical variables associated with pancreatic cancer. We will explore feature distributions, and potential relationships between variables and survival outcomes.