Skip to content

The purpose of this project was to analyze a global dataset of cancer patients (2015–2024) to identify demographic patterns, risk factors, treatment effectiveness, and survival outcomes. Using a combination of exploratory data analysis, hypothesis testing, and initial machine learning models, we aimed to uncover insights that could

Notifications You must be signed in to change notification settings

bhuvi16t/Cancer-Data-Analysis-Python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🩺 Cancer Data Analysis (2015–2024)

📌 Overview

This project explores a global cancer patient dataset (2015–2024) to understand:

  • Patient demographics
  • Lifestyle & environmental risk factors
  • Treatment effectiveness
  • Survival outcomes

We combine exploratory data analysis (EDA), statistical hypothesis testing, and machine learning modeling to extract insights that can support healthcare decision-making, clinical strategy, and future predictive systems.


🎯 Objectives

  • Identify key risk factors influencing cancer survival.
  • Study demographic patterns (age, gender, country).
  • Evaluate treatment effectiveness across chemotherapy, surgery, radiation, and combinations.
  • Test hypotheses regarding stage, treatment, and geography.
  • Build initial predictive models for survival outcomes.
  • Provide policy-level and clinical recommendations.

📊 Dataset

  • Source: Global Cancer Patients Dataset (2015–2024)
  • Observations: Thousands of patient-level entries

Features:

  • Demographics: Age, Gender, Country, Year of Diagnosis
  • Risk Factors (scaled 0–10): Genetic Risk, Smoking, Alcohol Use, Obesity, Air Pollution
  • Clinical: Cancer Stage, Treatment Type
  • Outcome: Survival Status, Survival Months

Data Quality:

  • ✅ No missing values
  • ✅ No inconsistencies
  • ✅ Risk factors standardized

🔍 Exploratory Data Analysis (EDA)

👥 Demographics

  • Age: 20–89 years (mean ~54). Risk rises sharply after 50.
  • Gender: Balanced dataset.
  • Year of Diagnosis: 2015–2024 (median ~2019).

⚠️ Risk Factors

  • All risk scores follow symmetric distributions (mean ~5, std ~2.8).
  • Normalized → fair comparisons across patients.

🧬 Cancer Stage

  • Most patients diagnosed at Stage II & III.
  • Stage IV shows lowest survival months.

💊 Treatment Effectiveness

  • Treatments: Chemotherapy, Surgery, Radiation, Combinations.
  • Combination therapy (Chemo + Surgery)highest survival.
  • Single treatments less effective.

🌍 Geographic Variation

  • Developed countries: Higher survival due to early detection & advanced care.
  • Developing countries: Late-stage diagnoses, shorter survival.

📑 Hypothesis Testing

Hypothesis 1: Cancer Stage vs Survival

  • H0: Survival independent of stage
  • H1: Survival differs across stages
  • Result: p < 0.05 → Reject H0
    ✅ Survival significantly affected by cancer stage

Hypothesis 2: Treatment Type vs Survival

  • H0: Survival independent of treatment
  • H1: Survival differs across treatments
  • Result: p < 0.05 → Reject H0
    Combination therapies improve survival outcomes

Hypothesis 3: Country vs Survival

  • Result: p > 0.05 → Fail to reject H0
    ⚠️ Differences exist descriptively, but not statistically significant
    → Additional socio-economic data needed

🤖 Machine Learning Models

Models Tested

  • Logistic Regression (LR)
  • Stochastic Gradient Descent (SGD)
  • Gradient Boosting (GB)
  • Random Forest (RF)

Observations

  • All models initially showed 100% accuracy
  • ⚠️ Indicates overfitting / data leakage
  • Likely due to strong correlation between Stage and Survival

Next Steps

  • Apply train-test split with stratification
  • Use cross-validation
  • Handle class imbalance (SMOTE, undersampling)
  • Engineer interaction features (e.g., Stage × Treatment)

💡 Key Insights

  1. Risk increases after age 50 → target screening
  2. Cancer stage at diagnosis is the most critical survival factor
  3. Combination therapies (Chemo + Surgery) extend survival significantly
  4. Geographic disparities highlight healthcare inequality
  5. Predictive modeling needs refinement to prevent overfitting

🛠️ Tech Stack

  • Languages: Python
  • Libraries: Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn, Statsmodels
  • Methods:
    • Exploratory Data Analysis (EDA)
    • Hypothesis Testing (ANOVA)
    • Predictive Modeling (LR, RF, GB)

🚀 How to Run

Clone the repository

git clone https://github.com/your-username/cancer-data-analysis.git cd cancer-data-analysis

Install dependencies

pip install -r requirements.txt

Run Jupyter Notebook

jupyter notebook Cancer_Data_Analysis.ipynb

For Data Science

Refine models with cross-validation

Perform feature engineering for better predictive power

Explore survival analysis (Kaplan-Meier, Cox Regression)

Deploy results in a dashboard or API

⚠️ Limitations

No socio-economic variables included

Risk of model overfitting

Dataset does not separate cancer subtypes

Limited post-treatment follow-up data

🔮 Future Work

Apply time-to-event survival models

Add socio-economic & healthcare access data

Build an AI-powered cancer outcome predictor

Deploy findings into a web-based clinical dashboard

👨‍💻 Contributors

Bhoopendra Vishwakarma – Data Analyst | Python • SQL • Power BI • Machine Learning linkedin

Yogesh Chouhan – Collaborator & Contributor ( Data Scientist ) linkedin

About

The purpose of this project was to analyze a global dataset of cancer patients (2015–2024) to identify demographic patterns, risk factors, treatment effectiveness, and survival outcomes. Using a combination of exploratory data analysis, hypothesis testing, and initial machine learning models, we aimed to uncover insights that could

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published