🩺 Cancer Data Analysis (2015–2024)

📌 Overview

This project explores a global cancer patient dataset (2015–2024) to understand:

Patient demographics
Lifestyle & environmental risk factors
Treatment effectiveness
Survival outcomes

We combine exploratory data analysis (EDA), statistical hypothesis testing, and machine learning modeling to extract insights that can support healthcare decision-making, clinical strategy, and future predictive systems.

🎯 Objectives

Identify key risk factors influencing cancer survival.
Study demographic patterns (age, gender, country).
Evaluate treatment effectiveness across chemotherapy, surgery, radiation, and combinations.
Test hypotheses regarding stage, treatment, and geography.
Build initial predictive models for survival outcomes.
Provide policy-level and clinical recommendations.

📊 Dataset

Source: Global Cancer Patients Dataset (2015–2024)
Observations: Thousands of patient-level entries

Features:

Demographics: Age, Gender, Country, Year of Diagnosis
Risk Factors (scaled 0–10): Genetic Risk, Smoking, Alcohol Use, Obesity, Air Pollution
Clinical: Cancer Stage, Treatment Type
Outcome: Survival Status, Survival Months

Data Quality:

✅ No missing values
✅ No inconsistencies
✅ Risk factors standardized

🔍 Exploratory Data Analysis (EDA)

👥 Demographics

Age: 20–89 years (mean ~54). Risk rises sharply after 50.
Gender: Balanced dataset.
Year of Diagnosis: 2015–2024 (median ~2019).

⚠️ Risk Factors

All risk scores follow symmetric distributions (mean ~5, std ~2.8).
Normalized → fair comparisons across patients.

🧬 Cancer Stage

Most patients diagnosed at Stage II & III.
Stage IV shows lowest survival months.

💊 Treatment Effectiveness

Treatments: Chemotherapy, Surgery, Radiation, Combinations.
Combination therapy (Chemo + Surgery) → highest survival.
Single treatments less effective.

🌍 Geographic Variation

Developed countries: Higher survival due to early detection & advanced care.
Developing countries: Late-stage diagnoses, shorter survival.

📑 Hypothesis Testing

Hypothesis 1: Cancer Stage vs Survival

H0: Survival independent of stage
H1: Survival differs across stages
Result: p < 0.05 → Reject H0
✅ Survival significantly affected by cancer stage

Hypothesis 2: Treatment Type vs Survival

H0: Survival independent of treatment
H1: Survival differs across treatments
Result: p < 0.05 → Reject H0
✅ Combination therapies improve survival outcomes

Hypothesis 3: Country vs Survival

Result: p > 0.05 → Fail to reject H0
⚠️ Differences exist descriptively, but not statistically significant
→ Additional socio-economic data needed

🤖 Machine Learning Models

Models Tested

Logistic Regression (LR)
Stochastic Gradient Descent (SGD)
Gradient Boosting (GB)
Random Forest (RF)

Observations

All models initially showed 100% accuracy
⚠️ Indicates overfitting / data leakage
Likely due to strong correlation between Stage and Survival

Next Steps

Apply train-test split with stratification
Use cross-validation
Handle class imbalance (SMOTE, undersampling)
Engineer interaction features (e.g., Stage × Treatment)

💡 Key Insights

Risk increases after age 50 → target screening
Cancer stage at diagnosis is the most critical survival factor
Combination therapies (Chemo + Surgery) extend survival significantly
Geographic disparities highlight healthcare inequality
Predictive modeling needs refinement to prevent overfitting

🛠️ Tech Stack

Languages: Python
Libraries: Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn, Statsmodels
Methods:
- Exploratory Data Analysis (EDA)
- Hypothesis Testing (ANOVA)
- Predictive Modeling (LR, RF, GB)

🚀 How to Run

Clone the repository

git clone https://github.com/your-username/cancer-data-analysis.git cd cancer-data-analysis

Install dependencies

pip install -r requirements.txt

Run Jupyter Notebook

jupyter notebook Cancer_Data_Analysis.ipynb

For Data Science

Refine models with cross-validation

Perform feature engineering for better predictive power

Explore survival analysis (Kaplan-Meier, Cox Regression)

Deploy results in a dashboard or API

⚠️ Limitations

No socio-economic variables included

Risk of model overfitting

Dataset does not separate cancer subtypes

Limited post-treatment follow-up data

🔮 Future Work

Apply time-to-event survival models

Add socio-economic & healthcare access data

Build an AI-powered cancer outcome predictor

Deploy findings into a web-based clinical dashboard

👨‍💻 Contributors

Bhoopendra Vishwakarma – Data Analyst | Python • SQL • Power BI • Machine Learning linkedin

Yogesh Chouhan – Collaborator & Contributor ( Data Scientist ) linkedin

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Dataset		Dataset
Reports		Reports
notebooks		notebooks
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🩺 Cancer Data Analysis (2015–2024)

📌 Overview

🎯 Objectives

📊 Dataset

🔍 Exploratory Data Analysis (EDA)

👥 Demographics

⚠️ Risk Factors

🧬 Cancer Stage

💊 Treatment Effectiveness

🌍 Geographic Variation

📑 Hypothesis Testing

Hypothesis 1: Cancer Stage vs Survival

Hypothesis 2: Treatment Type vs Survival

Hypothesis 3: Country vs Survival

🤖 Machine Learning Models

Models Tested

Observations

Next Steps

💡 Key Insights

🛠️ Tech Stack

🚀 How to Run

Clone the repository

Install dependencies

Run Jupyter Notebook

For Data Science

⚠️ Limitations

🔮 Future Work

👨‍💻 Contributors

About

Uh oh!

Releases

Packages

Languages

bhuvi16t/Cancer-Data-Analysis-Python

Folders and files

Latest commit

History

Repository files navigation

🩺 Cancer Data Analysis (2015–2024)

📌 Overview

🎯 Objectives

📊 Dataset

🔍 Exploratory Data Analysis (EDA)

👥 Demographics

⚠️ Risk Factors

🧬 Cancer Stage

💊 Treatment Effectiveness

🌍 Geographic Variation

📑 Hypothesis Testing

Hypothesis 1: Cancer Stage vs Survival

Hypothesis 2: Treatment Type vs Survival

Hypothesis 3: Country vs Survival

🤖 Machine Learning Models

Models Tested

Observations

Next Steps

💡 Key Insights

🛠️ Tech Stack

🚀 How to Run

Clone the repository

Install dependencies

Run Jupyter Notebook

For Data Science

⚠️ Limitations

🔮 Future Work

👨‍💻 Contributors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages