This project explores a global cancer patient dataset (2015–2024) to understand:
- Patient demographics
- Lifestyle & environmental risk factors
- Treatment effectiveness
- Survival outcomes
We combine exploratory data analysis (EDA), statistical hypothesis testing, and machine learning modeling to extract insights that can support healthcare decision-making, clinical strategy, and future predictive systems.
- Identify key risk factors influencing cancer survival.
- Study demographic patterns (age, gender, country).
- Evaluate treatment effectiveness across chemotherapy, surgery, radiation, and combinations.
- Test hypotheses regarding stage, treatment, and geography.
- Build initial predictive models for survival outcomes.
- Provide policy-level and clinical recommendations.
- Source: Global Cancer Patients Dataset (2015–2024)
- Observations: Thousands of patient-level entries
Features:
- Demographics:
Age
,Gender
,Country
,Year of Diagnosis
- Risk Factors (scaled 0–10):
Genetic Risk
,Smoking
,Alcohol Use
,Obesity
,Air Pollution
- Clinical:
Cancer Stage
,Treatment Type
- Outcome:
Survival Status
,Survival Months
Data Quality:
- ✅ No missing values
- ✅ No inconsistencies
- ✅ Risk factors standardized
- Age: 20–89 years (mean ~54). Risk rises sharply after 50.
- Gender: Balanced dataset.
- Year of Diagnosis: 2015–2024 (median ~2019).
- All risk scores follow symmetric distributions (mean ~5, std ~2.8).
- Normalized → fair comparisons across patients.
- Most patients diagnosed at Stage II & III.
- Stage IV shows lowest survival months.
- Treatments: Chemotherapy, Surgery, Radiation, Combinations.
- Combination therapy (Chemo + Surgery) → highest survival.
- Single treatments less effective.
- Developed countries: Higher survival due to early detection & advanced care.
- Developing countries: Late-stage diagnoses, shorter survival.
- H0: Survival independent of stage
- H1: Survival differs across stages
- Result: p < 0.05 → Reject H0
✅ Survival significantly affected by cancer stage
- H0: Survival independent of treatment
- H1: Survival differs across treatments
- Result: p < 0.05 → Reject H0
✅ Combination therapies improve survival outcomes
- Result: p > 0.05 → Fail to reject H0
⚠️ Differences exist descriptively, but not statistically significant
→ Additional socio-economic data needed
- Logistic Regression (LR)
- Stochastic Gradient Descent (SGD)
- Gradient Boosting (GB)
- Random Forest (RF)
- All models initially showed 100% accuracy
⚠️ Indicates overfitting / data leakage- Likely due to strong correlation between Stage and Survival
- Apply train-test split with stratification
- Use cross-validation
- Handle class imbalance (SMOTE, undersampling)
- Engineer interaction features (e.g., Stage × Treatment)
- Risk increases after age 50 → target screening
- Cancer stage at diagnosis is the most critical survival factor
- Combination therapies (Chemo + Surgery) extend survival significantly
- Geographic disparities highlight healthcare inequality
- Predictive modeling needs refinement to prevent overfitting
- Languages: Python
- Libraries: Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn, Statsmodels
- Methods:
- Exploratory Data Analysis (EDA)
- Hypothesis Testing (ANOVA)
- Predictive Modeling (LR, RF, GB)
git clone https://github.com/your-username/cancer-data-analysis.git cd cancer-data-analysis
pip install -r requirements.txt
jupyter notebook Cancer_Data_Analysis.ipynb
Refine models with cross-validation
Perform feature engineering for better predictive power
Explore survival analysis (Kaplan-Meier, Cox Regression)
Deploy results in a dashboard or API
No socio-economic variables included
Risk of model overfitting
Dataset does not separate cancer subtypes
Limited post-treatment follow-up data
Apply time-to-event survival models
Add socio-economic & healthcare access data
Build an AI-powered cancer outcome predictor
Deploy findings into a web-based clinical dashboard
Bhoopendra Vishwakarma – Data Analyst | Python • SQL • Power BI • Machine Learning linkedin
Yogesh Chouhan – Collaborator & Contributor ( Data Scientist ) linkedin