# Grand Challend: Understanding Water Pollution and Global Health Display

## Introduction

Water pollution is a pressing global issue that affects not just ecosystems, but also human health, economic stability, and social equity. Contaminated water sources contribute to the spread of preventable diseases, especially in developing regions, where access to clean water and sanitation remains limited. Despite global efforts, many countries continue to face high rates of waterborne diseases and environmental degradation, while others have managed to control these risks effectively.

This project aims to explore **why some countries struggle with water pollution impacts more than others** by analyzing a comprehensive dataset of water quality indicators, health outcomes, and socioeconomic infrastructure. Using statistical modeling and visual analytics, we seek to uncover the **key relationships between pollution, disease rates, and national resilience factors** like GDP, healthcare access, and sanitation coverage.

## Project Goals

- Quantify the level of water contamination across countries based on WHO/EPA thresholds.
- Correlate contamination levels with health indicators like diarrheal diseases, cholera, typhoid, and infant mortality.
- Identify whether higher GDP, better healthcare, and stronger infrastructure offer significant protection.
- Compare "high-risk" and "low-risk" countries to extract actionable insights.
- Suggest policy directions and water treatment strategies that can mitigate future risks.

By combining data science with environmental analysis, we hope to not only diagnose the problem — but also highlight what works, where, and why.


## Imported Libraries – Why We Use Them

To effectively explore, analyze, and visualize our dataset, we use the following Python libraries:

- **pandas**: For loading, cleaning, and manipulating tabular data.
- **numpy**: For numerical operations, especially useful when handling arrays and applying mathematical transformations.
- **matplotlib.pyplot**: For creating clear and customizable static plots.
- **seaborn**: For more aesthetically pleasing and informative statistical plots (built on top of matplotlib).
- **scipy.stats**: To compute statistical metrics such as correlations and z-scores.
- **sklearn.preprocessing**: For feature scaling (normalizing risk scores between 0 and 100).
- **sklearn.linear_model** & **sklearn.metrics** *(optional)*: For running and evaluating regression models to test relationships between variables.
- **warnings**: To suppress unnecessary warnings and keep our notebook output clean.

These libraries form the backbone of our analysis pipeline — from data wrangling and scoring to insightful visualizations and hypothesis testing.


In [None]:
# Importing the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import pearsonr, zscore
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

import warnings
warnings.filterwarnings('ignore')

# Loading the dataset
file_path = 'water_pollution_disease.csv'
df = pd.read_csv(file_path)

# Displaying the first few rows of the dataset
print("First few rows of the dataset:")
display(df.head())

# Show all the columns
print("\n Columns in the dataset:")
for col in df.columns:
    print(col)

# Basic shape information
print(f"\nDataset shape: {df.shape[0]} rows, {df.shape[1]} columns.")

# Summary statistics
print("\nSummary Statistics:")
display(df.describe())


First few rows of the dataset:


Unnamed: 0,Country,Region,Year,Water Source Type,Contaminant Level (ppm),pH Level,Turbidity (NTU),Dissolved Oxygen (mg/L),Nitrate Level (mg/L),Lead Concentration (µg/L),Bacteria Count (CFU/mL),Water Treatment Method,Access to Clean Water (% of Population),"Diarrheal Cases per 100,000 people","Cholera Cases per 100,000 people","Typhoid Cases per 100,000 people","Infant Mortality Rate (per 1,000 live births)",GDP per Capita (USD),Healthcare Access Index (0-100),Urbanization Rate (%),Sanitation Coverage (% of Population),Rainfall (mm per year),Temperature (°C),Population Density (people per km²)
0,Mexico,North,2015,Lake,6.06,7.12,3.93,4.28,8.28,7.89,3344,Filtration,33.6,472,33,44,76.16,57057,96.92,84.61,63.23,2800,4.94,593
1,Brazil,West,2017,Well,5.24,7.84,4.79,3.86,15.74,14.68,2122,Boiling,89.54,122,27,8,77.3,17220,84.73,73.37,29.12,1572,16.93,234
2,Indonesia,Central,2022,Pond,0.24,6.43,0.79,3.42,36.67,9.96,2330,,35.29,274,39,50,48.45,86022,58.37,72.86,93.56,2074,21.73,57
3,Nigeria,East,2016,Well,7.91,6.71,1.96,3.12,36.92,6.77,3779,Boiling,57.53,3,33,13,95.66,31166,39.07,71.07,94.25,937,3.79,555
4,Mexico,South,2005,Well,0.12,8.16,4.22,9.15,49.35,12.51,4182,Filtration,36.6,466,31,68,58.78,25661,23.03,55.55,69.23,2295,31.44,414



 Columns in the dataset:
Country
Region
Year
Water Source Type
Contaminant Level (ppm)
pH Level
Turbidity (NTU)
Dissolved Oxygen (mg/L)
Nitrate Level (mg/L)
Lead Concentration (µg/L)
Bacteria Count (CFU/mL)
Water Treatment Method
Access to Clean Water (% of Population)
Diarrheal Cases per 100,000 people
Cholera Cases per 100,000 people
Typhoid Cases per 100,000 people
Infant Mortality Rate (per 1,000 live births)
GDP per Capita (USD)
Healthcare Access Index (0-100)
Urbanization Rate (%)
Sanitation Coverage (% of Population)
Rainfall (mm per year)
Temperature (°C)
Population Density (people per km²)

Dataset shape: 3000 rows, 24 columns.

Summary Statistics:


Unnamed: 0,Year,Contaminant Level (ppm),pH Level,Turbidity (NTU),Dissolved Oxygen (mg/L),Nitrate Level (mg/L),Lead Concentration (µg/L),Bacteria Count (CFU/mL),Access to Clean Water (% of Population),"Diarrheal Cases per 100,000 people","Cholera Cases per 100,000 people","Typhoid Cases per 100,000 people","Infant Mortality Rate (per 1,000 live births)",GDP per Capita (USD),Healthcare Access Index (0-100),Urbanization Rate (%),Sanitation Coverage (% of Population),Rainfall (mm per year),Temperature (°C),Population Density (people per km²)
count,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0
mean,2012.012667,4.95439,7.255847,2.480023,6.49285,25.08025,10.047913,2488.477333,64.612333,249.776667,24.251,49.27,50.8119,50036.196667,50.029193,50.06248,60.371007,1591.849,20.130917,505.390333
std,7.229287,2.860072,0.720464,1.419984,2.027966,14.50517,5.798238,1431.421553,20.308463,144.111543,14.33259,28.984165,28.465323,28598.750508,28.896676,22.779125,23.159678,817.502434,11.689244,283.275224
min,2000.0,0.0,6.0,0.0,3.0,0.05,0.0,0.0,30.01,0.0,0.0,0.0,2.06,521.0,0.19,10.03,20.01,200.0,0.06,10.0
25%,2006.0,2.56,6.63,1.2575,4.71,12.525,5.12,1268.0,47.0275,124.0,12.0,24.0,26.4675,25010.25,24.9825,30.5575,40.44,865.75,9.84,254.75
50%,2012.0,4.95,7.28,2.46,6.49,24.79,10.065,2469.0,64.78,248.0,24.0,49.0,50.23,49621.5,50.39,49.795,60.58,1572.0,20.175,513.0
75%,2018.0,7.4,7.87,3.66,8.2525,37.91,15.0325,3736.25,82.3025,378.0,37.0,75.0,76.26,74778.25,74.8175,69.7275,80.42,2308.25,30.6725,745.0
max,2024.0,10.0,8.5,4.99,10.0,49.99,20.0,4998.0,99.99,499.0,49.0,99.0,99.99,99948.0,99.98,89.98,99.99,2999.0,39.99,999.0


## Dataset Overview

This dataset captures the multifaceted nature of water pollution and its impacts on public health and infrastructure across various countries. It spans 3000 data points with 24 columns, including:

- **Water quality indicators** (e.g. contaminant level, pH, turbidity, dissolved oxygen, etc.)
- **Health statistics** (e.g. diarrheal, cholera, typhoid cases, infant mortality)
- **Socioeconomic factors** (e.g. GDP per capita, healthcare access, urbanization)
- **Infrastructure access** (e.g. access to clean water, sanitation coverage, water treatment method)

The data spans from the year 2000 to 2024, providing a temporal dimension for analysis.

A preview of the first few rows helps us understand the data structure, formatting, and potential cleaning needs.

In [8]:
# Cleaning the dataset
df.isnull().sum()[df.isnull().sum() > 0]

Water Treatment Method    747
dtype: int64

## Handling Missing Values

We identified missing values in the **Water Treatment Method** column. Since this is a categorical feature and dropping rows could lead to loss of important data, we fill the missing entries with `"Unknown"` to preserve dataset integrity.


In [12]:
#Fill missing values with 'Unknown'
df = df.fillna('Unknown')

# Save the processed dataset
df.to_csv('water_pollution_disease_processed.csv', index=False)
print("\nProcessed dataset saved as 'water_pollution_disease_processed.csv'")
#Verify no missing values remain
print("\nMissing values after filling:")
print(df.isnull().sum()[df.isnull().sum() > 0])



Processed dataset saved as 'water_pollution_disease_processed.csv'

Missing values after filling:
Series([], dtype: int64)
