Darlene Eligado (1001889134)

# **Tabular Prototype - Machine Learning** 
**Predicting Waterborne Disease Counts Using Global Water Quality and Socioeconomic Indicators**

----

## **Problem Formulation**
- Beginning with standard linear regression as a baseline for each disease (Diarrhea, Typhoid, Cholera) to identify any linear patterns, despite weak correlations. Following this, to use nonlinear, tree-based models such as Random Forest and XGBoost to capture more complex interactions in the data.

In [1]:
# import libraries and cleaned datasets
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

diarrheal_linear = pd.read_csv('diarrhea_ml_ready.csv')  
typhoid_linear = pd.read_csv('typhoid_ml_ready.csv')  
cholera_linear = pd.read_csv('cholera_ml_ready.csv') 

---
### **Diarrheal Cases per 100,000 People**

In [3]:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

In [5]:
# check dataset
diarrheal_linear.head()

Unnamed: 0,Year,Contaminant Level (ppm),pH Level,Turbidity (NTU),Dissolved Oxygen (mg/L),Nitrate Level (mg/L),Lead Concentration (µg/L),Bacteria Count (CFU/mL),Access to Clean Water (% of Population),"Infant Mortality Rate (per 1,000 live births)",...,Country_Mexico,Country_Nigeria,Country_Pakistan,Country_USA,Water Source Type_Pond,Water Source Type_River,Water Source Type_Spring,Water Source Type_Tap,Water Source Type_Well,"Diarrheal Cases per 100,000 people"
0,2015,6.06,7.12,3.93,4.28,8.28,7.89,3344,33.6,76.16,...,True,False,False,False,False,False,False,False,False,472
1,2017,5.24,7.84,4.79,3.86,15.74,14.68,2122,89.54,77.3,...,False,False,False,False,False,False,False,False,True,122
2,2022,0.24,6.43,0.79,3.42,36.67,9.96,2330,35.29,48.45,...,False,False,False,False,True,False,False,False,False,274
3,2016,7.91,6.71,1.96,3.12,36.92,6.77,3779,57.53,95.66,...,False,True,False,False,False,False,False,False,True,3
4,2005,0.12,8.16,4.22,9.15,49.35,12.51,4182,36.6,58.78,...,True,False,False,False,False,False,False,False,True,466


In [7]:
diarrheal_linear.dtypes

Year                                               int64
Contaminant Level (ppm)                          float64
pH Level                                         float64
Turbidity (NTU)                                  float64
Dissolved Oxygen (mg/L)                          float64
Nitrate Level (mg/L)                             float64
Lead Concentration (µg/L)                        float64
Bacteria Count (CFU/mL)                            int64
Access to Clean Water (% of Population)          float64
Infant Mortality Rate (per 1,000 live births)    float64
GDP per Capita (USD)                               int64
Healthcare Access Index (0-100)                  float64
Urbanization Rate (%)                            float64
Sanitation Coverage (% of Population)            float64
Rainfall (mm per year)                             int64
Temperature (°C)                                 float64
Population Density (people per km²)                int64
Water Treatment Method_Chlorina

In [9]:
# checking for duplicate rows
num_duplicates = diarrheal_linear.duplicated().sum()

print(f"Number of duplicate rows: {num_duplicates}")

Number of duplicate rows: 0


#### **Linear Regression as a baseline model first**

In [12]:
# separate features and target
X = diarrheal_linear.drop(columns='Diarrheal Cases per 100,000 people')
y = diarrheal_linear['Diarrheal Cases per 100,000 people']

# identify numerical columns, skipping bools then scale numerical columms
num_cols = X.select_dtypes(include=['int64', 'float64']).columns

scale_diarrheal = StandardScaler()
X[num_cols] = scale_diarrheal.fit_transform(X[num_cols])

In [14]:
# splitting into train, validation, and test sets (80/10/10)
X_trainval, X_test, y_trainval, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval, test_size=0.1111, random_state=42)

In [16]:
# train LR model
diarrheal_lr = LinearRegression()
diarrheal_lr.fit(X_train, y_train)

# predict on validation set
y_val_pred = diarrheal_lr.predict(X_val)



In [18]:
# evaluate LR diarrheal cases performance
diarrheal_rmse = np.sqrt(mean_squared_error(y_val, y_val_pred))
diarrheal_mae = mean_absolute_error(y_val, y_val_pred)
diarrheal_r2 = r2_score(y_val, y_val_pred)

print("Validation Performance Diarrheal Cases per 100,000 People Linear Regression:")
print(f"R² Score: {diarrheal_r2:.4f}")
print(f"RMSE: {diarrheal_rmse:.2f}")
print(f"MAE: {diarrheal_mae:.2f}")

Validation Performance Diarrheal Cases per 100,000 People Linear Regression:
R² Score: -0.0257
RMSE: 146.32
MAE: 127.25


**Linear Regression on Diarrheal Cases per 100,00 People Evaluation**
- The model had an R² score of about -0.03 on the validation set, meaning it performs worse than just predicting the average value.
- With an RMSE of 146.32 and MAE of 127.25, the model shows significant prediction errors relative to the range of diarrheal case counts.
- The results show weak linear relationships in the data, suggesting that this simple linear model isn't enough to understand the factors affecting diarrheal disease rates.

----
- For the upcoming models, I will only be using the top positive features and dropping the rest.

For Reference: 

Top positively correlated features with Diarrheal Cases per 100,000 people
- Water Source Type_Tap                            0.039820
- Sanitation Coverage (% of Population)            0.033032
- Infant Mortality Rate (per 1,000 live births)    0.031683
- Region_South                                     0.030641
- Country_Brazil                                   0.026303
- Country_Mexico                                   0.023499
- pH Level                                         0.018386
- Turbidity (NTU)                                  0.016556
- Region_East                                      0.015668

#### **XGBoost Baseline**

In [22]:
import xgboost as xgb

# XGBoost regressor w/ basic params
diarrheal_xgb_reg = xgb.XGBRegressor(
    objective='reg:squarederror',  # for regression
    n_estimators=100,
    random_state=42,
    verbosity=1
)

# train model
diarrheal_xgb_reg.fit(X_train, y_train)


y_val_pred = diarrheal_xgb_reg.predict(X_val)
dxg_rmse = np.sqrt(mean_squared_error(y_val, y_val_pred))
dxg_mae = mean_absolute_error(y_val, y_val_pred)
dxg_r2 = r2_score(y_val, y_val_pred)

print("XGBoost:")
print(f"R² Score: {dxg_r2:.4f}")
print(f"RMSE: {dxg_rmse:.2f}")
print(f"MAE: {dxg_mae:.2f}")

XGBoost:
R² Score: -0.2769
RMSE: 163.26
MAE: 137.45


#### **Possible Next Steps**
- **Go back and do more feature engineering** , aggregating some columns together
- **Feature Selection** , maybe only use top correlated features gathered in EDA and cleaning

-----