## Water Quality Prediction Project Proposal
### Introduction

Water is one of the main basic needs humans need to survive, though many regions around the globe have limited access to clean, safe drinking-water. A body of water's potability can be determined using a range of water quality metrics, each serving as indicators as to whether or not the water is safe to drink, or in other words, potable.

For this project, we will be utilizing data from the [Water Quality dataset](https://www.kaggle.com/datasets/adityakadiwal/water-potability/data), which consists of water quality metrics from 3276 distinct water bodies. 


<br>


Our **objective** with this project is to answer this question:


> **Can we predict the potability of water based on its quality metrics?**
   

<br>

The **variables** of this dataset are detailed below:

- **pH Value:** How acidic or alkaline water is.

- **Hardness:** The presence of calcium and magnesium salts.

- **Solids (Total Dissolved Solids - TDS):** The concentration of dissolved minerals in water.

- **Chloramines:** The concentration of disinfectants used in public water systems.

- **Sulfate:** The concentration of sulfate found in many natural sources.

- **Conductivity:** Water's electrical conductivity (EC) based on the amount of dissolved solids in water.

- **Organic_carbon (Total Organic Carbon - TOC):** The amount of carbon in organic compounds in pure water. 

- **Trihalomethanes (THMs):** Chemicals present in chlorine-treated water.

- **Turbidity:** The amount of solid matter suspended in water, influencing transparency.

- **Potability:** Water is safe for human consumption or not. '1' is potable, '0' is not potable.

### Preliminary exploratory data analysis
(Due to Kaggle's security/authentication methods, we uploaded the dataset file to Google Drive and created a share link with public access)

In [29]:
#import commands
import altair as alt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.compose import make_column_transformer
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_validate,
    train_test_split,
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.utils import resample
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_selector

# Simplify working with large datasets in Altair
alt.data_transformers.disable_max_rows()

# Output dataframes instead of arrays
set_config(transform_output="pandas")

In [20]:
raw_water_data = pd.read_csv('https://drive.google.com/uc?id=13N4nBi8cZCQUQambCexi0-XArwSghdrj')

missing_sum = raw_water_data.isnull().sum()

print(missing_sum)

total_rows = raw_water_data.shape

print(total_rows)

percent_missing = ((missing_sum["ph"] + missing_sum["Sulfate"] + missing_sum["Trihalomethanes"]) / total_rows) * 100

print(percent_missing)

ph                 491
Hardness             0
Solids               0
Chloramines          0
Sulfate            781
Conductivity         0
Organic_carbon       0
Trihalomethanes    162
Turbidity            0
Potability           0
dtype: int64
(3276, 10)
[   43.77289377 14340.        ]


In [21]:
preprocessor_missing = make_column_transformer(
    (SimpleImputer(), ["ph", "Sulfate", "Trihalomethanes"]),
    remainder='passthrough',
    verbose_feature_names_out=False
)

preprocessor_missing.fit(raw_water_data)
water_data = preprocessor_missing.transform(raw_water_data)


water_data["Potability"] = water_data["Potability"].replace({
    0: "Not Potable",
    1: "Potable"
})
water_data["Potability"].value_counts()

Not Potable    1998
Potable        1278
Name: Potability, dtype: int64

In [22]:
np_water = water_data[water_data["Potability"] == "Not Potable"]
p_water = water_data[water_data["Potability"] == "Potable"]
p_water_upsampled = resample(
    p_water, n_samples=np_water.shape[0]
)
upsampled_water = pd.concat((p_water_upsampled, np_water))
upsampled_water['Potability'].value_counts()

Potable        1998
Not Potable    1998
Name: Potability, dtype: int64

In [23]:
water_data = upsampled_water.reset_index(drop=True)
water_data.head(10)

Unnamed: 0,ph,Sulfate,Trihalomethanes,Hardness,Solids,Chloramines,Conductivity,Organic_carbon,Turbidity,Potability
0,7.580049,333.775777,78.354341,225.088555,31749.924396,5.884795,503.908733,18.502406,3.959637,Potable
1,7.493291,341.460632,52.951363,216.38188,9465.323905,4.947531,501.345347,11.247507,3.748747,Potable
2,4.812434,398.161904,62.511467,250.183954,11465.575643,4.362641,607.026542,18.210032,3.020996,Potable
3,8.848586,182.39737,77.671337,188.919983,32033.332019,13.127,479.791975,12.070444,4.014682,Potable
4,7.381758,475.73746,55.132546,203.460302,1372.091043,11.129154,361.342496,16.548438,3.338022,Potable
5,8.137713,333.775777,95.905288,178.716633,33786.716309,9.101885,481.073539,12.273181,2.743867,Potable
6,6.170526,333.775777,75.067706,193.335517,16206.219671,7.123966,528.096091,20.532277,3.652207,Potable
7,6.851443,333.775777,79.155822,197.339559,15349.142585,7.446412,373.549867,11.367275,4.935557,Potable
8,7.617033,298.413238,76.51317,242.989402,17681.272357,2.85579,549.987318,10.065225,4.299543,Potable
9,7.378597,333.775777,37.389452,175.982447,9460.322635,5.941012,402.019945,15.639455,3.215219,Potable


In [24]:
water_train, water_test = train_test_split(
    water_data, train_size=0.75, stratify=water_data["Potability"]
)
pd.to_numeric(water_train["ph"])
pd.to_numeric(water_train["ph"])
print(water_train.info())
print(water_test.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2997 entries, 3433 to 286
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ph               2997 non-null   float64
 1   Sulfate          2997 non-null   float64
 2   Trihalomethanes  2997 non-null   float64
 3   Hardness         2997 non-null   float64
 4   Solids           2997 non-null   float64
 5   Chloramines      2997 non-null   float64
 6   Conductivity     2997 non-null   float64
 7   Organic_carbon   2997 non-null   float64
 8   Turbidity        2997 non-null   float64
 9   Potability       2997 non-null   object 
dtypes: float64(9), object(1)
memory usage: 257.6+ KB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 999 entries, 2662 to 2932
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ph               999 non-null    float64
 1   Sulfate          999 non-nul

In [25]:
predictor_vals_summary = water_train.describe()
predictor_vals_summary

Unnamed: 0,ph,Sulfate,Trihalomethanes,Hardness,Solids,Chloramines,Conductivity,Organic_carbon,Turbidity
count,2997.0,2997.0,2997.0,2997.0,2997.0,2997.0,2997.0,2997.0,2997.0
mean,7.050312,333.348215,66.359882,196.917865,22000.837674,7.135386,425.485238,14.247171,3.970265
std,1.47936,37.260366,15.798871,34.625862,8668.67472,1.579708,81.393155,3.289666,0.77434
min,0.0,180.206746,0.738,47.432,728.75083,0.352,181.483754,4.371899,1.45
25%,6.2668,316.572415,56.71551,176.779947,15661.940335,6.142618,365.439202,12.024866,3.437946
50%,7.080795,333.775777,66.396293,197.617494,21065.663658,7.148518,420.549219,14.174063,3.955917
75%,7.835907,349.995746,77.030487,218.26117,27418.781044,8.104498,482.63718,16.488156,4.51515
max,14.0,481.030642,118.357275,323.124,61227.196008,13.127,753.34262,28.3,6.739


In [26]:
selected_predictors_summary = water_train[["Solids", "Conductivity", "Hardness", "Organic_carbon", "Chloramines"]].describe()
selected_predictors_summary

Unnamed: 0,Solids,Conductivity,Hardness,Organic_carbon,Chloramines
count,2997.0,2997.0,2997.0,2997.0,2997.0
mean,22000.837674,425.485238,196.917865,14.247171,7.135386
std,8668.67472,81.393155,34.625862,3.289666,1.579708
min,728.75083,181.483754,47.432,4.371899,0.352
25%,15661.940335,365.439202,176.779947,12.024866,6.142618
50%,21065.663658,420.549219,197.617494,14.174063,7.148518
75%,27418.781044,482.63718,218.26117,16.488156,8.104498
max,61227.196008,753.34262,323.124,28.3,13.127


In [27]:
metric_hists = alt.Chart(water_train).mark_bar(opacity = .8).encode(
    alt.X(alt.repeat("repeat"), type='quantitative', bin=True),
    alt.Y("count()", type='quantitative', stack=False),
    color=alt.Color("Potability", scale=alt.Scale(scheme = 'paired'))
).properties(
    width=200,
    height=200
).repeat(
    repeat=['Hardness', 'Solids', 
             'Chloramines', 'Conductivity', 'Organic_carbon'],
    columns=3
).properties(title="Selected Water Quality Metrics by Potability")

metric_hists

### Method
We decided to find the five most contributing variables using their relative difference. We first computed the mean for both potable and not potable variables, then calculated the relative difference. Using the results, we selected "Solids", "Conductivity", "Hardness", "Organic_carbon" and "Chloramines" for the initial predictor variables. As we develop our classification model, we will confirm the necessity of the predictor variables to ensure our model is optimized.

We will visualize the model's performance with a confusion matrix, as well as a line plot that displays the best K-value to use. We will apply our model to the testing data set to check the accuracy, precision, and recall of the model.


In [28]:
np_water = water_data[water_data["Potability"] == "Not Potable"].mean(numeric_only=True)
display(np_water)

p_water = water_data[water_data["Potability"] == "Potable"].mean(numeric_only=True)
display(p_water)

((np_water - p_water)/np_water).abs().nlargest(5)

ph                     7.084658
Sulfate              334.371700
Trihalomethanes       66.308522
Hardness             196.733292
Solids             21777.490788
Chloramines            7.092175
Conductivity         426.730454
Organic_carbon        14.364335
Turbidity              3.965800
dtype: float64

ph                     7.007324
Sulfate              333.353204
Trihalomethanes       66.743507
Hardness             196.277087
Solids             22228.949975
Chloramines            7.161154
Conductivity         426.065989
Organic_carbon        14.168298
Turbidity              3.970963
dtype: float64

Solids             0.020731
Organic_carbon     0.013647
ph                 0.010916
Chloramines        0.009726
Trihalomethanes    0.006560
dtype: float64

### Expected Outcomes and Significance
#### What do you expect to find? 

We expect to compute a prediction with a high level of accuracy of a water sample's potability. Through the process of developing the model, we expect to uncover which variables are most indicative of potability. This might prove challenging due to the marginal difference between the values.

#### What impact could such findings have? 

It is vital to know whether drinking water is potable or not before a population has access to it. Applications of this model could lead to the prevention of damage to ab individual's or even an entire population's health. Additionally, we could see what range the metric values fall in for potable water. This is useful for researchers to be able to rule out certain samples quickly before having to test for other variables. 

#### What future questions could this lead to? 

These findings could lead to questioning if there are other variables not included in the dataset that could further impact potability. Inaccurate predictions could also lead us to consider other possible factors that may have been overlooked.