## Water Quality Prediction Project Proposal
### Introduction

Water is one of the main basic needs humans need to survive, though many regions around the globe have limited access to clean, safe drinking-water. A body of water's potability can be determined using a range of water quality metrics, each serving as indicators as to whether or not the water is safe to drink, or in other words, potable.

For this project, we will be utilizing the 'water_potability.csv' file retrieved from the [Water Quality dataset](https://www.kaggle.com/datasets/adityakadiwal/water-potability/data) on Kaggle, which consists of water quality metrics from 3276 distinct water bodies. 

<br>


Our **objective** with this project is to answer this question:


> **Can we predict the potability of water based on its quality metrics?**
   

<br>

The **variables** of this dataset are detailed below:

- **pH Value:** A measure of how acidic or alkaline water is (values fall within the WHO's recommended limits).

- **Hardness:** Indicates the presence of calcium and magnesium salts, which are vital determinants of water's ability to precipitate soap.

- **Solids (Total Dissolved Solids - TDS):** Denotes the concentration of dissolved minerals in water, affecting its taste and appearance.

- **Chloramines:** The concentration of disinfectants, primarily used in public water systems (chlorine and chloramine). Chlorine levels up to 4mg/L are considered safe in drinking water.

- **Sulfate:** Found in many natural sources (groundwater, plants, food, etc.). Concentration varies greatly.

- **Conductivity:** A measure of water's ability to conduct electric current, primarily determined by the amount of dissolved solids in water. Pure water is not a good conductor, and WHO standards state that EC (electrical conductivity) should not exceed 400 μS/cm. 

- **Organic_carbon (Total Organic Carbon - TOC):** A measure of the total amount of carbon in organic compounds in pure water. Comes from decaying natural organic matter and synthetic sources. 

- **Trihalomethanes (THMs):** Chemicals that might be present in water that has been treated with chlorine. Concentration fluctuates according to level of organic material in the water, the temperature of the water, and the amount of chlorine needed to treat the water.

- **Turbidity:** The amount of solid matter suspended in water, influencing the water's transparency. A measure of light emitting properties of water. WHO recommends a value of 5.00 NTU.

- **Potability:** Indicates if water is safe for human consumption or not. '1' is potable, '0' is not potable. 

(389 words)


### Preliminary exploratory data analysis
To begin, we read the data from the web into Python and imported everything we could potentially require. Due to Kaggle's security/authentication methods, we uploaded the dataset file to Google Drive and created a share link with public access. This way, we did not have to directly upload the file into Jupyter or utilize the Kaggle API command. We then went on to clean and wrangle the data and split the data into training and testing sets.

In [18]:
#import commands
import altair as alt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.compose import make_column_transformer
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_validate,
    train_test_split,
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.utils import resample
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_selector

# Simplify working with large datasets in Altair
alt.data_transformers.disable_max_rows()

# Output dataframes instead of arrays
set_config(transform_output="pandas")

In [19]:
raw_water_data = pd.read_csv('https://drive.google.com/uc?id=13N4nBi8cZCQUQambCexi0-XArwSghdrj')

missing_sum = raw_water_data.isnull().sum()

print(missing_sum)

total_rows = raw_water_data.shape

print(total_rows)

percent_missing = ((missing_sum["ph"] + missing_sum["Sulfate"] + missing_sum["Trihalomethanes"]) / total_rows) * 100

print(percent_missing)

ph                 491
Hardness             0
Solids               0
Chloramines          0
Sulfate            781
Conductivity         0
Organic_carbon       0
Trihalomethanes    162
Turbidity            0
Potability           0
dtype: int64
(3276, 10)
[   43.77289377 14340.        ]


In [20]:
preprocessor_missing = make_column_transformer(
    (SimpleImputer(), ["ph", "Sulfate", "Trihalomethanes"]),
    remainder='passthrough',
    verbose_feature_names_out=False
)

preprocessor_missing.fit(raw_water_data)
water_data = preprocessor_missing.transform(raw_water_data)


water_data["Potability"] = water_data["Potability"].replace({
    0: "Not Potable",
    1: "Potable"
})

water_data.isnull().sum()

ph                 0
Sulfate            0
Trihalomethanes    0
Hardness           0
Solids             0
Chloramines        0
Conductivity       0
Organic_carbon     0
Turbidity          0
Potability         0
dtype: int64

In [21]:
water_data["Potability"].value_counts()

Not Potable    1998
Potable        1278
Name: Potability, dtype: int64

In [22]:
np_water = water_data[water_data["Potability"] == "Not Potable"]
p_water = water_data[water_data["Potability"] == "Potable"]
p_water_upsampled = resample(
    p_water, n_samples=np_water.shape[0]
)
upsampled_water = pd.concat((p_water_upsampled, np_water))
upsampled_water['Potability'].value_counts()

Potable        1998
Not Potable    1998
Name: Potability, dtype: int64

In [23]:
water_data = upsampled_water.reset_index(drop=True)
water_data.head(10)

Unnamed: 0,ph,Sulfate,Trihalomethanes,Hardness,Solids,Chloramines,Conductivity,Organic_carbon,Turbidity,Potability
0,7.080795,233.792588,52.164236,273.255621,44506.441561,8.412688,485.647232,14.705014,4.124781,Potable
1,6.149185,333.775777,64.394907,150.563594,20596.391231,6.906911,431.651283,12.82938,4.275615,Potable
2,5.768868,356.552697,100.231668,184.395003,31155.98882,8.392834,380.09681,10.43034,3.092626,Potable
3,7.402653,295.931373,83.118742,182.999381,19259.193937,7.63237,339.737304,10.494609,6.22658,Potable
4,5.836105,400.167599,72.059866,277.065713,17711.487774,3.458192,456.732862,17.552294,3.738991,Potable
5,6.941719,333.775777,66.396293,173.334389,20111.821256,6.697194,374.485332,19.937486,4.563183,Potable
6,4.872561,323.036852,79.962803,224.705105,16960.434631,7.078015,354.390604,15.692176,3.918006,Potable
7,8.787668,333.775777,66.396293,232.462637,7035.133797,9.306449,415.624882,12.051417,3.489946,Potable
8,6.203573,333.775777,68.298689,139.129083,6698.239095,3.876813,601.526167,13.368165,4.305549,Potable
9,6.280978,383.671459,32.799029,205.123123,25972.803751,8.417896,456.543945,13.95471,4.599432,Potable


In [24]:
water_train, water_test = train_test_split(
    water_data, train_size=0.75, stratify=water_data["Potability"]
)
pd.to_numeric(water_train["ph"])
pd.to_numeric(water_train["ph"])
print(water_train.info())
print(water_test.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2997 entries, 194 to 3647
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ph               2997 non-null   float64
 1   Sulfate          2997 non-null   float64
 2   Trihalomethanes  2997 non-null   float64
 3   Hardness         2997 non-null   float64
 4   Solids           2997 non-null   float64
 5   Chloramines      2997 non-null   float64
 6   Conductivity     2997 non-null   float64
 7   Organic_carbon   2997 non-null   float64
 8   Turbidity        2997 non-null   float64
 9   Potability       2997 non-null   object 
dtypes: float64(9), object(1)
memory usage: 257.6+ KB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 999 entries, 3186 to 1571
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ph               999 non-null    float64
 1   Sulfate          999 non-nul

In [25]:
predictor_vals_summary = water_train.describe()
predictor_vals_summary

Unnamed: 0,ph,Sulfate,Trihalomethanes,Hardness,Solids,Chloramines,Conductivity,Organic_carbon,Turbidity
count,2997.0,2997.0,2997.0,2997.0,2997.0,2997.0,2997.0,2997.0,2997.0
mean,7.085324,334.091954,66.696555,196.583685,22001.393953,7.119761,426.994169,14.188246,3.96207
std,1.432629,37.312948,15.904399,33.293118,8766.263557,1.591358,80.597055,3.305373,0.77706
min,0.0,129.0,8.175876,47.432,320.942611,0.352,201.619737,4.371899,1.45
25%,6.272475,317.935411,57.285763,176.841063,15618.135242,6.113608,366.370724,12.038457,3.426266
50%,7.080795,333.775777,66.396293,197.666301,20834.294278,7.144655,421.220228,14.13712,3.965647
75%,7.84588,350.547757,77.182622,216.752872,27352.343415,8.165222,481.853415,16.38314,4.501457
max,14.0,481.030642,124.0,323.124,56867.859236,13.127,708.226364,28.3,6.494249


In [26]:
selected_predictors_summary = water_train[["Solids", "Conductivity", "Hardness", "Organic_carbon", "Chloramines"]].describe()
selected_predictors_summary

Unnamed: 0,Solids,Conductivity,Hardness,Organic_carbon,Chloramines
count,2997.0,2997.0,2997.0,2997.0,2997.0
mean,22001.393953,426.994169,196.583685,14.188246,7.119761
std,8766.263557,80.597055,33.293118,3.305373,1.591358
min,320.942611,201.619737,47.432,4.371899,0.352
25%,15618.135242,366.370724,176.841063,12.038457,6.113608
50%,20834.294278,421.220228,197.666301,14.13712,7.144655
75%,27352.343415,481.853415,216.752872,16.38314,8.165222
max,56867.859236,708.226364,323.124,28.3,13.127


In [64]:
metric_hists = alt.Chart(water_train).mark_bar().encode(
    alt.X(alt.repeat("repeat"), type='quantitative', bin=True),
    alt.Y("count()", type='quantitative', stack=False),
    color=alt.Color("Potability", scale=alt.Scale(scheme = 'paired'))
).properties(
    width=200,
    height=200
).repeat(
    repeat=['Hardness', 'Solids', 
             'Chloramines', 'Conductivity', 'Organic_carbon'],
    columns=3
).properties(title="Selected Water Quality Metrics by Potability")

metric_hists

### Method
Explain how you will conduct either your data analysis and which variables/columns you will use. Note - you do not need to use all variables/columns that exist in the raw data set. In fact, that's often not a good idea. For each variable think: is this a useful variable for prediction?
Describe at least one way that you will visualize the results

Although all of the variables do contribute in classifying the potability of water, we decided to pick the five most contributing variable by looking at their relative difference. We have first computed the mean for both potable and not potable variables (first and second pandas series). Then, we calculated the relative difference to see the five most contributing variable (third pandas series). Judging from the values in the third pandas series, we decided to pick "Solids", "Conductivity", "Hardness", "Organic_carbon" and "Chloramines" for the predictor variables and "Potability" for the response variable. 

We will visualize the result by creating a confusion matrix and the plot of estimated accuracy versus the number of neighbors to evaluate our model. We will also find the accuracy for the testing dataset as well as the precision and recall of the model.


(136 words)


In [11]:
np_water = water_data[water_data["Potability"] == "Not Potable"].mean(numeric_only=True)
display(np_water)

p_water = water_data[water_data["Potability"] == "Potable"].mean(numeric_only=True)
display(p_water)

((np_water - p_water)/np_water).abs().nlargest(5)

ph                     7.084658
Sulfate              334.371700
Trihalomethanes       66.308522
Hardness             196.733292
Solids             21777.490788
Chloramines            7.092175
Conductivity         426.730454
Organic_carbon        14.364335
Turbidity              3.965800
dtype: float64

ph                     7.089844
Sulfate              333.045507
Trihalomethanes       67.109710
Hardness             195.437133
Solids             22301.681043
Chloramines            7.184823
Conductivity         422.764661
Organic_carbon        14.178265
Turbidity              3.957450
dtype: float64

Solids             0.024070
Chloramines        0.013063
Organic_carbon     0.012954
Trihalomethanes    0.012083
Conductivity       0.009293
dtype: float64

### Expected Outcomes and Significance
What do you expect to find? 
We expect to find a prediction of a water sample's potability. At first glance, we see that most of the data we got from the database shows potable water samples. We think that the chances of getting a prediction that a random water sample will be potable is higher than it being not potable. i.e. we can see from the summary table above that the potable water has conductivity ranging from 307 to 562 units. If the random water sample has conductivity within this range, its chance of being potable is higher. However, we cannot tell until we do the classification with the other 4 predictors for a more accurate prediction.

What impact could such findings have? 
These findings could be very helpful in testing drinking water. It is vital to know whether drinking water is potable or not before being sold or consumed. Using our classifier and proper knowledge of the variables of our water sample, we could simply input data and get an accurate prediction of whether or not the sample is potable. This avoids the need for physical testing and risking someone's health. Additionally, with our findings, we could see what range the variables of potable water falls in (i.e. conductivity ranging from 307 to 562 units as mentioned). This is useful for researchers to be able to rule out certain samples quickly before having to test for other variables. For example, if they found that the water sample has conductivity of 15 units, which is way below the range, then it would be safe to assume that the water is not potable before having to spend more time and resources to test for other variables.

What future questions could this lead to? 
These findings could lead to questioning whether or not there are other variables that have not been used in the dataset that could further impact potability. Although we will be maximizing accuracy, the chances of getting 100% accurate predictions is very low. Inaccurate predictions could lead us to question whether or not there are other factors that have been overlooked that could result in water being not potable.
