## Water Quality Prediction Project Proposal
### Introduction

Water is one of the main basic needs humans need to survive, though many regions around the globe have limited access to clean, safe drinking-water. A body of water's potability can be determined using a range of water quality metrics, each serving as indicators as to whether or not the water is safe to drink, or in other words, potable.

For this project, we will be utilizing the 'water_potability.csv' file retrieved from the [Water Quality dataset](https://www.kaggle.com/datasets/adityakadiwal/water-potability/data) on Kaggle, which consists of water quality metrics from 3276 distinct water bodies. 

<br>


Our **objective** with this project is to answer this question:


> **Can we predict the potability of water based on its quality metrics?**
   

<br>

The **variables** of this dataset are detailed below:

- **pH Value:** A measure of how acidic or alkaline water is (values fall within the WHO's recommended limits).

- **Hardness:** Indicates the presence of calcium and magnesium salts, which are vital determinants of water's ability to precipitate soap.

- **Solids (Total Dissolved Solids - TDS):** Denotes the concentration of dissolved minerals in water, affecting its taste and appearance.

- **Chloramines:** The concentration of disinfectants, primarily used in public water systems (chlorine and chloramine). Chlorine levels up to 4mg/L are considered safe in drinking water.

- **Sulfate:** Found in many natural sources (groundwater, plants, food, etc.). Concentration varies greatly.

- **Conductivity:** A measure of water's ability to conduct electric current, primarily determined by the amount of dissolved solids in water. Pure water is not a good conductor, and WHO standards state that EC (electrical conductivity) should not exceed 400 μS/cm. 

- **Organic_carbon (Total Organic Carbon - TOC):** A measure of the total amount of carbon in organic compounds in pure water. Comes from decaying natural organic matter and synthetic sources. 

- **Trihalomethanes (THMs):** Chemicals that might be present in water that has been treated with chlorine. Concentration fluctuates according to level of organic material in the water, the temperature of the water, and the amount of chlorine needed to treat the water.

- **Turbidity:** The amount of solid matter suspended in water, influencing the water's transparency. A measure of light emitting properties of water. WHO recommends a value of 5.00 NTU.

- **Potability:** Indicates if water is safe for human consumption or not. '1' is potable, '0' is not potable. 

(389 words)


### Preliminary exploratory data analysis
To begin, we read the data from the web into Python and imported everything we could potentially require. Due to Kaggle's security/authentication methods, we uploaded the dataset file to Google Drive and created a share link with public access. This way, we did not have to directly upload the file into Jupyter or utilize the Kaggle API command. We then went on to clean and wrangle the data and split the data into training and testing sets.

In [1]:
#import commands
import altair as alt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.compose import make_column_transformer
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_validate,
    train_test_split,
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.utils import resample
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_selector

# Simplify working with large datasets in Altair
alt.data_transformers.disable_max_rows()

# Output dataframes instead of arrays
set_config(transform_output="pandas")

In [2]:
raw_water_data = pd.read_csv('https://drive.google.com/uc?id=13N4nBi8cZCQUQambCexi0-XArwSghdrj')

missing_sum = raw_water_data.isnull().sum()

print(missing_sum)

total_rows = raw_water_data.shape

print(total_rows)

percent_missing = ((missing_sum["ph"] + missing_sum["Sulfate"] + missing_sum["Trihalomethanes"]) / total_rows) * 100

print(percent_missing)

ph                 491
Hardness             0
Solids               0
Chloramines          0
Sulfate            781
Conductivity         0
Organic_carbon       0
Trihalomethanes    162
Turbidity            0
Potability           0
dtype: int64
(3276, 10)
[   43.77289377 14340.        ]


In [3]:
preprocessor_missing = make_column_transformer(
    (SimpleImputer(), ["ph", "Sulfate", "Trihalomethanes"]),
    remainder='passthrough',
    verbose_feature_names_out=False
)

preprocessor_missing.fit(raw_water_data)
water_data = preprocessor_missing.transform(raw_water_data)


water_data["Potability"] = water_data["Potability"].replace({
    0: "Not Potable",
    1: "Potable"
})

water_data.isnull().sum()

ph                 0
Sulfate            0
Trihalomethanes    0
Hardness           0
Solids             0
Chloramines        0
Conductivity       0
Organic_carbon     0
Turbidity          0
Potability         0
dtype: int64

In [4]:
water_data["Potability"].value_counts()

Not Potable    1998
Potable        1278
Name: Potability, dtype: int64

In [5]:
np_water = water_data[water_data["Potability"] == "Not Potable"]
p_water = water_data[water_data["Potability"] == "Potable"]
p_water_upsampled = resample(
    p_water, n_samples=np_water.shape[0]
)
upsampled_water = pd.concat((p_water_upsampled, np_water))
upsampled_water['Potability'].value_counts()

Potable        1998
Not Potable    1998
Name: Potability, dtype: int64

In [6]:
water_data = upsampled_water.reset_index(drop=True)
water_data.head(10)

Unnamed: 0,ph,Sulfate,Trihalomethanes,Hardness,Solids,Chloramines,Conductivity,Organic_carbon,Turbidity,Potability
0,5.893103,341.256362,63.846319,239.269481,20526.666156,6.349561,403.61756,18.963707,4.390702,Potable
1,9.001823,355.006426,61.514342,263.514188,15379.912763,6.473674,561.687003,22.045457,3.976873,Potable
2,6.552847,274.904351,78.173063,198.80694,34006.420733,8.691206,477.163907,14.36963,4.687986,Potable
3,10.761898,318.427241,52.246972,81.710895,25999.953669,8.477394,392.704082,12.71635,4.661799,Potable
4,7.30099,333.775777,49.895342,182.447697,29136.338677,8.253015,307.433303,8.730149,4.596347,Potable
5,4.037288,333.775777,87.8148,291.461897,52318.917298,7.779459,401.204271,16.542921,3.045049,Potable
6,7.080795,266.639384,28.400877,205.235194,22613.297485,6.48581,313.009639,11.623605,3.978495,Potable
7,5.324942,180.206746,55.084668,280.089655,35344.658047,13.043806,392.421496,10.50482,4.427138,Potable
8,7.080795,333.775777,78.499418,188.743562,19037.462638,6.034236,388.065857,15.149068,2.723651,Potable
9,5.803497,255.976746,8.577013,193.200991,19451.767603,4.146601,365.477618,14.920616,2.181714,Potable


In [7]:
water_train, water_test = train_test_split(
    water_data, train_size=0.75, stratify=water_data["Potability"]
)
pd.to_numeric(water_train["ph"])
pd.to_numeric(water_train["ph"])
print(water_train.info())
print(water_test.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2997 entries, 2051 to 745
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ph               2997 non-null   float64
 1   Sulfate          2997 non-null   float64
 2   Trihalomethanes  2997 non-null   float64
 3   Hardness         2997 non-null   float64
 4   Solids           2997 non-null   float64
 5   Chloramines      2997 non-null   float64
 6   Conductivity     2997 non-null   float64
 7   Organic_carbon   2997 non-null   float64
 8   Turbidity        2997 non-null   float64
 9   Potability       2997 non-null   object 
dtypes: float64(9), object(1)
memory usage: 257.6+ KB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 999 entries, 1085 to 3519
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ph               999 non-null    float64
 1   Sulfate          999 non-nul

In [8]:
predictor_vals_summary = water_train.describe()
predictor_vals_summary

Unnamed: 0,ph,Sulfate,Trihalomethanes,Hardness,Solids,Chloramines,Conductivity,Organic_carbon,Turbidity
count,2997.0,2997.0,2997.0,2997.0,2997.0,2997.0,2997.0,2997.0,2997.0
mean,7.071013,333.773046,66.881703,195.444185,22053.533185,7.132274,424.320805,14.186205,3.957547
std,1.438726,37.090423,15.705247,33.247841,8713.259761,1.622079,81.164531,3.323279,0.782751
min,0.227499,180.206746,0.738,73.492234,1198.943699,0.352,201.619737,4.371899,1.496101
25%,6.284985,316.753889,57.280265,175.588146,15704.482093,6.110295,363.235756,11.992772,3.415851
50%,7.080795,333.775777,66.396293,195.6658,21043.626929,7.144655,419.534133,14.098786,3.93221
75%,7.830608,350.681732,76.824752,215.479068,27283.780655,8.151081,479.825497,16.470528,4.498685
max,14.0,481.030642,124.0,323.124,61227.196008,13.043806,695.369528,28.3,6.739


In [9]:
selected_predictors_summary = water_train[["Solids", "Conductivity", "Hardness", "Organic_carbon", "Chloramines"]].describe()
selected_predictors_summary

Unnamed: 0,Solids,Conductivity,Hardness,Organic_carbon,Chloramines
count,2997.0,2997.0,2997.0,2997.0,2997.0
mean,22053.533185,424.320805,195.444185,14.186205,7.132274
std,8713.259761,81.164531,33.247841,3.323279,1.622079
min,1198.943699,201.619737,73.492234,4.371899,0.352
25%,15704.482093,363.235756,175.588146,11.992772,6.110295
50%,21043.626929,419.534133,195.6658,14.098786,7.144655
75%,27283.780655,479.825497,215.479068,16.470528,8.151081
max,61227.196008,695.369528,323.124,28.3,13.043806


In [10]:
unscaled_water_train_plot = alt.Chart(water_train).mark_circle(opacity=0.4).encode(
    x=alt.X("Solids").title("Total Dissolved Solids (TDS)").scale(zero=False),
    y=alt.Y("Organic_carbon").title("Conductivity (μS/cm)").scale(zero=False),
    color=alt.Color("Potability")
)
unscaled_water_train_plot


### Method
Explain how you will conduct either your data analysis and which variables/columns you will use. Note - you do not need to use all variables/columns that exist in the raw data set. In fact, that's often not a good idea. For each variable think: is this a useful variable for prediction?
Describe at least one way that you will visualize the results

Although all of the variables do contribute in classifying the potability of water, we decided to pick the five most contributing variable by looking at their relative difference. We have first computed the mean for both potable and not potable variables (first and second pandas series). Then, we calculated the relative difference to see the five most contributing variable (third pandas series). Judging from the values in the third pandas series, we decided to pick "Solids", "Conductivity", "Hardness", "Organic_carbon" and "Chloramines" for the predictor variables and "Potability" for the response variable. 

We will visualize the result by creating a confusion matrix and the plot of estimated accuracy versus the number of neighbors to evaluate our model. We will also find the accuracy for the testing dataset as well as the precision and recall of the model.


(136 words)


In [11]:
np_water = water_data[water_data["Potability"] == "Not Potable"].mean(numeric_only=True)
display(np_water)

p_water = water_data[water_data["Potability"] == "Potable"].mean(numeric_only=True)
display(p_water)

((np_water - p_water)/np_water).abs().nlargest(5)

ph                     7.084658
Sulfate              334.371700
Trihalomethanes       66.308522
Hardness             196.733292
Solids             21777.490788
Chloramines            7.092175
Conductivity         426.730454
Organic_carbon        14.364335
Turbidity              3.965800
dtype: float64

ph                     7.089844
Sulfate              333.045507
Trihalomethanes       67.109710
Hardness             195.437133
Solids             22301.681043
Chloramines            7.184823
Conductivity         422.764661
Organic_carbon        14.178265
Turbidity              3.957450
dtype: float64

Solids             0.024070
Chloramines        0.013063
Organic_carbon     0.012954
Trihalomethanes    0.012083
Conductivity       0.009293
dtype: float64