## Water Quality Prediction Project Proposal
### Introduction

Water is one of the main basic needs humans need to survive, though many regions around the globe have limited access to clean, safe drinking-water. A body of water's potability can be determined using a range of water quality metrics, each serving as indicators as to whether or not the water is safe to drink, or in other words, potable.

For this project, we will be utilizing the 'water_potability.csv' file retrieved from the [Water Quality dataset](https://www.kaggle.com/datasets/adityakadiwal/water-potability/data) on Kaggle, which consists of water quality metrics from 3276 distinct water bodies. 

<br>


Our **objective** with this project is to answer this question:


> **Can we predict the potability of water based on its quality metrics?**
   

<br>

The **variables** of this dataset are detailed below:

- **pH Value:** A measure of how acidic or alkaline water is (values fall within the WHO's recommended limits).

- **Hardness:** Indicates the presence of calcium and magnesium salts, which are vital determinants of water's ability to precipitate soap.

- **Solids (Total Dissolved Solids - TDS):** Denotes the concentration of dissolved minerals in water, affecting its taste and appearance.

- **Chloramines:** The concentration of disinfectants, primarily used in public water systems (chlorine and chloramine). Chlorine levels up to 4mg/L are considered safe in drinking water.

- **Sulfate:** Found in many natural sources (groundwater, plants, food, etc.). Concentration varies greatly.

- **Conductivity:** A measure of water's ability to conduct electric current, primarily determined by the amount of dissolved solids in water. Pure water is not a good conductor, and WHO standards state that EC (electrical conductivity) should not exceed 400 μS/cm. 

- **Organic_carbon (Total Organic Carbon - TOC):** A measure of the total amount of carbon in organic compounds in pure water. Comes from decaying natural organic matter and synthetic sources. 

- **Trihalomethanes (THMs):** Chemicals that might be present in water that has been treated with chlorine. Concentration fluctuates according to level of organic material in the water, the temperature of the water, and the amount of chlorine needed to treat the water.

- **Turbidity:** The amount of solid matter suspended in water, influencing the water's transparency. A measure of light emitting properties of water. WHO recommends a value of 5.00 NTU.

- **Potability:** Indicates if water is safe for human consumption or not. '1' is potable, '0' is not potable. 




### Preliminary exploratory data analysis
To begin, we read the data from the web into Python and imported everything we could potentially require. Due to Kaggle's security/authentication methods, we uploaded the dataset file to Google Drive and created a share link with public access. This way, we did not have to directly upload the file into Jupyter or utilize the Kaggle API command. We then went on to clean and wrangle the data and split the data into training and testing sets.

In [61]:
#import commands
import altair as alt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.compose import make_column_transformer
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_validate,
    train_test_split,
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.utils import resample
from sklearn.impute import SimpleImputer

# Simplify working with large datasets in Altair
alt.data_transformers.disable_max_rows()

# Output dataframes instead of arrays
set_config(transform_output="pandas")

In [62]:
raw_water_data = pd.read_csv('https://drive.google.com/uc?id=13N4nBi8cZCQUQambCexi0-XArwSghdrj')

missing_sum = raw_water_data.isnull().sum()

print(missing_sum)

total_rows = raw_water_data.shape

print(total_rows)

percent_missing = ((missing_sum["ph"] + missing_sum["Sulfate"] + missing_sum["Trihalomethanes"]) / total_rows) * 100

print(percent_missing)

ph                 491
Hardness             0
Solids               0
Chloramines          0
Sulfate            781
Conductivity         0
Organic_carbon       0
Trihalomethanes    162
Turbidity            0
Potability           0
dtype: int64
(3276, 10)
[   43.77289377 14340.        ]


In [70]:
preprocessor_missing = make_column_transformer(
    (SimpleImputer(), ["ph", "Sulfate", "Trihalomethanes"]),
    remainder='passthrough',
    verbose_feature_names_out=False
)

preprocessor_missing.fit(raw_water_data)
water_data = preprocessor_missing.transform(raw_water_data)


water_data = water_data.replace({
    0: "Not Potable",
    1: "Potable"
})

water_data.isnull().sum()

ph                 0
Sulfate            0
Trihalomethanes    0
Hardness           0
Solids             0
Chloramines        0
Conductivity       0
Organic_carbon     0
Turbidity          0
Potability         0
dtype: int64

In [71]:
water_data["Potability"].value_counts()

Not Potable    1998
Potable        1278
Name: Potability, dtype: int64

In [72]:
np_water = water_data[water_data["Potability"] == "Not Potable"]
p_water = water_data[water_data["Potability"] == "Potable"]
p_water_upsampled = resample(
    p_water, n_samples=np_water.shape[0]
)
upsampled_water = pd.concat((p_water_upsampled, np_water))
upsampled_water['Potability'].value_counts()

Potable        1998
Not Potable    1998
Name: Potability, dtype: int64

In [110]:
water_data = upsampled_water.reset_index(drop=True)
water_data.head(10)

Unnamed: 0,ph,Sulfate,Trihalomethanes,Hardness,Solids,Chloramines,Conductivity,Organic_carbon,Turbidity,Potability
0,7.080795,417.911837,68.388378,246.239763,6974.225135,9.812468,355.53247,13.330091,4.138697,Potable
1,6.817608,340.937614,68.568791,219.337429,27548.614177,6.298121,530.710026,19.486791,3.048292,Potable
2,5.729303,415.287072,76.744677,162.857585,34573.678786,5.654856,297.631365,13.990842,3.556395,Potable
3,7.080795,412.407278,33.56304,176.772186,19486.191112,7.960133,276.58861,16.710056,3.888177,Potable
4,7.080795,365.080963,48.254307,155.675777,52060.2268,2.577555,323.001036,14.166602,2.000757,Potable
5,6.991685,331.252916,69.670038,152.976217,20389.593816,2.64839,460.146174,15.485378,3.724824,Potable
6,6.191241,248.304391,54.060851,231.322797,29778.357877,4.381097,488.954164,13.022135,3.315071,Potable
7,5.632732,318.465146,67.651025,283.997284,28315.437777,7.144655,425.984213,11.813231,5.114607,Potable
8,7.186931,295.834151,62.756891,177.486533,34510.752995,4.984432,477.994992,16.77754,4.275645,Potable
9,6.260111,340.792574,82.365378,211.594112,18577.623969,7.154891,357.098395,7.99221,5.403615,Potable


In [111]:
water_train, water_test = train_test_split(
    water_data, train_size=0.75, stratify=water_data["Potability"]
)
print(water_train.info())
print(water_test.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2997 entries, 2511 to 720
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ph               2997 non-null   object 
 1   Sulfate          2997 non-null   float64
 2   Trihalomethanes  2997 non-null   float64
 3   Hardness         2997 non-null   float64
 4   Solids           2997 non-null   float64
 5   Chloramines      2997 non-null   float64
 6   Conductivity     2997 non-null   float64
 7   Organic_carbon   2997 non-null   float64
 8   Turbidity        2997 non-null   float64
 9   Potability       2997 non-null   object 
dtypes: float64(8), object(2)
memory usage: 257.6+ KB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 999 entries, 1759 to 1789
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ph               999 non-null    object 
 1   Sulfate          999 non-nul

In [115]:
ph_summary = water_train['ph'].describe()
print(ph_summary)

predictor_vals_summary = pd.DataFrame(water_train.describe())
predictor_vals_summary

count     2997.000000
unique    1986.000000
top          7.080795
freq       479.000000
Name: ph, dtype: float64


Unnamed: 0,Sulfate,Trihalomethanes,Hardness,Solids,Chloramines,Conductivity,Organic_carbon,Turbidity
count,2997.0,2997.0,2997.0,2997.0,2997.0,2997.0,2997.0,2997.0
mean,333.10269,66.337539,197.027186,22328.96932,7.131579,426.287034,14.21425,3.95021
std,37.287367,15.864046,33.099053,8986.600998,1.61697,78.987433,3.27673,0.7718
min,129.0,8.175876,73.492234,320.942611,1.390871,201.619737,2.2,1.492207
25%,316.820248,56.460386,177.233643,15825.182571,6.121662,366.370724,11.887161,3.412853
50%,333.775777,66.396293,197.685838,21165.590649,7.123611,420.219104,14.15411,3.937188
75%,349.948941,77.241984,217.936799,27795.732041,8.142513,480.143109,16.542921,4.493747
max,481.030642,120.030077,323.124,61227.196008,13.127,708.226364,27.006707,6.494749


### Method
Explain how you will conduct either your data analysis and which variables/columns you will use. Note - you do not need to use all variables/columns that exist in the raw data set. In fact, that's often not a good idea. For each variable think: is this a useful variable for prediction?
Describe at least one way that you will visualize the results


In [35]:
np_water = water_data[water_data["Potability"] == "Not Potable"].mean(numeric_only=True)
display(np_water)

p_water = water_data[water_data["Potability"] == "Potable"].mean(numeric_only=True)
display(p_water)

((np_water - p_water)/(np_water + p_water)).abs().nlargest(5)

ph                     7.067201
Hardness             196.008440
Solids             21628.535122
Chloramines            7.107267
Sulfate              333.742928
Conductivity         427.554342
Organic_carbon        14.400250
Trihalomethanes       66.278712
Turbidity              3.955181
dtype: float64

ph                     7.113791
Hardness             195.908341
Solids             22344.922883
Chloramines            7.174395
Sulfate              332.457832
Conductivity         425.005423
Organic_carbon        14.294764
Trihalomethanes       66.581596
Turbidity              3.991254
dtype: float64

Solids            0.016291
Chloramines       0.004700
Turbidity         0.004539
Organic_carbon    0.003676
ph                0.003285
dtype: float64

Judging from the difference in the last panadas series, we should pick "Solids", "Chloramines", "Turbidity", "Organic_carbon" and "ph" for the predictor variables and "Potability" for the response variable.