## Water Quality Prediction Project Proposal
### Introduction

Water is one of the main basic needs humans need to survive, though many regions around the globe have limited access to clean, safe drinking-water. A body of water's potability can be determined using a range of water quality metrics, each serving as indicators as to whether or not the water is safe to drink, or in other words, potable.

For this project, we will be utilizing the 'water_potability.csv' file retrieved from the [Water Quality dataset](https://www.kaggle.com/datasets/adityakadiwal/water-potability/data) on Kaggle, which consists of water quality metrics from 3276 distinct water bodies. 

<br>


Our **objective** with this project is to answer this question:


> **Can we predict the potability of water based on its quality metrics?**
   

<br>

The **variables** of this dataset are detailed below:

- **pH Value:** A measure of how acidic or alkaline water is (values fall within the WHO's recommended limits).

- **Hardness:** Indicates the presence of calcium and magnesium salts, which are vital determinants of water's ability to precipitate soap.

- **Solids (Total Dissolved Solids - TDS):** Denotes the concentration of dissolved minerals in water, affecting its taste and appearance.

- **Chloramines:** The concentration of disinfectants, primarily used in public water systems (chlorine and chloramine). Chlorine levels up to 4mg/L are considered safe in drinking water.

- **Sulfate:** Found in many natural sources (groundwater, plants, food, etc.). Concentration varies greatly.

- **Conductivity:** A measure of water's ability to conduct electric current, primarily determined by the amount of dissolved solids in water. Pure water is not a good conductor, and WHO standards state that EC (electrical conductivity) should not exceed 400 μS/cm. 

- **Organic_carbon (Total Organic Carbon - TOC):** A measure of the total amount of carbon in organic compounds in pure water. Comes from decaying natural organic matter and synthetic sources. 

- **Trihalomethanes (THMs):** Chemicals that might be present in water that has been treated with chlorine. Concentration fluctuates according to level of organic material in the water, the temperature of the water, and the amount of chlorine needed to treat the water.

- **Turbidity:** The amount of solid matter suspended in water, influencing the water's transparency. A measure of light emitting properties of water. WHO recommends a value of 5.00 NTU.

- **Potability:** Indicates if water is safe for human consumption or not. '1' is potable, '0' is not potable. 




### Preliminary exploratory data analysis
To begin, we read the data from the web into Python and imported everything we could potentially require. Due to Kaggle's security/authentication methods, we uploaded the dataset file to Google Drive and created a share link with public access. This way, we did not have to directly upload the file into Jupyter or utilize the Kaggle API command. We then went on to clean and wrangle the data and split the data into training and testing sets.

In [2]:
#import commands
import altair as alt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.compose import make_column_transformer
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_validate,
    train_test_split,
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.utils import resample
from sklearn.impute import SimpleImputer

# Simplify working with large datasets in Altair
alt.data_transformers.disable_max_rows()

# Output dataframes instead of arrays
set_config(transform_output="pandas")

In [3]:
raw_water_data = pd.read_csv('https://drive.google.com/uc?id=13N4nBi8cZCQUQambCexi0-XArwSghdrj')

missing_sum = raw_water_data.isnull().sum()

print(missing_sum)

total_rows = raw_water_data.shape

print(total_rows)

percent_missing = ((missing_sum["ph"] + missing_sum["Sulfate"] + missing_sum["Trihalomethanes"]) / total_rows) * 100

print(percent_missing)

ph                 491
Hardness             0
Solids               0
Chloramines          0
Sulfate            781
Conductivity         0
Organic_carbon       0
Trihalomethanes    162
Turbidity            0
Potability           0
dtype: int64
(3276, 10)
[   43.77289377 14340.        ]


In [32]:
preprocessor_missing = make_column_transformer(
    (SimpleImputer(), ["ph", "Sulfate", "Trihalomethanes"]),
    remainder='passthrough',
    verbose_feature_names_out=False
)

preprocessor_missing.fit(raw_water_data)
water_data = preprocessor_missing.transform(raw_water_data)


water_data["Potability"] = water_data["Potability"].replace({
    0: "Not Potable",
    1: "Potable"
})

water_data.isnull().sum()

ph                 0
Sulfate            0
Trihalomethanes    0
Hardness           0
Solids             0
Chloramines        0
Conductivity       0
Organic_carbon     0
Turbidity          0
Potability         0
dtype: int64

In [33]:
water_data["Potability"].value_counts()

Not Potable    1998
Potable        1278
Name: Potability, dtype: int64

In [34]:
np_water = water_data[water_data["Potability"] == "Not Potable"]
p_water = water_data[water_data["Potability"] == "Potable"]
p_water_upsampled = resample(
    p_water, n_samples=np_water.shape[0]
)
upsampled_water = pd.concat((p_water_upsampled, np_water))
upsampled_water['Potability'].value_counts()

Potable        1998
Not Potable    1998
Name: Potability, dtype: int64

In [35]:
water_data = upsampled_water.reset_index(drop=True)
water_data.head(10)

Unnamed: 0,ph,Sulfate,Trihalomethanes,Hardness,Solids,Chloramines,Conductivity,Organic_carbon,Turbidity,Potability
0,7.701033,354.250252,69.260209,121.575139,16978.926829,6.812416,368.224575,12.997263,3.395483,Potable
1,6.428866,333.775777,87.163631,225.613094,15596.473334,6.153472,466.044399,18.015385,4.184783,Potable
2,8.589202,268.971355,41.930439,233.727975,7263.056749,6.953308,258.880209,8.986363,4.041446,Potable
3,7.080795,333.775777,52.331462,161.038971,25583.492009,6.796914,467.479901,14.105554,4.514175,Potable
4,4.945695,340.64567,84.551081,170.440921,23000.723137,4.433725,346.642267,13.276638,3.782517,Potable
5,8.458797,313.885548,84.510985,241.76834,29317.14244,5.783275,328.579429,18.296001,3.827431,Potable
6,6.170526,333.775777,75.067706,193.335517,16206.219671,7.123966,528.096091,20.532277,3.652207,Potable
7,9.24142,281.99564,65.516668,127.918826,39566.754352,8.860818,487.339169,9.534499,4.718851,Potable
8,8.357613,317.30168,67.598158,163.098254,34989.047081,7.696943,404.492614,8.271882,4.366242,Potable
9,7.080795,333.775777,85.338053,158.762959,26210.116012,4.088763,354.847053,16.006711,5.010163,Potable


In [36]:
water_train, water_test = train_test_split(
    water_data, train_size=0.75, stratify=water_data["Potability"]
)
pd.to_numeric(water_train["ph"])
pd.to_numeric(water_train["ph"])
print(water_train.info())
print(water_test.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2997 entries, 2648 to 669
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ph               2997 non-null   float64
 1   Sulfate          2997 non-null   float64
 2   Trihalomethanes  2997 non-null   float64
 3   Hardness         2997 non-null   float64
 4   Solids           2997 non-null   float64
 5   Chloramines      2997 non-null   float64
 6   Conductivity     2997 non-null   float64
 7   Organic_carbon   2997 non-null   float64
 8   Turbidity        2997 non-null   float64
 9   Potability       2997 non-null   object 
dtypes: float64(9), object(1)
memory usage: 257.6+ KB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 999 entries, 2813 to 2486
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ph               999 non-null    float64
 1   Sulfate          999 non-nul

In [37]:
ph_summary = water_train['ph'].describe()
print(ph_summary)

predictor_vals_summary = pd.DataFrame(water_train.describe())
predictor_vals_summary

count    2997.000000
mean        7.060013
std         1.435278
min         0.000000
25%         6.304769
50%         7.080795
75%         7.833361
max        13.541240
Name: ph, dtype: float64


Unnamed: 0,ph,Sulfate,Trihalomethanes,Hardness,Solids,Chloramines,Conductivity,Organic_carbon,Turbidity
count,2997.0,2997.0,2997.0,2997.0,2997.0,2997.0,2997.0,2997.0,2997.0
mean,7.060013,334.490185,66.296911,195.45163,22074.85347,7.084829,423.673461,14.355114,3.963419
std,1.435278,36.875215,15.957651,33.902606,8692.598647,1.599163,81.04374,3.353203,0.772581
min,0.0,129.0,0.738,47.432,1198.943699,0.530351,181.483754,2.2,1.45
25%,6.304769,317.103903,56.437485,173.975027,15751.1753,6.073041,361.705354,12.091189,3.445756
50%,7.080795,333.775777,66.396293,195.540967,20852.764496,7.071299,419.895888,14.274092,3.958609
75%,7.833361,351.418634,76.634798,216.122144,27463.654795,8.080156,479.402509,16.620933,4.49214
max,13.54124,476.539717,124.0,317.338124,61227.196008,13.127,753.34262,27.006707,6.739


### Method
Explain how you will conduct either your data analysis and which variables/columns you will use. Note - you do not need to use all variables/columns that exist in the raw data set. In fact, that's often not a good idea. For each variable think: is this a useful variable for prediction?
Describe at least one way that you will visualize the results


In [None]:
np_water = water_data[water_data["Potability"] == "Not Potable"].mean(numeric_only=True)
display(np_water)

p_water = water_data[water_data["Potability"] == "Potable"].mean(numeric_only=True)
display(p_water)

((np_water - p_water)/(np_water + p_water)).abs().nlargest(5)

Judging from the difference in the last panadas series, we should pick "Solids", "Chloramines", "Turbidity", "Organic_carbon" and "ph" for the predictor variables and "Potability" for the response variable.