## Water Quality Prediction Project Report
### Introduction

Water is one of the main basic needs humans need to survive, though many regions around the globe have limited access to clean, safe drinking-water. A body of water's potability can be determined using a range of water quality metrics, each serving as indicators as to whether or not the water is safe to drink, or in other words, potable.

For this project, we will be utilizing data from the [Water Quality dataset](https://www.kaggle.com/datasets/adityakadiwal/water-potability/data), which consists of water quality metrics from 3276 distinct water bodies. 


<br>


Our **objective** with this project is to answer this question:


> **Can we predict the potability of water based on its quality metrics?**
   

<br>

The **variables** of this dataset are detailed below:

- **pH Value:** How acidic or alkaline water is.

- **Hardness:** The presence of calcium and magnesium salts.

- **Solids (Total Dissolved Solids - TDS):** The concentration of dissolved minerals in water.

- **Chloramines:** The concentration of disinfectants used in public water systems.

- **Sulfate:** The concentration of sulfate found in many natural sources.

- **Conductivity:** Water's electrical conductivity (EC) based on the amount of dissolved solids in water.

- **Organic_carbon (Total Organic Carbon - TOC):** The amount of carbon in organic compounds in pure water. 

- **Trihalomethanes (THMs):** Chemicals present in chlorine-treated water.

- **Turbidity:** The amount of solid matter suspended in water, influencing transparency.

- **Potability:** Water is safe for human consumption or not. '1' is potable, '0' is not potable.

### Methods & Results

[description of methods used]

In [1]:
#import commands
import altair as alt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.compose import make_column_transformer
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_validate,
    train_test_split,
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.utils import resample
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_selector

# Simplify working with large datasets in Altair
alt.data_transformers.disable_max_rows()

# Output dataframes instead of arrays
set_config(transform_output="pandas")

In [2]:
raw_water_data = pd.read_csv('https://drive.google.com/uc?id=13N4nBi8cZCQUQambCexi0-XArwSghdrj')

#### The Relevant Summary of the dataset for exploratory data analysis

In [3]:
# The number of rows that has at least one NaN 
missing_sum = (raw_water_data.isna().sum(axis=1) > 0).sum()
missing_sum

1265

In [4]:
# The total number of rows in the dataset
total_rows = raw_water_data.shape[0]
total_rows

3276

In [5]:
# The percentage of how many rows have missing data in the dataset
percent_missing = (missing_sum / total_rows) * 100
percent_missing

38.614163614163616

In [6]:
# Preprocessor to impute missing values
preprocessor_missing = make_column_transformer(
    (SimpleImputer(), ["ph", "Sulfate", "Trihalomethanes"]),
    remainder='passthrough',
    verbose_feature_names_out=False
)

# Fit and transform the dataset with the preprocessor
preprocessor_missing.fit(raw_water_data)
water_data = preprocessor_missing.transform(raw_water_data)

# Replace the label for each class
water_data["Potability"] = water_data["Potability"].replace({
    0: "Not Potable",
    1: "Potable"
})

# The number of observations in each class
water_data["Potability"].value_counts()

Potability
Not Potable    1998
Potable        1278
Name: count, dtype: int64

In [7]:
# The dataset that only contains "Not Potable" water
np_water = water_data[water_data["Potability"] == "Not Potable"]

# The dataset that only contains "Potable" water
p_water = water_data[water_data["Potability"] == "Potable"]

# Upsampling to increase the number of observations for "Potable" water 
p_water_upsampled = resample(
    p_water, n_samples=np_water.shape[0]
)

# Concatnating the upsampled dataset and the dataset that only contains "Not Potable" water together
upsampled_water = pd.concat((p_water_upsampled, np_water))

# The number of observations in each class
upsampled_water['Potability'].value_counts()

Potability
Potable        1998
Not Potable    1998
Name: count, dtype: int64

In [8]:
# The first ten rows of the dataset we are using
water_data = upsampled_water.reset_index(drop=True)
water_data.head(10)

Unnamed: 0,ph,Sulfate,Trihalomethanes,Hardness,Solids,Chloramines,Conductivity,Organic_carbon,Turbidity,Potability
0,6.643734,340.975559,64.363793,202.413638,14110.920968,8.75411,291.084744,15.954655,3.920607,Potable
1,9.869232,281.11849,84.013585,223.772661,29549.658823,7.716923,356.181916,14.202664,4.73685,Potable
2,8.174186,285.332026,56.464157,244.249046,31114.820836,4.726912,418.6196,14.873626,4.709403,Potable
3,4.788861,187.424131,89.010974,234.893703,28174.620516,10.850036,444.854321,11.784799,2.896852,Potable
4,6.952512,333.775777,39.802907,211.697297,33737.288524,6.300871,395.407004,14.182396,4.105495,Potable
5,10.538098,409.44673,72.730639,200.139829,13867.244196,7.365015,340.808823,17.073123,3.823093,Potable
6,7.935607,312.343607,63.771833,207.016852,19657.843315,8.604505,358.849003,21.228127,3.619651,Potable
7,6.281904,305.094,41.27983,160.306685,17095.27007,6.730577,424.446185,14.374967,4.45773,Potable
8,7.080795,290.311034,58.354856,256.936378,13766.330789,5.083866,384.906516,17.731523,3.979297,Potable
9,6.417716,321.382124,66.396293,209.702425,31974.481631,7.263425,289.450118,11.369071,4.210327,Potable


In [9]:
# Splitting the dataset into training and testing dataset
water_train, water_test = train_test_split(
    water_data, train_size=0.75, stratify=water_data["Potability"]
)

# The information about the two datasets
print(water_train.info())
print(water_test.info())

<class 'pandas.core.frame.DataFrame'>
Index: 2997 entries, 1650 to 2981
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ph               2997 non-null   float64
 1   Sulfate          2997 non-null   float64
 2   Trihalomethanes  2997 non-null   float64
 3   Hardness         2997 non-null   float64
 4   Solids           2997 non-null   float64
 5   Chloramines      2997 non-null   float64
 6   Conductivity     2997 non-null   float64
 7   Organic_carbon   2997 non-null   float64
 8   Turbidity        2997 non-null   float64
 9   Potability       2997 non-null   object 
dtypes: float64(9), object(1)
memory usage: 257.6+ KB
None
<class 'pandas.core.frame.DataFrame'>
Index: 999 entries, 2528 to 3332
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ph               999 non-null    float64
 1   Sulfate          999 non-null    floa

In [10]:
# The description of all the predictor variables in the training dataset
predictor_vals_summary = water_train.describe()
predictor_vals_summary

Unnamed: 0,ph,Sulfate,Trihalomethanes,Hardness,Solids,Chloramines,Conductivity,Organic_carbon,Turbidity
count,2997.0,2997.0,2997.0,2997.0,2997.0,2997.0,2997.0,2997.0,2997.0
mean,7.076786,332.247687,66.489572,197.209516,22290.235535,7.14276,423.840424,14.200352,3.971749
std,1.478984,37.769295,15.851283,34.099029,9056.674919,1.60714,81.611188,3.301492,0.786079
min,0.0,129.0,8.175876,47.432,1198.943699,0.530351,181.483754,2.2,1.492207
25%,6.232011,314.727835,56.931906,177.353052,15661.940335,6.124095,360.932804,12.002323,3.439561
50%,7.080795,333.775777,66.396293,197.602078,21111.664104,7.137871,418.642063,14.15411,3.968647
75%,7.874671,350.060347,77.131051,218.424637,27701.794055,8.117748,480.878544,16.410654,4.497731
max,14.0,481.030642,124.0,323.124,56867.859236,13.127,753.34262,28.3,6.739


In [11]:
# The description of the selected predictor variables in the training dataset
selected_predictors_summary = water_train[["Solids", "Conductivity", "Hardness", "Organic_carbon", "Chloramines"]].describe()
selected_predictors_summary

Unnamed: 0,Solids,Conductivity,Hardness,Organic_carbon,Chloramines
count,2997.0,2997.0,2997.0,2997.0,2997.0
mean,22290.235535,423.840424,197.209516,14.200352,7.14276
std,9056.674919,81.611188,34.099029,3.301492,1.60714
min,1198.943699,181.483754,47.432,2.2,0.530351
25%,15661.940335,360.932804,177.353052,12.002323,6.124095
50%,21111.664104,418.642063,197.602078,14.15411,7.137871
75%,27701.794055,480.878544,218.424637,16.410654,8.117748
max,56867.859236,753.34262,323.124,28.3,13.127


#### The Relevant Visualizations of the dataset for exploratory data analysis

In [12]:
# Unstacked histogram of the dataset
selected_predictors=['Hardness', 'Solids', 'Chloramines', 'Conductivity', 'Organic_carbon']
metric_hists = [
    alt.Chart(water_train).mark_bar(opacity = .8).encode(
    x=alt.X(preditors, bin=alt.X(maxbins=30)),
    y=alt.Y("count()")
)
               for preditors in selected_predictors
]
metric_hists = [metric_hists[n].properties(
    height=100
).facet(
    "Potability",
    title="Selected Water Quality Metrics by Potability",
    columns=1
)
for n in range(0,5)
]
display (metric_hists[0])
display (metric_hists[1])
display (metric_hists[2])
display (metric_hists[3])
display (metric_hists[4])

### Discussion

summarize what you found
discuss whether this is what you expected to find?
discuss what impact could such findings have?
discuss what future questions could this lead to?

### References

At least 2 citations of literature relevant to the project (format is your choice, just be consistent across the references).
Make sure to cite the source of your data as well.