## Water Quality Prediction Project Proposal
### Introduction

Water is one of the main basic needs humans need to survive, though many regions around the globe have limited access to clean, safe drinking-water. A body of water's potability can be determined using a range of water quality metrics, each serving as indicators as to whether or not the water is safe to drink, or in other words, potable.

For this project, we will be utilizing the 'water_potability.csv' file retrieved from the [Water Quality dataset](https://www.kaggle.com/datasets/adityakadiwal/water-potability/data) on Kaggle, which consists of water quality metrics from 3276 distinct water bodies. 

<br>


Our **objective** with this project is to answer this question:


> **Can we predict the potability of water based on its quality metrics?**
   

<br>

The **variables** of this dataset are detailed below:

- **pH Value:** A measure of how acidic or alkaline water is (values fall within the WHO's recommended limits).

- **Hardness:** Indicates the presence of calcium and magnesium salts, which are vital determinants of water's ability to precipitate soap.

- **Solids (Total Dissolved Solids - TDS):** Denotes the concentration of dissolved minerals in water, affecting its taste and appearance.

- **Chloramines:** The concentration of disinfectants, primarily used in public water systems (chlorine and chloramine). Chlorine levels up to 4mg/L are considered safe in drinking water.

- **Sulfate:** Found in many natural sources (groundwater, plants, food, etc.). Concentration varies greatly.

- **Conductivity:** A measure of water's ability to conduct electric current, primarily determined by the amount of dissolved solids in water. Pure water is not a good conductor, and WHO standards state that EC (electrical conductivity) should not exceed 400 μS/cm. 

- **Organic_carbon (Total Organic Carbon - TOC):** A measure of the total amount of carbon in organic compounds in pure water. Comes from decaying natural organic matter and synthetic sources. 

- **Trihalomethanes (THMs):** Chemicals that might be present in water that has been treated with chlorine. Concentration fluctuates according to level of organic material in the water, the temperature of the water, and the amount of chlorine needed to treat the water.

- **Turbidity:** The amount of solid matter suspended in water, influencing the water's transparency. A measure of light emitting properties of water. WHO recommends a value of 5.00 NTU.

- **Potability:** Indicates if water is safe for human consumption or not. '1' is potable, '0' is not potable. 




### Preliminary exploratory data analysis
To begin, we read the data from the web into Python and imported everything we could potentially require. Due to Kaggle's security/authentication methods, we uploaded the dataset file to Google Drive and created a share link with public access. This way, we did not have to directly upload the file into Jupyter or utilize the Kaggle API command. We then went on to clean and wrangle the data and split the data into training and testing sets.

In [42]:
#import commands
import random

import altair as alt
import pandas as pd
from sklearn.compose import make_column_transformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.metrics.pairwise import euclidean_distances
from sklearn import set_config

# Simplify working with large datasets in Altair
alt.data_transformers.disable_max_rows()

# Output dataframes instead of arrays
set_config(transform_output="pandas")

In [43]:
water_data = pd.read_csv('https://drive.google.com/uc?id=13N4nBi8cZCQUQambCexi0-XArwSghdrj')

water_data.head()

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
0,,204.890455,20791.318981,7.300212,368.516441,564.308654,10.379783,86.99097,2.963135,0
1,3.71608,129.422921,18630.057858,6.635246,,592.885359,15.180013,56.329076,4.500656,0
2,8.099124,224.236259,19909.541732,9.275884,,418.606213,16.868637,66.420093,3.055934,0
3,8.316766,214.373394,22018.417441,8.059332,356.886136,363.266516,18.436524,100.341674,4.628771,0
4,9.092223,181.101509,17978.986339,6.5466,310.135738,398.410813,11.558279,31.997993,4.075075,0
