<a href="https://colab.research.google.com/github/cubansalsa/skills-copilot-codespaces-vscode/blob/main/Classification_of_Potable_Water_using_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<font size="+3">Module 38 - Classification of Potable Water using ML</font>

- We will be following the AI Project Cycle for the entire project. An AI Project Cycle helps to create solutions efficiently, quickly and have an overview of the entire process.
- We further proceed step-wise starting from defining a problem, acquiring data, exploring it, and then we model the data
- Finally we evaluate the model.This helps to have effective solution for the problem.

 -  ****AI Project Cycle****

![AI_Project_Cycle.png](attachment:download%20%281%29.png)

## Problem Scoping - Understanding the Problem Statement (AI Project Cycle - Step 1)

Water Quality- Drinking water Potability

Water is at the core of sustainable development, being critical for socio-economic development, energy and food production, healthy ecosystems and for human survival.Water the heart of adaptation to climate change, serves as the crucial link between the society and the environment.In this, social impact use case, we will be determining whether a given water quality data ensures potable water or non potable water.
![kidsimage.jpg](attachment:ogb_1028408_children_water_hassansham_camp_iraq_900x395%20%281%29.jpg)

![problem statement.png](attachment:image-4.png)

## Dataset:  Data Acquisition (AI Project Cycle - Step 2)

Source - https://www.kaggle.com/datasets/adityakadiwal/water-potability

According to the World Health Organization (WHO), "access to safe drinking-water is essential to health, a basic human right and a component of effective policy for health protection."In 1990, only 76 percent of the global population had access to drinking water. By 2015 that number had increased to 91 percent.
Around 60 percent of the body is made up of water, and around 71 percent of the planet’s surface has water.Global access to safe water along with adequate sanitation, and proper hygiene resources reduces illness and death from diseases, thereby lead to improved health, poverty reduction, and socio-economic development.

### Import the useful Packages and Libraries

In [None]:
# Import Libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Visualization Libraries
import matplotlib.pyplot as plt
%matplotlib inline
#!pip install seaborn
!pip install plotly
import seaborn as sns
import plotly.express as px

# Data Pre-processing Libraries
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

#Modelling Libraries
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# sklearn evaluation libraries
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report,precision_score

# ignore warnings related to versions etc..
import warnings
warnings.filterwarnings('ignore')

#for adding audio,image and YouTube video within Jupyter Notebook
from IPython.display import Audio,Image,YouTubeVideo

### Load/Read the Dataset

In [3]:
# Load the Dataset
df=pd.read_csv("water_potability.csv")

### View the data

In [None]:
# View the first 5 rows with default head().
df.head()

### Attribute Information-
**pH**: pH (0 to 14).

**Hardness**: Capacity of water to precipitate soap in mg/L.

**Solids**:  Total Dissolved Solids are measured in parts per million (ppm)

**Chloramines**: Amount of Chloramines in ppm.

**Sulfate**: Amount of Sulfates dissolved in mg/L.

**Conductivity**: Electrical conductivity of water in μS/cm.(micro siemens/centimeter).

**Organic_carbon**: Amount of organic carbon in ppm.

**Trihalomethanes**: Amount of Trihalomethanes in μg/L(micrograms/liter).

**Turbidity**: Measure of light emiting property of water in NTU.

**Potability**: Indicates if water is safe for human consumption.
0 for Non-Potable water
1 for Potable water

For details of these attributes refer Module 38 slides from 44-51

## Data Preprocessing ------ Data Exploration(AI Project Cycle - Step 3)

**Data Preprocessing consists of EDA/Data Cleaning/Feature Engineering**

We have listed below all the basic steps involved in Data Pre-processing but we may not require all of them for all problems. It will depend on the nature of the dataset in hand.. you will have to figure out which steps are essential for your problem statement.

**In this particular problem, will be using steps -
EDA(1,2,3,6) Data Cleaning(3) Feature Engineering(1)**

-----------------------------------------------------------------------------------------------------------------------------

**Exploratory Data Analysis (EDA)** - goal is to maximize the insight into a dataset and understand the underlying structure of a dataset.

1.	Explore the data- view the information of data
2.	Check how many rows and columns does the data consist of
3.	Find null values/missing values
4.	Fill the missing values- use fillna (mean,median or mode)
5.	Data scaling/standardization

### View the information of data

In [None]:
# info() helps summarize the dataset- It gives basic information like number of non-null values, datatypes and memory usage
# It is a good practise to start by this information
df.info()

### Check the number of rows and columns of the dataset

In [None]:
#It returns the column names of the given dataframe
df.columns

In [None]:
#It shows the number of rows and columns of the given dataframe
df.shape

In [None]:
#It gives the numerical statistical information of the dataframe
"""
count - The number of non-empty values.
mean - The average value
std - The standard deviation
min - the minimum value
25% - The 25% percentile*
50% - The 50% percentile*
75% - The 75% percentile*
max - the maximum value """

df.describe()

### Balancing the Dataset

In [None]:
#Checking if there is any  imbalance in dataframe
df['Potability'].value_counts()

In [None]:
#Applying oversampling to balance the data

# class count
class_count_0, class_count_1 = df['Potability'].value_counts()
# Separate class
class_0 = df[df['Potability'] == 0]
class_1 = df[df['Potability'] == 1]# print the shape of the class
print('class 0:', class_0.shape)
print('class 1:', class_1.shape)
class_1_over = class_1.sample(n=1998, replace=True)
df_over = pd.concat([class_0, class_1_over], axis=0)
df_over.shape

In [None]:
# Check the Balance of the dataset after implementing Oversampling technique
d= pd.DataFrame(df_over['Potability'].value_counts())
d.reset_index(inplace=True)  # Reset index to create a 'count' column
fig = px.pie(d,values='Potability',names=['Non Potable : 0','Potable : 1'],hole=0.65,opacity=0.8)
fig.show()


### Performing EDA on the Balanced Dataset

In [None]:
# The new balanced dataset
df_over.head()

#### Checking the missing values

In [None]:
# Finding Number of Missing values in the dataset
nans = df_over.isna().sum().sort_values(ascending=False).to_frame()
px.imshow(nans,text_auto=True)

#### Histogram plots of the missing values feature

In [None]:
## Lets check the skewness of the missing values features to fill them
#1. histogram for ph column
fig = px.histogram(df_over, x="ph")
fig.show()

In [None]:
# 2. histogram for Sulfate
fig = px.histogram(df_over, x="Sulfate")
fig.show()

In [None]:
#3. Histogram for Trihalomethanes
fig = px.histogram(df_over, x="Trihalomethanes")
fig.show()

#### How to replace missing values in the dataset ?

![MEAN.png](attachment:imputation.png)

From the above 3 columns histogram we observe that all the 3 features are skewed to the right.
So, we impute the median values for the missing values.

In [14]:
df_over['ph'].fillna(df_over['ph'].median(), inplace=True)
df_over['Sulfate'].fillna(df_over['Sulfate'].median(), inplace=True)
df_over['Trihalomethanes'].fillna(df_over['Trihalomethanes'].median(), inplace=True)

In [None]:
## Checking the missing values after Imputation method
#plt.title('Missing Values')
nans = df_over.isna().sum().sort_values(ascending=False).to_frame()
px.imshow(nans,text_auto=True)
#sns.heatmap(nans,annot=True,fmt='d',cmap='vlag')

#### Check correlation between features
- We are doing this check to remove features which are redundant

In [None]:
#Correlation Matrix( )
px.imshow(df_over.corr(),text_auto=True)

In this correlation we find there are no redundant features

#### Checking Outliers

In [20]:
# Checking Outliers for ph feature
fig = px.box(df_over, x='Potability', y='ph', color="Potability")
fig.update_traces(quartilemethod="exclusive") # or "inclusive", or "linear" by default
fig.show()

In [None]:
# Checking Outliers for Hardness feature
fig = px.box(df_over, x='Potability', y='Hardness', color="Potability")
fig.update_traces(quartilemethod="exclusive") # or "inclusive", or "linear" by default
fig.show()

In [None]:
# Checking Outliers for Solids feature
fig = px.box(df_over, x='Potability', y='Solids', color="Potability")
fig.update_traces(quartilemethod="exclusive") # or "inclusive", or "linear" by default
fig.show()

In [None]:
# Checking Outliers for Chloramines feature
fig = px.box(df_over, x='Potability', y='Chloramines', color="Potability")
fig.update_traces(quartilemethod="exclusive") # or "inclusive", or "linear" by default
fig.show()

In [None]:
# Checking Outliers for Sulfate feature
fig = px.box(df_over, x='Potability', y='Sulfate', color="Potability")
fig.update_traces(quartilemethod="exclusive") # or "inclusive", or "linear" by default
fig.show()

In [None]:
# Checking Outliers for Conductivity feature
fig = px.box(df_over, x='Potability', y='Conductivity', color="Potability")
fig.update_traces(quartilemethod="exclusive") # or "inclusive", or "linear" by default
fig.show()

In [None]:
# Checking Outliers for Organic_carbon feature
fig = px.box(df_over, x='Potability', y='Organic_carbon', color="Potability")
fig.update_traces(quartilemethod="exclusive") # or "inclusive", or "linear" by default
fig.show()

In [None]:
# Checking Outliers for Trihalomethanes feature
fig = px.box(df_over, x='Potability', y='Trihalomethanes', color="Potability")
fig.update_traces(quartilemethod="exclusive") # or "inclusive", or "linear" by default
fig.show()

In [None]:
# Checking Outliers for Turbidity  feature
fig = px.box(df_over, x='Potability', y='Turbidity', color="Potability")
fig.update_traces(quartilemethod="exclusive") # or "inclusive", or "linear" by default
fig.show()

In [None]:
## Outlier treatment using InterQuartile Range Method

![OUTLIERS.png](attachment:1_0MPDTLn8KoLApoFvI0P2vQ.png)

In [None]:
## Fill outlier for ph
# Common notation for all features
# df_over-- represents the balanced dataset
#Q1 =Q1*(0.25)
#Q3=Q3*(0.75)
#IQR=Q3-Q1
#low=Q1-1.5*IQR
#TOP=Q3+1.5*IQR
df_ph = df_over['ph']
Q1_ph = df_ph.quantile(0.25)
Q3_ph = df_ph.quantile(0.75)
IQR_ph = Q3_ph - Q1_ph
low = Q1_ph - 1.5 * IQR_ph
top = Q3_ph + 1.5 * IQR_ph
print(top)
print(low)

# selecting the outlier as  either < low or > top value
(df_ph < low) | (df_ph > top)
outlier_ph = (df_ph < low) | (df_ph > top)
outlier_ph.head()
print('Sum of Outlier for PH:' , df_ph[outlier_ph].sum())

# assigning mean of the feature column
print(df_ph.mean())
df_ph[outlier_ph] = df_ph.mean()

In [None]:
## Fill outlier for Hardness
df_h = df_over['Hardness']
Q1_h = df_h.quantile(0.25)
Q3_h = df_h.quantile(0.75)
IQR_h = Q3_h - Q1_h
low_h = Q1_h - 1.5 * IQR_h
top_h = Q3_h + 1.5 * IQR_h
print(top_h)
print(low_h)

# selecting the outlier as  either < low or > top value
(df_h < low_h) | (df_h > top_h)
outlier_h = (df_h < low_h) | (df_ph > top_h)
print('Sum of outlier for Hardness:' , df_h[outlier_h].sum())

# assigning mean value to the outlier
print(df_h.mean())
df_h[outlier_h] = df_h.mean()

In [None]:
## Fill outlier for Solids
df_solids = df_over['Solids']
Q1_solids = df_solids.quantile(0.25)
Q3_solids = df_solids.quantile(0.75)
IQR_solids = Q3_solids - Q1_solids
low_s = Q1_solids - 1.5 * IQR_solids
top_s = Q3_solids + 1.5 * IQR_solids
print(top_s)
print(low_s)

# selecting the outlier as  either < low or > top value
(df_solids < low_s) | (df_solids > top_s)
outlier_solids = (df_solids < low_s) | (df_solids > top_s)
print('Sum of outliers for Solids:' , outlier_solids.sum())

# assigning mean value to the outlier
print(df_solids.mean())
df_solids[outlier_solids] = df_solids.mean()

In [None]:
## Fill outliers for Chloramins
df_ch = df_over['Chloramines']
Q1_ch = df_ch.quantile(0.25)
Q3_ch = df_ch.quantile(0.75)
IQR_ch = Q3_ch - Q1_ch
low_ch = Q1_ch - 1.5 * IQR_ch
top_ch = Q3_ch + 1.5 * IQR_ch
print(top_ch)
print(low_ch)

# selecting the outlier as  either < low or > top value
(df_ch < low_ch) | (df_ch > top_ch)
outliers_ch = (df_ch < low_ch) | (df_ch > top_ch)
print('Sum of outlier for Chloramines:' , outliers_ch.sum())

# assigning mean value to the outlier
print(df_ch.mean())
df_ch[outliers_ch] = df_ch.mean()

In [None]:
## Fill outlier for Conductivity
df_con = df_over['Conductivity']
Q1_con = df_con.quantile(0.25)
Q3_con = df_con.quantile(0.75)
IQR_con = Q3_con - Q1_con
low_con = Q1_con - 1.5 * IQR_con
top_con = Q3_con + 1.5 * IQR_con
print(top_con)
print(low_con)

# selecting the outlier as  either < low or > top value
(df_con < low_con) | (df_con > top_con)
outliers_con = (df_ch < low_con) | (df_con > top_con)
print('Sum of outliers for Conductivity:' , outliers_con.sum())

# assigning mean value to the outlier
print(df_con.mean())
df_con[outliers_con] = df_con.mean()

In [None]:
## Fill outlier for Organic Carbon
df_og = df_over['Organic_carbon']
Q1_og = df_og.quantile(0.25)
Q3_og = df_og.quantile(0.75)
IQR_og = Q3_og - Q1_og
low_og = Q1_og - 1.5 * IQR_og
top_og = Q3_og + 1.5 * IQR_og
print(top_og)
print(low_og)

# selecting the outlier as  either < low or > top value
(df_og < low_og) | (df_og > top_og)
outliers_og = (df_og < low_og) | (df_og > top_og)
print('Sum of outliers for Organic Carbon:' , outliers_og.sum())

# assigning mean value to the outlier
print(df_og.mean())
df_og[outliers_og] = df_og.mean()

In [None]:
## Fill outlier for Trihalomethanes
df_tr = df_over['Trihalomethanes']
Q1_tr = df_tr.quantile(0.25)
Q3_tr = df_tr.quantile(0.75)
IQR_tr = Q3_tr - Q1_tr
low_tr = Q1_tr - 1.5 * IQR_tr
top_tr = Q3_tr + 1.5 * IQR_tr
print(top_tr)
print(low_tr)

# selecting the outlier as  either < low or > top value
(df_tr < low_tr) | (df_tr > top_tr)
outliers_tr = (df_tr < low_tr) | (df_tr > top_tr)
print('Sum of outliers for Trihalomethanes:' , outliers_tr.sum())

# assigning mean value to the outlier
print(df_tr.mean())
df_tr[outliers_tr] = df_tr.mean()

In [None]:
## Fill outlier for Turbidity
df_tur = df_over['Turbidity']
Q1_tur = df_tur.quantile(0.25)
Q3_tur = df_tur.quantile(0.75)
IQR_tur = Q3_tur - Q1_tur
low_tur = Q1_tur - 1.5 * IQR_tur
top_tur = Q3_tur + 1.5 * IQR_tur
print(low_tur)
print(top_tur)

# selecting the outlier as  either < low or > top value
(df_tur < low_tur) | (df_tur > top_tur)
outliers_tur = (df_tur < low_tur) | (df_tur > top_tur)
print('Sum of outliers for Turbidity:' , outliers_tur.sum())

# assigning mean value to the outlier
print(df_tur.mean())
df_tur[outliers_tur] = df_tur.mean()

### Splitting dataset into separate training and test set

In [34]:
X = df_over.drop('Potability',axis=1).values
y = df_over['Potability'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# Generally, we have splitting ratio of 70:30, 80:20,60:40,depending on the nature of the data value
# In this dataset ,we use 75:25 as the train and test split ratio

In [None]:
df_over.shape

In [None]:
X_train.shape

In [None]:
X_test.shape

### Standardize the Data

In [36]:
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

## Building a Model ------ Modeling ( AI Project Cycle - Step 4)

We will now use the sklearn library to build the model. We will begin by defining the hyperparameters and determining the best fit model. We will be comparing the accuracy of three models, namely Support Vector Machine (SVM), Random Forest and Logistic Regression, Decision Tree

In [37]:
# Logistic Regression

In [None]:
# importing from sklearn Logistic Regression a classiifcation technique
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(penalty="l2")
lr.fit(X_train,y_train)

In [39]:
# Decision Tree

In [40]:
from sklearn.tree import DecisionTreeClassifier
dTree = DecisionTreeClassifier(random_state=12)
dTree.fit(X_train, y_train)

In [None]:
# Random Forest

In [41]:
from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier(random_state = 12,n_estimators=100)
rf.fit(X_train,y_train)

In [None]:
# Support Vector Machines

In [42]:
from sklearn.svm import SVC
svc = SVC(C=1.0,random_state=12)
svc.fit(X_train, y_train)

## Evaluating the Model ------- Evaluation (AI Project Cycle - Step 5)

In [None]:
# Logistic Regression

In [None]:
y_pred=lr.predict(X_test)
LR_acc_score = accuracy_score(y_pred, y_test)
print(f'Train Accuracy Score for Logistic Regression: {LR_acc_score}')
print(f'\nClassification Report : \n\n{classification_report(y_pred, y_test)}')

In [None]:
from sklearn.metrics import confusion_matrix
y_pred = lr.predict(X_test)
cm = confusion_matrix(y_test,y_pred)
plt.figure(figsize = (7,5))
sns.heatmap(cm, annot=True, fmt="d")
plt.xlabel('Predicted')

In [None]:
# Decision Tree

In [None]:
y_pred = dTree.predict(X_test)
DT_acc_score = accuracy_score(y_pred, y_test)
print(f'Train Accuracy Score for Decision Tree: {DT_acc_score}')
print(f'\nClassification Report : \n\n{classification_report(y_pred, y_test)}')

In [None]:
from sklearn.metrics import confusion_matrix
y_pred = dTree.predict(X_test)
cm = confusion_matrix(y_test,y_pred)
plt.figure(figsize = (7,5))
sns.heatmap(cm, annot=True, fmt="d")
plt.xlabel('Predicted')

In [None]:
# Random Forest

In [None]:
y_pred=rf.predict(X_test)
RF_acc_score = accuracy_score(y_pred, y_test)
print(f'Train Accuracy Score for Random Forest: {RF_acc_score}')
print(f'\nClassification Report : \n\n{classification_report(y_pred, y_test)}')

In [None]:
from sklearn.metrics import confusion_matrix
y_pred = rf.predict(X_test)
cm = confusion_matrix(y_test,y_pred)
plt.figure(figsize = (7,5))
sns.heatmap(cm, annot=True, fmt="d")
plt.xlabel('Predicted')

In [None]:
# Support Vector Machines

In [None]:
y_pred = svc.predict(X_test)
SVM_acc_score = accuracy_score(y_pred, y_test)
print(f'Test Accuracy Score for SVM: {SVM_acc_score}')
print(f'\nClassification Report : \n\n{classification_report(y_pred, y_test)}')

In [None]:
from sklearn.metrics import confusion_matrix
y_pred = svc.predict(X_test)
cm = confusion_matrix(y_test,y_pred)
plt.figure(figsize = (7,5))
sns.heatmap(cm, annot=True, fmt="d")
plt.xlabel('Predicted')

In [None]:
## Summary Table for Accuracy of models

In [None]:
!pip install tabulate -q

In [None]:
import tabulate
data = [["Decision Tree",DT_acc_score],
         ["Random Forest Classifier",RF_acc_score],
         ["Logistic Regression",LR_acc_score],
         ["Support Vector Machines",SVM_acc_score]]
table = tabulate.tabulate(data, headers=["Model", "Accuracy"], tablefmt='fancy_grid')
#table = tabulate.tabulate(data, tablefmt='html')
print(table)  # Print the table string for display in Jupyter
#display(table) # Use display() to render the HTML content of 'table'

## References:

1.McIntosh, J. (2018, July 16). Fifteen benefits of drinking water. Medicalnewstoday. https://www.medicalnewstoday.com/articles/290814#benefits

2.Wikipedia contributors. (2022r, July 27). Drinking water. Wikipedia. https://en.wikipedia.org/wiki/Drinking_water#:%7E:text=According%20to%20the%20World%20Health%20Organization%20(WHO)%2C%20%22access,had%20increased%20to%2091%20percent.

3.World Water Day | Drinking Water | Healthy Water | CDC. (n.d.). Cdc.Gov. Retrieved August 11, 2022, from https://www.cdc.gov/healthywater/drinking/world-water-day.html

4.United Nations. (n.d.). Water. Retrieved August 11, 2022, from https://www.un.org/en/global-issues/water

5.Water and sanitation. (2022, May 25). Oxfam International. https://www.oxfam.org/en/what-we-do/issues/water-and-sanitation

6.Nichani, P. (2021, December 14). Appropriate ways to Treat Missing Values - Analytics Vidhya. Medium. https://medium.com/analytics-vidhya/appropriate-ways-to-treat-missing-values-f82f00edd9be

7.Agarwal, V. (2021, December 12). Outlier detection with Boxplots - Vishal Agarwal. Medium. https://medium.com/@agarwal.vishal819/outlier-detection-with-boxplots-1b6757fafa21
