# Water Quality Analysis and Classification

## Context

Water quality is a critical factor in maintaining public health and environmental sustainability. Contaminants in water, such as heavy metals, bacteria, and chemicals, can pose serious health risks if present in unsafe concentrations. This project focuses on analyzing and classifying the quality of water samples from an urban environment to determine their safety for consumption. By examining various water quality parameters, we can identify potential hazards and ensure that the water meets safety standards for human use.

This project focuses on analyzing and classifying water quality in an urban environment using an imaginary dataset. The dataset is designed for educational purposes, allowing users to practice and acquire knowledge in data analysis and machine learning.

## Source

This data set is available on Kaggle in the following link:

> https://www.kaggle.com/datasets/mssmartypants/water-quality

### Data Dictionary

The dataset contains various attributes related to water quality, where each attribute represents a specific water ingredient. The attributes are all numeric variables and are listed below along with their safety thresholds:

- **aluminium**: It is the aluminium content present in water sample. It contaons numeric data. Dangerous if greater than 2.8
- **ammonia**: It is the amonia content present in water sample. It contaons numeric data. Dangerous if greater than 32.5
- **arsenic**: It is the arsenic content present in water sample. It contaons numeric data. Dangerous if greater than 0.01
- **barium**: It is the barium content present in water sample. It contaons numeric data. Dangerous if greater than 2
- **cadmium**: It is the cadmium content present in water sample. It contaons numeric data. Dangerous if greater than 0.005
- **chloramine**: It is the chloramine content present in water sample. It contaons numeric data. Dangerous if greater than 4
- **chromium**: It is the chromium content present in water sample. It contaons numeric data. Dangerous if greater than 0.1
- **copper**: It is the copper content present in water sample. It contaons numeric data. Dangerous if greater than 1.3
- **flouride**: It is the flouride content present in water sample. It contaons numeric data. Dangerous if greater than 1.5
- **bacteria**: It is the different bacteria present in water sample. It contaons numeric data. Dangerous if greater than 0
- **viruses**: It is the different viruses present in water sample. It contaons numeric data. Dangerous if greater than 0
- **lead**: It is the lead content present in water sample. It contaons numeric data. Dangerous if greater than 0.015
- **nitrates**: It is the nitrate content present in water sample. It contaons numeric data. Dangerous if greater than 10
- **nitrites**: It is the nitrites content present in water sample. It contaons numeric data. Dangerous if greater than 1
- **mercury**: It is the mercury content present in water sample. It contaons numeric data. Dangerous if greater than 0.002
- **perchlorate**: It is the perchlorate content present in water sample. It contaons numeric data. Dangerous if greater than 56
- **radium**: It is the radium content present in water sample. It contaons numeric data. Dangerous if greater than 5
- **selenium**: It is the selenium content present in water sample. It contaons numeric data. Dangerous if greater than 0.5
- **silver**: It is the silder content present in water sample. It contaons numeric data. Dangerous if greater than 0.1
- **uranium**: It is the uranium content present in water sample. It contaons numeric data. Dangerous if greater than 0.3
- **is_safe**: It is the output feature the clasify the quality of the water whether it is safe or not. It contains two attributes (**0 - not safe, 1 - safe**)

### Problem Statement

1. **Model Training**: The objective of model training is to separate the input and output features, split them into training and testing set and train the model with the training set of data.
2. **Data Cleaning**: Evaluate the performace of the model with testing dataset using different evaluation metrics such as accuracy, precision, recall and F1 score.


### Load Libraries

In [32]:
# General
import pandas as pd
import numpy as np
import warnings
import os
import pickle

# Preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Model training and Evaluation
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Model Optimization
from sklearn.model_selection import KFold, cross_val_score
from sklearn.model_selection import GridSearchCV

### Settings

In [20]:
# Warning
warnings.filterwarnings("ignore")

# Path
data_path = "../data"
model_path = "../models"
# csv_path = os.path.join(data_path,"water_wom.csv")
csv_path = os.path.join(data_path,"water_wom.csv")

### Load Data

In [21]:
df = pd.read_csv(csv_path)

In [22]:
# Check the data
df.head()

Unnamed: 0,aluminium,ammonia,arsenic,barium,cadmium,chloramine,chromium,copper,flouride,bacteria,...,lead,nitrates,nitrites,mercury,perchlorate,radium,selenium,silver,uranium,is_safe
0,1.65,9.08,0.04,2.85,0.007,0.35,0.83,0.17,0.05,0.2,...,0.054,16.08,1.13,0.007,37.75,6.78,0.08,0.34,0.02,1.0
1,2.32,21.16,0.01,3.31,0.002,5.28,0.68,0.66,0.9,0.65,...,0.1,2.01,1.93,0.003,32.26,3.21,0.08,0.27,0.05,1.0
2,1.01,14.02,0.04,0.58,0.008,4.24,0.53,0.02,0.99,0.05,...,0.078,14.16,1.11,0.006,50.28,7.07,0.07,0.44,0.01,0.0
3,1.36,11.33,0.04,2.96,0.001,7.23,0.03,1.66,1.08,0.71,...,0.016,1.41,1.29,0.004,9.12,1.72,0.02,0.45,0.05,1.0
4,0.92,24.33,0.03,0.2,0.006,2.67,0.69,0.57,0.61,0.13,...,0.117,6.74,1.11,0.003,16.9,2.41,0.02,0.06,0.02,1.0


### Preprocessing

In [23]:
# Separate input and output features
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

In [24]:
# Split training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state=42)

In [25]:
# Scale the data
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

### Model Training and Evaluation

In [26]:
# Train the model and evaluate the performance of the trained model with different metrics
def train_evaluate(model):
    # Train the model
    model.fit(X_train_s, y_train)

    # Predition with trained model for train and test data
    y_train_pred = model.predict(X_train_s)
    y_test_pred = model.predict(X_test_s)

    # Evaluate the model with training data
    print("=" * 60)
    print("EVALUATION  FOR TRAINING DATASET")
    print("=" * 60)
    print(f"Accuracy: {accuracy_score(y_train, y_train_pred)}")
    print(f"Precision: {precision_score(y_train, y_train_pred)}")
    print(f"Recall: {recall_score(y_train, y_train_pred)}")
    print(f"F1: {f1_score(y_train, y_train_pred)}")
    print("=" * 60)
    print("EVALUATION  FOR TESTING DATASET")
    print("=" * 60)
    print(f"Accuracy: {accuracy_score(y_test, y_test_pred)}")
    print(f"Precision: {precision_score(y_test, y_test_pred)}")
    print(f"Recall: {recall_score(y_test, y_test_pred)}")
    print(f"F1: {f1_score(y_test, y_test_pred)}")

In [28]:
rf = RandomForestClassifier()
train_evaluate(rf)

EVALUATION  FOR TRAINING DATASET
Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1: 1.0
EVALUATION  FOR TESTING DATASET
Accuracy: 0.96125
Precision: 0.96
Recall: 0.72
F1: 0.8228571428571428


### Model Optimization

In [31]:
# Model Validation
kfold = KFold(n_splits = 10, random_state= 42, shuffle= True)
vmodel = RandomForestClassifier()
# Cross validation score
result = cross_val_score(vmodel, X, y, cv= kfold)
print(f"Accuracy: {result.mean()}")

Accuracy: 0.9612298185231541


In [33]:
# Dehine Hyperparameter for Random Forest Classifier
param_dict = {
    "n_estimators": [100, 200, 400],
    "criterion": ["gini", "entropy"],
    "max_depth": [None, 2, 3, 4],
    "min_samples_split": [2, 3, 4],
    "min_samples_leaf": [1, 2, 4]
}

In [36]:
# Hyperparameter tuning to get the best performed model
gmodel = RandomForestClassifier()

# Define gridsearchcv
gscv = GridSearchCV(estimator= gmodel,
                   param_grid= param_dict,
                   cv= 5,
                   verbose= 1)
#Train the model
gscv.fit(X, y)

# Get Best parameter set and score
print(f"Best Score: {gscv.best_score_}")
best_param_set = gscv.best_params_

Fitting 5 folds for each of 216 candidates, totalling 1080 fits
Best Score: 0.8619430112570357


In [38]:
best_param_set

{'criterion': 'entropy',
 'max_depth': 2,
 'min_samples_leaf': 4,
 'min_samples_split': 2,
 'n_estimators': 200}

In [39]:
model = RandomForestClassifier(** best_param_set)
train_evaluate(model)

EVALUATION  FOR TRAINING DATASET
Accuracy: 0.8886804252657912
Precision: 0.0
Recall: 0.0
F1: 0.0
EVALUATION  FOR TESTING DATASET
Accuracy: 0.875
Precision: 0.0
Recall: 0.0
F1: 0.0


### Conclusion

> Model with default parameters generates most accuracy **96%**. 

In [40]:
# Save the model
rf_model_path = os.path.join(model_path, "rf_model.pkl")
with open(rf_model_path, "wb") as model_path_rf:
    pickle.dump(rf, model_path_rf)