# Water Quality Model

The water quality model employs supervised machine learning techniques to classify water as potable (1) or not potable (0), making this a binary classification problem since the output variable is categorical. Supervised learning is a type of machine learning where the model is trained on labeled data. In this context, "labeled data" means that each training example is paired with an output label. The goal of the model is to learn the mapping from inputs to outputs based on the provided labels. 

## Data

The data is stored in a CSV file located in the `data` folder under the name `water_potability.csv`. The first task involves performing Exploratory Data Analysis (EDA) to identify any discrepancies in the dataset, normalize the data, and visualize it effectively.

In [2]:

# import modules needed

import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

2024-06-05 12:16:06.118240: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [3]:
# read csv file and print the first 10 rows

data = pd.read_csv('data/water_potability.csv')
print(data.head(10))

# Number of rows and columns in entire dataset
print(f'Data has {data.shape[0]} rows and {data.shape[1]} columns')

          ph    Hardness        Solids  Chloramines     Sulfate  Conductivity  \
0        NaN  204.890455  20791.318981     7.300212  368.516441    564.308654   
1   3.716080  129.422921  18630.057858     6.635246         NaN    592.885359   
2   8.099124  224.236259  19909.541732     9.275884         NaN    418.606213   
3   8.316766  214.373394  22018.417441     8.059332  356.886136    363.266516   
4   9.092223  181.101509  17978.986339     6.546600  310.135738    398.410813   
5   5.584087  188.313324  28748.687739     7.544869  326.678363    280.467916   
6  10.223862  248.071735  28749.716544     7.513408  393.663396    283.651634   
7   8.635849  203.361523  13672.091764     4.563009  303.309771    474.607645   
8        NaN  118.988579  14285.583854     7.804174  268.646941    389.375566   
9  11.180284  227.231469  25484.508491     9.077200  404.041635    563.885481   

   Organic_carbon  Trihalomethanes  Turbidity  Potability  
0       10.379783        86.990970   2.963135   

### Check for the availability and number of NaN values in the dataset

In [4]:
# check for Nan Values in each column
nan_counts = data.isnull().sum()

print(f'Nan Values in each column: {nan_counts}')

Nan Values in each column: ph                 491
Hardness             0
Solids               0
Chloramines          0
Sulfate            781
Conductivity         0
Organic_carbon       0
Trihalomethanes    162
Turbidity            0
Potability           0
dtype: int64


From the information above the columns sulfate and Trihalomethanes both have a good number of NaN Values. 

### Replace NaN values with the mean of the column

In [5]:
# Replace NaN values in 'sulfate' and 'Trihalomethanes' with the mean of the respective column
data['ph'] = data['ph'].fillna(data['ph'].mean())
data['Sulfate'] = data['Sulfate'].fillna(data['Sulfate'].mean())
data['Trihalomethanes'] = data['Trihalomethanes'].fillna(data['Trihalomethanes'].mean())

print("NaN values after replacement:")
print(data.isnull().sum())

NaN values after replacement:
ph                 0
Hardness           0
Solids             0
Chloramines        0
Sulfate            0
Conductivity       0
Organic_carbon     0
Trihalomethanes    0
Turbidity          0
Potability         0
dtype: int64


In [6]:
print(data.head(10))

          ph    Hardness        Solids  Chloramines     Sulfate  Conductivity  \
0   7.080795  204.890455  20791.318981     7.300212  368.516441    564.308654   
1   3.716080  129.422921  18630.057858     6.635246  333.775777    592.885359   
2   8.099124  224.236259  19909.541732     9.275884  333.775777    418.606213   
3   8.316766  214.373394  22018.417441     8.059332  356.886136    363.266516   
4   9.092223  181.101509  17978.986339     6.546600  310.135738    398.410813   
5   5.584087  188.313324  28748.687739     7.544869  326.678363    280.467916   
6  10.223862  248.071735  28749.716544     7.513408  393.663396    283.651634   
7   8.635849  203.361523  13672.091764     4.563009  303.309771    474.607645   
8   7.080795  118.988579  14285.583854     7.804174  268.646941    389.375566   
9  11.180284  227.231469  25484.508491     9.077200  404.041635    563.885481   

   Organic_carbon  Trihalomethanes  Turbidity  Potability  
0       10.379783        86.990970   2.963135   

In [7]:
# Separate our data to X -> Feature Columms and Y -> Output Label

X = data.iloc[:, 0:9]
Y = data['Potability']

# Display first few rows of X and Y to verify
print(f'X features: {X.head()}')
print(f'Y target: {Y.head()}')

X features:          ph    Hardness        Solids  Chloramines     Sulfate  Conductivity  \
0  7.080795  204.890455  20791.318981     7.300212  368.516441    564.308654   
1  3.716080  129.422921  18630.057858     6.635246  333.775777    592.885359   
2  8.099124  224.236259  19909.541732     9.275884  333.775777    418.606213   
3  8.316766  214.373394  22018.417441     8.059332  356.886136    363.266516   
4  9.092223  181.101509  17978.986339     6.546600  310.135738    398.410813   

   Organic_carbon  Trihalomethanes  Turbidity  
0       10.379783        86.990970   2.963135  
1       15.180013        56.329076   4.500656  
2       16.868637        66.420093   3.055934  
3       18.436524       100.341674   4.628771  
4       11.558279        31.997993   4.075075  
Y target: 0    0
1    0
2    0
3    0
4    0
Name: Potability, dtype: int64


### Normalize the Data

`Why Normalize?`

- Consistent Scale: Ensures all features contribute equally to the model.
- Improved Performance: Helps gradient-based algorithms converge faster.
-Prevent Bias: Avoids models being biased towards features with larger scales.

`Why Standard Scaler?`

- Standardization: Transforms features to have a mean of 0 and a standard deviation of 1.
- Robust to Different Ranges: Handles features with varying ranges effectively.
- Suitable for Normal Distribution: Aligns well with algorithms assuming normally distributed data.

In [8]:
scaler = StandardScaler()

def normalize_data(X):
    '''
    Normalizes a data
    '''
    normalized_data = scaler.fit_transform(X)
    
    return normalized_data 

X_normalized = normalize_data(X)
print(type(X_normalized))
    

<class 'numpy.ndarray'>


### Split the data into training and testing

In [9]:
X = X_normalized

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=1)

X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.25, random_state=1)


## Building a Neural Network for Binary Classification using tensorflow

In the next cell, we define a function neural_net that constructs and compiles a neural network model using the TensorFlow Keras API. The function takes an optional parameter regularizer, which allows us to apply regularization to the Dense layers in the network. 

In [10]:
def neural_net(regularizer=None):
    model = tf.keras.models.Sequential([
        tf.keras.layers.Flatten(input_shape = (9,)),
        tf.keras.layers.Dense(250, activation="tanh", kernel_regularizer=regularizer),
        tf.keras.layers.Dense(125, activation="tanh", kernel_regularizer=regularizer),
        tf.keras.layers.Dense(2, activation="softmax")
    ])
    loss_function = tf.keras.losses.SparseCategoricalCrossentropy()
    model.compile(optimizer = 'adam',
              loss = loss_function,
              metrics = ['accuracy'])
    
    return model

unregularized_model = neural_net()

  super().__init__(**kwargs)


### Fit the training data to the Unregularized model

In [11]:
unregularized_model.fit(X_train, Y_train, epochs = 100, validation_data=(X_val, Y_val))

Epoch 1/100
[1m77/77[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.5730 - loss: 0.7077 - val_accuracy: 0.5604 - val_loss: 0.7136
Epoch 2/100
[1m77/77[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.5970 - loss: 0.6876 - val_accuracy: 0.5885 - val_loss: 0.7022
Epoch 3/100
[1m77/77[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.6155 - loss: 0.6626 - val_accuracy: 0.5690 - val_loss: 0.6992
Epoch 4/100
[1m77/77[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.6212 - loss: 0.6617 - val_accuracy: 0.5824 - val_loss: 0.6796
Epoch 5/100
[1m77/77[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.6203 - loss: 0.6639 - val_accuracy: 0.5995 - val_loss: 0.6717
Epoch 6/100
[1m77/77[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.6498 - loss: 0.6390 - val_accuracy: 0.6313 - val_loss: 0.6521
Epoch 7/100
[1m77/77[0m [32m━━━

<keras.src.callbacks.history.History at 0x145561e80>

In [12]:
# confussion matrix with sklearn
# import the module

from sklearn.metrics import confusion_matrix

y_true = Y_test
y_pred = unregularized_model.evaluate(X_test, Y_test)

confusion_matrix(y_true, y_pred)

print(confusion_matrix)

[1m21/21[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.5976 - loss: 0.9520 


ValueError: Found input variables with inconsistent numbers of samples: [656, 2]

### Create a regularized model that uses L2 Regularization and train the model with X_test, Y_test

In [None]:
regularizer = tf.keras.regularizers.l2(0.0001)
reg_model = neural_net(regularizer)

reg_model.fit(X_train, Y_train, epochs = 100, validation_data=(X_val, Y_val))

Epoch 1/100


  super().__init__(**kwargs)


[1m77/77[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 6ms/step - accuracy: 0.5440 - loss: 0.7327 - val_accuracy: 0.5531 - val_loss: 0.7372
Epoch 2/100
[1m77/77[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.5817 - loss: 0.7069 - val_accuracy: 0.5665 - val_loss: 0.7197
Epoch 3/100
[1m77/77[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.6110 - loss: 0.6947 - val_accuracy: 0.5995 - val_loss: 0.7052
Epoch 4/100
[1m77/77[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.6304 - loss: 0.6833 - val_accuracy: 0.5861 - val_loss: 0.6902
Epoch 5/100
[1m77/77[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.6128 - loss: 0.6802 - val_accuracy: 0.5971 - val_loss: 0.6978
Epoch 6/100
[1m77/77[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.6325 - loss: 0.6738 - val_accuracy: 0.6142 - val_loss: 0.6826
Epoch 7/100
[1m77/77[0m [32m━━━━━━━━━━━━━━━

<keras.src.callbacks.history.History at 0x14e7f0230>

### Implement Dropout in confluence with L2 Regularization

Our model currently has an accuracy of 89.14% and a loss of 0.34 on the training data and 64.96% accuracy and a loss of 0.86, suggesting a possible case of `Overfitting`. Now let's try to implement early stopping.

In [None]:
# Define the neural network model with L2 regularization and Dropout
def neural_net_with_dropout(regularizer=None):
    model = tf.keras.models.Sequential([
        tf.keras.layers.Flatten(input_shape=(9,)),
        tf.keras.layers.Dense(500, activation="tanh", kernel_regularizer=regularizer),
        # tf.keras.layers.Dropout(0.3),
        tf.keras.layers.Dense(250, activation="tanh", kernel_regularizer=regularizer),
        # tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(2, activation="softmax")
    ])
    loss_function = tf.keras.losses.SparseCategoricalCrossentropy()
    model.compile(optimizer='adam',
                  loss=loss_function,
                  metrics=['accuracy'])
    return model

regularizer = tf.keras.regularizers.l2(0.0001)
dropout_model = neural_net_with_dropout(regularizer)

# Train the model with the EarlyStopping callback
dropout_model.fit(
    X_train, Y_train, 
    epochs=100, 
    validation_data=(X_val, Y_val)
)

  super().__init__(**kwargs)


Epoch 1/100
[1m77/77[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 7ms/step - accuracy: 0.5616 - loss: 0.7736 - val_accuracy: 0.5543 - val_loss: 0.7826
Epoch 2/100
[1m77/77[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.6007 - loss: 0.7339 - val_accuracy: 0.5897 - val_loss: 0.7361
Epoch 3/100
[1m77/77[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.6103 - loss: 0.7125 - val_accuracy: 0.5714 - val_loss: 0.7180
Epoch 4/100
[1m77/77[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.6136 - loss: 0.7058 - val_accuracy: 0.6081 - val_loss: 0.7211
Epoch 5/100
[1m77/77[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.6071 - loss: 0.7047 - val_accuracy: 0.6142 - val_loss: 0.7030
Epoch 6/100
[1m77/77[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.6128 - loss: 0.6925 - val_accuracy: 0.6252 - val_loss: 0.6894
Epoch 7/100
[1m77/77[0m [32m━━━

<keras.src.callbacks.history.History at 0x14f1dd730>