<a href="https://colab.research.google.com/github/aka-gera/Data_Classification/blob/main/water_probability.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **WATER PROBABILITY PREDICTION**

---

We will employ classification algorithms to predict whether the water in the dataset is potable or non-potable based on its various chemical and physical characteristics.

---

This dataset is obtained from: [Kaggle - Water Probability Dataset](https://www.kaggle.com/datasets/nayanack/water-probability)

---

The algorithms demonstrate an average accuracy level of up to 66%.

---

The most influential features contributing to the prediction are:

1. The pH value of the water, indicating its acidity or alkalinity.
2. The level of water hardness, typically measured in milligrams per liter (mg/L) of calcium carbonate.

These features significantly impact the classification process, enhancing the model's predictive accuracy.


## Dataset Description


Here is a brief description of the dataset.


| Column           | Description                                                                                               |
|------------------|-----------------------------------------------------------------------------------------------------------|
| ph               | The pH value of the water, which indicates its acidity or alkalinity.                                     |
| Hardness         | The level of water hardness, typically measured in milligrams per liter (mg/L) of calcium carbonate.      |
| Solids           | The total dissolved solids (TDS) concentration in the water, usually measured in parts per million (ppm).  |
| Chloramines      | The concentration of chloramines in the water, which are disinfectants commonly used in water treatment.  |
| Sulfate          | The concentration of sulfate ions in the water, often measured in milligrams per liter (mg/L).             |
| Conductivity     | The electrical conductivity of the water, typically measured in microsiemens per centimeter (μS/cm).      |
| Organic_carbon   | The concentration of organic carbon compounds in the water, usually measured in milligrams per liter (mg/L).|
| Trihalomethanes  | The concentration of trihalomethanes in the water, which are disinfection byproducts.                      |
| Turbidity        | The turbidity of the water, which measures its clarity or cloudiness.                                      |
| Potability       | A binary indicator (0 or 1) indicating whether the water is potable (safe for drinking).                  |


# Preset Parameters

In [1]:
data_dir = f'nayanack/water-probability'  # Dataset location

view_hist_feat = [0, 1, 2, -2, -1]  # Features selected for histogram visualization

target_switcher = -1  # Switch target to a feature which is in the last column

feat = []  # Features to drop

data_nan_drop = True  # Choose True to drop NaN values,
                      # otherwise fill them with:
                      #          the mode of the categorical feature
                      #          the mean of the numerical feature


balanced_dataset = False  # Whether to balance the dataset or not

confidence_interval_limit = [-3, 3]  # Define the limits of the confidence interval [-m, m] and eliminate the outliers

correlation_percentage_threshold = 0.7  # Set the correlation threshold between features for removal

pre_proc = 'X'  # Data preprocessing:
                #   Choose 'XY' to standardize both 'X' and 'Y',
                #   Choose 'X' to standardize only 'X',
                #   Choose 'Y' to standardize only 'Y',

target_values_label = False  # True if target values are float or integers

####### Neural Network Parameters #######
activation = 'relu'
epoch = 10
num_nodes = [2, 4]
dropout_prob = [0.05, 0.1]
lr = [0.01, 0.1]
batch_size = [2, 4]


# Import Dataset

In [2]:
! pip install kaggle



In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
!pwd
%cd /content

/content
/content


In [5]:
! mkdir ~/.kaggle

In [6]:
! cp /content/drive/MyDrive/Kaggle_API/kaggle.json ~/.kaggle

In [7]:
! chmod 600 ~/.kaggle/kaggle.json

In [8]:
! kaggle datasets download {data_dir}

Dataset URL: https://www.kaggle.com/datasets/nayanack/water-probability
License(s): CC0-1.0
Downloading water-probability.zip to /content
  0% 0.00/251k [00:00<?, ?B/s]
100% 251k/251k [00:00<00:00, 65.5MB/s]


In [9]:
import os
file_names = os.listdir()
zip_file =   [file for file in file_names if file.endswith('.zip')]
zip_file

['water-probability.zip']

In [10]:
import zipfile

# Open the zip file
with zipfile.ZipFile(zip_file[-1], 'r') as zip_ref:
    zip_ref.extractall()
    unzipped_file_names = zip_ref.namelist()
unzipped_file_names

['water_potability.csv']

# Import the helper classes

In [11]:
!pwd
%cd /content/drive/MyDrive/ML2023/data-analysis

/content
/content/drive/MyDrive/ML2023/data-analysis


In [12]:
!pip install aka-mlearning==0.0.1
from aka_MLearning import aka_classification,aka_ML_analysis,aka_regression

Collecting aka-mlearning==0.0.1
  Downloading aka_mlearning-0.0.1.tar.gz (4.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting catboost (from aka-mlearning==0.0.1)
  Downloading catboost-1.2.5-cp310-cp310-manylinux2014_x86_64.whl (98.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.2/98.2 MB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: aka-mlearning
  Building wheel for aka-mlearning (setup.py) ... [?25l[?25hdone
  Created wheel for aka-mlearning: filename=aka_mlearning-0.0.1-py3-none-any.whl size=5222 sha256=1ac2c4d874ed857be54e662cbc7b539514d898d23966be7ffedd7d58b6ba8cff
  Stored in directory: /root/.cache/pip/wheels/a9/32/37/dc5b42ab80d79613dd21357f887d4b9b1d5c93a64ccb4372ab
Successfully built aka-mlearning
Installing collected packages: catboost, aka-mlearning
Successfully installed aka-mlearning-0.0.1 catboost-1.2.5


In [13]:
!pip install aka-data-prep==0.1.2
from aka_data_prep import aka_encoding,aka_df_prepare,aka_plot_prep,aka_cleaned_data,aka_plot_shap,aka_plot_ML
aka_plot_ = aka_plot_prep()

Collecting aka-data-prep==0.1.2
  Downloading aka-data-prep-0.1.2.tar.gz (7.7 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting shap (from aka-data-prep==0.1.2)
  Downloading shap-0.45.1-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (540 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m540.5/540.5 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
Collecting slicer==0.0.8 (from shap->aka-data-prep==0.1.2)
  Downloading slicer-0.0.8-py3-none-any.whl (15 kB)
Building wheels for collected packages: aka-data-prep
  Building wheel for aka-data-prep (setup.py) ... [?25l[?25hdone
  Created wheel for aka-data-prep: filename=aka_data_prep-0.1.2-py3-none-any.whl size=7988 sha256=6fe11c7207dc60d250202ffb3b1e0696266e97a83d4f88d6cb071ac521da8651
  Stored in directory: /root/.cache/pip/wheels/f2/de/d6/05cbd71695a5fc82a0e740c266bf8e77f99f7219d78f1a954d
Successfully built aka-data-prep
Installing collected packag

In [14]:
from aka_data_analysis.aka_nn import aka_nn

In [15]:
import matplotlib.pyplot as plt

import numpy as np
import pandas as pd



# Dataset Information

In [16]:
df = aka_df_prepare().df_get(f'/content/{unzipped_file_names[0]}')
df.head()

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
0,,204.890455,20791.318981,7.300212,368.516441,564.308654,10.379783,86.99097,2.963135,0
1,3.71608,129.422921,18630.057858,6.635246,,592.885359,15.180013,56.329076,4.500656,0
2,8.099124,224.236259,19909.541732,9.275884,,418.606213,16.868637,66.420093,3.055934,0
3,8.316766,214.373394,22018.417441,8.059332,356.886136,363.266516,18.436524,100.341674,4.628771,0
4,9.092223,181.101509,17978.986339,6.5466,310.135738,398.410813,11.558279,31.997993,4.075075,0


In [17]:
df.describe()

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
count,2785.0,3276.0,3276.0,3276.0,2495.0,3276.0,3276.0,3114.0,3276.0,3276.0
mean,7.080795,196.369496,22014.092526,7.122277,333.775777,426.205111,14.28497,66.396293,3.966786,0.39011
std,1.59432,32.879761,8768.570828,1.583085,41.41684,80.824064,3.308162,16.175008,0.780382,0.487849
min,0.0,47.432,320.942611,0.352,129.0,181.483754,2.2,0.738,1.45,0.0
25%,6.093092,176.850538,15666.690297,6.127421,307.699498,365.734414,12.065801,55.844536,3.439711,0.0
50%,7.036752,196.967627,20927.833607,7.130299,333.073546,421.884968,14.218338,66.622485,3.955028,0.0
75%,8.062066,216.667456,27332.762127,8.114887,359.95017,481.792304,16.557652,77.337473,4.50032,1.0
max,14.0,323.124,61227.196008,13.127,481.030642,753.34262,28.3,124.0,6.739,1.0


In [18]:

import plotly.express as px
# import plotly.graph_objects as go
def plot_box(df,index_col_box = [0,1,2,3,-3,-2,-1]):
    index_col_box = [ind % df.shape[1] for ind in index_col_box]

    if index_col_box:
      df_max = df[df.columns[index_col_box]].select_dtypes(exclude=['object']).max().sort_values()
      fig = px.box(df[df.columns[index_col_box]], y=df_max.index)
      # fig.update_layout(**self.update_layout_parameter)
      # fig.update_xaxes(**self.update_axes)
      # fig.update_layout(
      #         title='Box Plot',
      #         font=dict(size=self.fsize)
      #     )
      return fig
    else:
      print("Empty list provided")
      return plt.figure()


plot_box(df,[]).show()#,list(range(df.shape[1])))

Empty list provided


<Figure size 640x480 with 0 Axes>

In [19]:
aka_plot_.plot_box(df,[0,-2])

In [20]:
# view_hist_feat = [0,1,2,-2,-1]
fig = aka_plot_.Plot_histogram_Features(df,view_hist_feat )
if fig is not None:
    fig.show()

# Null Values

In [21]:
df_null_sum = df.isnull().sum()
df_null = df[df.columns[df_null_sum>0]].isnull().astype(float)
fig = aka_plot_.plot_heatmap(df_null,False)
fig.update_layout(
    xaxis_title='Feature',
    yaxis_title='Count',
    title='Missing Values',
    font=dict(size=20)
)

## Prepocess Missing Values

In [22]:
if df_null_sum.sum() > 0:
  aka_df_prepare().missing_data_processing(df, data_nan_drop=data_nan_drop)

# Clean Dataset

## Drop Duplicate data

In [23]:
df.drop_duplicates()

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
3,8.316766,214.373394,22018.417441,8.059332,356.886136,363.266516,18.436524,100.341674,4.628771,0
4,9.092223,181.101509,17978.986339,6.546600,310.135738,398.410813,11.558279,31.997993,4.075075,0
5,5.584087,188.313324,28748.687739,7.544869,326.678363,280.467916,8.399735,54.917862,2.559708,0
6,10.223862,248.071735,28749.716544,7.513408,393.663396,283.651634,13.789695,84.603556,2.672989,0
7,8.635849,203.361523,13672.091764,4.563009,303.309771,474.607645,12.363817,62.798309,4.401425,0
...,...,...,...,...,...,...,...,...,...,...
3267,8.989900,215.047358,15921.412018,6.297312,312.931022,390.410231,9.899115,55.069304,4.613843,1
3268,6.702547,207.321086,17246.920347,7.708117,304.510230,329.266002,16.217303,28.878601,3.442983,1
3269,11.491011,94.812545,37188.826022,9.263166,258.930600,439.893618,16.172755,41.558501,4.369264,1
3270,6.069616,186.659040,26138.780191,7.747547,345.700257,415.886955,12.067620,60.419921,3.669712,1


## Swap the target and the last feature

In [24]:
# target_switcher = -1
df = aka_df_prepare().swap_features(df,target_switcher)
df.head()

Invalid feature indices or feat_a is equal to feat_b.


Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
3,8.316766,214.373394,22018.417441,8.059332,356.886136,363.266516,18.436524,100.341674,4.628771,0
4,9.092223,181.101509,17978.986339,6.5466,310.135738,398.410813,11.558279,31.997993,4.075075,0
5,5.584087,188.313324,28748.687739,7.544869,326.678363,280.467916,8.399735,54.917862,2.559708,0
6,10.223862,248.071735,28749.716544,7.513408,393.663396,283.651634,13.789695,84.603556,2.672989,0
7,8.635849,203.361523,13672.091764,4.563009,303.309771,474.607645,12.363817,62.798309,4.401425,0


### Drop feature(s)


In [25]:
# feat =  []
df = aka_df_prepare().drop_feature(df,feat)
df.head()

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
3,8.316766,214.373394,22018.417441,8.059332,356.886136,363.266516,18.436524,100.341674,4.628771,0
4,9.092223,181.101509,17978.986339,6.5466,310.135738,398.410813,11.558279,31.997993,4.075075,0
5,5.584087,188.313324,28748.687739,7.544869,326.678363,280.467916,8.399735,54.917862,2.559708,0
6,10.223862,248.071735,28749.716544,7.513408,393.663396,283.651634,13.789695,84.603556,2.672989,0
7,8.635849,203.361523,13672.091764,4.563009,303.309771,474.607645,12.363817,62.798309,4.401425,0


## Transforming Categorical Variables into Numerical Representations Using Encoding

In [26]:
# data_nan = 'drop'                    # Choose 'drop' to drop rows containing NaN values
df_encod = aka_encoding(df)
df = df_encod.label_encoding()

## Balance Dataset

In [27]:
aka_plot_.plot_pie(df,-1)

In [28]:
# balanced_dataset = False

if balanced_dataset:
  df = aka_cleaned_data().balance_df(df,'j')
  aka_plot_().plot_pie(df, -1)

## Correlation Matrix

In [29]:
aka_plot_.Plot_Correlation_Matrix(df)

In [30]:

# confidence_interval_limit =   [-3,3]             # Define the limits m of the confidence interval [-m, m] and eliminate the outliers'''

# correlation_percentage_threshold = .7      # Set the limit of the correlation between the feature to be removed

df_filtered,corr_tmp = aka_cleaned_data().filter_drop_corr_df(df,confidence_interval_limit,correlation_percentage_threshold)

print(f'We dropped {df.shape[0]-df_filtered.shape[0]} outliers and remove {df.shape[1]-df_filtered.shape[1]} feature(s)')
print(f'The filtered dataset\'s shape is {df_filtered.shape} ')

We dropped 82 outliers and remove 0 feature(s)
The filtered dataset's shape is (1929, 10) 


## Graph the features that are highly correlated


In [31]:
fig = aka_plot_.Plot_scatter(df,list(corr_tmp))
if fig is not None:
    fig.show()

Empty list is provided


<Figure size 300x200 with 0 Axes>

### Visualize the distribution of the filtered dataset

In [32]:
aka_plot_.Plot_box_2_Features(df,df_filtered,corr_tmp=range(df_filtered.shape[1]))

# Search for the most effective ML algorithm to learn the dataset

In [33]:
# pre_proc = 'X'                                # Choose between 'XY' to standardize both 'X' and 'Y',
#                                               #                'X' to standardize only 'X',
#                                               #                'Y' to standardize only 'Y',

X_train, X_test, y_train, y_test = aka_cleaned_data().train_test_cleaned_data(df_filtered,pre_proc)

In [34]:
clf, df_metric_algorithms, clf_algorithms = aka_classification().train_and_find_best_classifier(X_train, y_train, X_test, y_test)


Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.



In [35]:
fig = aka_plot_.plot_heatmap(df_metric_algorithms)
fig.update_layout(
    xaxis_title='ML algorithm',
    yaxis_title='Metric',
    title='Metric Report',
    font=dict(size=20)
)

In [36]:
clf

In [37]:
clf = clf_algorithms['Cat Boost Classifier']

In [38]:
y_pred_ = df_encod.label_encoding_inverse(clf.predict(X_test))
y_test_ = df_encod.label_encoding_inverse(y_test)

if target_values_label:
  Label = [str(un) for un in np.unique(pd.concat([y_pred_, y_test_]))]
else:
  Label = ['v_'+str(un) for un in np.unique(pd.concat([y_pred_, y_test_]))]


## Confusion Matrix

In [39]:
aka_plot_ML().plot_confusion_matrix(y_test_,y_pred_,Label).show()

## Classification Report

In [40]:
aka_plot_ML().plot_classification_report(y_test_,y_pred_,Label).show()

## Plot Important Features by Weight



In [41]:
# aka_plot.plot_important_features(model,df_filtered)

In [42]:
feat_names = df_filtered.columns[:-1]
aka_plot_shap(clf.best_estimator_, X_train, feat_names).plot_summary_shap().show()



# Neural Net

In [43]:
myNN = aka_nn(X_train, X_test, y_train, y_test,activation)
model,scre =myNN.DNN(epoch,num_nodes,dropout_prob,lr,batch_size)
y_pred = myNN.predict(model)

2 nodes, dropout 0.05, lr 0.01, batch size 2
2 nodes, dropout 0.05, lr 0.01, batch size 4
2 nodes, dropout 0.05, lr 0.1, batch size 2
2 nodes, dropout 0.05, lr 0.1, batch size 4
2 nodes, dropout 0.1, lr 0.01, batch size 2
2 nodes, dropout 0.1, lr 0.01, batch size 4
2 nodes, dropout 0.1, lr 0.1, batch size 2
2 nodes, dropout 0.1, lr 0.1, batch size 4
4 nodes, dropout 0.05, lr 0.01, batch size 2
4 nodes, dropout 0.05, lr 0.01, batch size 4
4 nodes, dropout 0.05, lr 0.1, batch size 2
4 nodes, dropout 0.05, lr 0.1, batch size 4
4 nodes, dropout 0.1, lr 0.01, batch size 2
4 nodes, dropout 0.1, lr 0.01, batch size 4
4 nodes, dropout 0.1, lr 0.1, batch size 2
4 nodes, dropout 0.1, lr 0.1, batch size 4


## Confusion Matrix

In [44]:
y_pred_ = df_encod.label_encoding_inverse(y_pred)
y_test_ = df_encod.label_encoding_inverse(y_test)

In [45]:
aka_plot_ML().plot_confusion_matrix(y_test_,y_pred_,Label).show()

## Classification Report

In [46]:
aka_plot_ML().plot_classification_report(y_test_,y_pred_,Label).show()

## Plot Important Features by Weight


In [47]:
aka_plot_shap(clf.best_estimator_, X_train, feat_names).plot_summary_shap().show()

