# Water Quality Prediction

This notebook analyzes the water quality dataset, builds and compares multiple classification models to predict water quality, performs hyperparameter tuning on the models, and presents the best performing model.

## Data Loading

In [2]:
import pandas as pd

data_air = pd.read_csv("water_dataset.csv")

display(data_air.head())
data_air.columns

Unnamed: 0,Temp,Turbidity (cm),DO(mg/L),BOD (mg/L),CO2,pH,Alkalinity (mg L-1 ),Hardness (mg L-1 ),Calcium (mg L-1 ),Ammonia (mg L-1 ),Nitrite (mg L-1 ),Phosphorus (mg L-1 ),H2S (mg L-1 ),Plankton (No. L-1),Water Quality
0,67.448725,10.127148,0.208153,7.473607,10.181084,4.751657,218.364855,300.12508,337.178226,0.286054,4.35531,0.005984,0.066793,6069.624017,2
1,64.626666,94.015595,11.434463,10.859998,14.860521,3.085154,273.939692,8.426776,363.66074,0.09604,2.182753,0.004906,0.023428,250.995959,2
2,65.121842,90.653462,12.430865,12.80997,12.31998,9.648515,220.81273,11.726274,309.370934,0.974501,4.90176,0.006979,0.065041,7218.927473,2
3,1.640334,0.066344,10.963529,8.508023,12.955209,4.819988,266.571628,6.627655,8.180468,0.884865,3.571842,3.174473,0.026018,1230.062252,2
4,64.863434,2.119173,1.361736,13.335372,13.603197,10.244034,252.108,339.891514,253.996871,0.801695,4.655898,3.854701,0.060995,1035.05482,2


Index(['Temp', 'Turbidity (cm)', 'DO(mg/L)', 'BOD (mg/L)', 'CO2', 'pH',
       'Alkalinity (mg L-1 )', 'Hardness (mg L-1 )', 'Calcium (mg L-1 )',
       'Ammonia (mg L-1 )', 'Nitrite (mg L-1 )', 'Phosphorus (mg L-1 )',
       'H2S (mg L-1 )', 'Plankton (No. L-1)', 'Water Quality'],
      dtype='object')

## Data Exploration

### Subtask:
Check the distribution of the target variable.

**Reasoning**:
Understanding the distribution of the target variable is important for assessing class balance and potential challenges in modeling.

In [3]:
display(data_air['Water Quality'].value_counts())

Water Quality
2    1500
1    1400
0    1400
Name: count, dtype: int64

### Subtask:
Check the column names.

**Reasoning**:
Checking column names helps to confirm they are as expected and to identify any potential issues with special characters or spacing that might need to be addressed.

In [4]:
display(data_air.columns)

Index(['Temp', 'Turbidity (cm)', 'DO(mg/L)', 'BOD (mg/L)', 'CO2', 'pH',
       'Alkalinity (mg L-1 )', 'Hardness (mg L-1 )', 'Calcium (mg L-1 )',
       'Ammonia (mg L-1 )', 'Nitrite (mg L-1 )', 'Phosphorus (mg L-1 )',
       'H2S (mg L-1 )', 'Plankton (No. L-1)', 'Water Quality'],
      dtype='object')

## Data Cleaning and Preparation

### Subtask:
Handle missing values and prepare the data for modeling.

**Reasoning**:
Check for missing values in the DataFrame.

In [5]:
display(data_air.isnull().sum())

Temp                    0
Turbidity (cm)          0
DO(mg/L)                0
BOD (mg/L)              0
CO2                     0
pH                      0
Alkalinity (mg L-1 )    0
Hardness (mg L-1 )      0
Calcium (mg L-1 )       0
Ammonia (mg L-1 )       0
Nitrite (mg L-1 )       0
Phosphorus (mg L-1 )    0
H2S (mg L-1 )           0
Plankton (No. L-1)      0
Water Quality           0
dtype: int64

**Reasoning**:
Separate features and target variable, and ensure feature columns are numerical.

In [6]:
X = data_air.drop('Water Quality', axis=1)
y = data_air['Water Quality']

# Check if all feature columns are numerical
if not all(pd.api.types.is_numeric_dtype(X[col]) for col in X.columns):
    print("Some feature columns are not numerical. Further processing may be needed.")

display(X.head())
display(y.head())

Unnamed: 0,Temp,Turbidity (cm),DO(mg/L),BOD (mg/L),CO2,pH,Alkalinity (mg L-1 ),Hardness (mg L-1 ),Calcium (mg L-1 ),Ammonia (mg L-1 ),Nitrite (mg L-1 ),Phosphorus (mg L-1 ),H2S (mg L-1 ),Plankton (No. L-1)
0,67.448725,10.127148,0.208153,7.473607,10.181084,4.751657,218.364855,300.12508,337.178226,0.286054,4.35531,0.005984,0.066793,6069.624017
1,64.626666,94.015595,11.434463,10.859998,14.860521,3.085154,273.939692,8.426776,363.66074,0.09604,2.182753,0.004906,0.023428,250.995959
2,65.121842,90.653462,12.430865,12.80997,12.31998,9.648515,220.81273,11.726274,309.370934,0.974501,4.90176,0.006979,0.065041,7218.927473
3,1.640334,0.066344,10.963529,8.508023,12.955209,4.819988,266.571628,6.627655,8.180468,0.884865,3.571842,3.174473,0.026018,1230.062252
4,64.863434,2.119173,1.361736,13.335372,13.603197,10.244034,252.108,339.891514,253.996871,0.801695,4.655898,3.854701,0.060995,1035.05482


0    2
1    2
2    2
3    2
4    2
Name: Water Quality, dtype: int64

### Subtask:
Split data into training and testing sets and scale the features.

**Reasoning**:
Split the data into training and testing sets to evaluate the model's performance on unseen data. Then, scale the features to ensure that all features contribute equally to the model training process.

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

display(X_train_scaled[:5])
display(y_train.head())

array([[ 0.69692309, -0.56665658,  0.30151888,  1.15920736, -0.8130562 ,
         1.0024908 , -0.96031248, -0.86071184,  0.30150072, -0.07339259,
         1.43234004,  1.27473264, -0.1507425 ,  1.10122943],
       [ 0.29835413,  1.44322755, -0.290754  , -0.7566535 , -0.07473707,
         0.63894544, -0.64674213, -0.64818122,  0.04104753, -0.33476315,
        -0.68906292, -0.56893939,  0.2380824 , -0.29718149],
       [-0.58881335,  1.70223182, -0.87842093, -0.78172947, -0.32227905,
         0.22694701, -0.3164553 ,  0.25452999, -0.69594099, -0.22207715,
        -0.69514551,  0.60089791,  0.23758986,  0.23497502],
       [ 0.65725149, -0.68602139,  1.24875359,  0.21152991,  0.96651043,
        -1.05704408, -0.78995817, -1.35142495, -0.93804222, -0.08340112,
        -0.34606693,  1.14610597, -1.17353883, -1.05271517],
       [ 0.9112736 , -0.99335635,  0.89663586, -0.23570265, -0.90447314,
         0.90687286, -0.85925508,  2.00200748,  1.8503815 , -0.17324897,
         0.35256453,  0.83

2605    1
3328    0
3404    0
1588    1
1612    1
Name: Water Quality, dtype: int64

## Model Selection and Comparison

### Subtask:
Select and compare multiple classification models.

**Reasoning**:
Compare the performance of different classification models to identify the most promising ones for this dataset. We will start with some common models like Logistic Regression, Decision Tree, and Support Vector Machine.

In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

# Initialize models
log_reg = LogisticRegression(random_state=42, max_iter=1000)
dec_tree = DecisionTreeClassifier(random_state=42)
svm = SVC(random_state=42)

models = {
    "Logistic Regression": log_reg,
    "Decision Tree": dec_tree,
    "SVM": svm
}

# Train and evaluate models
results = {}
for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred)
    results[name] = {"accuracy": accuracy, "report": report}

# Display results
for name, metrics in results.items():
    print(f"--- {name} ---")
    print(f"Accuracy: {metrics['accuracy']:.4f}")
    print("Classification Report:")
    print(metrics['report'])
    print("-" * 20)

--- Logistic Regression ---
Accuracy: 0.8314
Classification Report:
              precision    recall  f1-score   support

           0       0.85      1.00      0.92       261
           1       0.78      0.91      0.84       274
           2       0.89      0.63      0.74       325

    accuracy                           0.83       860
   macro avg       0.84      0.85      0.83       860
weighted avg       0.84      0.83      0.82       860

--------------------
--- Decision Tree ---
Accuracy: 0.9942
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       261
           1       0.99      1.00      0.99       274
           2       1.00      0.98      0.99       325

    accuracy                           0.99       860
   macro avg       0.99      0.99      0.99       860
weighted avg       0.99      0.99      0.99       860

--------------------
--- SVM ---
Accuracy: 0.9500
Classification Report:
              

  raw_prediction = X @ weights.T + intercept  # ndarray, likely C-contiguous
  raw_prediction = X @ weights.T + intercept  # ndarray, likely C-contiguous
  raw_prediction = X @ weights.T + intercept  # ndarray, likely C-contiguous
  grad[:, :n_features] = grad_pointwise.T @ X + l2_reg_strength * weights
  grad[:, :n_features] = grad_pointwise.T @ X + l2_reg_strength * weights
  grad[:, :n_features] = grad_pointwise.T @ X + l2_reg_strength * weights
  ret = a @ b
  ret = a @ b
  ret = a @ b


## Hyperparameter Tuning

### Subtask:
Perform hyperparameter tuning on the best performing models.

**Reasoning**:
Tune the hyperparameters of the Decision Tree and SVM models to potentially improve their accuracy and generalization. We will use GridSearchCV for this.

In [9]:
from sklearn.model_selection import GridSearchCV

# Hyperparameter tuning for Decision Tree
param_grid_tree = {
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}
grid_search_tree = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid_tree, cv=5, scoring='accuracy')
grid_search_tree.fit(X_train_scaled, y_train)

best_tree = grid_search_tree.best_estimator_
print(f"Best Decision Tree Parameters: {grid_search_tree.best_params_}")
print(f"Best Decision Tree Cross-validation Accuracy: {grid_search_tree.best_score_:.4f}")

# Hyperparameter tuning for SVM
param_grid_svm = {
    'C': [0.1, 1, 10],
    'gamma': ['scale', 'auto'],
    'kernel': ['rbf', 'linear']
}
grid_search_svm = GridSearchCV(SVC(random_state=42), param_grid_svm, cv=5, scoring='accuracy')
grid_search_svm.fit(X_train_scaled, y_train)

best_svm = grid_search_svm.best_estimator_
print(f"Best SVM Parameters: {grid_search_svm.best_params_}")
print(f"Best SVM Cross-validation Accuracy: {grid_search_svm.best_score_:.4f}")

Best Decision Tree Parameters: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2}
Best Decision Tree Cross-validation Accuracy: 0.9942
Best SVM Parameters: {'C': 10, 'gamma': 'auto', 'kernel': 'rbf'}
Best SVM Cross-validation Accuracy: 0.9535


## Model Evaluation and Selection

### Subtask:
Evaluate the tuned models and select the best performing one.

**Reasoning**:
Evaluate the performance of the tuned Decision Tree and SVM models on the test set and select the model with the highest accuracy.

In [10]:
from sklearn.metrics import accuracy_score, classification_report

# Evaluate tuned Decision Tree
y_pred_tree = best_tree.predict(X_test_scaled)
accuracy_tree = accuracy_score(y_test, y_pred_tree)
report_tree = classification_report(y_test, y_pred_tree)

print("--- Tuned Decision Tree ---")
print(f"Accuracy: {accuracy_tree:.4f}")
print("Classification Report:")
print(report_tree)
print("-" * 20)

# Evaluate tuned SVM
y_pred_svm = best_svm.predict(X_test_scaled)
accuracy_svm = accuracy_score(y_test, y_pred_svm)
report_svm = classification_report(y_test, y_pred_svm)

print("--- Tuned SVM ---")
print(f"Accuracy: {accuracy_svm:.4f}")
print("Classification Report:")
print(report_svm)
print("-" * 20)

# Select the best model
if accuracy_tree > accuracy_svm:
    best_model = best_tree
    best_model_name = "Tuned Decision Tree"
    best_accuracy = accuracy_tree
    best_report = report_tree
else:
    best_model = best_svm
    best_model_name = "Tuned SVM"
    best_accuracy = accuracy_svm
    best_report = report_svm

print(f"The best performing model is: {best_model_name} with Accuracy: {best_accuracy:.4f}")

--- Tuned Decision Tree ---
Accuracy: 0.9942
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       261
           1       0.99      1.00      0.99       274
           2       1.00      0.98      0.99       325

    accuracy                           0.99       860
   macro avg       0.99      0.99      0.99       860
weighted avg       0.99      0.99      0.99       860

--------------------
--- Tuned SVM ---
Accuracy: 0.9581
Classification Report:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       261
           1       0.91      0.99      0.94       274
           2       0.99      0.90      0.94       325

    accuracy                           0.96       860
   macro avg       0.96      0.96      0.96       860
weighted avg       0.96      0.96      0.96       860

--------------------
The best performing model is: Tuned Decision Tree with Accuracy: 0.994

## Summary

### Data Analysis Key Findings

* The dataset contains 4300 entries with 15 columns, including features related to water quality parameters and a target variable 'Water Quality'.
* There are no missing values in the dataset.
* The target variable 'Water Quality' has three classes with a relatively balanced distribution (1500, 1400, 1400).

### Model Performance and Selection

* Initial comparison of Logistic Regression, Decision Tree, and SVM models showed Decision Tree and SVM performing significantly better than Logistic Regression.
* Hyperparameter tuning was performed on the Decision Tree and SVM models.
* The tuned Decision Tree model achieved an accuracy of {accuracy_tree:.4f} on the test set.
* The tuned SVM model achieved an accuracy of {accuracy_svm:.4f} on the test set.

Based on the evaluation metrics, the **{best_model_name}** is the best performing model with an accuracy of **{best_accuracy:.4f}**.

### Insights or Next Steps

* The **{best_model_name}** model can be used for predicting water quality based on the given features.
* Further analysis could include exploring feature importance for the best model or trying other advanced classification algorithms to potentially improve performance further.

### Subtask:
Compare model performance on training and testing data.

**Reasoning**:
Compare the accuracy of the best model on the training and testing sets to check for overfitting. A significant difference suggests overfitting.

In [11]:
from sklearn.metrics import accuracy_score

# Evaluate the best model on training data
y_train_pred = best_model.predict(X_train_scaled)
accuracy_train = accuracy_score(y_train, y_train_pred)

print(f"Accuracy on Training Data ({best_model_name}): {accuracy_train:.4f}")
print(f"Accuracy on Testing Data ({best_model_name}): {best_accuracy:.4f}")

if accuracy_train > best_accuracy + 0.05: # A threshold of 0.05 is arbitrary, adjust as needed
    print("Potential Overfitting detected: Training accuracy is significantly higher than testing accuracy.")
else:
    print("No significant overfitting detected based on training and testing accuracy.")

Accuracy on Training Data (Tuned Decision Tree): 1.0000
Accuracy on Testing Data (Tuned Decision Tree): 0.9942
No significant overfitting detected based on training and testing accuracy.


### Subtask:
Save the best performing model.

**Reasoning**:
Save the trained model to a file so it can be loaded and used for making predictions in a web application or other deployment environment.

In [12]:
import pickle

# Define the filename for the saved model
model_filename = 'best_water_quality_model.pkl'

# Save the best model to the file
with open(model_filename, 'wb') as file:
    pickle.dump(best_model, file)
# Simpan scaler juga
with open("scaler_water.pkl", "wb") as f:
    pickle.dump(scaler, f)
print(f"Best model saved to {model_filename}")

Best model saved to best_water_quality_model.pkl
