# Predicting Water Quality Using Machine Learning
Author: **Fabrice Faustin**  
This notebook demonstrates the application of supervised machine learning techniques to predict groundwater quality using a dataset from Telangana, India (2018).
We'll walk through preprocessing, modeling (Decision Tree and KNN), cross-validation, and comparison of performance using accuracy and precision.


## 1. Load and Preprocess Data
We select only numeric features relevant for analysis and handle nulls as 0.0, following the same methodology used in the original study.

In [None]:
import pandas as pd
import csv

file_path = 'ground_water_quality_2018_post.csv'
feature_indices = list(range(4, 7)) + list(range(8, 23))
target_index = 23

features = []
target = []

with open(file_path, mode='r') as file:
    csv_reader = csv.reader(file)
    headers = next(csv_reader)

    for row in csv_reader:
        try:
            selected_features = [float(row[i]) if row[i] else 0.0 for i in feature_indices]
            if len(selected_features) != len(feature_indices):
                continue
            features.append(selected_features)
            target.append(row[target_index])
        except:
            continue

X = pd.DataFrame(features, columns=[headers[i] for i in feature_indices])
y = pd.Series(target, name="Classification")


## 2. Remove Rare Classes
To ensure valid stratified splits and meaningful cross-validation, we drop classes with fewer than 2 samples.

In [None]:
class_counts = y.value_counts()
valid_classes = class_counts[class_counts >= 2].index
mask = y.isin(valid_classes)
X = X[mask]
y = y[mask]


## 3. Encode Target and Split Dataset

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

le = LabelEncoder()
y_encoded = le.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(
    X, y_encoded, test_size=0.2, random_state=42, stratify=y_encoded
)

## 4. Train Models and Evaluate
We use 5-fold cross-validation and calculate both mean accuracy and precision on test data.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import precision_score
import numpy as np

results = []

# Decision Tree
for depth in [3, 5, 7]:
    model = DecisionTreeClassifier(max_depth=depth, random_state=42)
    cv_scores = cross_val_score(model, X_train, y_train, cv=5)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    precision = precision_score(y_test, y_pred, average='weighted')
    results.append({
        'Model': 'Decision Tree',
        'Hyperparameter': depth,
        'CV Mean Accuracy': np.mean(cv_scores),
        'Precision': precision
    })

# KNN
for k in [5, 10, 15]:
    model = KNeighborsClassifier(n_neighbors=k)
    cv_scores = cross_val_score(model, X_train, y_train, cv=5)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    precision = precision_score(y_test, y_pred, average='weighted')
    results.append({
        'Model': 'KNN',
        'Hyperparameter': k,
        'CV Mean Accuracy': np.mean(cv_scores),
        'Precision': precision
    })

results_df = pd.DataFrame(results)
results_df

## 5. Visualize Performance Comparison

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 6))
sns.barplot(data=results_df, x="Hyperparameter", y="CV Mean Accuracy", hue="Model")
plt.title("Mean Cross-Validation Accuracy by Model and Hyperparameter")
plt.ylabel("Mean Accuracy")
plt.xlabel("Depth (Decision Tree) / K (KNN)")
plt.grid(axis='y')
plt.legend(title='Model')
plt.tight_layout()
plt.show()

## 6. Conclusion
- The **Decision Tree (depth=5)** achieved the best combination of accuracy (~91.6%) and precision.
- **KNN** performed best in accuracy at **k=10**, though slightly lower in precision.
- Class imbalance was significant and could be further addressed using resampling techniques.
- This project illustrates the end-to-end workflow of a classification pipeline in machine learning, grounded in real-world water quality data.