# CENG 463 HW 1– Water Resource Risk Classification
**Start Date:**  
**Due Date:** Month Dayth, 2026

## Dataset Overview
The dataset utilized in this assignment originates from the World Resources Institute (WRI) – Aqueduct Water Risk Atlas. It provides country-level indicators describing key hydrological and environmental factors, which are listed in the table below. The objective is to classify each country into a Water Resource Risk Category (0-4) using these indicators. Students are also expected to create two derived features — Composite Water Stress Index (CWSI) and Seasonal–Flood Interaction (SFI).

| Feature | Description |
|---------|-------------|
| gid_0 | Country Code |
| bws_score | Baseline Water Stress |
| gtd_score | Groundwater Depletion |
| drr_score | Drought Risk |
| rfr_score | River Flood Risk |
| sev_score | Seasonal Variability |
| w_awr_def_tot_cat | Target: Water Risk Category (0-4) |


## 1. Feature Engineering (35 pts)
Students are expected to create two new features based on the existing indicators:

1. **Composite Water Stress Index (CWSI):**
   CWSI combines baseline water stress, groundwater depletion, and drought risk.
   Formula: CWSI = 0.5 × bws_score + 0.3 × gtd_score + 0.2 × drr_score

2. **Seasonal–Flood Interaction (SFI):**
   SFI represents interaction between seasonal variability and river flood risk.
   Formula: SFI = sev_score × rfr_score


In [60]:
# TODO: Create CWSI and SFI features
# and country column has processed

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder,StandardScaler
from sklearn.pipeline import Pipeline

In [61]:
df=pd.read_csv("water_risk_dataset.csv")

df['CWSI'] =  0.5 * df['bws_score'] + 0.3 * df['gtd_score'] + 0.2 * df['drr_score']
df['SFI'] = df['sev_score'] * df['rfr_score']

target_col = 'w_awr_def_tot_cat'
X=df.drop(columns=[target_col])
y=df[target_col]


In [62]:
cat_cols = ['gid_0']
num_cols = [c for c in X.columns if c not in cat_cols]

preprocessor = ColumnTransformer(transformers=[
	('encoder', OneHotEncoder(handle_unknown='ignore'), cat_cols),
	('scaler', StandardScaler(), num_cols)
])


In [None]:
print(X)
print(y)

## 2. Model Training & Evaluation (40 pts)
Train five classification models: Random Forest, SVM, KNN, Gaussian Naive Bayes, Logistic Regression.

*Hint: Use scaled data for SVM, KNN, Logistic Regression.*

In [None]:
# TODO: Split data into features X and target y
from sklearn.model_selection import train_test_split

train_X,test_X,train_y,test_y=train_test_split(X,y,test_size=0.25,random_state=42)
print("data splitted into train and test")

In [64]:
from sklearn.neighbors import KNeighborsClassifier

knn_pipe = Pipeline([
    ('prep', preprocessor),
    ('clf', KNeighborsClassifier())
])

knn_pipe.fit(train_X, train_y)
print("KNN pipeline trained.")

KNN pipeline trained.


In [65]:
# KNN evaluation metrics
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Predictions
y_pred_train = knn_pipe.predict(train_X)
y_pred_test = knn_pipe.predict(test_X)

# Accuracy
train_acc = accuracy_score(train_y, y_pred_train)
test_acc = accuracy_score(test_y, y_pred_test)
print(f"Train Accuracy: {train_acc:.3f}")
print(f"Test Accuracy:  {test_acc:.3f}")

print("\nClassification Report (Test):")
print(classification_report(test_y, y_pred_test))

print("Confusion Matrix (Test):")
print(confusion_matrix(test_y, y_pred_test))

Train Accuracy: 0.917
Test Accuracy:  0.869

Classification Report (Test):
              precision    recall  f1-score   support

         0.0       0.96      0.97      0.96       186
         1.0       0.82      0.86      0.84       208
         2.0       0.81      0.75      0.78       216
         3.0       0.83      0.85      0.84       240
         4.0       0.92      0.92      0.92       289

    accuracy                           0.87      1139
   macro avg       0.87      0.87      0.87      1139
weighted avg       0.87      0.87      0.87      1139

Confusion Matrix (Test):
[[180   6   0   0   0]
 [  6 179  22   1   0]
 [  2  31 162  21   0]
 [  0   1  14 203  22]
 [  0   0   3  20 266]]


In [None]:
# TODO: Train models and evaluate accuracy
# TODO: Prepare evaluation table


## 3. Hyperparameter Optimization (20 pts)
Tune each model using GridSearchCV with 5-fold CV.
Compare baseline and tuned results and report improvements.
Identify the model with highest tuned performance.

Hint: Use accuracy as scoring metric. Add classification report.

In [None]:
# TODO: Define parameter grids for each model
# TODO: Perform GridSearchCV and compare results

## 4. Feature Importance Analysis (5 pts)
Choose one model and analyze feature importance. Present most influential features in a table and bar chart.

In [None]:
# TODO: Select your best model (e.g., Random Forest)
# TODO: Train the model on the training data if not already trained
# TODO: Calculate feature importances
# TODO: Create a table of features sorted by importance
# TODO: Plot a bar chart of feature importances
# TODO: Optional: Comment on top 3-5 most influential features