---
# IoT Water Network Monitoring

## Problem Definition

The city monitors a **water distribution network** with IoT sensors (pressure, flow). Vendors deliver CSVs with different key names and time granularities. Build a clean, integrated dataset, engineer anomaly indicators, perform EDA, and test whether **mean pressure** differs across zones.
**Dataset:** `sensors_s4.csv`, `zones_s4.csv`, `flow_logs_s4.csv`.

## Mark SplitUp

| Data Cleaning | Data Integration | Data Transformation |EDA Visualization|Statistical Inference|Model Building|Best Practices|Total|
|:-------------:|:----------------:|:-------------------:|:---------------:|:-------------------:|:------------:|:------------:|:---:|
|(15)           | (15)             |(15)                 | (15)            | (15)                |(20)          | (5)          |(100)|
|---------------|------------------|---------------------|-----------------|---------------------|--------------|--------------|-----|

In [None]:

import pandas as pd, numpy as np, matplotlib.pyplot as plt, os
pd.set_option("display.max_columns", 100)
np.set_printoptions(edgeitems=3, suppress=True)
print("Libs ready.")


### Generate/Load Data

In [None]:

import pandas as pd, numpy as np, os
np.random.seed(1004)
n=500
t0 = pd.Timestamp("2024-07-01")
timestamps = [t0 + pd.Timedelta(minutes=15*i) for i in range(n)]
zones = np.random.choice(["Z1","Z2","Z3","Z4"], size=n, p=[0.3,0.25,0.25,0.2])
pressure = np.round(np.random.normal(3.2,0.4,size=n),2)
flow = np.round(np.random.normal(120,25,size=n),1)
sensors = pd.DataFrame({"rec_time":timestamps,"zone":zones,"pressure_bar":pressure,"flow_lpm":flow,"sensor_id":np.random.randint(1000,1100,size=n)})
sensors.loc[np.random.choice(sensors.index, size=30, replace=False), "pressure_bar"] = np.nan
sensors.loc[np.random.choice(sensors.index, size=8, replace=False), "flow_lpm"] = 400
sensors.to_csv("sensors_s4.csv", index=False)
zone_map = pd.DataFrame({"zone":["Z1","Z2","Z3","Z4"],"district":["North","East","South","West"]})
zone_map.to_csv("zones_s4.csv", index=False)
m=300
v_times = [t0 + pd.Timedelta(minutes=20*i) for i in range(m)]
vendor = pd.DataFrame({"timeStamp":v_times,"area":np.random.choice(["Z1","Z2","Z3","Z4"], size=m),"flowRate":np.round(np.random.normal(118,27,size=m),1)})
vendor.to_csv("flow_logs_s4.csv", index=False)


## Q1 Cleaning
- Audit + missing pattern.
- Impute `pressure_bar` with justification.
- Treat `flow_lpm` spikes.

In [None]:

sen = pd.read_csv("sensors_s4.csv", parse_dates=["rec_time"])
# TODO



## Q2 Integration
- Harmonize vendor keys to sensors schema.
- Time-nearest join with tolerance; justify.
- Join coverage and policy for unjoined records.
- District-level summary of mean `flow` and `pressure`.


In [None]:

vendor = pd.read_csv("flow_logs_s4.csv", parse_dates=["timeStamp"])
zones = pd.read_csv("zones_s4.csv")
# TODO



## Q3 (16) — Transformation
- `hour`, `daypart`.
- `pressure_z`, `flow_z` per zone.
- Anomaly flag (|z|>2).
- Pivot anomaly rate by `district`×`daypart`; plot.

In [None]:
# TODO


## Q4  EDA & Inference
- Distributions and correlation by zone.
- Two insights with evidence.
- One labelled figure of key risk.
- Hypotheses + checks.
- ANOVA/robust alternative; interpret at α=0.05.

In [None]:
# TODO

## Q5 Model Building
- IoT Sensors: cluster operating states, classify anomaly_flag (top 20% vibration), and model pressure.
- Data: temperature, vibration, pressure, zone∈{Z1,Z2,Z3}.

## Generate Data

In [None]:
import numpy as np, pandas as pd
rng = np.random.default_rng(4)
n=600
temperature = rng.normal(28, 4, n)
vibration = np.abs(rng.normal(0.8, 0.4, n))
pressure = 3.0 + 0.05*(temperature-28) + 0.3*vibration + rng.normal(0, 0.15, n)
zone = rng.choice(["Z1","Z2","Z3"], n, p=[0.4,0.35,0.25])
df4 = pd.DataFrame({"temperature":temperature.round(2),"vibration":vibration.round(3),"pressure":pressure.round(3),"zone":zone})
thr = np.quantile(df4["vibration"], 0.80)
df4["anomaly_flag"] = (df4["vibration"] >= thr).astype(int)
df4.head()

## Build **KNN** model 
- Choose **K** via a small search; report test performance.

In [None]:
#TODO


## Build **KMeans** model 
- Standardize features.
- Try multiple k values and report **silhouette**; pick k and interpret clusters briefly.
- Plot a 2D scatter (two features or PCA to 2D).

In [None]:
#TODO

## Build **Linear Regression** 
- Train/test split and fit Linear Regression (optionally compare Ridge/Lasso).
- Report RMSE & R²; inspect coefficients.

In [None]:
#TODO