# 01 — Exploratory Data Analysis (EDA)

This notebook explores the **household electricity consumption** dataset.

**Notes**  
- Usage is derived from billing records.  
- Demographics are from a third-party aggregator (watch for bias/missingness).

We'll inspect schema, missingness, distributions, seasonality (Apr–Aug), and proxies like `mozip`.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# If reading locally; else use BigQuery connector in Colab or pandas-gbq
csv_path = "../data/data_test.csv"  # update if needed
df = pd.read_csv(csv_path)

print(df.shape)
df.head()

In [None]:
# Basic missingness report
miss = df.isna().mean().sort_values(ascending=False)
miss.to_frame("missing_rate").head(15)

In [None]:
# Distribution of lusage
plt.figure()
df['lusage'].dropna().hist(bins=40)
plt.title("Distribution of log(kWh) monthly usage")
plt.xlabel("lusage (log kWh)")
plt.ylabel("count")
plt.show()

In [None]:
# Seasonality by month and year
pivot = df.pivot_table(index="month", columns="year", values="lusage", aggfunc="mean")
pivot

In [None]:
# Correlations with lags
cols = ['lusage','lusage1','lusage2','lusage3','lusage4','lusage5','lusage6']
corr = df[cols].corr()
corr