# 01 — Exploratory Data Analysis

**Objective.** This notebook provides an initial examination of the household electricity dataset in the staging table. We document basic distributions, seasonality across months, and correlations with lagged usage. The analysis serves to motivate later imputation and modeling choices.

**Tables.**
- Staging: `PROJECT_ID.DATASET.stg_usage`

**Notes.**
- `lusage` denotes the natural logarithm of monthly kWh.
- The study window covers months 4–8 (April–August) in 2010–2011.

In [None]:
# Configuration
PROJECT_ID = "data-engineering-ecometricx"
DATASET = "energy_analytics"
LOCATION = "EU"

# If running locally/Colab, uncomment the next line to install requirements.
# !pip -q install --upgrade pandas pandas-gbq google-cloud-bigquery

import pandas as pd
from pandas_gbq import read_gbq
import matplotlib.pyplot as plt

STG = f"`{PROJECT_ID}.energy_analytics.stg_usage`".format(PROJECT_ID=PROJECT_ID)

## 1. Row counts and missingness

In [None]:
q = (
    "SELECT COUNT(*) AS n_rows, "
    "COUNTIF(lusage IS NULL) AS miss_lusage, "
    "COUNTIF(size_sqft IS NULL) AS miss_size, "
    "COUNTIF(children IS NULL) AS miss_children "
    f"FROM {STG}"
)
summary = read_gbq(q, project_id=PROJECT_ID, location=LOCATION)
summary

## 2. Sample extraction for plotting

In [None]:
q = f"SELECT * FROM {STG} LIMIT 100000"
df = read_gbq(q, project_id=PROJECT_ID, location=LOCATION)
df.shape

## 3. Distribution of log usage (`lusage`)

In [None]:
plt.figure()
df['lusage'].dropna().hist(bins=50)
plt.title('Histogram of log monthly kWh (lusage)')
plt.xlabel('lusage')
plt.ylabel('count')
plt.show()

## 4. Seasonal pattern by month

In [None]:
month_avg = df.groupby('month', as_index=False)['lusage'].mean()
plt.figure()
plt.plot(month_avg['month'], month_avg['lusage'], marker='o')
plt.title('Average log usage by month')
plt.xlabel('month')
plt.ylabel('average lusage')
plt.xticks([4,5,6,7,8])
plt.show()

## 5. Relationship between contemporaneous usage and the first lag

In [None]:
sample = df.dropna(subset=['lusage','lusage1'])
n = min(20000, len(sample))
sample = sample.sample(n=n, random_state=42) if n>0 else sample

plt.figure()
plt.scatter(sample['lusage1'], sample['lusage'], s=3, alpha=0.3)
plt.title('Scatter: lusage versus lusage1')
plt.xlabel('lusage1 (t-1)')
plt.ylabel('lusage (t)')
plt.show()

## 6. Brief interpretation

- The histogram of `lusage` indicates a right-skewed but compact distribution, consistent with log-kWh.
- The monthly profile (April→August) exhibits the expected summer increase in electricity usage.
- The scatter of `lusage` on `lusage1` shows strong persistence across adjacent months, motivating the use of lagged features in modeling.