In [3]:
import pandas as pd
import numpy as np

# Daily Dataset Inspection

Purpose:
- Understand the structure and quality of daily consumption data
- Identify identifiers, targets, and merge keys
- Flag risks before merging with household and weather data


In [13]:
pd.set_option("display.max_columns", 100)
pd.set_option("display.width", 120)

In [15]:
df = pd.read_csv("/Users/loso/code/projects/forecast-and-flex/data/raw/daily_dataset.csv")

In [18]:
df.shape
df.head()


Unnamed: 0,LCLid,day,energy_median,energy_mean,energy_max,energy_count,energy_std,energy_sum,energy_min
0,MAC000131,2011-12-15,0.485,0.432045,0.868,22,0.239146,9.505,0.072
1,MAC000131,2011-12-16,0.1415,0.296167,1.116,48,0.281471,14.216,0.031
2,MAC000131,2011-12-17,0.1015,0.189812,0.685,48,0.188405,9.111,0.064
3,MAC000131,2011-12-18,0.114,0.218979,0.676,48,0.202919,10.511,0.065
4,MAC000131,2011-12-19,0.191,0.325979,0.788,48,0.259205,15.647,0.066


### What does one row represent?

One row corresponds to a single household (`LCLid`) on a single calendar day (`day`).

For each household-day pair, the dataset contains summary statistics (mean, median, min, max, std, sum, count) computed from that household‚Äôs half-hourly electricity consumption readings for that day.


Note:
Each row aggregates half-hourly smart meter readings for one household on one day.
A complete day has 48 readings; rows with fewer readings represent partial or missing data and must be handled carefully in analysis.


## Check whether (LCLid, day) is unique

üîç Step 1: How many rows vs unique keys

In [19]:
n_rows = len(df)
n_unique_keys = df[["LCLid", "day"]].drop_duplicates().shape[0]

n_rows, n_unique_keys


(3510433, 3510433)

### Uniqueness check

Each row is expected to represent one household-day.

Result:
- `(LCLid, day)` uniqueness: ‚úÖ 
- Action required: none 
- dataset is a clean daily household panel


------

## Next questions to answer:

- How many days per household do we have?

- How often is energy_count < 48?

- Are missing days random or systematic?

- Do some households have much shorter histories?

## This tells us whether:

- Time-series models are feasible

- Cross-household comparisons are fair

- We need filtering rules

In [20]:
df["energy_count"].describe()


count    3.510433e+06
mean     4.780364e+01
std      2.810982e+00
min      0.000000e+00
25%      4.800000e+01
50%      4.800000e+01
75%      4.800000e+01
max      4.800000e+01
Name: energy_count, dtype: float64

In [21]:
df["energy_count"].value_counts().sort_index()


energy_count
0          30
1       11301
4           2
8           2
9           2
10          2
11          1
12          2
13          2
14          2
15         12
16         15
17         29
18         65
19        132
20        168
21        250
22        318
23        369
24        423
25        421
26        510
27        501
28        543
29        544
30        529
31        399
32        274
33         39
34         43
35         39
36         56
37         69
38         92
39        112
40        233
41        164
42        204
43        252
44        290
45        426
46       1005
47      21209
48    3469352
Name: count, dtype: int64