## 02 - Interactive exploration of data

This notebook covers the interactive exploration that guides our data cleansing and pre-processing later. We look at:

- Basic statistics (counts, missing, min, max)
- Sense checking our assumptions against what we know
- establishing bounds for outliers
- Considering the integrity of our timestamp data

We're looking at the top flow of this diagram:

<img src="../docs/imgs/energy-sa-clean-up-flow.png " width="300">

The data exploration and pre-processing steps are covered in detail in the [original weave exploration notebook](https://github.com/centre-for-ai-and-climate/weave/blob/main/docs/smart-meter-examples-v1.0.ipynb).



In [0]:
%run ./includes/common_functions_and_imports

In [0]:
import pandas as pd
import geopandas as gpd

import pyspark.sql.functions as F
from pyspark.sql.window import Window


In [0]:
source_table_name = (
    f"{CONFIG.target_catalog}.{CONFIG.target_schema}.smart_meter_data_raw"
)
if not spark.catalog.tableExists(source_table_name):
  dbutils.notebook.exit('Source table does not exist')

raw_df = spark.table(source_table_name)

## Section 1 : Initial exploration and checking our assumptions

Timeseries analysis is _hard_. Datasets are large, almost always complex, and introducing a temporal component brings with it a whole host of new things to consider.

One saving grace is that we are often dealing with real, physical systems. Since somewhere in the chain we're dealing with a physical thing, we can check our assumptions against physics before we start (Is a value of 10x the output of the sun realistic for a domestic lightbulb?), or use this to inform our modelling approach.

As an example, let's look at what we know, from the [about section of the website](https://weave.energy/smart-meter-data), with some key parts highlighted:
> **UK domestic** smart meter consumption data at **half-hourly **,** aggregated at LV feeder level**. This dataset represents **120,000 LV feeders, 2,000,000 smart meters**, and is over a billion rows of data.

**Summarising**:
- 🏡 We're dealing with UK domestic consumption 
- ⏱️ Each row should represent a half hour time period
- ∑ Each row is an aggregate at an LV feeder level
- 🏘️ The average number of smart meters fed by a LV feeder (naively...) is 200/12 ~= 17

With a bit more reading from the provided docs, we also know:
- We have some extreme outlier values for consumption.
- we have some sparse time series where data is missing for certain periods due to upstream errors.
- The units are in Watt Hours.

Units are important, because they let us get a feel for sense of scale and if something is correct.

An example:
- I have 17 houses fed by my LV feeder.
- The average UK household consumption is ~2700 Kilowatt hours per year, or 7.4 kWh (7400 Wh) per day.
- This means i'd expect an average of 17 * 7400 ~= 126,000 Wh for a typical LV feeder on a given day.
- Or naively, divide by 48 and get 2625 Wh per half hour if it was evenly split (not a great assumption).

So let's look at some statistics of our data:

### Examining our numerical features
We'll start with `aggregated_device_count_active` and `total_consumption_active_import`.

In [0]:
input_count = raw_df.count()
missing_count = raw_df.filter('isnull(total_consumption_active_import)').count()
print("Total count before processing   = {:,.2f}".format(input_count))
print("Number of missing datapoints is = {:,.2f} ({:,.2f}%)".format(missing_count, missing_count/input_count*100))

I use sparks dataframe `summary()` function below to calculate the statistics. The cell can take some time to run, so if you want to skip it I include the outputs with the analysis after.

In [0]:
display(raw_df.select('aggregated_device_count_active', 'total_consumption_active_import').summary())

Databricks visualization. Run in Databricks to view.

### Thinking about our statistics

the output looks like this:

|Statistic|aggregated_device_count_active|total_consumption_active_import
|-|-|-|
|count|2143519514|2143519514|
|mean|24.573|776109.242|
|stddev|18.916|1.031x10^8|
|min|5|-268434518|
|25%|11|1903|
|50%|19|3737|
|75%|32|6811|
|max|1299|236223218982|

There are a few things that stand out:
1. Outliers _are_ crazy 🤯 (For scale, the nominal output of Hinkley point C nuclear power plant is rated at 3200 MW, or 3.2e9 W). Our peak domestic consumption through a single LV feeder is only pulling a measly ~74 times the hourly output of a reactor we haven't finished building yet, in a single half hour window. Why arn't we building more reactors to feed this very hungry household?
1. Our initial yard-stick values are pretty good - Median consumption for an LV feeder is 3737 against 19 households fed vs our guestimate of 2625 against 17. This is a nice starting point.
1. 🤔 I want to investigate the device count max value. Some quick internet searches reveal that typically you'd expect 20-80 properties fed from a single LV feed...


## Section 2 : Diving deeper into LV feeders

So less than a minute on the internet tells me to expect between 20 and 80 households to an LV feeder.

Less than 5 minutes tells me that modern designs can be rated significantly higher...

What does my data tell me?

Why do we care? We need to normalise against the number of devices served per LV feeder later.

We're going to use `width_bucket()` [docs link](https://docs.databricks.com/aws/en/sql/language-manual/functions/width_bucket) to quickly get a histogram. It has two useful return values we want to be aware of:
1. Null valued bucket for null values
1. maxBucket+1 for collecting everything outside of our specified max range.

I run it with the following settings:
- minValue 1, since we know our minimum devices served is 5.
- maxValue 1300, which is just beyond our observed max (1299)
- 13 buckets to get an approximate width of 100.


In [0]:
display(
    raw_df.selectExpr(
        "aggregated_device_count_active",
        "width_bucket(aggregated_device_count_active, 1, 1300, 13) as bucket",
    )
    .groupby("bucket")
    .agg(
        F.count("aggregated_device_count_active").alias("num_feeders_in_bucket"),
        F.max("aggregated_device_count_active").alias(
            "highest_number_of_devices_served_in_bucket"
        ),
    )
    .orderBy("bucket")
)

Cool, it looks like the value of 1299 is a unique outlier, and actually we have a good number of LV feeders that feed a significant number of properties. (You can re-run the above experiment yourself to explore the 600-700 devices range 👀)

In the real world this could be a significant factor that influences our model - different LV feeder specs or clusters may supply different property types (for example, properties with EVs requiring a higher capacity), or inversely it may give you a way to detect outliers with more confidence based on the capacity of a feeder.

### Okay so when should we drop outliers in consumption?

According to the [FAQ page](https://docs.google.com/document/d/16tOUcaxGzSzuTw2JyR2O5yp_q7VM2J4jTV3lAkV1skQ/edit?tab=t.0), there is a known issue with the data provided by the DNOs where very large demand values are recorded.

> ### Extreme consumption values
> Some of the energy consumption values in the data are insane, like, multiple GW in a half-hour insane. In our visualisations and processing so far, we’ve tended to filter out everything higher than 20,000 Wh. Getting to the bottom of this is one of our tasks for the future.

If we accept that we have valid data in the order of ~650 devices to an LV feeder,  we could use our average household value from earlier (7400 watts) to get a rough limit (650*7400/48 ~= 100K).

Or we can just use the box plot feature of display for this range to give us a 1.5xIQR limit to use (upper whisker value =  189,223) - So we can **start with a filter value of 200K.**


In [0]:
display(
  raw_df.filter("aggregated_device_count_active between 600 and 700")
)
## Doing the same thing with pandas and grouping by dno_alias reveals we only have these conditions from readings from the DNO SSEN.
# raw_df.filter("aggregated_device_count_active between 600 and 700").toPandas().boxplot(column='total_consumption_active_import', by='dno_alias')

Databricks visualization. Run in Databricks to view.

## Section 3 : Thinking about our time data

One limitation of the data we have is the lack of historical data. Because we're looking at just over a year of data, it's impossible to pull out seasonal trends with any certainty, and we will need to be careful to stop the model over-fitting against seasonal variations.

We're not going to spend long here since we need to pre-process our data before we go too far, but we do want to consider:

- Are our timestamps uniformly spaced?
- Do we have sufficient data from each DNO over the time period?


In [0]:
# Convert our timestamps to integer representation and check the remainder after dividing by 1800 (60s * 30 mins). Anything not equal to zero implies a timestamp not sitting on a half hour window.
(
  raw_df.select("dno_alias", "data_collection_log_timestamp")
  .withColumn(
      "timestamp_is_not_half_hour",
      (F.to_unix_timestamp("data_collection_log_timestamp") % 1800) != 0,
  )
  .filter("timestamp_is_not_half_hour")
  .count()
)
# We could use isEmpty() here to get a binary yes/no answer to our question, but the count is useful to give us insight into the scale of the impact.

In [0]:
# This cell takes a while to run, skip it if you like. The output is 1952629 rows with >30min gaps since last reading.
display(
    raw_df.withColumn(
        "test",
        F.timestamp_diff(
            "minute",
            F.lag("data_collection_log_timestamp").over(
                Window.partitionBy("LV_feeder_unique_id").orderBy(
                    "data_collection_log_timestamp"
                )
            ),
            F.col("data_collection_log_timestamp"),
        ),
    ).filter('test > 30').count()
)

Our first two tests look at the integrity of our timestamps.
While we have an insigificant number of points which arn't on the 30minute mark (176 rows), we have significantly more LV feeders with gaps in the data provided (1952629 rows)

We have data for around 150K metering points across 22K half-hourly timesteps.

In [0]:
lv_feeder_count = (
  raw_df
  .select("lv_feeder_unique_id")
  .distinct()
  .count()
  )
print(f"{lv_feeder_count=:,}")

time_steps = (
  raw_df
  .select("data_collection_log_timestamp")
  .distinct()
  .count()
  )
print(f"{time_steps=:,}")

The data extends from early in 2024 to the end of last month.

In [0]:
display(raw_df.groupBy()
  .agg(
    F.countDistinct("data_collection_log_timestamp").alias("periods"),
    F.min("data_collection_log_timestamp").alias("start"),
    F.max("data_collection_log_timestamp").alias("end")
  )
)

In [0]:
display(
  raw_df.where('total_consumption_active_import < 200000').where("lv_feeder_unique_id == 'NGED-726031-3'")
  .where("data_collection_log_timestamp > '2024-08-01'")
  .orderBy("data_collection_log_timestamp", ascending=True)
)

Databricks visualization. Run in Databricks to view.

## Section 4: Working with geography

We can also take a look at the regional coverage of this dataset by plotting the locations of the LV feeders on a slippy map.

In [0]:
lv_feeder_locations_pdf = (
  raw_df
  .where(~(F.col("geometry.x") == "NaN"))
  .where(~(F.col("geometry.y") == "NaN"))
  .groupby('lv_feeder_unique_id')
  .agg(F.first("geometry").alias('geometry'))
  .select("lv_feeder_unique_id", "geometry.*")
  .toPandas()
  )

In [0]:
to_geodf(lv_feeder_locations_pdf, 0.005).explore()