
# Data preparation

This notebook does some basic pre-processing to get data ready for feature engineering.

We're looking at the central part of this diagram:

<img src="../docs/imgs/energy-sa-clean-up-flow.png " width="300">

## Clean up for energy data

From our 02_interactive_exploration notebook, and the guidance from Weave, we have the following situations:

| Problem | Applied solution |
| - | - |
|Extreme outliers beyond a value of 200,000|Remove data points, no imputing |
| Timestamps not aligned to 30 minute increments | Round to nearest 30 minutes|
| Missing data | ignore, handle in modelling prep |
| duplicate timestamps | Take average value to deduplicate |

## Data augmentation
- Since our energy data is already in the relevant format, all we're doing is cleaning it up a little bit
- For weather data, we need to pivot the dataset to get one row per location + timestamp combination with all variable values.

In [0]:
%run ./includes/common_functions_and_imports

In [0]:
source_table_name_energy = (
    f"{CONFIG.target_catalog}.{CONFIG.target_schema}.smart_meter_data_raw"
)
source_table_name_weather_history = (
  f"{CONFIG.target_catalog}.{CONFIG.target_schema}.weather_data_raw"
)
source_table_name_weather_forecast = (
  f"{CONFIG.target_catalog}.{CONFIG.target_schema}.weather_forecast_raw"
)

if not all(spark.catalog.tableExists(t) for t in [
  source_table_name_energy,
  source_table_name_weather_history,
  source_table_name_weather_forecast
  ]):
  dbutils.notebook.exit('One of our sources does not exist')

target_table_name_energy_data = f"{CONFIG.target_catalog}.{CONFIG.target_schema}.smart_meter_data_clean"
target_table_name_weather_history_data = f"{CONFIG.target_catalog}.{CONFIG.target_schema}.weather_data_clean"
target_table_name_weather_units = f"{CONFIG.target_catalog}.{CONFIG.target_schema}.weather_data_units"
target_table_name_weather_forecast_data = f"{CONFIG.target_catalog}.{CONFIG.target_schema}.weather_forecast_clean"

raw_energy_df = spark.table(source_table_name_energy)
raw_weather_history_df = spark.table(source_table_name_weather_history)
raw_weather_forecast_df = spark.table(source_table_name_weather_forecast)

### Energy data clean up
From our exploration we had the following statistics:

- Total count before processing   =  2,154,498,882
- Number of missing datapoints is =  10,979,368 (0.51%)

after the processing below we arrive at = 2,142,619,142

In [0]:
def round_to_nearest_half_hour(ts_col):
    return F.from_unixtime((F.unix_timestamp(ts_col) / 1800).cast("integer") * 1800)

clean_meter_df = (
    raw_energy_df
    .filter("total_consumption_active_import <= 200000 AND total_consumption_active_import >= 0")
    .dropna(how='any', subset=['aggregated_device_count_active', 'total_consumption_active_import'])
    .withColumn(
        "data_collection_log_timestamp",
        round_to_nearest_half_hour("data_collection_log_timestamp").cast('timestamp'),
    )
    .groupby("lv_feeder_unique_id", "data_collection_log_timestamp")
    .agg(
      F.mean("total_consumption_active_import").alias("total_consumption_active_import"),
      F.first('aggregated_device_count_active').alias('aggregated_device_count_active'),
      F.first('geometry').alias('geometry'),
      F.first('secondary_substation_unique_id').alias('secondary_substation_unique_id'),
      F.first('dataset_id').alias('dataset_id'),
      F.first('dno_alias').alias('dno_alias')
    )
)

### Weather data shaping

Data comes in with the cardinality:
1 row per variable reading per time period for a set of coordinates.

We need to pivot this into:
1 row per time period, coordinate combination with each variable value in a column.

We're also going to extract the variable units from the metadata to store as reference incase we need them. Unfortunately, the map keys are a little complex. They're all of the form `variable_name#field_name`. Luckily we can still just use getItem(), or `[]`, we just need to think about dynamically creating the key.

In [0]:
historic_weather_pivoted_df = (
    raw_weather_history_df.groupBy("time", "x", "y")
    .pivot("variable", ["t2m", "u10", "v10", "ssrd", "strd"])
    .agg(F.first("m").alias("m"))
)

weather_units_df = raw_weather_history_df.withColumn(
    "variable_units",
    # Get the value from the map for the key  '<variable>#units'
    F.col("metadata").getItem(F.concat(F.col("variable"), F.lit("#units")))
).groupby('variable').agg(F.first('variable_units').alias('units'))

In [0]:
forecast_weather_pivoted_df = (
    raw_weather_forecast_df.groupBy("valid_time", "x", "y")
    .pivot("variable", ["t2m", "u10", "v10", "ssrd", "strd"])
    .agg(F.first("m").alias("m"))
)

### Writing the data out

Write the data out to storage. If you don't want to write this data again, you can comment out this cell entirely and include it in the next notebook, 03_feature_engineering, with a `%run` command. You will need to comment out the cell titled 'Load data and configs'. It _should_ work out of the box, but double check the variable names.

In [0]:
tables_to_process = {
    target_table_name_energy_data: clean_meter_df,
    target_table_name_weather_history_data: historic_weather_pivoted_df,
    target_table_name_weather_units: weather_units_df,
    target_table_name_weather_forecast_data: forecast_weather_pivoted_df,
}
  
for tgt, df_to_process in tables_to_process.items():
    if spark.catalog.tableExists(tgt) and (not CONFIG.overwrite_data):
        print(f"Skipping table {tgt} as it already exists and overwrite is not set")
        continue
    df_to_process.write.saveAsTable(tgt, mode="overwrite")
