## Purpose

This notebook processes the input data for the simulation of an Australian distribution network feeder. Load and generation data is sourced from the [Pecan Street](https://www.pecanstreet.org/) student-licensed data set. Power factor data is sourced from the [ECO](https://www.vs.inf.ethz.ch/res/show.html?what=eco-data) data set and combined with the Pecan Street data to produce active and reactive power load profiles.

Network models from the [Australian Low Voltage Feeder Taxonomy](https://near.csiro.au/assets/f325fb3c-2dcd-410c-97a8-e55dc68b8064) are converted from OpenDSS format to pandapower. 

The overall data cleaning process is depicted below:

![data_cleaning](https://i.ibb.co/2KZ48xg/data-cleaning.png)



## Load Data Processing
The load data was downloaded from the Pecan Street Dataport. The original files were called `1s_data_austin_file{1,2,3,4}.csv.gz`. 


Each is unzipped and processed. The single file containing all households is split into a file per household containing only the `dataid`, `localminute`, `grid`, `solar` and `solar2` columns. This reduces the size of the data set, and makes it practical to manipulate the data from a single household at once in pandas.


In [3]:
from pathlib import Path
import pandas as pd
import numpy as np
import csv
import os
from typing import Union
from datetime import datetime
from multiprocessing.pool import ThreadPool
from scipy import io, stats
import math

In [4]:
PECAN_ST_FILENAMES = [
    "1s_data_austin_file1.csv.gz",
    "1s_data_austin_file2.csv.gz",
    "1s_data_austin_file3.csv.gz",
    "1s_data_austin_file4.csv.gz",
]

INPUT_DATA_PATH = Path("./input_data")
OUTPUT_DATA_PATH = Path("./output_data")

In [4]:
def create_household_file(file_path: Path, header_row: str):
    with open(file_path, "w") as fd:
        fd.write(header_row)


def process_file(file_path: Path):
    file_descriptors = {}
    header_row = "dataid,localminute,grid,solar,solar2\n"
    try:
        with open(file_path, "r") as infile:
            csv_reader = csv.DictReader(infile)
            for row in csv_reader:
                dataid = row["dataid"]
                localminute = row["localminute"]
                grid = row["grid"]
                solar = row["solar"]
                solar2 = row["solar2"]

                file_descriptor = file_descriptors.get(dataid, None)
                if not file_descriptor:
                    output_file_path = OUTPUT_DATA_PATH / f"{dataid}-load-pv.csv"
                    if not output_file_path.exists():
                        create_household_file(output_file_path, header_row)
                    file_descriptor = open(output_file_path, "a")
                    file_descriptors[dataid] = file_descriptor
                file_descriptor.write(
                    f"{dataid},{localminute},{grid},{solar},{solar2}\n",
                )
    finally:
        for fd in file_descriptors.values():
            fd.close()


def unzip_raw_data(data_path: Path, unzip_path: Path):
    if not unzip_path.exists():
        # Lord forgive me for this but Python is terribly slow at this
        os.system(f"gzip -d < {data_path.resolve()} > {unzip_path.resolve()}")
        print("unzipped")
        


def process_data(data_path: Path):
    outfile_name = data_path.stem
    unzipped_path = INPUT_DATA_PATH / outfile_name
    unzip_raw_data(data_path, unzipped_path)
    process_file(unzipped_path)
    unzipped_path.unlink()


In [32]:
# NOTE: This takes a long time to run, 1.5 hours on an M1 mac.
# I could probably optimise it, but I'm only running it once 
# Should probably also multi thread it
for file_name in PECAN_ST_FILENAMES:
    print(file_name)
    data_path = INPUT_DATA_PATH / file_name
    process_data(data_path)



1s_data_austin_file1.csv.gz
unzipped


sh: /Users/eddie/Documents/Uni/MAPDN/data_preparation/input_data/1s_data_austin_file1.csv.gz: No such file or directory


FileNotFoundError: [Errno 2] No such file or directory: 'input_data/1s_data_austin_file1.csv'

## Aggregate to 30s resolution
There's still a lot of data to work with. Though a resolution of 1s is interesting, the volume of data is too large for us to work with sensibly. Let's downsample the data to 30s resolution to make it a bit easier to work with.

In [22]:
def get_closest_interval(dt):
    seconds = dt.second
    interval_args = {
        "year": dt.year,
        "month": dt.month,
        "day": dt.day,
        "hour": dt.hour,
        "minute": dt.minute,
    }

    if seconds < 30:
        return datetime(**interval_args, second=0)
    else:
        return datetime(**interval_args, second=30)


def calculate_use(grid: str, solar: str, solar2: str) -> Union[float, None]:
    if not grid:
        return None
    grid_val = float(grid)
    solar_val = float(solar) if solar else 0
    solar2_val = float(solar2) if solar2 else 0

    # This may seem weird but it's explained here: https://docs.google.com/document/d/1_9H9N4cgKmJho7hK8nii6flIGKPycL7tlWEtd4UhVEQ/edit#
    return grid_val + solar_val + solar2_val


def calculate_net_solar(solar: str, solar2: str) -> Union[float, None]:
    if not solar and not solar2:
        return None
    solar_val = float(solar) if solar else 0
    solar2_val = float(solar) if solar2 else 0

    return solar_val + solar2_val


def downsample_data(file_path: Path, out_path: Path):
    data_periods = {}
    with open(file_path, "r") as infile:
        csv_reader = csv.DictReader(infile)
        for line in csv_reader:
            # The last three characters are TZ info, we will lose an hour's data every time the clocks change.
            # Fortunately, we don't really care about that
            localminute = line["localminute"][:-3]
            grid = line["grid"]
            solar = line["solar"]
            solar2 = line["solar2"]
            solar_val = calculate_net_solar(solar, solar2)
            use = calculate_use(grid, solar, solar2)
            dt = datetime.strptime(localminute, "%Y-%m-%d %H:%M:%S")

            interval = get_closest_interval(dt)
            if not data_periods.get(interval, None):
                data_periods[interval] = dict(
                    solar_sum=0, solar_count=0, use_sum=0, use_count=0
                )

            data = data_periods[interval]
            if solar_val is not None:
                data["solar_sum"] += solar_val
                data["solar_count"] += 1
            if use is not None:
                data["use_sum"] += use
                data["use_count"] += 1

    intervals = sorted(data_periods)
    with open(out_path, "w") as outfile:
        header = "datetime,solar,use\n"
        outfile.write(header)
        for interval in intervals:
            interval_data = data_periods[interval]
            iso_date = interval.isoformat()
            solar_count = interval_data["solar_count"]
            solar_sum = interval_data["solar_sum"]

            use_count = interval_data["use_count"]
            use_sum = interval_data["use_sum"]

            solar_val = ""
            use_val = ""
            if solar_count:
                solar_val = solar_sum / solar_count
            if use_count:
                use_val = use_sum / use_count
            outfile.write(f"{iso_date},{solar_val},{use_val}\n")


In [33]:
# NOTE: This also takes ages - should really have written this more effficiently
threads = 25
t = ThreadPool(threads)

files_to_process = []
for file_path in OUTPUT_DATA_PATH.iterdir():
    if not str(file_path).endswith("load-pv.csv"):
        continue
    outfile_name = f"{file_path.stem}-reduced.csv"
    outfile_path = file_path.parent / outfile_name
    files_to_process.append((
        file_path,
        outfile_path
    ))
 

with ThreadPool(threads) as t:
    t.starmap(downsample_data, files_to_process)


## Further processing
The next steps are:
1. Aggregate the ECO dataset and compute an average power factor.
2. Check the existing load and PV data for outliers and replace them with neighbouring values. We know from experience that these values exist. 
3. Perturb the average power factor by +/- 10% and compute the appropriate reactive power for each interval in the load data.
4. Bootstrap to 100 load and PV profiles
5. Create active, reactive, and pv profiles for all the households.



In [5]:
# The input data was requested from ETH Zurich and downloaded directly from their portal
adres_input_path = INPUT_DATA_PATH / "ADRES_Daten_120208.mat"
loaded = io.loadmat(adres_input_path)
adres_df = pd.DataFrame(loaded["Data"]["PQ"][0][0])


In [61]:
adres_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,170,171,172,173,174,175,176,177,178,179
0,40.200001,-14.670000,22.150000,3.189000,141.500000,-80.730003,550.083313,-31.236666,247.800003,66.066666,...,168.100006,20.600000,10.700000,-0.995,55.000000,-31.100000,52.200001,-42.900002,35.000000,-1.700000
1,41.500000,-14.980000,22.180000,3.529000,140.699997,-80.550003,549.080017,-31.166000,247.960007,66.199997,...,168.500000,20.600000,10.640000,-0.947,54.900002,-31.180000,52.500000,-43.000000,35.000000,-1.800000
2,40.799999,-15.120000,22.040001,3.408000,141.000000,-80.330002,549.099976,-31.172001,247.880005,65.839996,...,168.100006,20.600000,10.810000,-0.945,55.799999,-31.299999,52.900002,-43.000000,35.000000,-1.800000
3,40.900002,-14.660000,22.219999,2.956000,141.100006,-80.550003,540.880005,-31.386000,247.399994,65.739998,...,168.500000,20.600000,10.710000,-0.962,60.200001,-31.660000,52.500000,-42.900002,35.000000,-1.800000
4,40.700001,-14.860000,22.120001,3.148000,141.000000,-80.669998,428.760010,-29.719999,247.259995,65.699997,...,168.399994,20.700001,10.650000,-0.927,60.200001,-31.740000,52.500000,-42.900002,35.099998,-1.900000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1209595,102.300003,24.510000,133.800003,-62.160000,108.300003,54.270000,290.899994,17.570000,107.699997,40.610001,...,139.899994,-9.500000,222.000000,2.300,57.500000,-7.800000,34.200001,-27.740000,28.400000,-27.469999
1209596,102.500000,24.459999,133.800003,-62.169998,113.400002,54.000000,289.799988,16.240000,107.699997,40.669998,...,139.800003,-9.500000,221.699997,2.300,57.200001,-7.900000,34.200001,-27.889999,28.400000,-27.650000
1209597,102.699997,24.549999,133.800003,-62.349998,112.300003,54.090000,287.299988,13.430000,107.900002,40.720001,...,139.800003,-9.600000,221.800003,2.400,56.299999,-7.700000,34.200001,-27.820000,28.400000,-27.600000
1209598,102.500000,24.389999,134.000000,-62.290001,108.199997,54.189999,282.299988,11.320000,107.599998,40.020000,...,139.899994,-9.500000,221.500000,2.400,56.500000,-7.700000,34.299999,-27.180000,28.500000,-27.740000


## Average power factor calculation

Each household in the data set has six columns - P1, Q1, P2, Q2, P3, Q3.

To compute the average power factor across the entire data set we will compute the average P across all the columns, and the average Q across all the columns. This can then be used to calculate an average power factor. 

As we can't make any assumptions about the relationship between the ADRES data set and the Pecan Street data this is all we will do.

We can then use the average power factor to finish processing the Pecan Street data.

In [7]:
# Even columns are P, odd are Q
p_cols = [col for col in adres_df if col % 2 == 0]
q_cols = [col for col in adres_df if col % 2 == 1]

# Sum across columns then down the column
p_sum = adres_df[p_cols].sum(axis=1).sum(axis=0)
q_sum = adres_df[q_cols].sum(axis=1).sum(axis=0)

# We've summed across the rows and the columns so need to divide by their length to get the average
avg_p = p_sum / len(p_cols) / len(adres_df)
avg_q = q_sum / len(q_cols) / len(adres_df)

avg_s = math.sqrt((avg_p**2 + avg_q**2))

print(f"Average P: {avg_p} W")
print(f"Average Q: {avg_q} VAr")
print(f"Average S: {avg_s} VA")

Average P: 215.26437624926513 W
Average Q: 27.710495002939446 VAr
Average S: 217.04060268828297 VA


In [92]:
avg_pf = avg_p / avg_s
print(f"Average PF: {avg_pf}")


Average PF: 0.9918161559771888


## Now we have a power factor to use as our baseline
Somewhat unsurprisingly, it's not far from unity. In general, residential premises typically operate near a unity power factor.

Next step: use it to create feasible reactive power readings for each load and create the final input profiles.