# Load and Wind Power Dispatch Signals for AUS NEM Model
Methods in [1] are adopted, whose source code can be found in [2]. Besides, data in 'df_n','df_g' are also downloaded from [2].<br/>
This version was updated in Dec., 2020.

#### This code aims to ####
(1) Extract ***hourly*** load and wind/solar dispatch signals during 2015-2019 in Australia National Electricity Market (NEM) from AEMO's MMSDM database [3], to facilitate the case study in the AUS 912-node model;<br />
(2) Fit the Gaussian Mixture Model(GMM) using the historical data (5 years, assumming that these data can be regarded as the population) (5years=43800 hours); <br/>
(3) Sample small-scale datasets (20%,40%,60%,80% of **historical data** ) for constructing empirical distribution in the case studies; <br/>
(4) Kmeans cluster of the small-scale datasets for reducing computational burden.<br/>

##### NEM model #####
In total, the NEM generator dataset contains technical and economic information relating to 203 generating units （conventional generators and renewable energy units, each has a dispatchable unit ID）, while the network dataset consists of 912 nodes, and 1406 AC edges with line voltages in the range of 110 kV to 500 kV. To 

At the same time, there are ***16*** NEM zones in the operation devided by AEMO. To facilitate our transmission network planning study, We take this 16-node model by assuming that the power balance constriants are met in each NEM zone.Therefore, the load and wind power signals are aggragated in these 16 nodes based on their geographic positions.

##### IEEE RTS 24-node model #####
In the original IEEE RTS 24-node model, it has 17 load nodes while no wind power. <br/>
In the modified IEEE RTS 24-node model, technical and economic information relating to generators and topology of existing lines are not changed. However, to be compatible with the NEM dataset: <br/> (1) the load demand in IEEE RTS 24-node model at #13 node (reference node) is set to be zero (so that we get a 16-load model);  <br/> (2) #11,#12,#17,#24 are selected as nodes equipped with wind power.  
    

## Import packages

In [1]:
import os
import pandas as pd
import numpy as np
from zipfile import ZipFile
from IPython.core.interactiveshell import InteractiveShell 
InteractiveShell.ast_node_interactivity = 'all'

from sklearn import mixture
from sklearn.cluster import MiniBatchKMeans

import scipy.io as sio
from pyomo.environ import *
import matplotlib.pyplot as plt

from random import random
from scipy.stats import multivariate_normal
from scipy.stats import matrix_normal

##  Paths to directories

In [2]:
# Core data directory (common files)
data_dir = os.path.abspath(os.path.join(os.path.curdir, os.path.pardir, os.path.pardir, 'data'))

# MMSDM data directory
MMSDMdata_dir = os.path.join(data_dir,'AUS2016_2020')

# Network directory
network_dir = os.path.abspath(os.path.join(os.path.curdir, os.path.pardir, '1_network'))

# Generators directory
gens_dir = os.path.abspath(os.path.join(os.path.curdir, os.path.pardir, '2_generators'))

# Output directory
output_dir = os.path.abspath(os.path.join(os.path.curdir, 'output'))
output_for_mat_dir=os.path.abspath(os.path.join(os.path.curdir, 'output_for_mat'))

# basic DataFrame for network (index=NODE_ID)
df_n = pd.read_csv(os.path.join(network_dir, 'output', 'network_nodes.csv'), index_col='NODE_ID', dtype={'NEAREST_NODE':np.int32})

# basic DataFrame for generator (index=DUID)
df_g = pd.read_csv(os.path.join(gens_dir, 'output', 'generators.csv'), dtype={'NODE': int}, index_col='DUID')

## Define two functions for population (true distribution) and samples (empirical distributions)

### Fit GMM 

These GMM paramters would be used for measuring the "distance" between current probability distribution and future probability distribution

In [3]:
def GMM_forTrue_PDF(df_population):
    # Fit the GMM and save the GMM parameters as distionary and mat. file
    clf = mixture.GaussianMixture(n_components=3, covariance_type='full')
    clf.fit(df_population)

    GMM_True={}
    GMM_True['Pr']=clf.weights_# Pr for each components
    GMM_True['means']=clf.means_
    GMM_True['Covariances']=clf.covariances_
    return GMM_True

### Small-size sample and K means

In [3]:
def sample_Kmeans(df_population):
# Small-size samples of historical data for empirical distribution and K-means
    SizeList=[0.2,0.4,0.6,0.8] # proportion of sample size to population
    df_s={}
    for sz in SizeList:        
        sr=str(sz)  
        sample=df_population.sample(frac=sz, replace=False, random_state=1)
        df_s['Sample'+sr[2]]=sample.values
        
        #n_c=int(len(df_population)*sz*0.1)
        n_c=1000
        kmeans=MiniBatchKMeans(n_clusters=n_c).fit(sample)

        df_s['Centers'+sr[2]]=kmeans.cluster_centers_
        df_Pr=pd.value_counts(kmeans.labels_,sort=False,normalize=True)
        df_s['Pr'+sr[2]]=df_Pr.values
    return df_s
    
    

### Small-size sample groups and data whitening 

In [4]:
def smallgroup_white(df_population):
# Small-size samples of historical data for empirical distribution and K-means
    SizeList=[0.2,0.4,0.6,0.8] # proportion of sample size to population
    df_diffsample={}
    for sz in SizeList:    
        # (1) Get a small size group (to show the influences of data size)
        sr=str(sz)  
        df_smallgroup=df_population.sample(frac=sz, replace=False, random_state=1)
        df_diffsample['Sample'+sr[2]]=df_smallgroup.values
        
        #(2)  Statistics for empirical distribution
        df_diffsample['Em_mean'+sr[2]]=df_smallgroup.mean().values
        df_diffsample['Em_cov'+sr[2]]=df_smallgroup.cov().values
        
        #(3)  Standardization in each group
        
        U, S, V = np.linalg.svd(df_smallgroup.cov().values)
        Xrot = np.dot(df_smallgroup.values, U)
        df_diffsample['Xwhite'+sr[2]]=Xrot/np.sqrt(S+1e-5)        
    return df_diffsample

In [6]:
def smallgroup_normalize(df_population):
# Small-size samples of historical data for empirical distribution and K-means
    SizeList=[0.2,0.4,0.6,0.8,1.0] # proportion of sample size to population
    df_diffsample={}
    for sz in SizeList:    
        # (1) Get a small size group (to show the influences of data size)
        sr=str(sz)  
        df_smallgroup=df_population.sample(frac=sz, replace=False, random_state=1)
        df_diffsample['Sample'+sr[2]]=df_smallgroup.values
        
        #(2)  Statistics for empirical distribution
        df_diffsample['Em_mean'+sr[2]]=df_smallgroup.mean().values
        df_diffsample['Em_cov'+sr[2]]=df_smallgroup.cov().values
        
        #(3)  Standardization in each group
        L=np.linalg.cholesky(df_Pv.cov().values)
        invL=np.linalg.inv(L)
        Zsample=np.dot(invL,(df_Pv-df_Pv.mean()).values.T)
        df_diffsample['Zsample'+sr[2]]=Zsample        
    return df_diffsample

## Duration of the dataset

In [7]:
month_list=['0101','0201','0301','0401','0501','0601','0701','0801','0901','1001','1101','1201']
year_list=['2016','2017','2018','2019','2020']
Missing_list=['20160101','20170701','20170801','20170901','20200601','20200701','20201201']

# Get dispatch data of wind power
1. Every dispatched units have an ID (i.e., DUID) while we attached the fuel type from 'df_g', so that wind power dispatch data can be extracted;
2. Parse and save unit dispatch data. Note that dispatch in MW is given at 5min intervals, and that the time resolution of demand data is 30min intervals, corresponding to the length of a trading period in the NEM. To align the time resolution of these signals unit dispatch data are aggregated, with mean power output over 60min intervals computed for each DUID.
3. There are 912 nodes which are aggregated in 16 regions. Select ['TAS','NSA','CAN','MEL'] for the four wind farms in the modified IEEE RTS 24-node model;
4. Fit the GMM and save the GMM parameters as distionary and mat file, for further optimization;


In [8]:
def get_zonal_wind_power (df_DISPATCH_UNIT_SCADA):
    # Convert to datetime objects
    df_DISPATCH_UNIT_SCADA['SETTLEMENTDATE'] = pd.to_datetime(df_DISPATCH_UNIT_SCADA['SETTLEMENTDATE'])
    # Pivot dataframe. Dates are the index values, columns are DUIDs, values are DUID dispatch levels
    df_DISPATCH_UNIT_SCADA_piv = df_DISPATCH_UNIT_SCADA.pivot(index='SETTLEMENTDATE', columns='DUID', values='SCADAVALUE')
    # To ensure the 30th minute interval is included during each trading interval the time index is offset
    # by 1min. Once the groupby operation is performed this offset is removed.
    df_DISPATCH_UNIT_SCADA_agg = df_DISPATCH_UNIT_SCADA_piv.groupby(pd.Grouper(freq='60Min', base=1, label='right')).mean()
    df_DISPATCH_UNIT_SCADA_agg = df_DISPATCH_UNIT_SCADA_agg.set_index(df_DISPATCH_UNIT_SCADA_agg.index - pd.Timedelta(minutes=1))
    df_DISPATCH_UNIT_SCADA_agg
    # (1)Nodal renewable energy system (RES) disaptch;(2)NEM Zonal RES dispatch
    # Add fuel category to each DUID in SCADA dispatch dataframe
    df_DISPATCH_UNIT_SCADA_agg = df_DISPATCH_UNIT_SCADA_agg.T.join(df_g[['FUEL_CAT']])
    # Only consider intermittent solar and wind generators
    mask = df_DISPATCH_UNIT_SCADA_agg['FUEL_CAT'].isin(['Wind'])
    # Keep wind and solar (RES) DUIDs, drop fuel category column, and transpose (columns=DUID, index=Timestamp)
    # All intermittent generation profiles
    # (columns=DUID, index=Timestamps)
    df_DUID_RES = df_DISPATCH_UNIT_SCADA_agg[mask].drop('FUEL_CAT', axis=1).T  

    #(1) Nodal RES
    # Add node to which generator is connected, groupby node and sum, 
    # reindex columns using all node IDs, yielding total intermittent injection at each node
    # Injections from intermittent sources (columns=node ID, index=Timestamps)
    df_nodal_RES=df_DUID_RES.T.join(df_g[['NODE']], how='left').groupby('NODE').sum().T.reindex(columns=df_n.index, fill_value=0)

    return df_nodal_RES
   

## Main function for wind power data

In [9]:
# Unit dispatch data
#for file in os.listdir(dispatch_dir)


df_wind=pd.DataFrame(columns = ['TAS','NSA','CAN','MEL']) 

for year in year_list:
    for month in month_list:
        a=year+''+month
        if a in Missing_list:
            continue
        print(a)
        name_csv='PUBLIC_DVD_DISPATCH_UNIT_SCADA_201901010000.CSV'
        name_csv=name_csv.replace('0101',month)
        name_csv=name_csv.replace('2019',year)
        
        df_DISPATCH_UNIT_SCADA = pd.read_csv(os.path.join(MMSDMdata_dir, name_csv),
                                     skiprows=1, skipfooter=1, engine='python')
    
        df_wind0=get_zonal_wind_power (df_DISPATCH_UNIT_SCADA)
        df_wind=df_wind.append(df_wind0)
        
#GMM_TrueWind=GMM_forTrue_PDF(df_wind)
#Sample_Wind=sample_Kmeans(df_wind)
#sio.savemat(os.path.join(output_for_mat_dir,'GMM_TrueWind.mat'), {'GMM_TrueWind': GMM_TrueWind})
#sio.savemat(os.path.join(output_for_mat_dir,'Sample_Wind.mat'), {'Sample_Wind': Sample_Wind})



20160201
20160301
20160401
20160501
20160601
20160701
20160801
20160901
20161001
20161101
20161201
20170101
20170201
20170301
20170401
20170501
20170601
20171001
20171101
20171201
20180101
20180201
20180301
20180401
20180501
20180601
20180701
20180801
20180901
20181001
20181101
20181201
20190101
20190201
20190301
20190401
20190501
20190601
20190701
20190801
20190901
20191001
20191101
20191201
20200101
20200201
20200301
20200401
20200501
20200801
20200901
20201001
20201101


# Get load data
1. Load data in each NEM region are given at 30min intervals. Likewise, demand data are aggregated, with mean values over 60min intervals computed for each region. <br />
2. In the AEMO database, only trading information across the five regions, i.e., NSW, QLD,SA, VIC,TAS, are available. To obtain the load data for the 16 zones, we allocated the total load to each zone based on their respective population size, based on the information provided in [2] (i.e., in 'df_n')<br />


In [15]:
def get_nodal_load(df_TRADINGREGIONSUM_piv,df_n):
    def node_demand(row, df_regd):
        # row['NEM_REGION'] : define which trading region (the five) the node belongs to        
            return (row['PROP_REG_D']+ 0.5*np.random.rand(1))* df_regd.loc[:, row['NEM_REGION']]
        
    df_nodal_d0=df_n.apply(node_demand, args=(df_TRADINGREGIONSUM_piv,), axis=1).T            
    
    return df_nodal_d0


## Main function for load data

#### Note: </br>
As the detailed information for each NEM zone is unavaliable, we <br/> 
(1) allocate the load in each trading region to the zones based on population size in each zone (50%), i.e., df_load_pop <br/> 
(2) generate normal random variables based on practical mean and variance data (50%), i.e., df_loadrandom <br/> 
Otherwise, a non-sigular covariance matrix would be calculated

In [16]:
df_load_pop=pd.DataFrame()

for year in year_list:
    for month in month_list:
        a=year+''+month
        if a in Missing_list:
            continue
        print(a)
        name_csv='PUBLIC_DVD_TRADINGREGIONSUM_201901010000.CSV'
        name_csv=name_csv.replace('0101',month)
        name_csv=name_csv.replace('2019',year)
        # Regional summary for each trading interval
        df_TRADINGREGIONSUM = pd.read_csv(os.path.join(MMSDMdata_dir, name_csv),
                                      skiprows=1, skipfooter=1, engine='python')
        # Convert settlement date to datetime
        df_TRADINGREGIONSUM['SETTLEMENTDATE'] = pd.to_datetime(df_TRADINGREGIONSUM['SETTLEMENTDATE'])
        # Pivot dataframe. Index is timestamp, columns are NEM region IDs, values are total demand
        df_TRADINGREGIONSUM_piv = df_TRADINGREGIONSUM.pivot(index='SETTLEMENTDATE', columns='REGIONID', values='TOTALDEMAND')
        df_TRADINGREGIONSUM_piv = df_TRADINGREGIONSUM_piv.groupby(pd.Grouper(freq='60Min', base=1, label='right')).mean()
        df_TRADINGREGIONSUM_piv = df_TRADINGREGIONSUM_piv.set_index(df_TRADINGREGIONSUM_piv.index - pd.Timedelta(minutes=1))

        df_load0=get_nodal_load(df_TRADINGREGIONSUM_piv,df_n)
        df_load_pop=df_load_pop.append(df_load0)
        
#df_load_pop=df_load_pop.T
#np.linalg.matrix_rank(df_load_pop.cov().values)        

20160201
20160301
20160401
20160501
20160601
20160701
20160801
20160901
20161001
20161101
20161201
20170101
20170201
20170301
20170401
20170501
20170601
20171001
20171101
20171201
20180101
20180201
20180301
20180401
20180501
20180601
20180701
20180801
20180901
20181001
20181101
20181201
20190101
20190201
20190301
20190401
20190501
20190601
20190701
20190801
20190901
20191001
20191101
20191201
20200101
20200201
20200301
20200401
20200501
20200801
20200901
20201001
20201101


In [22]:
mask=df_n['VOLTAGE_KV'].isin([275,330,400,500])
df_load=df_load_pop.T[mask]


In [25]:
df_load=df_load.T

In [33]:
mu=df_load.mean().values
sigma=np.diag(df_load.var().values)
c,r=df_load.shape
c,r

(38856, 199)

In [41]:
load_random=np.random.multivariate_normal(mu,sigma,c)

In [44]:
df_load_final=df_load+load_random

In [45]:
np.linalg.matrix_rank(df_load_final.cov().values)   

199

### GMM parameters for Pv=[Pload;Pwind]

In [14]:
df_Pv = pd.concat([df_load_pop*0.7/300, df_wind], axis=1)
# np.linalg.matrix_rank(df_Pv.cov().values)
Pv_diffsample=samllgroup_normalize(df_Pv)
sio.savemat(os.path.join(output_for_mat_dir,'Pv_diffsample2.mat'), {'Pv_diffsample': Pv_diffsample})


# References
[1] -Xenophon, A., Hill, D. Open grid model of Australia’s National Electricity Market allowing backtesting against historic data. Sci Data 5, 180203 (2018). https://doi.org/10.1038/sdata.2018.203

[2] -Xenophon, A. K. Geospatial Modelling of Australia’s National Electricity Market. GitHub https://github.com/akxen/egrimod-nem (2018).

[3] -Australian Energy Markets Operator. Data Archive (2018). at http://www.nemweb.com.au/#mms-data-model 

[4] -Australian Energy Markets Operator. Data Archive (2018). at https://nemweb.com.au/Data_Archive/Wholesale_Electricity/MMSDM/

#  Datasets List from MMSDM
A summary of the tables used from AEMO's MMSDM database [3] is given below:

| Table | Description |
| :----- | :----- |
|DISPATCH_UNIT_SCADA | MW dispatch at 5 minute (dispatch) intervals for DUIDs within the NEM.|
|TRADINGREGIONSUM | Contains load in each NEM region at 30 minute (trading) intervals.|