# Preprocessing Observational Data For Training Bias Correction

### In order to train the bias correction, we require data on the turbines we are simulating for. Information on the model, position and height of the turbine is required. Along with these turbines/farms we require the observed power output or capacity factor from each.

### NOTE: It is difficult to automate this process as it will be dependent on the data you are able to obtain, please adjust the code to produce the required result.

### (1) Inputing the turbine information data.
input: a csv with the variables described below
output: pandas dataframe called `data`

ESSENTIAL
* Latitude and longitude of the turbine/farm
* Max capacity of turbine/farm
* Number of turbines at this point (if it is a farm, to estimate the individual turbine capacity)

DESIRABLE (CAN BE ROUGHLY MATCHED LATER IF NOT PROVIDED)
* Individual turbine capacity (if it is a farm)
* Commisioning/decommisioning date (not sure if this is 100% needed but helps with a more accurate training)
* Onshore or Offshore
* Turbine model
* Hub height



### (2) Using the turbine metadata to fill missing variables from (1)
If the desirable variables cannot be found we can use turbine `metadata` collected from Denmarks turbine database. Currently it is coded to match the nearest capacity to a turbine with similar capacity. More considerations can be used for a more accurate match, I haven't coded this.

input: `data` and `metadata` (loaded from `model.csv`)
output: pandas dataframe called `turb_info`

### (3) Matching observational data with turbines/farms in `turb_info`
Observational data should be the observed generated capacity factor covering the desired training area. Preferably this will be monthly generation data for each turbine/farm, however this is hard to find. Try to find the best spatial and temporal resolution capacity factor you can find as this will determine the resolution of the bias correction factors.

input: `turb_info` and `obs_cap` (loaded from observational data you find)
output: `obs_data`

In [1]:
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import xarray as xr
from sklearn.cluster import KMeans
from calendar import monthrange
import itertools
from vwf.simulation import simulate_wind
from vwf.preprocessing import (
    prep_era5,
    prep_obs,
    prep_obs_test,
    prep_merra2_method_1
)
import xarray as xr
import numpy as np
import pandas as pd
from scipy import interpolate

year_star = 2015 # start year of training period
year_end = 2019 # end year of training period
year_test = 2020 # year you wish to receive a time series for

In [2]:
powerCurveFileLoc = 'data/turbine_info/Wind Turbine Power Curves.csv'
powerCurveFile = pd.read_csv(powerCurveFileLoc)

# files for training
era5_train = prep_era5(True)
obs_cf, turb_info_train = prep_obs("DK", year_star, year_end)
unc_ws_train, unc_cf_train = simulate_wind(era5_train, turb_info_train, powerCurveFile)

Number of turbines before preprocessing:  5682
Number of turbines used in training:  3712


In [3]:
def add_times(data):
    data['year'] = pd.DatetimeIndex(data['time']).year
    data['month'] = pd.DatetimeIndex(data['time']).month
    data.insert(1, 'year', data.pop('year'))
    data.insert(2, 'month', data.pop('month'))
    return data

In [4]:
unc_cf = unc_cf_train.groupby(pd.Grouper(key='time',freq='M')).mean().reset_index()
unc_cf = unc_cf.melt(id_vars=["time"], 
                var_name="ID", 
                value_name="sim")
unc_cf = add_times(unc_cf)
unc_cf['ID'] = unc_cf['ID'].astype(str)
unc_cf['month'] = unc_cf['month'].astype(int)
unc_cf['year'] = unc_cf['year'].astype(int)
obs_cf2 = obs_cf
obs_cf2.columns = ['ID','1','2','3','4','5','6','7','8','9','10','11','12','year']
obs_cf2 = obs_cf2.melt(id_vars=["ID", "year"], 
                var_name="month", 
                value_name="obs")
obs_cf2['ID'] = obs_cf2['ID'].astype(str)
obs_cf2['month'] = obs_cf2['month'].astype(int)
obs_cf2['year'] = obs_cf2['year'].astype(int)
train_cf = pd.merge(unc_cf, obs_cf2, on=['ID','month', "year"], how='left')
scalar_alpha = 0.6
scalar_beta = 0.2
train_cf['scalar'] = (scalar_alpha * (train_cf['obs']/train_cf['sim'])) + scalar_beta
train_cf = train_cf.drop(['time'], axis=1).reset_index(drop=True)
train_cf

Unnamed: 0,year,month,ID,sim,obs,scalar
0,2015,1,570715000001403421,0.321041,0.333199,0.822723
1,2015,2,570715000001403421,0.215571,0.321577,1.095047
2,2015,3,570715000001403421,0.179502,0.304167,1.216703
3,2015,4,570715000001403421,0.149309,0.311944,1.45355
4,2015,5,570715000001403421,0.171895,0.281317,1.181941
...,...,...,...,...,...,...
222715,2019,8,571313134808512710,0.422627,0.206886,0.493715
222716,2019,9,571313134808512710,0.615157,0.213678,0.408413
222717,2019,10,571313134808512710,0.507889,0.21256,0.45111
222718,2019,11,571313134808512710,0.470190,0.194146,0.447746


In [22]:
def extrapolate_wind_speed(reanal_data, turb_info):
    reanal_data = reanal_data.assign_coords(
        height=('height', turb_info['height'].unique()))
    
    # calculating wind speed from reanalysis dataset variables
    ws = reanal_data.wnd100m * (np.log(reanal_data.height/ reanal_data.roughness) / np.log(100 / reanal_data.roughness))
    
    # creating coordinates to spatially interpolate to
    lat =  xr.DataArray(turb_info['lat'], dims='turbine', coords={'turbine':turb_info['ID']})
    lon =  xr.DataArray(turb_info['lon'], dims='turbine', coords={'turbine':turb_info['ID']})
    height =  xr.DataArray(turb_info['height'], dims='turbine', coords={'turbine':turb_info['ID']})

    # spatial interpolating to turbine positions
    raw_ws = ws.interp(
            x=lon, y=lat, height=height,
            kwargs={"fill_value": None})
    
    return raw_ws
    
def speed_to_power(sim_ws, turb_info, powerCurveFile): 
    # identifying the model assigned to this turbine ID to access the power curve
    # and covert the speed into power
    x = powerCurveFile['data$speed']
    turb_name = turb_info.loc[turb_info['ID'] == sim_ws.turbine.data, 'model']
    y = powerCurveFile[turb_name].to_numpy().flatten()
    f = interpolate.Akima1DInterpolator(x, y)
    return f(sim_ws.data)


def simulate_wind(turb_info, reanal_data, powerCurveFile, *args):
    scalar, offset = args
    raw_ws = extrapolate_wind_speed(reanal_data, turb_info)
    raw_ws = (raw_ws * scalar) + offset # equation 2
    raw_ws = raw_ws.where(raw_ws > 0 , 0)
    raw_ws = raw_ws.where(raw_ws < 40 , 40)
    raw_cf = speed_to_power(raw_ws, turb_info, powerCurveFile)
    return np.mean(raw_cf)

In [23]:
def find_farm_offset(row,turb_info,reanal_data,powerCurveFile):
    myOffset = 0
    
    # decide our initial search step size
    stepSize = -0.64
    if (row.sim > row.obs):
        stepSize = 0.64
        
    # Stop when step-size is smaller than our power curve's resolution
    while np.abs(stepSize) > 0.002:
        # If we are still far from energytarget, increase stepsize
        myOffset += stepSize
        
        # calculate the mean simulated CF using the new offset
        mean_cf = simulate_wind(
            turb_info[turb_info["ID"]==row.ID], 
            reanal_data.sel(time=slice(str(row.year)+'-'+str(row.month)+'-01', str(row.year)+'-'+str(row.month)+'-'+str(monthrange(row.year, row.month)[1]))), 
            powerCurveFile, 
            row.scalar, 
            myOffset)

        # if we have overshot our target, then repeat, searching the other direction
        # ((guess < target & sign(step) < 0) | (guess > target & sign(step) > 0))
        if mean_cf != 0:
            sim = mean_cf
            if np.sign(sim - row.obs) == np.sign(stepSize):
                stepSize = -stepSize / 2
            # If we have reached unreasonable places, stop
            if myOffset < -20 or myOffset > 20:
                break
        elif mean_cf == 0:
            myOffset = 0
            break
    
    return myOffset

In [67]:
%%time
top_50 = train_cf.head(100)
top_50.apply(find_farm_offset, args=(turb_info_train,era5_train,powerCurveFile), axis=1)

CPU times: user 8.85 s, sys: 180 ms, total: 9.03 s
Wall time: 9.09 s


0     1.5125
1     0.4825
2    -0.0250
3    -0.8175
4    -0.0150
       ...  
95    1.9925
96    2.0200
97    1.6975
98    1.5425
99    0.0475
Length: 100, dtype: float64

In [101]:
import dask.dataframe as dd
ddf = dd.from_pandas(train_cf, npartitions=40)

In [102]:
def offset2(df):
    return df.apply(find_farm_offset, args=(turb_info_train,era5_train,powerCurveFile), axis=1)

In [103]:
%%time
ddf["offset"] = ddf.map_partitions(offset2, meta=('offset', 'f8'))
ddf.to_csv('data/turbine_info/all_bias_results.csv', single_file=True, compute_kwargs={'scheduler':'processes'})

CPU times: user 55min 48s, sys: 6min 14s, total: 1h 2min 2s
Wall time: 1h 14min 18s


['/Users/ellyess/Desktop/PhD/ninja-reimplementation/data/turbine_info/all_bias_results.csv']

In [106]:
check = pd.read_csv('data/turbine_info/all_bias_results.csv')
check

Unnamed: 0.1,Unnamed: 0,year,month,ID,sim,obs,scalar,offset
0,0,2015,1,570715000001403421,0.321041,0.333199,0.822723,1.5125
1,1,2015,2,570715000001403421,0.215571,0.321577,1.095047,0.4825
2,2,2015,3,570715000001403421,0.179502,0.304167,1.216703,-0.0250
3,3,2015,4,570715000001403421,0.149309,0.311944,1.453550,-0.8175
4,4,2015,5,570715000001403421,0.171895,0.281317,1.181941,-0.0150
...,...,...,...,...,...,...,...,...
222715,222715,2019,8,571313134808512710,0.422627,0.206886,0.493715,1.9600
222716,222716,2019,9,571313134808512710,0.615157,0.213678,0.408413,1.8925
222717,222717,2019,10,571313134808512710,0.507889,0.212560,0.451110,2.0850
222718,222718,2019,11,571313134808512710,0.470190,0.194146,0.447746,2.0825


### (1) Inputing the turbine information data.

In [2]:
data = pd.read_csv('../ninja-reimplementation/data/wind_data/UK/renewable_power_plants_UK_filtered.csv')
data.head()

Unnamed: 0,electrical_capacity,energy_source_level_1,energy_source_level_2,energy_source_level_3,technology,data_source,nuts_1_region,nuts_2_region,nuts_3_region,lon,...,country,commissioning_date,solar_mounting_type,chp,capacity_individual_turbine,number_of_turbines,site_name,uk_beis_id,operator,comment
0,1.3,Renewable energy,Wind,,Onshore,BEIS,UKE,UKE2,UKE22,-1.914154,...,England,1992-01-06,,,0.3,4,Chelker Reservoir,2921,Yorkshire Water,
1,2.7,Renewable energy,Wind,,Onshore,BEIS,UKC,UKC2,UKC21,-1.495191,...,England,1992-01-12,,,0.3,9,Blyth Harbour Wind Farm,3659,Border Wind Farms Ltd,
2,31.0,Renewable energy,Wind,,Onshore,BEIS,UKL,UKL2,UKL24,-3.430831,...,Wales,1993-01-01,,,0.3,103,Llandinam Windfarm,3057,CELTPOWER LTD,
3,4.8,Renewable energy,Wind,,Onshore,BEIS,UKD,UKD1,UKD12,-3.135725,...,England,1993-01-01,,,0.4,12,Kirkby Moor,2713,Npower Renewables,
4,9.6,Renewable energy,Wind,,Onshore,BEIS,UKD,UKD4,UKD46,-2.149984,...,England,1993-01-02,,,0.4,24,Coal Clough Wind Farm,3079,Renewable Energy Systems (RES),


In [3]:
columns = ['country','technology','lon','lat','electrical_capacity','number_of_turbines','capacity_individual_turbine', 'commissioning_date']
data = data[columns]
data['commissioning_date'] = pd.to_datetime(data['commissioning_date'])
data = data.sort_values('capacity_individual_turbine')
data.head()

Unnamed: 0,country,technology,lon,lat,electrical_capacity,number_of_turbines,capacity_individual_turbine,commissioning_date
25,Scotland,Onshore,-4.426957,57.709987,17.0,34,0.0,1997-01-09
80,England,Onshore,-3.324533,54.203954,1.2,5,0.22,2004-01-05
299,England,Onshore,-1.342652,52.826198,1.0,4,0.25,2011-10-05
0,England,Onshore,-1.914154,53.962432,1.3,4,0.3,1992-01-06
58,England,Onshore,-4.546749,50.645509,6.6,22,0.3,2002-01-06


### (2) Using the turbine metadata to fill missing variables from (1)

In [4]:
# # turn elizabeths metadata into a general turbine metadata file the heights here are a range of the min and max denmark had
# metadata = pd.read_excel('../ninja-reimplementation/data/turbine_info/Metadata_2020.xlsx')
# metadata = metadata.sort_values('Dato for \nnettilslutning')
# metadata = metadata.drop(metadata[metadata.height < 10].index)
# max = metadata.groupby('turb_match', as_index=False)['capacity'].max()
# min = metadata.groupby('turb_match', as_index=False)['height'].min()
# metadata = metadata[['Dato for \nnettilslutning', 'capacity', 'turb_match']]
# metadata.columns = ['date', 'capacity', 'model']
# metadata.drop_duplicates(subset=['model'], keep='first',inplace=True)
# metadata = metadata.reset_index(drop=True)
# metadata = metadata.sort_values('model').reset_index(drop=True)
# metadata['height_min'] = min.height
# metadata['height_max'] = max.height
# metadata.to_csv('../ninja-reimplementation/data/turbine_info/models.csv', index = None) 

In [5]:
metadata = pd.read_csv('../ninja-reimplementation/data/turbine_info/models.csv')
metadata['date'] = pd.to_datetime(metadata['date'])
metadata = metadata.sort_values('capacity')
metadata.capacity = metadata.capacity/1000
metadata.head()

Unnamed: 0,model,capacity,height_min,height_max,date
0,Bonus.B23.150,0.15,30.0,60.0,1987-04-13
18,Nordex.N27.150,0.15,35.0,40.0,1982-04-21
44,Vestas.V27.225,0.225,29.3,39.0,1980-01-03
45,Vestas.V29.225,0.225,17.0,35.0,1979-08-16
19,Nordex.N29.250,0.25,30.0,69.0,1988-06-27


In [8]:
turb_info = pd.merge_asof(data, metadata, left_on=["capacity_individual_turbine"], right_on=["capacity"], direction="nearest")
turb_info.head()

Unnamed: 0,country,technology,lon,lat,electrical_capacity,number_of_turbines,capacity_individual_turbine,commissioning_date,model,capacity,height_min,height_max,date
0,Scotland,Onshore,-4.426957,57.709987,17.0,34,0.0,1997-01-09,Bonus.B23.150,0.15,30.0,60.0,1987-04-13
1,England,Onshore,-3.324533,54.203954,1.2,5,0.22,2004-01-05,Vestas.V27.225,0.225,29.3,39.0,1980-01-03
2,England,Onshore,-1.342652,52.826198,1.0,4,0.25,2011-10-05,Nordex.N29.250,0.25,30.0,69.0,1988-06-27
3,England,Onshore,-1.914154,53.962432,1.3,4,0.3,1992-01-06,Bonus.B33.300,0.3,105.0,130.0,1991-12-15
4,England,Onshore,-4.546749,50.645509,6.6,22,0.3,2002-01-06,Bonus.B33.300,0.3,105.0,130.0,1991-12-15


### (3) Matching observational data with turbines/farms in `turb_info`

In [None]:
time_res = "quarter" # "quarter", "month", "season", "year", "day"
space_res = "country" # "turbine", "farm", "country", "custom" 

