## simple notebook to show basic steps for preparing era5 data for use

In many instances, it is possible to identify the single closest ERA5 point to our data, and use it to build rough models. Monthly wind speed aggregate values are typically used with great success with monthly wind plant aggregate values of gross energy (daily average or adjusted for number of days in month) as the gross production will/should follow the resource wind speed, so a linear model with wind speed as the predictor and gross energy as the predictand can be quite strong and reliable. It is not the absolute value of the wind speed that matters, but the pattern of change, month to month.

Small improvements might be made by interpolating the 4 surrounding points to the site coordinates to approximate the relative impact of the 4 surrounding sites. It is common for the interpolated wind speed value to have a slightly higher correlation and rsquared than the closest point.

In cases where models will be built at higher resolution, additional adjustments may reduce model uncertainty.  Besides interpolation to the site coordinates, we can consider the elevation of teh measurements.   ERA5 has u and v vector measurements of easting and northing at both 10 and 100m as well as higher elevations based on pressure levels, along with temperature at 2m and pressure at ground level.  The u and v parameters can be converted to wind speed and direction,  we can adjust speed based on shear from the 10/100 m measurements to our hub height (an average of 75 m at Kelmarsh) and there are lapse rates - changes with elevation for pressure and temperature.

in the simplified code that follows, we'll employ all the more common adjustments to demonstrate what is possible.  Again, it is part of model design, development, testing and verification to confirm that the additional steps are adding value to the modeling process by reducing model uncertainty. As many of these adjustments are numeric transformations, most models will self adjust to the raw unadjusted parameter, but it should be confirmed, not assumed.



### import and inspect data

In [2]:
import polars as pl
import numpy as np
from pathlib import Path
cwd =  Path.cwd()

# get the data saved to the output folder
polars_df = pl.read_parquet(cwd / 'output' / 'era5_data.parquet')

### adjust parameters to avg kelmarsh hub height 75 m

we jump to speed and direction at 10 and 100 so we can use a shear calc to get 75m wind speed, and we extrapolate wind direction, then we convert these back to u and v at 75 m


In [146]:
import polars as pl
import numpy as np

# Constants
R = 287.05  # Specific gas constant for dry air (J/(kg·K))
g = 9.80665  # Acceleration due to gravity (m/s²)
L = -0.0065  # Standard temperature lapse rate (K/m)
kelmarsh_avg_hh = 75  # Height difference (m)
z0 = 10  # Reference height (m)
z1 = 100  # Second height (m)

def calculate_temperature(df, base_temp_col, height_diff, alias):
    return df.with_columns(
        (pl.col(base_temp_col) + L * height_diff).alias(alias)
    )

def calculate_pressure(df, base_pres_col, temp_col, height_diff, alias):
    return df.with_columns(
        (pl.col(base_pres_col) * np.exp(-g * height_diff / (R * pl.col(temp_col)))).alias(alias)
    )

def calculate_air_density(df, pres_col, temp_col, alias):
    return df.with_columns(
        (pl.col(pres_col) / (R * pl.col(temp_col))).alias(alias)
    )

def wswd_from_uv(df, u_col, v_col, speed_alias, dir_alias):
    return df.with_columns(
        (pl.col(u_col)**2 + pl.col(v_col)**2).sqrt().alias(speed_alias),
        (pl.arctan2(pl.col(v_col), pl.col(u_col)) * pl.lit(180 / np.pi)).alias(dir_alias)
    )

def uv_from_wswd(df, speed_col, dir_col, u_alias, v_alias):
    df = df.with_columns(
        (pl.col(dir_col) * pl.lit(np.pi / 180)).alias(f"{dir_col}_rad")
    )
    return df.with_columns(
        (-pl.col(speed_col) * pl.col(f"{dir_col}_rad").sin()).alias(u_alias),
        (-pl.col(speed_col) * pl.col(f"{dir_col}_rad").cos()).alias(v_alias)
    ).drop([f"{dir_col}_rad"])

def interpolate_wind_direction(df, dir_col_10m, dir_col_100m, height_diff, alias):
    return df.with_columns(
        ((pl.col(dir_col_10m) + (pl.col(dir_col_100m) - pl.col(dir_col_10m)) * height_diff / (z1 - z0))).alias(alias)
    )

# Assuming polars_df is already created and contains the necessary columns
# polars_df = ... (created in the previous cell)

# Calculate temperature at 75 meters
polars_df = calculate_temperature(polars_df, "EnvTmp_2m", kelmarsh_avg_hh - 2, "EnvTmp_75m")

# Calculate pressure at 75 meters
polars_df = calculate_pressure(polars_df, "EnvPres_0m", "EnvTmp_75m", kelmarsh_avg_hh, "EnvPres_75m")

# Calculate air density at the surface (0 meters)
polars_df = calculate_air_density(polars_df, "EnvPres_0m", "EnvTmp_2m", "AirDen_0m")

# Calculate air density at 75 meters
polars_df = calculate_air_density(polars_df, "EnvPres_75m", "EnvTmp_75m", "AirDen_75m")

# Calculate temperature at 100 meters
polars_df = calculate_temperature(polars_df, "EnvTmp_2m", 100 - 2, "EnvTmp_100m")

# Calculate pressure at 100 meters
polars_df = calculate_pressure(polars_df, "EnvPres_0m", "EnvTmp_100m", 100, "EnvPres_100m")

# Calculate air density at 100 meters
polars_df = calculate_air_density(polars_df, "EnvPres_100m", "EnvTmp_100m", "AirDen_100m")

# Calculate wind speed and direction at 10 meters and 100 meters
polars_df = wswd_from_uv(polars_df, "HorWdU_10m", "HorWdV_10m", "HorWdSpd_10m", "HorWdDir_10m")
polars_df = wswd_from_uv(polars_df, "HorWdU_100m", "HorWdV_100m", "HorWdSpd_100m", "HorWdDir_100m")

# Calculate wind shear exponent using wind speeds at 10 meters and 100 meters
polars_df = polars_df.with_columns(
    (pl.col("HorWdSpd_100m") / pl.col("HorWdSpd_10m")).log().alias("log_wind_speed_ratio"),
    (pl.lit(z1 / z0).log()).alias("log_height_ratio")
)

polars_df = polars_df.with_columns(
    (pl.col("log_wind_speed_ratio") / pl.col("log_height_ratio")).alias("wind_shear_exponent")
)

# Calculate wind speed at 75 meters using the power law
polars_df = polars_df.with_columns(
    (pl.col("HorWdSpd_10m") * (kelmarsh_avg_hh / z0)**pl.col("wind_shear_exponent")).alias("HorWdSpd_75m")
)

# Interpolate wind direction at 75 meters
polars_df = interpolate_wind_direction(polars_df, "HorWdDir_10m", "HorWdDir_100m", kelmarsh_avg_hh - z0, "HorWdDir_75m")

# Calculate U and V components at 75 meters
polars_df = uv_from_wswd(polars_df, "HorWdSpd_75m", "HorWdDir_75m", "HorWdU_75m", "HorWdV_75m")

# Drop intermediate columns
polars_df = polars_df.drop(["log_wind_speed_ratio", "log_height_ratio", "wind_shear_exponent"])

# Reorder columns
columns_order = [
    "valid_time", "latitude", "longitude",
    "HorWdU_10m", "HorWdV_10m", "HorWdSpd_10m", "HorWdDir_10m", "EnvPres_0m", "EnvTmp_2m", "AirDen_0m",
    "HorWdU_75m", "HorWdV_75m", "HorWdSpd_75m", "HorWdDir_75m", "EnvPres_75m", "EnvTmp_75m", "AirDen_75m",
    "HorWdU_100m", "HorWdV_100m", "HorWdSpd_100m", "HorWdDir_100m", "EnvPres_100m", "EnvTmp_100m", "AirDen_100m"
]
polars_df = polars_df.select(columns_order)

# Print the DataFrame with all calculations
print(polars_df)

shape: (35_136, 24)
┌───────────┬──────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ valid_tim ┆ latitude ┆ longitude ┆ HorWdU_10 ┆ … ┆ HorWdDir_ ┆ EnvPres_1 ┆ EnvTmp_10 ┆ AirDen_10 │
│ e         ┆ ---      ┆ ---       ┆ m         ┆   ┆ 100m      ┆ 00m       ┆ 0m        ┆ 0m        │
│ ---       ┆ f64      ┆ f64       ┆ ---       ┆   ┆ ---       ┆ ---       ┆ ---       ┆ ---       │
│ datetime[ ┆          ┆           ┆ f32       ┆   ┆ f32       ┆ f32       ┆ f32       ┆ f32       │
│ ns]       ┆          ┆           ┆           ┆   ┆           ┆           ┆           ┆           │
╞═══════════╪══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 2020-01-0 ┆ 52.25    ┆ -1.0      ┆ -3.089966 ┆ … ┆ 154.89227 ┆ 100480.58 ┆ 278.57693 ┆ 1.256549  │
│ 1         ┆          ┆           ┆           ┆   ┆ 3         ┆ 5938      ┆ 5         ┆           │
│ 00:00:00  ┆          ┆           ┆           ┆   ┆           ┆       

In [147]:
for i,c in enumerate(polars_df.columns):
    print(c, end = ', ')
    if i % 8 == 0:
        print()

valid_time, 
latitude, longitude, HorWdU_10m, HorWdV_10m, HorWdSpd_10m, HorWdDir_10m, EnvPres_0m, EnvTmp_2m, 
AirDen_0m, HorWdU_75m, HorWdV_75m, HorWdSpd_75m, HorWdDir_75m, EnvPres_75m, EnvTmp_75m, AirDen_75m, 
HorWdU_100m, HorWdV_100m, HorWdSpd_100m, HorWdDir_100m, EnvPres_100m, EnvTmp_100m, AirDen_100m, 

### interpolate values to the site location

polars_df now has a lot of data at the 4 coordinates surrounding the site - we could not create an interpolated dataset - so the weighted influence of hte surrounding points can better related to site conditions






In [148]:
# reminder where we are - Kelmarsh site:
lat = 52.40
lon = -0.943
closest_points = get_closest_grid_points(lat, lon) # function from up above

# Print the closest points and their distances
distance_list = []
for point in closest_points:
    distance = np.round(haversine(lat, lon, point[0], point[1]), 3)
    print(f'For point {point}, distance is {distance} km from Kelmarsh at {lat}, {lon}')
    # create a new polars dataframe with lat/long/distance as columns to use in next step
    point_lat = point[0]
    point_lon = point[1]
    tmp  = pl.DataFrame({
        'latitude': [point_lat],
        'longitude': [point_lon],
        'distance': [distance] })
    distance_list.append(tmp)
distances_df = (pl.concat(distance_list)).sort(['latitude','longitude'])

distances_df = distances_df.with_columns(  (pl.lit('{') + pl.col('latitude').cast(pl.Utf8) + pl.lit(',') + pl.col('longitude').cast(pl.Utf8) + pl.lit('}')).alias('pivot_col_name')) 
print(distances_df)

polars_df = distances_df.join(polars_df, on=['latitude', 'longitude'], how='inner')    

For point (52.5, -1.0), distance is 11.771 km from Kelmarsh at 52.4, -0.943
For point (52.25, -1.0), distance is 17.123 km from Kelmarsh at 52.4, -0.943
For point (52.5, -0.75), distance is 17.167 km from Kelmarsh at 52.4, -0.943
For point (52.25, -0.75), distance is 21.219 km from Kelmarsh at 52.4, -0.943
shape: (4, 4)
┌──────────┬───────────┬──────────┬────────────────┐
│ latitude ┆ longitude ┆ distance ┆ pivot_col_name │
│ ---      ┆ ---       ┆ ---      ┆ ---            │
│ f64      ┆ f64       ┆ f64      ┆ str            │
╞══════════╪═══════════╪══════════╪════════════════╡
│ 52.25    ┆ -1.0      ┆ 17.123   ┆ {52.25,-1.0}   │
│ 52.25    ┆ -0.75     ┆ 21.219   ┆ {52.25,-0.75}  │
│ 52.5     ┆ -1.0      ┆ 11.771   ┆ {52.5,-1.0}    │
│ 52.5     ┆ -0.75     ┆ 17.167   ┆ {52.5,-0.75}   │
└──────────┴───────────┴──────────┴────────────────┘


In [168]:
import polars as pl
import numpy as np

# Example usage
lat = 52.40
lon = -0.943

# Assuming polars_df and distances_df are already created and contain the necessary columns
# polars_df contains columns: valid_time, latitude, longitude, distance, pivot_col_name and other data columns
# distances_df contains columns: latitude, longitude, distance and pivot_col_name

# Perform inverse distance weighting interpolation
def idw_interpolation(distances, values):
    weights = 1 / distances
    weights /= weights.sum()  # Normalize weights
    interpolated_value = np.dot(weights, values)
    return interpolated_value

def interpolate_columns(polars_df, distances_df, lat, lon):
    interpolated_series = []

    for col in polars_df.columns:
        if col not in ["latitude", "longitude", "valid_time", "distance", 'pivot_col_name']:
            # Pivot the DataFrame on latitude and longitude combinations
            tmp = (polars_df.select('valid_time', 'pivot_col_name', col)
                           .pivot(index='valid_time', on=['pivot_col_name'], values=col))

            # Extract the pivot column names from distances_df
            pivot_col_names = distances_df['pivot_col_name'].to_list()

            # Extract the values for the surrounding points from the pivoted DataFrame
            values = tmp.select(pivot_col_names).to_numpy()

            # Extract the distances for the surrounding points from distances_df
            distances = distances_df['distance'].to_numpy()

            # Ensure that the values array is correctly shaped
            values = values.reshape(-1, len(distances))

            # Apply IDW interpolation for each row in the pivoted DataFrame
            interpolated_values = np.apply_along_axis(
                lambda row: idw_interpolation(distances, row),
                axis=1,
                arr=values
            )

            # Create a new DataFrame with the interpolated values
            interpolated_df = tmp.select("valid_time").with_columns([
                pl.Series(name=f"{lat},{lon}", values=interpolated_values)
            ])

            # Join the interpolated values back to the tmp DataFrame
            result_df = tmp.join(interpolated_df, on="valid_time")

            # Unpivot the DataFrame back to long format
            long_df = result_df.unpivot(
                index="valid_time",
                variable_name="location",
                value_name=col
            )

            # Split the 'location' column into 'latitude' and 'longitude'
            long_df = long_df.with_columns([
                pl.col("location").str.split_exact(",", 1).alias("split_location")
            ]).with_columns([
                pl.col("split_location").struct[0].str.strip_chars("{}").cast(pl.Float64).alias("latitude"),
                pl.col("split_location").struct[1].str.strip_chars("{}").cast(pl.Float64).alias("longitude")
            ]).drop("split_location", "location")
            
            interpolated_series.append(long_df)


    # the dfs in interpolated_series have to be joined on valid_time, latitude, longitude, not concatenated
    polars_df_interp = interpolated_series[0]
    for i in range(1, len(interpolated_series)):
        polars_df_interp = polars_df_interp.join(interpolated_series[i], on=['valid_time', 'latitude', 'longitude'], how='inner')   
        
    

    return polars_df_interp

# Interpolate all columns
polars_df_interp = interpolate_columns(polars_df, distances_df, lat, lon)

# Print the resulting DataFrame
print(polars_df_interp)

shape: (43_920, 24)
┌───────────┬───────────┬──────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ valid_tim ┆ HorWdU_10 ┆ latitude ┆ longitude ┆ … ┆ HorWdDir_ ┆ EnvPres_1 ┆ EnvTmp_10 ┆ AirDen_10 │
│ e         ┆ m         ┆ ---      ┆ ---       ┆   ┆ 100m      ┆ 00m       ┆ 0m        ┆ 0m        │
│ ---       ┆ ---       ┆ f64      ┆ f64       ┆   ┆ ---       ┆ ---       ┆ ---       ┆ ---       │
│ datetime[ ┆ f64       ┆          ┆           ┆   ┆ f64       ┆ f64       ┆ f64       ┆ f64       │
│ ns]       ┆           ┆          ┆           ┆   ┆           ┆           ┆           ┆           │
╞═══════════╪═══════════╪══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 2020-01-0 ┆ -3.089966 ┆ 52.25    ┆ -1.0      ┆ … ┆ 154.89227 ┆ 100480.58 ┆ 278.57693 ┆ 1.256549  │
│ 1         ┆           ┆          ┆           ┆   ┆ 3         ┆ 5938      ┆ 5         ┆           │
│ 00:00:00  ┆           ┆          ┆           ┆   ┆           ┆       

In [167]:
long_df

valid_time,EnvTmp_75m,latitude,longitude
datetime[ns],f64,f64,f64
2020-01-01 00:00:00,278.739441,52.25,-1.0
2020-01-01 00:00:00,278.735535,52.25,-0.75
2020-01-01 00:00:00,278.516553,52.4,-0.943
2020-01-01 00:00:00,278.3703,52.5,-1.0
2020-01-01 00:00:00,278.296082,52.5,-0.75
…,…,…,…
2020-12-31 23:00:00,271.644287,52.25,-1.0
2020-12-31 23:00:00,271.700928,52.25,-0.75
2020-12-31 23:00:00,271.577875,52.4,-0.943
2020-12-31 23:00:00,271.509521,52.5,-1.0
