### **CPUC Data Request**: Localized air temperature and dewpoint temperature
This notebook provides the full pre-processing and batch mode calculation for localized WRF data at 71 WECC weather station locations for both air temperature (degF) and dew-point temperature (degF). 

Produces:
- A single netcdf file per station with air temperature and dewpoint temperature as separate variables
- Summary statistics csv file
    - Values represent the multi-model mean count of dewpoint temperature exceeding air temperature per month    

<span style="color:#FF0000">**Reference Notebook Only**</span>: The HadISD station data used to localize WRF data are in a private bucket location `wecc-hadisd` that requires access in order to run this notebook. **Therefore this notebook is provided as methodology process only, and cannot be run**. The sole reason for this is that the version of the HadISD station data that is "hooked up" to climakitae only has the air temperature variable at present. Replacing the publically available station data with the version with the second variable dew-point temperature would "break" the existing climakitae code. Once the climakitae backend code is updated, the version of data with both variables will replace the single-variable version. 
    
### Step 0: Import libraries
Import useful libraries for analysis

In [None]:
import climakitae as ck
import climakitaegui as ckg
import xarray as xr
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from climakitae.util.utils import read_csv_file, get_closest_gridcell, convert_to_local_time

from xclim.core.calendar import convert_calendar
from xclim.core.units import convert_units_to
from xclim.sdba.adjustment import QuantileDeltaMapping
from xclim.sdba import Grouper

from bokeh.models import HoverTool
from timezonefinder import TimezoneFinder

import s3fs
import pyproj
import itertools
import panel as pn
pn.extension()

### Step 1: "Helper" functions for pre-processing
Two functions are provided: 
- Quantile delta mapping function: `do_QDM`
   - This is the same function that is provided in climakitae for localization 
- A version of climakitae function `get_closest_gridcell` that intentionally retrieves the closest gridcell:`get_closest_land_gridcell`
   - This is the preliminary work for identifying the importance of land vs. water pixels for some near-shore HadISD station locations
   - Currently set-up for Santa Barbara (KSBA) only

In [None]:
window = 90
def do_QDM(obs, ds, nquantiles=20, 
           group='time.dayofyear', window=window, 
           kind="+"):
    
    group = Grouper(group, window=window)

    ds.attrs['variable'] = ds.name
    ds.name = 'Raw' 
    
    QDM = QuantileDeltaMapping.train(
        obs, 
        ds.sel(
            time=slice(str(obs.time.values[0]),
                       str(obs.time.values[-1]))), 
        nquantiles=nquantiles, 
        group=group, 
        kind=kind)
    
    ds_adj = QDM.adjust(ds).compute()
    
    QDM_ds = QDM.ds.rename(dict(
        dayofyear = 'Day of Year', 
        quantiles='Quantile'))    
    
    ds_adj.name = 'Adjusted' 
    ds_adj = xr.merge([ds, ds_adj])
    
    return QDM_ds,ds_adj

In [None]:
def get_closest_land_gridcell(data, lat, lon, res='3km', print_coords=True):
    """From input gridded data, get the closest gridcell to a lat, lon coordinate pair.

    This function first transforms the lat,lon coords to the gridded data’s projection.
    Then, it uses xarray’s built in method .sel to get the nearest gridcell.

    Parameters
    -----------
    data: xr.DataArray or xr.Dataset
        Gridded data
    lat: float
        Latitude of coordinate pair
    lon: float
        Longitude of coordinate pair
    res: str
        Spatial resolution for Santa Barbara station
    print_coords: bool, optional
        Print closest coordinates?
        Default to True. Set to False for backend use.

    Returns
    --------
    xr.DataArray
        Grid cell closest to input lat,lon coordinate pair

    See also
    --------
    xarray.DataArray.sel
    """
    # Make Transformer object
    lat_lon_to_model_projection = pyproj.Transformer.from_crs(
        crs_from="epsg:4326",  # Lat/lon
        crs_to=data.rio.crs,  # Model projection
        always_xy=True,
    )

    # Hard-coding for Santa Barbara, forces land pixel selection over water pixel
    # Note fine-tuning produces different results for different spatial resolutions, specify which on call
    if lat==34.424 and lon==-119.842: # station coordinates for SB (HadISD_72392523190)
        print('Selecting closest land pixel for Santa Barbara...')
        if res == '9km':
            lat = 34.43
            lon = -119.83
        elif res == '3km':
            lat = 34.43
            lon = -119.845

    # Convert coordinates to x,y
    x, y = lat_lon_to_model_projection.transform(lon, lat)

    # Get closest gridcell
    closest_gridcell = data.sel(x=x, y=y, method="nearest")

    # Output information
    if print_coords:
        print(
            "Input coordinates: (%.4f, %.4f)" % (lat, lon)
            + "\nNearest grid cell coordinates: (%.4f, %.4f)"
            % (closest_gridcell.lat.values.item(), closest_gridcell.lon.values.item())
        )
    return closest_gridcell

### Step 2: Run full batch mode for all available stations
The next code is the full batch calculation to generate localized data for all 71 locations for both air temperature and dewpoint temperature. 
Requested data notes:
- Data is `Historical Climate` + `SSP 3-7.0 -- Business as Usual`
- Data covers 1981 - 2100, at hourly timesteps
- For 67 of the 71 stations, the 9km spatial resolution is used as many stations are located outside of CA
   - 3 stations are outside of the WECC domain and were run with the 45 km data (KFSD, KMCI, KSGF)
   - 1 station was run with 3 km, using the closest land gridcell (KSBA)

<span style="color:#FF0000">**Warning**</span>: Each station takes approximately **25 minutes to run**. All 71 stations will take approximately **30 hours of continuous run time**. 

In [None]:
# read stations df
stations = "/home/jovyan/cae-notebooks/collaborative/DFU/wecc-station-data.csv"
stations_df = read_csv_file(stations)

# prep summary stats df
stats_list = []

# initialize selections
selections = ckg.Select()

def tas_dew_localize(stations_df, correct_tz=False):
    '''
    Performs the localization procedure, where each variable is processed **individually**
    Multi-variate localization improvements are forthcoming. 
    
    Arguments
    ---------
    stations_df [pd.DataFrame]: station list of lat-lon locations to run
    correct_tz [Boolean]: flag to run timezone correction to PST, default is False (UTC)
    
    Return
    ------
    stats_df [pd.DataFrame]: df of computed diff of tdps > tas counts
    '''
    
    for my_station in stations_df['icao']:
    # for my_station in ['KFSD', 'KMCI', 'KSGF']:    # outside of wecc, run with 45km
    # for my_station in ['KSBA']:                    # SB station - water pixel check, run with 3km

        ## STEP 1 =================================================================================================
        # Grab observations data
        station_id = str(stations_df[stations_df['icao'] == my_station]['station id'].values[0]).replace('-', '')
        print('Running tas+dpts localization on: HadISD_{} ({})'.format(station_id, my_station))

        s3 = s3fs.S3FileSystem(anon=False, key=AWS_ACCESS_KEY_ID, secret=AWS_SECRET_ACCESS_KEY)
        aws_path = "s3://wecc-hadisd/02_tmp_tas_dpt/"
        filepath_zarr = aws_path + "HadISD_{}.zarr".format(station_id)
        print('Opening: {}'.format(filepath_zarr))
        store = s3fs.S3Map(root=filepath_zarr, s3=s3, check=False)
        ds = xr.open_zarr(store=store, consolidated=True) # observation data retrieved

        ## STEP 2 =================================================================================================
        # Subset and run per variable
        vars_to_run = ['tas', 'dpts']
        for var in vars_to_run: # run one variable at a time
            print('\nStart processing on {}'.format(var))
            obs_ds = ds[var] # Subset for variable
            obs_ds = convert_units_to(obs_ds, "degF") # Convert units from K to degF
            obs_ds = obs_ds.chunk(dict(time=-1)).compute()

            # extract coordinates
            lat0 = obs_ds.latitude.values
            lon0 = obs_ds.longitude.values
            print('Obs lat-lon coords: ', lat0, lon0)
            
            if correct_tz == True: # station timezone conversion
                # cannot use the convert_to_local_time function as we are not pulling bias-corrected model data for a station 
                # but modifying the existing raw station data itself
                # 1981 - 2014 is the baseline period
                print('Starting timezone conversion...')
                obs_ds_data = obs_ds.loc[(obs_ds.time.dt.year >= 1981) & (obs_ds.time.dt.year <= 2014)]
                obs_ds_data.time # to keep

                # need to retrieve 2015 to grab the last ~8 hours of "2014" from 2015 (in UTC) to do timezone conversion
                obs_ds_tzslice = obs_ds.loc[obs_ds.time.dt.year == 2015]

                # combine and convert for timezone correction
                obs_ds_total = xr.concat([obs_ds_data, obs_ds_tzslice], dim='time') # 1981-2015
                tf = TimezoneFinder()
                local_tz = tf.timezone_at(lng=float(lon0), lat=float(lat0))
                new_time = (pd.DatetimeIndex(obs_ds_total.time)
                            .tz_localize("UTC")
                            .tz_convert(local_tz)
                            .tz_localize(None)
                            .astype("datetime64[ns]"))
                obs_ds_total['time'] = new_time

                # subset by initial time
                start = obs_ds_data.time[0]
                end = obs_ds_data.time[-1]
                obs_ds_local = obs_ds_total.sel(time=slice(start, end))
                print('Obs timezone correction complete!')

            # retrieve WRF data
            print('Retrieve WRF data...')
            selections.scenario_historical=['Historical Climate']
            selections.scenario_ssp=['SSP 3-7.0 -- Business as Usual']
            selections.append_historical = True
            selections.area_average = 'No'
            selections.time_slice = (1981, 2100)
            selections.timescale = 'hourly'
            if var == 'tas':
                selections.variable = 'Air Temperature at 2m'
            elif var == 'dpts':
                selections.variable = 'Dew point temperature'
            selections.units = 'degF'
            selections.area_subset = 'lat/lon'
            selections.cached_area = ['coordinate selection']
            # depending on which station is being run, different spatial resolutions are required
            if my_station in ['KFSD', 'KMCI', 'KSGF']: # outside of WECC stations
                selections.resolution = '45 km' 
                selections.latitude = (lat0-.5, lat0+.5) 
                selections.longitude = (lon0-.5, lon0+.5)
            elif my_station in ['KSBA']: # Santa Barbara station run at closest gridcell, 3km
                selections.resolution = '3 km'
                selections.latitude = (lat0-.1, lat0+.1)
                selections.longitude = (lon0-.1, lon0+.1)
            else: # all other stations
                selections.resolution = '9 km' # can only use 9km, not 3km for outside of CA regions
                selections.latitude = (lat0-.2, lat0+.2) 
                selections.longitude = (lon0-.2, lon0+.2) 
            wrf_ds = selections.retrieve()
            print('Retrieving WRF lat-lon: ', selections.latitude, selections.longitude)
            # spacing on latlon kept larger because some station locations are tricky
            # keeping large spacing and reduce to single grid cell
            
            if correct_tz == True: # WRF timezone conversion
                wrf_ds = convert_to_local_tie(wrf_ds, selections)
             
            # reduce to only closest grid cell
            if my_station in ['KSBA']: 
                # for Santa Barbara, we have tested using "get_closest_land_gridcell" defined above
                print('Retrieving closest LAND gridcell...')
                wrf_ds = get_closest_land_gridcell(wrf_ds, lat0, lon0, res='3km', print_coords=True) ## test run on land pixel selection for KSBA
            else: 
                print('Retrieving closest gridcell...')
                wrf_ds = get_closest_gridcell(wrf_ds, lat0, lon0, print_coords=True)

            # need to unchunk for bias correction
            wrf_ds = wrf_ds.chunk(dict(time=-1)).compute()
            # do some renaming for plotting ease later
            wrf_ds.attrs['physical_variable'] = wrf_ds.name
            wrf_ds.name = 'Raw'
            
            # dropping duplicate time indexes and resetting calendar
            print('Dropping duplicates in time dimension')
            wrf_ds = wrf_ds.drop_duplicates(dim='time', keep='first')
            obs_ds = obs_ds.drop_duplicates(dim='time', keep='first')
            
            print('Converting to no leap day calendar, best practice')
            wrf_ds = convert_calendar(wrf_ds, "noleap")
            obs_ds = convert_calendar(obs_ds, "noleap")

            print('{} data now processed to go into localization process'.format(var))
            if var == 'tas':
                adj_factors1, adj_ds1 = do_QDM(obs_ds, wrf_ds) # run QDM                
                # drop raw data variable (and unnecessary coordinates), and rename adjusted back to air temperature
                adj_ds1 = adj_ds1['Adjusted']
                adj_ds1.name = 'Adjusted Air Temperature at 2m'
                adj_ds1 = adj_ds1.squeeze()
                adj_ds1 = adj_ds1.reset_coords(names=['Lambert_Conformal','x','y','lakemask','landmask','lat','lon'], drop='True')
                print('QDM complete on {}'.format(var))

            elif var == 'dpts':
                adj_factors2, adj_ds2 = do_QDM(obs_ds, wrf_ds) # run QDM
                # drop raw data variable (and unnecessary coordinates), and rename adjusted back to air temperature
                adj_ds2 = adj_ds2['Adjusted']
                adj_ds2.name = 'Adjusted Dewpoint Temperature'
                adj_ds2 = adj_ds2.squeeze()
                adj_ds2 = adj_ds2.reset_coords(names=['Lambert_Conformal','x','y','lakemask','landmask','lat','lon'], drop='True')
                print('QDM complete on {}'.format(var))

        ## STEP 3 =================================================================================================
        # Merge individual variable arrays into one dataset
        merged_ds = xr.merge([adj_ds1, adj_ds2], compat='override')
        # merged_ds = merged_ds.sel(time=slice('1981-01-01', '2100-12-31')) # ensuring right length
        merged_ds.attrs['localization_version'] = 'v2_utc' # add a "localization version" attribute to tag the data
        print('\nds created')
        
        ## STEP 4 =================================================================================================
        # summary stats
        td_exceed_tas = merged_ds['Adjusted Dewpoint Temperature'] > merged_ds['Adjusted Air Temperature at 2m'] # number of instances where Td > T
        counts_per_month_sim = td_exceed_tas.groupby('time.month').sum().mean(dim='simulation')
        counts_per_month_sim_percent = (counts_per_month_sim.values / len(merged_ds['time'])) *100
        stats_list.append(counts_per_month_sim_percent)
        
        ## STEP 5 =================================================================================================
        # export data
        filename = 'bc_tas_dpts_HadISD_{}_UTC.nc'.format(station_id)
        ck.export(merged_ds, filename, 'NetCDF')

        ## STEP 6 =================================================================================================
        # close dataset to save memory
        wrf_ds.close()
        obs_ds.close()
        merged_ds.close()
        print('All files closed ----------------------------------------------------------------------\n')
        
    stats_df = pd.DataFrame(stats_list)    
    return stats_df
        
## Approximately takes 25 minutes per station
## total time to run for 70+ stations: ~30 hours
stats_df = tas_dew_localize(stations_df, correct_tz=False)

In [None]:
# process and export summary stats dataframe
stats_df.to_csv('dewpt_exceed_tas_counts_station_localization.csv')

**Optional**: Timezone check, ensuring that UTC timestamped data is consistently 1 hour apart

In [None]:
# read in a successfully run station
ds = xr.open_dataset('FILENAME.nc')

In [None]:
def t_freq(df):
    print('len of df: {}'.format(len(df)))
    df['t_delta'] = df['time'].diff().fillna(pd.Timedelta(0))
    df['hours_diff'] = df['t_delta']/np.timedelta64(1, 'h')
    
    return df['hours_diff'].value_counts()

df = ds.to_dataframe().reset_index()
t_freq(df)