## Temperature Density Profiles

This notebook is an early attempt to replicate the daily minimum and maximum weather distribution profiles provided to us by the DFU. In this notebook, we will calculate and compare the probability density functions between the observed weather station data and the bias-corrected downscaled data available on the Cal-Adapt: Analytics Engine. 

In [None]:
import numpy as np
import pandas as pd
import xarray as xr
import scipy.stats as stats
import calendar
import climakitae as ck

pd.options.plotting.backend = 'holoviews'

In [None]:
app = ck.Application()

### Step 1: Retrieve bias-corrected downscaled data for a station

First we'll read in some **bias-corrected station data**. For ease of reproducibility, we have pre-loaded data selections for air temperature for the Burbank-Glendale-Pasadena Airport for 1985-2010. However, if you would like to make modifications, or see how the data can be selected, uncomment the line app.select in the cell below to pull up a useful panel that illustrates all of the data options.

In [None]:
app.location.data_type = "Station"
app.location.station=['Burbank-Glendale-Pasadena Airport']
app.selections.variable = "Air Temperature at 2m"
app.selections.units = "degF" 
app.selections.resolution = "3 km"
app.selections.time_slice = (1985, 2010)

# app.select()

In [None]:
bc_data = app.retrieve() # retrieves the bias-corrected data
bc_data # examine the dataset for information

In [None]:
bc_data = app.load(bc_data)

### Step 2: Retrieve the observed weather station data
Now we also grab the station data itself, in order to compare the difference between the raw weather station data and the downscaled data at that station. 

#### Step 2a: Identify the station data within the catalog
The station data is located in our catalog, for which we read in the `hadisd_stations.csv` file to identify the station of interest. We then use information about the station (like its name) to retrieve the exact path to grab the data. The following code cells normally occur "behind the scenes" in our data retrieval for station data, but here we utilize some of this code to illustrate the process of grabbing the observed station data. 

In [None]:
from climakitae.data_loaders import _preprocess_hadisd
import pkg_resources

stations = pkg_resources.resource_filename("climakitae", "data/hadisd_stations.csv")
stations_df = pd.read_csv(stations)

In [None]:
station_subset = stations_df.loc[stations_df["station"].isin(app.location.station)]
filepaths = [
    "s3://cadcat/tmp/hadisd/HadISD_{}.zarr".format(s_id)
    for s_id in station_subset["station id"]
]

In [None]:
# retrieve the data, and examine
station_ds = xr.open_mfdataset(
    filepaths,
    preprocess=_preprocess_hadisd,
    engine="zarr",
    consolidated=False,
    parallel=True,
    backend_kwargs=dict(storage_options={"anon": True}),
)

station_ds

#### Step 2b: Pre-process the observed station data
The observed station data covers a much longer time period, and the units are natively in Kelvin. Therefore, in order to compare to the bias-corrected data, we slice the observed data to match the time period of intereset (1985 to 2010), and convert units to degrees Fahrenheit. 

In [None]:
# slice to match time frame
station_ds = station_ds.sel(time = slice('1985-01-01', '2010-12-31'))

# convert units: data is in K, need to convert to degF for comparison
station_ds = (station_ds - 273.15) * (9/5) + 32.0
station_ds.attrs['units'] = 'degF'

### Step 3: Calculate daily min and max temperatures distributions

#### Step 3a: Calculate daily min and max temperaturees
As both the observed station data and bias corrected data are at an hourly scale, we will need to calculate the daily minimum and maximum values. We do this below using the built-in xarray function `resample` which identifies the maximum/minimum value in each 1 day period, and returns that value for every day as a collapsed daily time-series. 

Note, the resampling may take 1-2 minutes, it's doing a lot of work at this step!

In [None]:
# bias-corrected data
t2_dailymax = bc_data.resample(time="1D").max() # daily maximum from hourly data
t2_dailymin = bc_data.resample(time="1D").min() # daily minimum from hourly data

# observed station data
obs_dailymax = station_ds.resample(time="1D").max() # daily maximum from hourly data
obs_dailymin = station_ds.resample(time="1D").min() # daily minimum from hourly data

#### Step 3b: Calculate the probability distribution function for daily maximum and minimum temperature

We'll do this with the scipy library function `stats.norm` with the `pdf` option, this ensures that we are calculating the probability density function. We've created a wrapper function `data_pdf` that does this for all the simulations available. Because the observed station data does not retain simulation data (of course!), we also have a companion wrapper function `obs_pdf` to calculate the PDFs for the observed station data too. 

In [None]:
def data_pdf(data, bins, ext):
    """PDF processing for bias-corrected data, wth simulations"""
    
    # determines how many simulations we are working with
    num_sim = len(data.simulation.values)
    
    # set-up for first simulation
    data_sim = data.isel(simulation=0) # first simulation
    data_sim_arr = data_sim.to_array() # converts to a data-array, as stats can only be calculated on a single array at a time
    data_sim_mean, data_sim_std = data_sim_arr.mean(), data_sim_arr.std() # calculates the mean, standard deviation
    data_sim_snd = stats.norm(data_sim_mean.values, data_sim_std.values) # calculates normal distribution using mean and std. deviation
    data_pdf_arr = data_sim_snd.pdf(bins) # calculates the pdf
    
    # sets-up dataframe of pdf values, for easy plotting and export
    df = pd.DataFrame(data = data_pdf_arr, columns = [str(data_sim.simulation.values) + "_" + str(ext)])
    
    # same process for every other simulation
    for sim in range(1, num_sim):
        data_sim = data.isel(simulation=sim)
        data_sim_arr = data_sim.to_array()
        data_sim_i_mean, data_sim_i_std = data_sim_arr.mean(), data_sim_arr.std()
        data_sim_i_snd = stats.norm(data_sim_i_mean.values, data_sim_i_std.values) 
        data_pdf_arr = data_sim_i_snd.pdf(bins)
        df[str(data_sim.simulation.values) + '_' + str(ext)] = data_pdf_arr # adds simulation name and max/min extension
                
    return df

In [None]:
def obs_pdf(obs_ds, bins, ext):
    """PDF processing for observational data, no simulations"""
    data_arr = obs_ds.to_array()
    data_mean, data_std = data_arr.mean(), data_arr.std()
    data_snd = stats.norm(data_mean.values, data_std.values)
    data_pdf_arr = data_snd.pdf(bins)
    
    df = pd.DataFrame(data = data_pdf_arr, columns = ["obs_" + str(ext)])
    
    return df

Next we set-up the number of bins to calculate the PDF over. We are interested in the range between 20°F and 120°F, at a 1°F interval. In the bins set-up, the high end of the range has a +1 included to ensure that 120 is the maximum here (and not 119). 

In [None]:
lowest_temp = 20
highest_temp = 120
bins = np.arange(lowest_temp, highest_temp+1, 1)

Now, we calculate the PDF for a specific month. First, we need to grab just the data for that month, for which we've set-up the `grab_months` function, for which you can pass the month to, but be sure to pass a number to this function (Jan=1, Dec=12). We use February (month=2) as an example here, but you can modify the month to be any of your choosing. 

In [None]:
def grab_months(data, month):
    """Grabs the specific month of interest and returns DataSet of all years for that month.
    Month must be passed as a number"""
    data_months = data.groupby('time.month').groups
    month_idxs = data_months[month]
    return data.isel(time=month_idxs)

In [None]:
month = 2 # default of February

# bias-corrected data
t2_dailymax_monthly = grab_months(t2_dailymax, month=month)
t2_dailymin_monthly = grab_months(t2_dailymin, month=month)

# observed station data
obs_dailymax_monthly = grab_months(obs_dailymax, month=month)
obs_dailymin_monthly = grab_months(obs_dailymin, month=month)

Calculate the daily PDFs for that month below. 

In [None]:
# bias-corrected data
maxtemp_pdf = data_pdf(t2_dailymax_monthly, bins=bins, ext='max')
mintemp_pdf = data_pdf(t2_dailymin_monthly, bins=bins, ext='min')

# observed station data
obs_maxtemp_pdf = obs_pdf(obs_dailymax_monthly, bins=bins, ext='max')
obs_mintemp_pdf = obs_pdf(obs_dailymin_monthly, bins=bins, ext='min')

Combine the dataframes together so that they are all in a single location, and can be easily visualized and exported to a .csv file. 

In [None]:
bins_df = pd.DataFrame(data=bins, columns=['Temperature'])
df_obs_bc = pd.concat([bins_df, # temperature bins ranging between 20-120
                       obs_maxtemp_pdf, obs_mintemp_pdf, # observed max and min temp
                       maxtemp_pdf, mintemp_pdf, # bias-corrected downscaled data
                      ], axis=1, join="inner")
df_obs_bc = df_obs_bc.set_index('Temperature')
df_obs_bc.head()

We'll also export the dataframe of PDF values to a csv file. Included are the temperature bins and the maximum and minimum PDF distributions per simulation. 

In [None]:
filename = "temperature_pdfs_{0}_{1}.csv".format(app.location.station[0].replace(" ", "_"), calendar.month_abbr[month]).lower()
df_obs_bc.to_csv(filename, index=True)

### Step 4: Visualize the results
We now plot the distributions of daily maximum and minimum temperature for a selected month over a set of years. Remember, here we are using data from 1985-2010 as our baseline, and are displaying the results for February, but you can choose any month above! Play around with different months to see how the PDF distributions vary. 

This plotting code cell may take 1-2 minutes to run -- hang tight!

In [None]:
df_obs_bc.plot(xlabel="Temperature (degF)",
                grid=True, # adds gridlines for easier interpretation
                title="PDFs for " + str(app.location.station[0]) + "\n" + calendar.month_name[month], # detailed title with station and month
               )