# Using the Analytics Engine (AE) to produce heating and cooling degree days
This notebook aims to reproduce the workflow CEC's Demand Forecast Unit takes to generate weather and climate information for the annual consumption model. Here the existing workflow is replicated, but connecting with new data from California's Fifth Climate Change Assessment.

To execute a given 'cell' of this notebook, place the cursor in the cell and press the 'play' icon, or simply press shift+enter together. Some cells will take longer to run, and you will see a [$\ast$] to the left of the cell while AE is still working.

**Intended Application**: As a user, I want to **<span style="color:#FF0000">generate heating and cooling degree days</span>** as input for an annual energy consumption model by:
1. Understand and visualize the difference between data at the weather station, and aggregated across the demand zone
2. Compute and visualize trends in heating and cooling degree days with a flexible threshold
3. Compute and visualize trends in heating and cooling degree hours with a flexible threshold

**Runtime**: With the default settings, this notebook takes approximately **11 minutes** to run from start to finish. Modifications to selections may increase the runtime. 

## Step 0: Setup
First, we'll import any general python libraries required to run the notebook. We'll also import the python library climakitae, our AE toolkit for climate data analysis, along with specific functions that we'll use within this notebook. 

In [None]:
import numpy as np
import pandas as pd
import xarray as xr
import hvplot.pandas

import climakitae as ck
import climakitaegui as ckg
from climakitae.core.data_interface import get_data
from climakitae.util.utils import (compute_annual_aggreggate, trendline, 
                                   compute_multimodel_stats, combine_hdd_cdd)
from climakitae.tools.derived_variables import compute_hdd_cdd, compute_hdh_cdh
from climakitaegui.util.utils import (hdd_cdd_lineplot, hdh_cdh_lineplot)

import warnings
warnings.filterwarnings("ignore")

## Step 1: Get data from the closest grid cell to the weather station
As an example - to replicate the historical observations at Sacramento Executive Airport, grab the grid cell from the model nearest to the airport. It is **critical** to note that the station-based grid cell data we are retrieving is **bias-adjusted**. In later steps, the gridded data that we will retrieve is **not bias-adjusted**, and therefore should be carefully considered.

### 1a) Read in the data 
Here, we'll use the `get_data` function to load the data.

In [None]:
data_at_station = get_data(
    variable = "Air Temperature at 2m", 
    resolution = "3 km",
    timescale = "hourly",
    data_type = "Stations", # Retrieve the single grid cell closest to the weather station 
    stations = "Sacramento Executive Airport (KSAC)",
    units = "degF", 
    time_slice = (2005, 2025), 
    scenario = ["Historical Climate", "SSP 3-7.0"]
)
data_at_station

Becasuse the dynamically downscaled WRF data in the Cal-Adapt: Analytics Engine is in UTC time, we'll convert to the timezome of the station we've selected. This is particularly important for determining the timing of the daily maximum and minimum temperatures. For a station located in Pacific Time (US), UTC time places the daily minimum "in" the day prior because UTC is 8 hours ahead of Pacific! 

In [None]:
data_at_station["time"] = data_at_station["time"] - pd.Timedelta(hours=8)

### 1b) Load the data into memory
This may take some time, because the data has to be loaded into memory and then subsetted to get the closest grid cell. All computations we've done before this step are actually computed in this step; before, we just see a preview of the data. Because of this, **we recommend running this notebook in the Analytics Engine's Jupyter Hub, which provides additional computational resources that greatly speed up this step.**

In [None]:
data_at_station = ck.load(data_at_station)
data_at_station

### 1c) Read in a csv file of the station coordinates
We'll use the Sacramento Executive Airport here as an example. Make sure the filepath to the csv file matches the correct location on your computer. This file will be read into the notebook as a pandas DataFrame object. We'll use it in plotting below.

In [None]:
stations_df = pd.read_csv("data/CEC_Forecast_Weather_Stations_California.csv", index_col="STATION")
stations_df.head(5) # Display the first 5 rows

In [None]:
station_name = "SACRAMENTO METROPOLITAN AP"
one_station = stations_df.loc[station_name]

In [None]:
one_station

### 1d) Output final data product as a csv file
We'll drop all unneeded coordinates and convert our xarray Dataset to a pandas Dataframe, allowing us to easily output the final data product to a csv file. In the output table, the first column is the time in units of UTC, and the second column are the various global climate models (which can be filtered in excel or in python code in the notebook). The other columns are the variables selected at the beginning of the notebook.

In [None]:
data_at_station_df = data_at_station.isel(scenario=0).drop(["scenario"]).to_dataframe()
data_at_station_df.head()

In [None]:
filename = "hourly_data_at_station_{0}.csv".format(station_name.replace(" ", "_")).lower()
data_at_station_df.to_csv(filename, index=True)

## Step 2: Get data from across the demand forecast zone

As an alternative to a single point, we can instead consider weather conditions across an entire forecast zone. In this example, we calculate the median of all conditions across the Sacramento Municipal Utility District. We have pre-loaded data selections for gridded air temperature within the Sacramento Municipal Utility District for 2005-2025. Feel free to make modifications for your own workflows. <br>
**Reminder**: The gridded data we will be retrieving in this step and using throughout this notebook is not bias-corrected.

In [None]:
data_dfz = get_data(
    variable = "Air Temperature at 2m", 
    resolution = "3 km",
    timescale = "hourly",
    data_type = "Gridded", 
    cached_area = "SMUD Service Territory", # Retrieve all cells over this region 
    units = "degF", 
    time_slice = (2005, 2025), 
    scenario = ["Historical Climate", "SSP 3-7.0"]
)

In [None]:
data_dfz["time"] = data_dfz["time"] - pd.Timedelta(hours=8)

## Step 3: Compute the median value of the grid cells in station's corresponding forecast zone

In this example, we will visualize the data across the Demand Forecast Zone for the Sacramento Municipal Utility District, and then calculate the median of all conditions across the Sacramento Municipal Utility District.

### 3a) Visualize both the Demand Forecast Zone and the weather station on the same map 

For simplicity's sake, we'll show just the first 12 hours of data. In the outputted map, you can see that our data contains multiple simulation options as well, which you can toggle between in the map's dropdown.

In [None]:
# Used to add weather station as star to map 
point_df = pd.DataFrame({
    "longitude (degrees_east)":[one_station.LON_X],
    "latitude (degrees_north)":[one_station.LAT_Y],
    "weather station": station_name
})

# Grab subset of data and load into memory 
to_plot = data_dfz.isel(time = np.arange(0,13))
to_plot = ck.load(to_plot)

In [None]:
ckg.view(to_plot) * point_df.hvplot.points(
    hover_cols = ["weather station"], 
    marker = "star", size = 300, color = "black"
)

### 3b) Aggregate values across grid cells in the forecast zone 
**Chose your aggregation: median, mean, min, or max.** All can be easily computed with just one line of code, thanks to xarray. You could also write your own code to compute a weighted mean. 

In [None]:
data_dfz_aggregated = data_dfz.median(dim=["x","y"])
# data_dfz_aggregated = data_dfz.mean(dim=["x","y"])
# data_dfz_aggregated = data_dfz.min(dim=["x","y"])
# data_dfz_aggregated = data_dfz.max(dim=["x","y"])

Finally, let's load this final data product into memory. 

In [None]:
data_dfz_aggregated = ck.load(data_dfz_aggregated)

### 3c) Output final data product as a csv file
We'll drop all unneeded coordinates and convert our xarray Dataset to a pandas Dataframe, allowing us to easily output the final data product to a csv file. 

In [None]:
dfz_aggregated_df = data_dfz_aggregated.isel(scenario=0).drop(
    ["scenario","Lambert_Conformal"]).to_dataframe()
dfz_aggregated_df.head()

In [None]:
filename = "dfz_aggregated_{0}.csv".format(station_name.replace(" ", "_").lower())
dfz_aggregated_df.to_csv(filename, index=True)

## Step 4: Compute heating degree days and cooling degree days
Degree days are [based on the assumption](https://www.weather.gov/key/climate_heat_cool) that when the outside temperature is 65°F, we don't need heating or cooling to be comfortable. However, you may wish to use a different threshold than 65°F: for example, you might want to assume that temperatures between 60-70°F won't require heating or cooling, and use degrees below 60°F for HDD and degrees above 70°F for CDD. <br><br> In the code below, a heating degree day (HDD) is calculated by computing how many degrees Fahrenheit **colder** the daily temperature is from a specified temperature threshold. A cooling degree day (CDD) is calculated by computing how many degrees **warmer** the daily temperature is from a specified temperature threshold. In the computation below, you can provide different thresholds for HDD and CDD based on your needs. 

### 4a) Decide which input data you want to use 

You can use the data within the demand forecast zone, which we retrieved in **step 2b**. Or, you can use the closest grid cell to the weather station, which we computed in **step 1b**. We do not recommend using the aggregated DFZ data calculated in step 3b, as aggregating the data prior to computing HDD and CDD may remove some critical information about the weather extremes. You can comment out whichever method you don't want to use. We've chosen to show the analysis with the DFZ data, but if you want to use the closest grid cell data, just comment out the DFZ cells and uncomment the closest grid cells. Note, the closest grid cell resampling will take a few minutes! <br><br>Depending on the input data, we will also set a new variable defining the number of grid cells. This will be just 1 for the closest grid cell method; for the DFZ data, however, this value will change depending on the size of the DFZ before aggregating. This information is used to compute the annual aggregate HDD and CDD in step 4c. Lastly, we provide 2 options for determining the [daily mean](https://www.weather.gov/key/climate_heat_cool) against which we calculate degree days. 

In [None]:
# ALL DATA WITHIN DFZ ZONE
data_to_use = data_dfz
num_grid_cells = data_dfz.x.size * data_dfz.y.size # Number of grid cells within the demand forecast region

## CLOSEST GRID CELL 
# data_to_use = data_at_station.to_array(name='Air Temperature at 2m').squeeze()
# num_grid_cells = 1

In [None]:
# METHOD 1: Using daily max/min difference
daily_max = data_to_use.resample(time='1D').max()
daily_min = data_to_use.resample(time='1D').min()
data_to_use = (daily_max + daily_min)/2

## METHOD 2: Using daily mean
# data_to_use = data_to_use.resample(time='1D').mean()

### 4b) Compute HDD and CDD 
We'll use the climakitae helper function `compute_hdd_cdd` to compute both heating and cooling degree days, which uses the function arguments `hdd_threshold` and `cdd_threshold` to represent any threshold of your choosing. In the example below, we will calculate HDD with a threshold of 60 degF and CDD with a threshold of 70 degF. The function performs the following calculations:<br><br>
**HDD = threshold - temperature<br>
CDD = (-1)\*(threshold - temperature)**<br><br>
For HDD, we can just subtract the 2m temperature from the selected threshold, then set any negative value to 0. For CDD, we will do the same, but will then multiply by -1 to turn negative values to positive, then set negative values to 0. We need to multiply by -1 for CDD to avoid having all negative values; for example, on a day of 80F and a cdd_threshold of 70F, CDD = 70 - 80 = -10, but the CDD value is +10. Multiplying -10 by -1 will give us the true value of 10.

In [None]:
#help(compute_hdd_cdd) # See information about the function

In [None]:
hdd, cdd = compute_hdd_cdd(data_to_use, hdd_threshold=65, cdd_threshold=65) # Set for all data within selected DFZ zone

Now that we have computed the HDD and CDD, we can then aggregate the results across grid cells in the forecast zone like we did previously above. We will need to do this for both the HDD and CDD variables. If you would like to change the aggregation method, you can easily modify between **median, mean, min, or max**, or write your own code to compute a weighted mean here too. We will use the *median* as an example here. Note, because we are aggregating here, the number of grid cells is reduced to represent the aggregation method to 1. 

Please note, that this next step is **not required** if you selected the closest grid cell to the station instead of all data across the DFZ. 

In [None]:
# only for all data within DFZ zone, not station data
hdd = hdd.median(dim=["x","y"])
cdd = cdd.median(dim=["x","y"])
num_grid_cells = 1

### 4c) Aggregate annually to find HDD and CDD per year
To do this, we will first group the data by year and compute a sum across space and time. Then, we will divide the annual aggregated data by the number of grid cells over which the sum was computed. 

In [None]:
hdd_annual = compute_annual_aggreggate(
    data=hdd, 
    name="Annual Heating Degree Days (HDD)", 
    num_grid_cells=num_grid_cells
)
cdd_annual = compute_annual_aggreggate(
    data=cdd, 
    name="Annual Cooling Degree Days (CDD)", 
    num_grid_cells=num_grid_cells
)

### 4d) Compute the multimodel mean, min, and max. 
We'll add these statistics to our main datasets, `hdd_annual` and `cdd_annual`, so they can be easily accessed for plotting.

In [None]:
hdd_annual = compute_multimodel_stats(hdd_annual)
cdd_annual = compute_multimodel_stats(cdd_annual)

### 4e) Compute a trendline using the mean of all simulations
We'll find the coefficients for a first degree (linear) polynomial using [numpy's `polyfit` function](https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html). The returned coefficients (**m** and **b** in the code below) will allow us to compute the trendline using the linear polynomial y = mx + b, where **y** is the trendline and **x** is the years. 

In [None]:
hdd_trendline = trendline(hdd_annual, kind='mean') 
cdd_trendline = trendline(cdd_annual, kind='mean') 

### 4f) Visualize the results
Using the python package *hvplot*, we can easily make a line plot of the annual aggregated data. To do this, we'll plot the annual HDD, then add the trendline on top. The code to generate the plot is contained in a function `hdd_cdd_lineplot`. 

Please note, the gridded data is not currently bias-corrected. As a result of this, the minimum or maximum timeseries could reflect a single simulation that is biased high or low compared to others. You can toggle lines on and off in the plots below by clicking on the name in the legend. 

In [None]:
hdd_cdd_lineplot(
    annual_data = hdd_annual, 
    trendline = hdd_trendline, 
    title = "Annual Aggregate Heating Degree Days"
)

In [None]:
hdd_cdd_lineplot(
    annual_data = cdd_annual, 
    trendline = cdd_trendline, 
    title = "Annual Aggregate Cooling Degree Days"
)

### 4g) Output data as csv files
We'll drop all unneeded coordinates and convert our xarray Dataset to a pandas Dataframe, allowing us to easily output the final data product to a csv file. 

In [None]:
# Merge and simplify data 
hdd_cdd_combined = xr.merge([combine_hdd_cdd(hdd_annual), combine_hdd_cdd(cdd_annual)])
hdd_cdd_combined = ck.load(hdd_cdd_combined)

# Convert to pandas dataframe 
hdd_cdd_df = hdd_cdd_combined.to_dataframe()
hdd_cdd_df.head()

In [None]:
filename = "annual_hdd_cdd_{0}.csv".format(station_name.replace(" ", "_").lower())
hdd_cdd_df.to_csv(filename, index=True)

## Step 5: Compute heating degree hours and cooling degree hours
Alternatively, you may be interested in the number of hours in each day that a designated heating or cooling threshold crosses. For Cooling Degree Hours (CDH), this is the number of hours in which the hourly temperature exceeds the cooling degree threshold. Likewise, Heating Degree Hours (HDH) is the number of hours in which the hourly temperature is below the heating degree threshold. We'll use the helper function `compute_hdh_cdh` to calculate HDH and CDH:<br><br>
**CDH = num of hours where (temperature $>$ threshold)<br>
HDH = num of hours where (temperature $<$ threshold)**<br><br>
We will display the results to see how trends change throughout the year. 

### 5a) Compute HDH and CDH
Like the CDD and HDD examples above, we'll use all of the data for our selected DFZ zone to calculate CDH and HDH. Note that we've added an attribute to the data to retain the threshold used to compute the data here. If you forget, look at the attributes of CDH or HDH. 

In [None]:
#help(compute_hdh_cdh) # See information about the function

In [None]:
data_to_use = data_dfz # reset to hourly data
hdh, cdh = compute_hdh_cdh(data_to_use, hdh_threshold=60, cdh_threshold=70) # Set for all data within selected DFZ zone

In [None]:
# only for all data within DFZ zone, not station data
hdh = hdh.median(dim=["x","y"])
cdh = cdh.median(dim=["x","y"])

### 5b) Display a month of CDH and HDH
Next, we'll plot specific months of the overall timeseries produced by the CDH and HDH calculation to see the trend in degree hours. We'll use a helper plotting function, and input  a month of interest. For example, we'll look at June of 2011, but you can input any date of interest; we provide examples for plotting a specific month and a specific year below.

In [None]:
data_one_month = cdh.sel(time="June 2011")
hdh_cdh_lineplot(data_one_month)

In [None]:
data_one_month = hdh.sel(time="June 2011")
hdh_cdh_lineplot(data_one_month)

Alternatively, it may be useful to visualize a specific year to see the trends over time. We'll do this for 2021 as an example below with Cooling Degree Hours. 

In [None]:
data_one_year = cdh.sel(time="2021")
hdh_cdh_lineplot(data_one_year)

### 5c) Output data as csv files
We'll drop all unneeded coordinates and convert our xarray Dataset to a pandas Dataframe, allowing us to easily output the final data product to a csv file. 

In [None]:
# Merge and simplify data 
hdh_cdh_combined = xr.merge([combine_hdd_cdd(hdh), combine_hdd_cdd(cdh)])
hdh_cdh_combined = ck.load(hdh_cdh_combined) 

# Convert to pandas dataframe 
hdh_cdh_df = hdh_cdh_combined.to_dataframe()
hdh_cdh_df.head()

In [None]:
filename = "daily_hdh_cdh_{0}.csv".format(station_name.replace(" ", "_").lower())
hdh_cdh_df.to_csv(filename, index=True)