# Using the Analytics Engine (AE) to reproduce annual consumption model
This notebook is an early draft attempt to reproduce the workflow CEC's Demand Forecast Unit takes to generate weather and climate information for the annual consumption model. Here the existing workflow is replicated, but connecting with new data from California's Fifth Climate Change Assessment.

To execute a given 'cell' of this notebook, place the cursor in the cell and press the 'play' icon, or simply press shift+enter together. Some cells will take longer to run, and you will see a [$\ast$] to the left of the cell while AE is still working.

## Step 0: Setup
First, we'll import any general python libraries required to run the notebook.

In [None]:
import numpy as np
import pandas as pd
import xarray as xr
import panel as pn
pn.extension()

Next, we'll import the python library [climakitae](https://github.com/cal-adapt/climakitae), our AE toolkit for climate data analysis, along with this specific functions from that library that we'll use in this notebook.

In [None]:
from climakitae.utils import get_closest_gridcell
from climakitae.selectors import Boundaries
from climakitae.derive_variables import compute_hdd_cdd
from climakitae.cluster import Cluster
import climakitae as ck

Because we have two separate notebooks covering this same topic, we've put shared functions in a utils module, named `utils_notebook_1.py`. We'll import in the functions from that file next. 

In [None]:
from utils_notebook_1 import *

Additionally, get set up to make the computing go faster by executing the following cell. It will likely take several minutes to spin up! Learn more about dask and see some common [troubleshooting tips on our FAQ page](https://analytics.cal-adapt.org/docs/faq/).

In [None]:
cluster = Cluster()
cluster.adapt(minimum=0, maximum=8)
client = cluster.get_client()
cluster

To use climakitae, load a new application:

In [None]:
app = ck.Application()

## Step 1: Retrieve the data

### 1a) Read in the data 
To allow for better reproducibility of this notebook, we have put all data and location selections into a csv file, which should be located in the same folder as this notebook. We'll read this into our notebook by using the climakitae helper function `app.retrieve()`, providing the local filepath to the csv file as an argument to the function.

In [None]:
%%time
data = app.retrieve("data/config_hourly_data.csv") 

### 1b) Preview the data
The function above returns an xarray Dataset object, with the three variables we want to load in-- air temperature, relative humidity, and dew point temperature-- as separate elements of a single python object. Since all the variables have the same spatial and temporal dimensions, they can be stored in the same object. To access an individual variable (for example, Air Temperature at 2m), you can simple type `data["Air Temperature at 2m"]` to get just data for that variable.<br><br>
Understanding the dimensionality of this object is not a prerequisite to understanding the concepts in this notebook. However, if you'd like to learn more about this data type, [xarray's documentation](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html) gives an excellent description.<br>

In [None]:
data

### 1c) Preview the data using app.select()
This panel will display the data and location settings of the final row in the csv file fed to `app.retrieve()`. We'll display it in the notebook to give you a better sense of the data, as well as the other data options in the Analytics Engine catalog. <br><br>Although we won't do this in this particular notebook, you can also use this panel to directly modify the data and location selections, and then read in the data afterward by calling `app.retrieve()`. 

In [None]:
app.select()

## Step 2: Get data from the closest grid cell to the weather station. 
As an example - to replicate the historical observations at Sacramento Executive Airport, grab the grid cell from the model nearest to the airport.

### 2a) Read in a csv file of the station coordinates 
Make sure the filepath to the csv file matches the correct location on your computer. This file will be read into the notebook as a pandas DataFrame object.

In [None]:
stations_df = pd.read_csv("data/CEC_Forecast_Weather Stations_California.csv", index_col="STATION")
stations_df.head(5) # Display the first 5 rows 

### 2b) Grab the closest grid cell to the weather station
To demonstrate this process, we'll use the Sacramento Executive Airport weather station.

In [None]:
station_name = "SACRAMENTO EXECUTIVE AIRPORT"
one_station = stations_df.loc[station_name]

Next, we need to convert the lat/lon coordinate pair to the model's projection coordinates. We can easily do this using the built in helper function in climakitae: `get_closest_gridcell`. For more information on this function, you can call `help(get_closest_gridcell)` or look in the climakitae.utils module for the actual code that performs the computation.

In [None]:
data_closest_gridcell = get_closest_gridcell(
    data=data,
    lat=one_station.LAT_Y,
    lon=one_station.LON_X, 
)

### 2c) Load the data into memory 
This may take some time, because the data has to be loaded into memory and then subsetted to get the closest grid cell. All computations we've done before this step are actually computed in this step; before, we just see a preview of the data. **Because of this, we recommend running this notebook in the Analytics Engine's Jupyter Hub, which provides additional computational resources that greatly speed up this step.**

In [None]:
%%time
data_closest_gridcell = app.load(data_closest_gridcell)

### 2d) Output final data product as a csv file
We'll drop all unneeded coordinates and convert our xarray Dataset to a pandas Dataframe, allowing us to easily output the final data product to a csv file. In the output table, the first column is the time in units of UTC, and the second column are the various global climate models (which can be filtered in excel or in python code in the notebook). The other columns are the variables selected at the beginning of the notebook.

In [None]:
data_closest_gridcell_df = data_closest_gridcell.isel(scenario=0).drop(
    ["x","y","landmask","lakemask","lat","lon","Lambert_Conformal","scenario"]
).to_dataframe()
data_closest_gridcell_df.head()

In [None]:
filename = "hourly_data_closest_gridcell_{0}.csv".format(station_name.replace(" ", "_")).lower()
data_closest_gridcell_df.to_csv(filename, index=True)

## Step 3: Compute the median value of the grid cells in station's corresponding forecast zone
As an alternative to a single point, we can instead consider weather conditions across an entire forecast zone. In this example we calculate the median of all conditions across the Sacramento Municipal Utility District.

### 3a) Read in the shapefiles of the demand forecast zones 
We'll use this to find the demand forecast zone that contains the weather station, then find the overlapping grid cells over which to compute the median value. The geometries of each demand forecast zone is available in our data catalog. You can grab the data as a pandas DataFrame object using the code provided below, or subset by forecast zones easily in `app.select` in the location subsetting tab. 

In [None]:
%%time
dfzs_df = Boundaries()._ca_forecast_zones # Load geometries from catalog
dfzs_df.head() # Display the first few rows  

### 3b) Crop the data to the corresponding forecast zone
We'll use [geopanda's `.contains` function](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoSeries.contains.html) to find the demand forecast zone where the weather station is located, and print the result. Then, we'll use [rioxarray](https://corteva.github.io/rioxarray/stable/rioxarray.html#rioxarray.raster_array.RasterArray.clip) to clip the data to the geometry that defines the forecast zone. 

In [None]:
data_dfz = clip_data_to_dfz(
    gridded_data=data, 
    dfzs_df=dfzs_df, 
    station_lat=one_station.LAT_Y,
    station_lon=one_station.LON_X
)

### 3c) Visualize both the Demand Forecast Zone and the weather station on the same map 

For simplicity's sake, we'll show just one variable and only two weeks of data. In the outputted map, you can see that our data contains multiple simulation options as well, which you can toggle between in the map's dropdown.

In [None]:
# Used to add weather station as star to map 
point_df = pd.DataFrame({
    "longitude (degrees_east)":[one_station.LON_X],
    "latitude (degrees_north)":[one_station.LAT_Y],
    "weather station": station_name
})

# Grab subset of data and load into memory 
to_plot = data_dfz["Air Temperature at 2m"].isel(time = np.arange(0,13))
to_plot = app.load(to_plot)

In [None]:
app.view(to_plot) * point_df.hvplot.points(
    hover_cols = ["weather station"], 
    marker = "star", size = 300, color = "black"
)

### 3d) Agreggate values across grid cells in the forecast zone 
**Chose your aggregation: median, mean, min, or max.** All can be easily computed with just one line of code, thanks to xarray. You could also write your own code to compute a weighted mean. 

In [None]:
data_dfz_aggregated = data_dfz.median(dim=["x","y"])
#data_dfz_aggregated = data_dfz.mean(dim=["x","y"])
#data_dfz_aggregated = data_dfz.min(dim=["x","y"])
#data_dfz_aggregated = data_dfz.max(dim=["x","y"])

Finally, let's load this final data product into memory 

In [None]:
%%time
data_dfz_aggregated = app.load(data_dfz_aggregated)

### 3e) Output final data product as a csv file
We'll drop all unneeded coordinates and convert our xarray Dataset to a pandas Dataframe, allowing us to easily output the final data product to a csv file. 

In [None]:
dfz_aggregated_df = data_dfz_aggregated.isel(scenario=0).drop(
    ["scenario","Lambert_Conformal"]).to_dataframe()
dfz_aggregated_df.head()

In [None]:
filename = "dfz_aggregated_{0}.csv".format(station_name.replace(" ", "_").lower())
dfz_aggregated_df.to_csv(filename, index=True)

## Step 4: Compute heating degree days and cooling degree days
Here, a heating degree day (HDD) is calculated by computing how many degrees Farenheit **colder** the daily temperature is from a standard temperature of 65 degrees Farenheit. A cooling degree day (CDD) is calulcated by computing how many degrees **warmer** the daily temperature is from the same standard temperature.

### 4a) Decide which input data you want to use 
You can use the closest grid cell to the weather station, which we computed in step 3. Or, you can use the data agreggated over the demand forecast zone, which we computed in step 4. Just comment out whichever you don't want to use. We've chosen to show the analysis with the agreggated DFZ data, but if you want to use the closest grid cell data, just comment out the DFZ cells and uncomment out the closest grid cells.<br><br>Depending on the input data, we will also set a new variable defining the number of grid cells. This will of course be just 1 for the closest grid cell method; for the agreggated DFZ data, however, this value will change depending on the size of the DFZ. This information is used to compute the annual agreggate HDD and CDD in step 5c.

In [None]:
# CLOSEST GRID CELL 
# data_to_use = data_closest_gridcell
# num_grid_cells = 1

# AGGREGATED CELLS IN DFZ 
data_to_use = data_dfz_aggregated 
num_grid_cells = data_dfz.x.size * data_dfz.y.size # Number of grid cells within the demand forecast region

### 4b) Compute HDD and CDD 
We'll use the climakitae helper function `compute_hdd_cdd` to perform the computation, which uses a default standard temperature of 65 degrees F. You can change this default using the function argument `standard_temp`. The function performs the following calculations:<br><br>
**HDD = 65 - temperature<br>
CDD = (-1)\*(65 - temperature)**<br><br>
For HDD, we can just subtract the 2m temperature from 65 degrees Farenheight, then set any negative to 0. For CDD, we will do the same, but will then multiply by -1 to turn negative values to positive, then set negative values to 0. We need to multiply by -1 for CDD to avoid having all negative values; for example, on a day of 80F, CDD = 65 - 80 = -15, but the CDD value is +15. Multiplying -15 by -1 will give us the true value of 15. 

In [None]:
#help(compute_hdd_cdd) # See information about the function

In [None]:
t2 = data_to_use["Air Temperature at 2m"]
hdd, cdd = compute_hdd_cdd(t2, standard_temp=65)

### 4c) Aggregate annually to find HDD and CDD per year
To do this, we will first group the data by year and compute a sum across space and time. Then, we will divide the annual aggregated data by the number of grid cells over which the sum was computed. 

In [None]:
hdd_annual = compute_annual_aggreggate(
    data=hdd, 
    name="Annual Heating Degree Days (HDD)", 
    num_grid_cells=num_grid_cells
)
cdd_annual = compute_annual_aggreggate(
    data=cdd, 
    name="Annual Cooling Degree Days (CDD)", 
    num_grid_cells=num_grid_cells
)

### 4d) Compute the multimodel mean, min, and max. 
We'll add these statistics to our main datasets, `hdd_annual` and `cdd_annual`, so they can be easily accessed for plotting.

In [None]:
hdd_annual, cdd_annual = compute_multimodel_stats(hdd_annual, cdd_annual)

### 4e) Compute a trendline using the mean of all simulations
We'll find the coefficients for a first degree (linear) polynomial using [numpy's `polyfit` function](https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html). The returned coefficients (**m** and **b** in the code below) will allow us to compute the trendline using the linear polynomial y = mx + b, where **y** is the trendline and **x** is the years. 

In [None]:
hdd_trendline = trendline(hdd_annual) 
cdd_trendline = trendline(cdd_annual) 

### 4f) Visualize the results
Using the python package *hvplot*, we can easily make a line plot of the annual aggregated data. To do this, we'll plot the annual HDD, then add the trendline on top. The code to generate the plot is contained in a function `hdd_cdd_lineplot`. 

In [None]:
hdd_cdd_lineplot(
    annual_data = hdd_annual, 
    trendline = hdd_trendline, 
    title = "Annual Aggregate Heating Degree Days"
)

In [None]:
hdd_cdd_lineplot(
    annual_data = cdd_annual, 
    trendline = cdd_trendline, 
    title = "Annual Aggregate Cooling Degree Days"
)

### 4g) Output data as csv files
We'll drop all unneeded coordinates and convert our xarray Dataset to a pandas Dataframe, allowing us to easily output the final data product to a csv file. 

In [None]:
# Merge and simplify data 
hdd_cdd_combined = xr.merge([hdd_annual, cdd_annual]).drop(["Lambert_Conformal","scenario"])
hdd_cdd_combined = app.load(hdd_cdd_combined) 

# Convert to pandas dataframe 
hdd_cdd_df = hdd_cdd_combined.to_dataframe()
hdd_cdd_df.head()

In [None]:
filename = "annual_hdd_cdd_{0}.csv".format(station_name.replace(" ", "_").lower())
hdd_cdd_df.to_csv(filename, index=True)

## Step 5: Close the compute cluster
Lastly, when you are done, close your cluster resources to free them up for the next time you work. 

In [None]:
client.close()