Pythonic data access using climakitae 
--------------------------------------
This notebook showcases helper functions from `climakitae` that enable you to access the AE catalog data **without** using a GUI, while also allowing you to perform spatial subsetting and view the data options in an easy-to-use fashion. These functions could be easily implemented in a python script. <br>

As a reminder, you can access the data using one of the following methods: 
1) the climakitae Selections GUI ([getting_started.ipynb](getting_started.ipynb))
2) using helper functions in the `climakitae` library (this notebook!) 
3) the python library `intake` ([intake_direct_data_download.ipynb](intake_direct_data_download.ipynb))
<br>

This notebook showcases option 2.

In [1]:
from climakitae.core.data_interface import (
    get_data_options, 
    get_subsetting_options, 
    get_data
)

## See all the data options in the catalog 
These options will match those in our AE selections GUI. 

In [2]:
get_data_options()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,variable,resolution
downscaling_method,scenario,timescale,Unnamed: 3_level_1,Unnamed: 4_level_1
Statistical,Historical Climate,daily,Maximum relative humidity,3 km
Statistical,Historical Climate,daily,Minimum relative humidity,3 km
Statistical,Historical Climate,daily,Specific humidity at 2m,3 km
Statistical,Historical Climate,daily,Precipitation (total),3 km
Statistical,Historical Climate,daily,Shortwave flux at the surface,3 km
...,...,...,...,...
Dynamical,Historical Reconstruction,monthly,Maximum wind speed at 10m,9 km
Dynamical,Historical Reconstruction,monthly,Maximum wind speed at 10m,3 km
Dynamical,Historical Reconstruction,monthly,Mean wind speed at 10m,45 km
Dynamical,Historical Reconstruction,monthly,Mean wind speed at 10m,9 km


## See the data options for a particular subset of inputs
The `get_data_options` function enables you to input a number of different function arguments, corresponding to the columns in the table above, to subset the table. Inputting no arguments, like we did above, will return the entire range of options.<br><br>First, lets print the function documentation to see the inputs and outputs of the function. If an argument (or "parameter", as listed in the documentation) is listed as "optional", that means you don't have to input anything for that argument. In the case of this function, none of the function arguments are required, so you can simply call the function. 

In [3]:
print(get_data_options.__doc__)

Get data options, in the same format as the Select GUI, given a set of possible inputs.
    Allows the user to access the data using the same language as the GUI, bypassing the sometimes unintuitive naming in the catalog.
    If no function inputs are provided, the function returns the entire AE catalog that is available via the Select GUI

    Parameters
    ----------
    variable: str, optional
        Default to None
    downscaling_method: str, optional
        Default to None
    resolution: str, optional
        Default to None
    timescale: str, optional
        Default to None
    scenario: str or list, optional
        Default to None
    tidy: boolean, optional
        Format the pandas dataframe? This creates a DataFrame with a MultiIndex that makes it easier to parse the options.
        Default to True

    Returns
    -------
    cat_subset: pd.DataFrame
        Catalog options for user-provided inputs
    


If you call the function with **no inputs**, it will simply return the entire catalog! But, let's say you want to see all the data options for statistically downscaled data at 3 km resolution. You'll want to provide inputs for the `downscaling_method` and `resolution` arguments. 

In [4]:
get_data_options(
    downscaling_method = "Statistical", 
    resolution = "3 km"
)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,variable,resolution
downscaling_method,scenario,timescale,Unnamed: 3_level_1,Unnamed: 4_level_1
Statistical,Historical Climate,daily,Maximum relative humidity,3 km
Statistical,Historical Climate,daily,Minimum relative humidity,3 km
Statistical,Historical Climate,daily,Specific humidity at 2m,3 km
Statistical,Historical Climate,daily,Precipitation (total),3 km
Statistical,Historical Climate,daily,Shortwave flux at the surface,3 km
Statistical,...,...,...,...
Statistical,SSP 5-8.5 -- Burn it All,monthly,Maximum air temperature at 2m,3 km
Statistical,SSP 5-8.5 -- Burn it All,monthly,Minimum air temperature at 2m,3 km
Statistical,SSP 5-8.5 -- Burn it All,monthly,West-East component of Wind at 10m,3 km
Statistical,SSP 5-8.5 -- Burn it All,monthly,North-South component of Wind at 10m,3 km


Perhaps you want to see all the data options for daily precipitation. We have several precipitation options in the catalog. You don't need to know the name of these variables; simply use "precipitation" as your input to the function for the `variable` argument.<br><br>The function prefers that your inputs match an actual option in the catalog-- with exact capitalizations and no misspelling-- and will print a warning if your input is not a direct match ("precipitation" is not an option, but "Precipitation (total)" is). The function will then try to make a guess as to what you actually meant. 

In [5]:
get_data_options(
    variable = "precipitation", 
    timescale = "daily"
) 

Input variable='precipitation' is not a valid option.
Closest options: 
- Maximum precipitation
- Precipitation (convective only)
- Precipitation (cumulus portion only)
- Precipitation (grid-scale portion only)
- Precipitation (total)
Outputting data for variable='Maximum precipitation'



Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,variable,resolution
downscaling_method,scenario,timescale,Unnamed: 3_level_1,Unnamed: 4_level_1
Dynamical,Historical Climate,daily,Maximum precipitation,45 km
Dynamical,Historical Climate,daily,Maximum precipitation,9 km
Dynamical,Historical Climate,daily,Maximum precipitation,3 km
Dynamical,SSP 2-4.5 -- Middle of the Road,daily,Maximum precipitation,45 km
Dynamical,SSP 2-4.5 -- Middle of the Road,daily,Maximum precipitation,9 km
Dynamical,SSP 3-7.0 -- Business as Usual,daily,Maximum precipitation,45 km
Dynamical,SSP 3-7.0 -- Business as Usual,daily,Maximum precipitation,9 km
Dynamical,SSP 3-7.0 -- Business as Usual,daily,Maximum precipitation,3 km
Dynamical,SSP 5-8.5 -- Burn it All,daily,Maximum precipitation,45 km
Dynamical,SSP 5-8.5 -- Burn it All,daily,Maximum precipitation,9 km


The function can also return a simple pandas DataFrame without the complex MultiIndex. Just set `tidy = False`.

In [6]:
get_data_options(
    variable = "precipitation", 
    timescale = "daily", 
    tidy = False
) 

Input variable='precipitation' is not a valid option.
Closest options: 
- Maximum precipitation
- Precipitation (convective only)
- Precipitation (cumulus portion only)
- Precipitation (grid-scale portion only)
- Precipitation (total)
Outputting data for variable='Maximum precipitation'



Unnamed: 0,variable,downscaling_method,resolution,timescale,scenario
0,Maximum precipitation,Dynamical,45 km,daily,Historical Climate
1,Maximum precipitation,Dynamical,9 km,daily,Historical Climate
2,Maximum precipitation,Dynamical,3 km,daily,Historical Climate
3,Maximum precipitation,Dynamical,45 km,daily,SSP 2-4.5 -- Middle of the Road
4,Maximum precipitation,Dynamical,9 km,daily,SSP 2-4.5 -- Middle of the Road
5,Maximum precipitation,Dynamical,45 km,daily,SSP 3-7.0 -- Business as Usual
6,Maximum precipitation,Dynamical,9 km,daily,SSP 3-7.0 -- Business as Usual
7,Maximum precipitation,Dynamical,3 km,daily,SSP 3-7.0 -- Business as Usual
8,Maximum precipitation,Dynamical,45 km,daily,SSP 5-8.5 -- Burn it All
9,Maximum precipitation,Dynamical,9 km,daily,SSP 5-8.5 -- Burn it All


## See all the geometry options for spatially subsetting the data during retrieval
These options will match those in our AE selections GUI. This will enable you to retrieve a subset for a specific region.

In [7]:
get_subsetting_options()

Unnamed: 0_level_0,Unnamed: 1_level_0,geometry
area_subset,cached_area,Unnamed: 2_level_1
states,ID,"POLYGON ((-117.24269 44.39655, -117.23485 44.3..."
states,WA,"MULTIPOLYGON (((-122.57041 48.53786, -122.5686..."
states,NM,"POLYGON ((-109.05018 31.48001, -109.04985 31.4..."
states,CA,"MULTIPOLYGON (((-118.60443 33.47856, -118.5988..."
states,CO,"POLYGON ((-109.06026 38.59933, -109.05955 38.7..."
...,...,...
CA Electric Load Serving Entities (IOU & POU),City of Shasta Lake,"POLYGON ((-122.37577 40.69512, -122.35561 40.6..."
CA Electric Load Serving Entities (IOU & POU),Victorville Municipal Utilities Services,"POLYGON ((-117.38335 34.60109, -117.38330 34.6..."
CA Electric Load Serving Entities (IOU & POU),Shelter Cove Resort Improvement District,"POLYGON ((-124.04385 40.01842, -124.04415 40.0..."
CA Electric Load Serving Entities (IOU & POU),Kirkwood Meadows Public Utility District,"POLYGON ((-120.06332 38.68938, -120.06371 38.6..."


This shows a lot of options! Say you're only interested in California counties. Simply set the argument `area_subset` to "CA counties" to see the all options for counties. The function documentation shows the other options, which also match the values in the column "area_subset" in the table above. 

In [8]:
print(get_subsetting_options.__doc__)

Get all geometry options for spatial subsetting.
    Options match those in selections GUI

    Parameters
    ----------
    area_subset: str
        One of "all", "states", "CA counties", "CA Electricity Demand Forecast Zones", "CA watersheds", "CA Electric Balancing Authority Areas", "CA Electric Load Serving Entities (IOU & POU)"
        Defaults to "all", which shows all the geometry options with area_subset as a multiindex

    Returns
    -------
    geom_df: pd.DataFrame
        Geometry options
        Shows only options for one area_subset if input is provided that is not "all"
        i.e. if area_subset = "states", only the options for states will be returned
    


In [9]:
get_subsetting_options(area_subset = "CA counties")

Unnamed: 0_level_0,geometry
cached_area,Unnamed: 1_level_1
Alameda County,"POLYGON ((-122.37312 37.88388, -122.37378 37.8..."
Alpine County,"POLYGON ((-120.07333 38.70109, -120.07332 38.7..."
Amador County,"POLYGON ((-121.02771 38.50011, -121.02771 38.5..."
Butte County,"POLYGON ((-122.06943 39.84053, -122.06886 39.8..."
Calaveras County,"POLYGON ((-120.63180 38.34603, -120.63180 38.3..."
Colusa County,"POLYGON ((-121.91512 38.92535, -121.91491 38.9..."
Contra Costa County,"POLYGON ((-121.69732 37.78244, -121.69084 37.7..."
Del Norte County,"POLYGON ((-124.31611 41.72839, -124.31370 41.7..."
El Dorado County,"POLYGON ((-120.18443 39.03101, -120.18838 39.0..."
Fresno County,"POLYGON ((-119.57319 36.48884, -119.57305 36.4..."


You can see all the options for subsetting, and their corresponding geometries, but you don't actually need to use the geometries for subsetting if you use climakitae's data retrieval function-- `get_catalog_data` -- explained in the next section. 

## Retrieve data using the get_data() function
You can easily retrieve data from the Analytics Engine data catalog using climakitae's ```get_data``` function, described below. Additional details for each of the function arguments can be viewed in function docstrings in the next code cell. 

### Required inputs 
This function requires you to input values for the following arguments: 
- variable (required)
- downscaling method (required)
- resolution (required)
- timescale (required)

### Location subsetting 
The options for location subsetting can be found using the `get_data_options()` function, as described in the beginning of this notebook. You can also opt to perform an area average by setting `area_average = "Yes"`. The `get_data()` function will default to returning the entire spatial domain, with no area averaging performed. 
- area_subset (optional) 
- cached_area (optional) 
- area_average (optional)

### Additional options
Further modify the data returned using the following arguments. 
- approach (optional) 
- scenario (optional)
- units (optional)
- time_slice (optional)
- warming_level (optional)
- warming_level_window (optional)
- warming_level_months (optional)

In [17]:
# See additional details about the function arguments by printing the docstring
print(get_data.__doc__)

Retrieve formatted data from the Analytics Engine data catalog using a simple function.
    Contrasts with DataParameters().retrieve(), which retrieves data from the user inputs in climakitaegui's selections GUI.

    Parameters
    ----------
    variable: str
        String name of climate variable
    downscaling_method: str, one of ["Dynamical", "Statistical", "Dynamical+Statistical"]
        Downscaling method of the data:
        WRF ("Dynamical"), LOCA2 ("Statistical"), or both "Dynamical+Statistical"
    resolution: str, one of ["3 km", "9 km", "45 km"]
        Resolution of data in kilometers
    timescale: str, one of ["hourly", "daily", "monthly"]
        Temporal frequency of dataset
    approach: one of ["Time", "Warming Level"], optional
        Default to "Time"
    scenario: str or list of str, optional
        SSP scenario and/or historical data selection ("Historical Climate", "Historical Reconstruction")
        If approach = "Time", you need to set a valid option
  

### Example 1: Time-based approach
Retrieve data using a time-based approach. ```approach``` is an optional function argument, but the default is to use a time-based approach, so you don't actually need to set this argument. 

#### Example 1a
First, let's retrieve 3 kilometer resolution statistically downscaled historical data at a monthly timestep. 

In [11]:
get_data(
    variable = "Precipitation (total)", 
    downscaling_method = "Statistical", 
    resolution = "3 km", 
    timescale = "monthly", 
    scenario = "Historical Climate"
    # approach = "Time" # Optional because "Time" is the function default 
)

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!! Returned data array is huge. Operations could take 10x to infinity longer than 1GB of data !!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!



Unnamed: 0,Array,Chunk
Bytes,112.56 GiB,255.69 MiB
Shape,"(1, 70, 780, 495, 559)","(1, 1, 308, 310, 351)"
Count,224 Graph Layers,840 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 112.56 GiB 255.69 MiB Shape (1, 70, 780, 495, 559) (1, 1, 308, 310, 351) Count 224 Graph Layers 840 Chunks Type float64 numpy.ndarray",70  1  559  495  780,

Unnamed: 0,Array,Chunk
Bytes,112.56 GiB,255.69 MiB
Shape,"(1, 70, 780, 495, 559)","(1, 1, 308, 310, 351)"
Count,224 Graph Layers,840 Chunks
Type,float64,numpy.ndarray


#### Example 1b
Now say you're only interested in this data for San Bernadino County, and you want to compute an area average over the entire county. 

In [12]:
get_data(
    variable = "Precipitation (total)", 
    downscaling_method = "Statistical", 
    resolution = "3 km", 
    timescale = "monthly", 
    scenario = "Historical Climate",
    
    # Modify location settings
    cached_area = "San Bernardino County", 
    area_average = "Yes"
)

Unnamed: 0,Array,Chunk
Bytes,426.56 kiB,2.41 kiB
Shape,"(1, 70, 780)","(1, 1, 308)"
Count,467 Graph Layers,210 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 426.56 kiB 2.41 kiB Shape (1, 70, 780) (1, 1, 308) Count 467 Graph Layers 210 Chunks Type float64 numpy.ndarray",780  70  1,

Unnamed: 0,Array,Chunk
Bytes,426.56 kiB,2.41 kiB
Shape,"(1, 70, 780)","(1, 1, 308)"
Count,467 Graph Layers,210 Chunks
Type,float64,numpy.ndarray


#### Example 1c 
Perhaps next you want to get dynamically downscaled (i.e. WRF) precipitation data instead. First, you might want to check what options you have for scenario, timescale, and resolution using the ```get_data_options()``` function. 

In [13]:
get_data_options(
    variable = "Precipitation (total)", 
    downscaling_method = "Dynamical"
) 

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,variable,resolution
downscaling_method,scenario,timescale,Unnamed: 3_level_1,Unnamed: 4_level_1
Dynamical,Historical Climate,hourly,Precipitation (total),45 km
Dynamical,Historical Climate,hourly,Precipitation (total),9 km
Dynamical,Historical Climate,hourly,Precipitation (total),3 km
Dynamical,SSP 3-7.0 -- Business as Usual,hourly,Precipitation (total),45 km
Dynamical,SSP 3-7.0 -- Business as Usual,hourly,Precipitation (total),9 km
Dynamical,SSP 3-7.0 -- Business as Usual,hourly,Precipitation (total),3 km
Dynamical,Historical Climate,daily,Precipitation (total),45 km
Dynamical,Historical Climate,daily,Precipitation (total),9 km
Dynamical,Historical Climate,daily,Precipitation (total),3 km
Dynamical,Historical Climate,monthly,Precipitation (total),45 km


Next, let's retrieve both the future and historical dynamically downscaled data. "Historical Climate" is the correct historical data option here; "Historical Reconstruction" data is from ERA5 (a climate reanalysis product, rather than a climate model), and cannot be retrieved with future data in the same function call. <br><br>You can set the ```scenario``` argument to retrieve the shared socioeconomic pathway data (future projections) appended to the historical data. You can also set your desired time period using the ```time_slice``` argument. 

In [14]:
get_data(
    variable = "Precipitation (total)", 
    downscaling_method = "Dynamical", 
    resolution = "45 km", 
    timescale = "monthly", 
    cached_area = "San Bernardino County", 
    
    # Modify time-based settings 
    time_slice = (2000,2050),
    scenario = [
        "Historical Climate", 
        "SSP 3-7.0 -- Business as Usual", 
        "SSP 2-4.5 -- Middle of the Road",
        "SSP 5-8.5 -- Burn it All"
    ]
) 

-------
You have retrieved data for more than one SSP, but not all ensemble members for each GCM are available for all SSPs.

As a result, some scenario and simulation combinations may contain NaN values.

If you want to remove these empty simulations, it is recommended to first subset the data object by each individual scenario and then dropping NaN values.


Unnamed: 0,Array,Chunk
Bytes,2.75 MiB,83.45 kiB
Shape,"(3, 8, 612, 7, 7)","(1, 1, 436, 7, 7)"
Count,157 Graph Layers,48 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 2.75 MiB 83.45 kiB Shape (3, 8, 612, 7, 7) (1, 1, 436, 7, 7) Count 157 Graph Layers 48 Chunks Type float32 numpy.ndarray",8  3  7  7  612,

Unnamed: 0,Array,Chunk
Bytes,2.75 MiB,83.45 kiB
Shape,"(3, 8, 612, 7, 7)","(1, 1, 436, 7, 7)"
Count,157 Graph Layers,48 Chunks
Type,float32,numpy.ndarray


### Example 2: Warming levels approach 
By default, the function uses a time-based approach. To use a warming levels approach, set the argument ```approach = "Warming Level"```. 

#### Example 2a
Retrieve the same data as example 1c, using a warming levels approach instead of a time-based approach. Note that the ```scenario``` and ```time_slice``` arguments are invalid for a warming levels approach; if provided, they will be ignored by the function. 

In [15]:
get_data(
    variable = "Precipitation (total)", 
    downscaling_method = "Dynamical", 
    resolution = "45 km", 
    timescale = "monthly", 
    cached_area = "San Bernardino County", 
    
    # Modify your approach 
    approach = "Warming Level",
)

Unnamed: 0,Array,Chunk
Bytes,689.06 kiB,58.19 kiB
Shape,"(1, 360, 7, 7, 10)","(1, 304, 7, 7, 1)"
Count,174 Graph Layers,20 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 689.06 kiB 58.19 kiB Shape (1, 360, 7, 7, 10) (1, 304, 7, 7, 1) Count 174 Graph Layers 20 Chunks Type float32 numpy.ndarray",360  1  10  7  7,

Unnamed: 0,Array,Chunk
Bytes,689.06 kiB,58.19 kiB
Shape,"(1, 360, 7, 7, 10)","(1, 304, 7, 7, 1)"
Count,174 Graph Layers,20 Chunks
Type,float32,numpy.ndarray


#### Example 2b
The ```get_data``` function uses a default warming levels window of +/- 15 years, resulting in a 30 year period. Lets modify that by setting```warming_level_window = 10``` to retrieve a 20 year window.<br><br>We can also modify the warming levels computed to include additional warming levels beyond the default. Let's select a few more by setting ```warming_level = [2.5, 3.0, 4.0]```. 

In [16]:
get_data(
    variable = "Precipitation (total)", 
    downscaling_method = "Dynamical", 
    resolution = "45 km", 
    timescale = "monthly", 
    cached_area = "San Bernardino County", 
    approach = "Warming Level",
    
    # Modify warming level settings 
    warming_level_window = 10, 
    warming_level = [2.5, 3.0, 4.0]
)

-----------------------------------
There may be NaNs in your data for certain simulation/warming level combinations if the warming level is not reached for that particular simulation before the year 2100. 

This does not mean you have missing data, but rather a feature of how the data is combined in retrieval to return a single data object. 

If you want to remove these empty simulations, it is recommended to first subset the data object by each individual warming level and then dropping NaN values.


Unnamed: 0,Array,Chunk
Bytes,1.35 MiB,45.94 kiB
Shape,"(3, 240, 7, 7, 10)","(1, 240, 7, 7, 1)"
Count,208 Graph Layers,30 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 1.35 MiB 45.94 kiB Shape (3, 240, 7, 7, 10) (1, 240, 7, 7, 1) Count 208 Graph Layers 30 Chunks Type float32 numpy.ndarray",240  3  10  7  7,

Unnamed: 0,Array,Chunk
Bytes,1.35 MiB,45.94 kiB
Shape,"(3, 240, 7, 7, 10)","(1, 240, 7, 7, 1)"
Count,208 Graph Layers,30 Chunks
Type,float32,numpy.ndarray
