Pythonic data access using climakitae 
--------------------------------------
This notebook showcases helper functions from `climakitae` that enable you to access the AE catalog data **without** using a GUI, while also allowing you to perform spatial subsetting and view the data options in an easy-to-use fashion. These functions could be easily implemented in a python script. <br>

As a reminder, you can access the data using one of the following methods: 
1) the climakitae Selections GUI ([getting_started.ipynb](getting_started.ipynb))
2) using helper functions in the `climakitae` library (this notebook!) 
3) the python library `intake` ([intake_direct_data_download.ipynb](intake_direct_data_download.ipynb))
<br>

This notebook showcases option 2.

In [2]:
from climakitae.core.data_interface import (
    get_data_options, 
    get_subsetting_options, 
    get_data
)

## See all the data options in the catalog 
These options will match those in our AE selections GUI. 

In [None]:
get_data_options()

## See the data options for a particular subset of inputs
The `get_data_options` function enables you to input a number of different function arguments, corresponding to the columns in the table above, to subset the table. Inputting no arguments, like we did above, will return the entire range of options.<br><br>First, lets print the function documentation to see the inputs and outputs of the function. If an argument (or "parameter", as listed in the documentation) is listed as "optional", that means you don't have to input anything for that argument. In the case of this function, none of the function arguments are required, so you can simply call the function. 

In [None]:
print(get_data_options.__doc__)

If you call the function with **no inputs**, it will simply return the entire catalog! But, let's say you want to see all the data options for statistically downscaled data at 3 km resolution. You'll want to provide inputs for the `downscaling_method` and `resolution` arguments. 

In [None]:
get_data_options(
    downscaling_method = "Statistical", 
    resolution = "3 km"
)

Perhaps you want to see all the data options for daily precipitation. We have several precipitation options in the catalog. You don't need to know the name of these variables; simply use "precipitation" as your input to the function for the `variable` argument.<br><br>The function prefers that your inputs match an actual option in the catalog-- with exact capitalizations and no misspelling-- and will print a warning if your input is not a direct match ("precipitation" is not an option, but "Precipitation (total)" is). The function will then try to make a guess as to what you actually meant. 

In [None]:
get_data_options(
    variable = "precipitation", 
    timescale = "daily"
) 

The function can also return a simple pandas DataFrame without the complex MultiIndex. Just set `tidy = False`.

In [None]:
get_data_options(
    variable = "precipitation", 
    timescale = "daily", 
    tidy = False
) 

## See all the geometry options for spatially subsetting the data during retrieval
These options will match those in our AE selections GUI. This will enable you to retrieve a subset for a specific region.

In [None]:
get_subsetting_options()

This shows a lot of options! Say you're only interested in California counties. Simply set the argument `area_subset` to "CA counties" to see the all options for counties. The function documentation shows the other options, which also match the values in the column "area_subset" in the table above. 

In [None]:
print(get_subsetting_options.__doc__)

In [None]:
get_subsetting_options(area_subset = "CA counties")

You can see all the options for subsetting, and their corresponding geometries, but you don't actually need to use the geometries for subsetting if you use climakitae's data retrieval function-- `get_catalog_data` -- explained in the next section. 

## Retrieve data using the get_data() function
You can easily retrieve data from the Analytics Engine data catalog using climakitae's ```get_data``` function, described below. Additional details for each of the function arguments can be viewed in function docstrings in the next code cell. 

### Required inputs 
This function requires you to input values for the following arguments: 
- variable (required)
- downscaling method (required)
- resolution (required)
- timescale (required)

### Location subsetting 
The options for location subsetting can be found using the `get_data_options()` function, as described in the beginning of this notebook. You can also opt to perform an area average by setting `area_average = "Yes"`. The `get_data()` function will default to returning the entire spatial domain, with no area averaging performed. 
- area_subset (optional) 
- cached_area (optional) 
- area_average (optional)

### Additional options
Further modify the data returned using the following arguments. 
- approach (optional) 
- scenario (optional)
- units (optional)
- time_slice (optional)
- warming_level (optional)
- warming_level_window (optional)
- warming_level_months (optional)

In [None]:
# See additional details about the function arguments by printing the docstring
print(get_data.__doc__)

### Example 1: Time-based approach
Retrieve gridded data using a time-based approach. ```approach``` is an optional function argument, but the default is to use a time-based approach, so you don't actually need to set this argument. 

#### Example 1a
First, let's retrieve 3 kilometer resolution statistically downscaled historical data at a monthly timestep. 

In [None]:
get_data(
    variable = "Precipitation (total)", 
    downscaling_method = "Statistical", 
    resolution = "3 km", 
    timescale = "monthly", 
    scenario = "Historical Climate"
    # approach = "Time" # Optional because "Time" is the function default 
)

#### Example 1b
Now say you're only interested in this data for San Bernadino County, and you want to compute an area average over the entire county. 

In [None]:
get_data(
    variable = "Precipitation (total)", 
    downscaling_method = "Statistical", 
    resolution = "3 km", 
    timescale = "monthly", 
    scenario = "Historical Climate",
    
    # Modify location settings
    cached_area = "San Bernardino County", 
    area_average = "Yes"
)

#### Example 1c 
Perhaps next you want to get dynamically downscaled (i.e. WRF) precipitation data instead. First, you might want to check what options you have for scenario, timescale, and resolution using the ```get_data_options()``` function. 

In [None]:
get_data_options(
    variable = "Precipitation (total)", 
    downscaling_method = "Dynamical"
) 

Next, let's retrieve both the future and historical dynamically downscaled data. "Historical Climate" is the correct historical data option here; "Historical Reconstruction" data is from ERA5 (a climate reanalysis product, rather than a climate model), and cannot be retrieved with future data in the same function call. <br><br>You can set the ```scenario``` argument to retrieve the shared socioeconomic pathway data (future projections) appended to the historical data. You can also set your desired time period using the ```time_slice``` argument. 

In [None]:
get_data(
    variable = "Precipitation (total)", 
    downscaling_method = "Dynamical", 
    resolution = "45 km", 
    timescale = "monthly", 
    cached_area = "San Bernardino County", 
    
    # Modify time-based settings 
    time_slice = (2000,2050),
    scenario = [
        "Historical Climate", 
        "SSP 3-7.0", 
        "SSP 2-4.5",
        "SSP 5-8.5"
    ]
) 

### Example 2: Warming levels approach 
By default, the function uses a time-based approach. To use a warming levels approach, set the argument ```approach = "Warming Level"```. 

#### Example 2a
Retrieve the same data as example 1c, using a warming levels approach instead of a time-based approach. Note that the ```scenario``` and ```time_slice``` arguments are invalid for a warming levels approach; if provided, they will be ignored by the function. 

In [None]:
get_data(
    variable = "Precipitation (total)", 
    downscaling_method = "Dynamical", 
    resolution = "45 km", 
    timescale = "monthly", 
    cached_area = "San Bernardino County", 
    
    # Modify your approach 
    approach = "Warming Level",
)

#### Example 2b
The ```get_data()``` function uses a default warming levels window of +/- 15 years, resulting in a 30 year period. Lets modify that by setting ```warming_level_window = 10``` to retrieve a 20 year window.<br><br>We can also modify the warming levels computed to include additional warming levels beyond the default. Let's select a few more by setting ```warming_level = [2.5, 3.0, 4.0]```. 

In [None]:
get_data(
    variable = "Precipitation (total)", 
    downscaling_method = "Dynamical", 
    resolution = "45 km", 
    timescale = "monthly", 
    cached_area = "San Bernardino County", 
    approach = "Warming Level",
    
    # Modify warming level settings 
    warming_level_window = 10, 
    warming_level = [2.5, 3.0, 4.0]
)

### Example 3: Weather station data.  
By default, the function retrieves gridded data, but you can also retrieve point-based weather station data. This data is bias-corrected (i.e localized) to the exact location with the dynamically-downscaled gridded data. Station data can be retrieved using the `data_type` and `station` arguments. If you don't set the `station` argument, the function will return all available weather stations. 
```
data_type = "Station" # Return weather station data 
data_type = "Gridded" # Return gridded data (function default) 
```

As of now, you can only retrieve hourly data for the variable "Air Temperature at 2m". You can also chose the resolution of the gridded data used in bias correction by setting the `resolution` argument to one of "3 km", "9 km", "45 km" 

In [3]:
get_data(
    variable = "Air Temperature at 2m", 
    downscaling_method = "Dynamical", 
    resolution = "9 km",
    timescale = "hourly",
    data_type = "Station",
    station = "San Diego"
)

Input station='San Diego' is not a valid option.
Closest option: 'San Diego Lindbergh Field (KSAN)'
Outputting data for station='San Diego Lindbergh Field (KSAN)'


  sample = dates.ravel()[0]
  da_adj["time"] = da_adj.indexes["time"].to_datetimeindex()


Unnamed: 0,Array,Chunk
Bytes,18.18 MiB,2.27 MiB
Shape,"(8, 297840, 1)","(1, 297840, 1)"
Count,100 Graph Layers,8 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 18.18 MiB 2.27 MiB Shape (8, 297840, 1) (1, 297840, 1) Count 100 Graph Layers 8 Chunks Type float64 numpy.ndarray",1  297840  8,

Unnamed: 0,Array,Chunk
Bytes,18.18 MiB,2.27 MiB
Shape,"(8, 297840, 1)","(1, 297840, 1)"
Count,100 Graph Layers,8 Chunks
Type,float64,numpy.ndarray


To see all the available weather station options, you can use the `get_subsetting_options()` function detailed at the top of this notebook

In [4]:
get_subsetting_options(area_subset="Weather stations") 

Unnamed: 0_level_0,geometry
cached_area,Unnamed: 1_level_1
Arcata Eureka Airport (KACV),POINT (-124.10479 40.97844)
Bakersfield Meadows Field (KBFL),POINT (-119.05524 35.43424)
Blythe Asos (KBLH),POINT (-114.71451 33.61876)
Burbank-Glendale-Pasadena Airport (KBUR),POINT (-118.36543 34.19966)
Desert Resorts Regional Airport (KTRM),POINT (-116.16412 33.63166)
Downtown Los Angeles USC Campus (KCQT),POINT (-118.29100 34.02400)
Fresno Yosemite International Airport (KFAT),POINT (-119.72016 36.77999)
Gillespie Field Airport (KSEE),POINT (-116.97250 32.82611)
Imperial County Airport (KIPL),POINT (-115.57656 32.83464)
Lancaster William J Fox Field (KWJF),POINT (-118.21255 34.74121)
