# Basic data access 
This notebook showcases helper functions from `climakitae` that enable you to access and export the AE catalog data, while also allowing you to perform spatial subsetting and view the data options in an easy-to-use fashion. These functions could be easily implemented in a python script.

**Runtime**: < 1 min

In [None]:
import pandas as pd
import xarray as xr
import matplotlib.pyplot as plt

import climakitae as ck 

## High-level details 
The AE data catalog has many different types of data. Our helper library `climakitae` attempts to make accessing and retrieving this data intuitive, as well as simplify climate and statistical analysis with the data down the line, by performing some data transformations as the data is retrieved.<br><br> To retrieve the data, you'll need to make some selections as to your climate variable, data resolution, location settings, and many other options. There are also several high-level options you'll need to set when selecting your data, detailed below: 

### Data type: Gridded or Stations
**Gridded**: Gridded (i.e. raster) climate data at various spatial resolutions.<br><br>
**Stations**: Gridded (i.e. raster) climate data at unique grid cell(s) corresponding to the central coordinates of the selected weather station(s). 
- This data is bias-corrected (i.e localized) to the exact location of the weather station using the historical in-situ data from the weather station(s). 
- This data is currently only available for dynamically downscaled air temperature data. 

### Scientific approach: Time or Warming Level
**Time**: Retrieve the data using a traditional time-based approach that allows you to select historical data, future projections, or both, along with a time-slice of interest. 
- “Historical Climate” includes data from 1980-2014 simulated from the same GCMs used to produce the Shared Socioeconomic Pathways (SSPs). It will be automatically appended to a SSP time series when both are selected. Because this historical data is obtained through simulations, it represents average weather during the historical period and is not meant to capture historical timeseries as they occurred.
- “Historical Reconstruction” provides a reference downscaled [reanalysis](https://www.ecmwf.int/en/about/media-centre/focus/2020/fact-sheet-reanalysis) dataset based on atmospheric models fit to satellite and station observations, and as a result will reflect observed historical time-evolution of the weather.
- Future projections are available for [greenhouse gas emission scenario (Shared Socioeconomic Pathway, or SSP)](https://climatescenarios.org/primer/socioeconomic-development) SSP 3-7.0 through 2100 with the dynamically-downscaled General Circulation Models (GCMs).
     - One GCM was additionally downscaled for two additional SSPs (SSP 5-8.5 and SSP 2-4.5)<br>

**Warming Level**: Retrieve the data by future global warming levels, which will automatically retrieve all available model data for the historical+future period and then calculate the time window around which each simulation reaches the selected warming level.  
- Because warming levels are defined based on amount of global mean temperature change, they can be used to compare possible outcomes across multiple scenarios or model simulations.
- This approach includes all simulations that reach a specified amount of warming regardless of when they reach that level of warming, rather than the time-based approach, which will preliminarily subset a portion of simulations that follow a given SSP trajectory.
    
### Downscaling method: Dynamical, Statistical, or both
**Dynamical**: [Dynamically downscaled](https://dept.atmos.ucla.edu/alexhall/downscaling-cmip6) WRF data, produced at hourly intervals. If you select 'daily' or 'monthly' for 'Timescale', you will receive an average of the hourly data. The spatial resolution options, on the other hand, are each the output of a different simulation, nesting to higher resolution over smaller areas.<br><br>
**Statistical**: [Hybrid-statistically downscaled](https://loca.ucsd.edu) LOCA2-Hybrid data, available at daily and monthly timescales. Multiple LOCA2-Hybrid simulations are available (100+) at a fine spatial resolution of 3km.

## See the options in our data catalog
The interface provides several methods to explore available data options. You can get a comprehensive overview or explore step by step.

### Verbosity
You can choose the level of output the user interface provides.   
-2 : errors only  
-1 : warnings and errors  
0  : info, warnings, and errors (default)  
1  : debug, info, warnings, and errors (developers or debugging only, not recommended)

In [None]:
# Initialize the interface
cd = ck.ClimateData(verbosity=-2) # only give error messages, quiet output

In [None]:

# Get a comprehensive overview of all available options
cd.show_all_options()

# Or you can see specific categories by uncommenting any of the lines below:
# cd.show_catalog_options()
# cd.show_activity_id_options()
# cd.show_institution_id_options()
# cd.show_source_id_options()
# cd.show_experiment_id_options()
# cd.show_table_id_options()
# cd.show_grid_label_options()
# cd.show_variable_options()
# cd.show_installation_options()
# cd.show_processors()
# cd.show_boundary_options()
# cd.show_station_options()

# this will be explored in more detail in the next section

## See the data options for a particular subset of inputs
You can explore options step by step, building your query as you learn about available data.

In [None]:
# Explore options step by step
print("=== Available Catalogs ===")
cd.show_catalog_options()

print("\n=== Choose 'renewable energy generation' catalog and explore installations ===")
renewables_explorer = cd.catalog("renewable energy generation")
renewables_explorer.show_installation_options()

print("\n=== Choose 'pv_utility' installation and explore variables ===")
pv_explorer = renewables_explorer.installation("pv_utility")
pv_explorer.show_variable_options()

You can also explore the climate data catalog:

In [None]:
print("=== Climate Data Catalog ===")
cd = ck.ClimateData()
data_explorer = cd.catalog("cadcat")

print("\n=== WRF (Dynamical Downscaling) Variables ===")
wrf_explorer = data_explorer.activity_id("WRF")
wrf_explorer.show_variable_options()

At any point in building your query, you can check what parameters you've set:

In [None]:
# Build a partial query and check its state
cd = ck.ClimateData()
partial_query = (cd
    .catalog("cadcat")
    .activity_id("WRF")
    .experiment_id("historical")
    .table_id("mon")
    .grid_label("d01")
)

# Check what we've built so far
partial_query.show_query()

# See what variable options are still available
print("\nAvailable variables for this query:")
partial_query.show_variable_options()

You can reset the interface to start a new query at any time:

In [None]:
cd.reset()
print("Interface reset - ready for new query")

## Retrieve data 
The ClimateData interface allows you to chain method calls to build readable queries, and then retrieve the data easily in your query. 
<br><br>
Required components of the query depend on the data catalog you're interested in. In general, the required components for all catalogs are: 
- catalog 
- variable 

### Example 1: Future air temperature data 
You can retrieve data using a dictionary query, or by chaining operations to the ClimateData object. Either is valid and will result in the same output data, so just use whichever method is most intuitive to you. 

#### Method 1: Chained operations (recommended)

In [None]:
cd.reset()
climate_data = (cd
    .catalog("cadcat")
    .activity_id("WRF")
    .experiment_id("ssp370")
    .table_id("mon")
    .grid_label("d02")
    .variable("t2")
).get()

#### Method 2: Dictionary query

In [None]:
# Define your query 
climate_query_dict = {
    "catalog": "cadcat", # Catalog name 
    "activity_id": "WRF", # Downscaling method 
    "experiment_id": "ssp370", # Simulation
    "table_id": "mon", # Temporal resolution 
    "grid_label": "d02", # Grid resolution
    "variable_id": "t2" # Variable name 
}

# Load the query 
climate_query = ck.ClimateData().load_query(climate_query_dict)

# Retrieve the data
climate_data = climate_query.get()

### Example 2: Renewable energy model data 
Note that the renewables catalog has an additional query option: `installation`. This indicates the energy generation method, a parameter that is only applicable for this particular catalog. 

In [None]:
# Define your query 
renewables_query_dict = {
    "catalog": "renewable energy generation", # Catalog name 
    "experiment_id": "historical", # Model name 
    "table_id": "day", # Temporal resolution 
    "grid_label": "d03", # Grid resolution
    "variable_id": "cf", # Variable name 
    "installation": "pv_utility", # Renewables catalog only! 
    # "source_id": "MPI-ESM1-2-HR" # Optional: pick a simulation within the model 
}

# Load the query 
renewables_query = ck.ClimateData().load_query(renewables_query_dict)

# Retrieve the data
renewables_data = renewables_query.get()

## Working with Processors
You can further customize your data retrieval using `processors`, which perform operations on the data before it is returned to you. The available processors are: 

- **`concat`** - Concatenate datasets along specified dimensions, default behavior is to concatenate on "time" using a historical+ssp approach.
- **`filter_unadjusted_models`** - Remove or include unadjusted models (default: "yes" to remove)
- **`update_attributes`** - Updates the attributes of your dataset based on the processors applied
- `clip` - Applies a spatial clipping to the requested dataset. Many types of spatial clipping are supported including point based, bounding box, user provided shape files, and built-in boundaries including states, CA counties, CA watersheds, CA electric and utilities areas, CA demand forecast zones, CA electric balancing authority areas, and CA census tracts.
- `time_slice` - Applies a time slice to the requested dataset.
- `warming_level` - Applies a global warming level approach (as separate from the default time based approach). Please see our guidance on the use of global warming levels.
- `metric_calc` - applies metric calculations to your dataset such as min, max, mean, median, percentiles, and 1-in-X calculations. 
- `convert_units` - converts the units of your dataset.
- `bias_adjust_model_to_station` - For working with gridded data bias adjusted to historical HADISD weather station data.
- `export` - Exports your requested dataset to a range of file formats

The first three processors (bolded) are run by default every time that you retrieve data. Examples of other available processors can be found in the `climakitae` library documentation, or in other example notebooks. <br><br>
It's important to note that processors are applied as a **dictionary**. This enables you to add more than one processor to your chain of operations. 

### Processor Example 1: Concatenation along a specified dimension
By default, when historical data is retrieved in the same operation as future data, the historical data will be appended to the future data, giving a single timeseries. However, you can change this default behavior by setting the query to concatenate along the simulation "`sim`" dimension instead. This will return the historical and future data as separate simulations. Concatenating by `time` or by `sim` have unique benefits, and you'll need to decide which method is most appropriate for your analyses. 

In the returned data from the code below, future time periods for the historical simulation will be infilled with `NaN`, because the model has no data for that time period by definition. The same logic applies to the future simulations: any time in the past will be infilled with `NaN`. 

In [None]:
cd.reset()
concat_by_sim = (cd
    .catalog("cadcat")
    .activity_id("WRF")
    .experiment_id(["historical", "ssp370"]) # Retrieve historical and future data 
    .table_id("mon")
    .grid_label("d01")
    .variable("prec") # Precipitation 
    .processes({
        "concat": "sim"  # Concatenate along simulation dimension
    })
    .get()
)

For the purposes of the next examples we'll be building a complete analysis using WRF data. Any data works, but for demonstration purposes it is useful to see how the puzzle pieces fit together using a consistent dataset.
### Processor Example 2: Global Warming Levels (GWL)

Instead of a time based approach, it is recommended to use a GWL approach that determines when each simulation reaches a certain global warming level and slices in a window around that time.

In [None]:
wrf_wl_data = (cd
    .verbosity(1)
    .catalog("cadcat")
    .activity_id("WRF")
    .table_id("1hr")
    .grid_label("d03")
    .variable("t2") # Temperature at 2 meters
    .processes({
        "warming_level": {
            "warming_levels": [1.5, 2.0, 3.0], # Warming levels in °C
            # "warming_level_months": [1, 2, 3] # Optional: specify months to consider for warming level calculation
            # "warming_level_window": 20 # Optional: specify the window size (in years) for the warming level calculation
        }
    })
    .get()
)

### Processor Example 3: Clipping to Boundaries

There are many options for clipping to boundaries. They include:
#### Clipping to a user specified shape file:
- `"clip": "<path to geopandas readable shape file>"`

#### Clipping to user specified lat/lons
- `"clip": (lat0, lon0)`
- `"clip": [(lat0, lon0), (lat1, lon1), ..., (latN, lonN)]`

#### Clipping to a HADISD station
use `ClimateData().show_station_options()` to see a list of all accepted stations. Please note that this method DOES NOT bias adjust your data using historical station data, it simply pulls the data from the nearest grid cell in the data you're requesting.  
- `"clip": "KBFL"`
- `"clip": "Bakersfield Meadows Field (KBFL)"`
- `"clip": ["KBFL", "KBLH", "KBUR"]`

#### Clipping to `climakitae` supported boundaries
The supported boundary types can be seen with `ClimateData().show_boundary_options()`. To see a comprehensive list of the types you can use `ClimateData().show_boundary_options("<type>")`. Please be advised that some of these lists (like census tracts) are immense and may be listed by their numerical code.  

- `"clip": "Los Angeles County"`
- `"clip": ["Alameda County", "Los Angeles County"]`
- `"clip": {"boundaries": ["Alameda County", "Los Angeles County"], "separated" = True}`
- `"clip": ["Alameda County", "City and County of San Francisco - Hetch Hetchy Water and Power"]`

Note that you may clip to multiple boundaries. In this case the union will be returned by default allowing for a pleasant plotting experience (shown below). If you would like to preserve the location as a dimension for comparative analyses you can take the approach demonstrated in bullet 3 above -- provide the a dictionary with a `"boundaries"` key with your list of boundaries and specify `"separated": True`. This will produce a dimension named after the first type of boundary provided. For example, if you were to apply this approach to the example in bullet 4 you'd get a dimension named "county" since the first element in the list is a county.


In [None]:
cd.show_station_options()

In [None]:
cd.show_boundary_options()
cd.show_boundary_options("ca_counties")
cd.show_boundary_options("ious_pous")

In [None]:
wrf_wl_data = (cd
    .catalog("cadcat")
    .activity_id("WRF")
    .table_id("day")
    .grid_label("d03")
    .variable("t2") # Temperature at 2 meters
    .processes({
        "warming_level": {
            "warming_levels": [1.5, 2.0, 3.0], # Warming levels in °C
        },
        "clip": (34.05, -118.25)  # Clip to specified coordinates
    })
    .get()
)

wrf_wl_data

In [None]:
lat_lons = [
    (34.05, -118.25),  # Los Angeles, CA
    (37.77, -122.42),  # San Francisco, CA
]
wrf_wl_data = (cd
    .catalog("cadcat")
    .activity_id("WRF")
    .table_id("day")
    .grid_label("d03")
    .variable("t2") # Temperature at 2 meters
    .processes({
        "warming_level": {
            "warming_levels": [1.5, 2.0, 3.0], # Warming levels in °C
        },
        "clip": lat_lons  # Clip to specified coordinates
    })
    .get()
)

wrf_wl_data

In [None]:
wrf_wl_data = (cd
    .catalog("cadcat")
    .activity_id("WRF")
    .table_id("day")
    .grid_label("d03")
    .variable("t2") # Temperature at 2 meters
    .processes({
        "warming_level": {
            "warming_levels": [1.5, 2.0, 3.0], # Warming levels in °C
        },
        "clip": ["Alameda County", "Los Angeles County"]
    })
    .get()
)

wrf_wl_data

wrf_wl_data.isel(time_delta=0, sim=0, warming_level=0).t2.plot(y="lat", x="lon")

In [None]:
wrf_wl_data = (cd
    .catalog("cadcat")
    .activity_id("WRF")
    .table_id("day")
    .grid_label("d03")
    .variable("t2") # Temperature at 2 meters
    .processes({
        "warming_level": {
            "warming_levels": [1.5, 2.0, 3.0], # Warming levels in °C
        },
        "clip": {
            "boundaries": ["Alameda County", "Los Angeles County"],
            "separated": True
        }
    })
    .get()
)

wrf_wl_data

### Processor Example 4: Time Slicing

Time slicing is particularly effective when doing a time-based approach and doesn't work in conjunction with warming levels. We'll do a quick demo that may be useful as a comparison. In this example we'll run our standard global warming level analysis, extract the years around which we've centered our dataset and then slice to those values using a time based approach to show how you can programmatically do some interesting analyses.

In [None]:
wrf_wl_data = (cd
    .catalog("cadcat")
    .activity_id("WRF")
    .table_id("day")
    .grid_label("d03")
    .variable("t2") # Temperature at 2 meters
    .processes({
        "warming_level": {
            "warming_levels": [1.5, 2.0, 3.0], # Warming levels in °C
        },
        "clip": "Los Angeles County"
    })
    .get()
)

sim0 = wrf_wl_data.isel(sim=0)
centered_years = wrf_wl_data.centered_year.values

In [None]:
time_slice_data = []
valid_years = []
for year in centered_years:
    print(year)
    time_slice = (str(year - 15), str(year + 15))
    print(time_slice)
    result = (cd
        .catalog("cadcat")
        .activity_id("WRF")
        .table_id("day")
        .grid_label("d03")
        .variable("t2") # Temperature at 2 meters
        .experiment_id(sim0.attrs['experiment_id'])
        .source_id(sim0.attrs['source_id'])
        .processes({
            "clip": "Los Angeles County",
            "time_slice": time_slice
        })
        .get()
    )
    # Only append if we got valid data
    if result is not None:
        time_slice_data.append(result)
        valid_years.append(year)

# Only concatenate if we have valid data
if time_slice_data:
    sim0_sliced = xr.concat(
        time_slice_data, 
        dim=pd.Index(
            valid_years, 
            name='centered_year'
        )
    )
    # compare sim0 and sim0_sliced
    display(sim0)
    display(sim0_sliced)
else:
    print("No valid data returned from time slice queries")

### Processor Example 5: Unit Conversion

Unit conversion is another example of a post-fetch manipulation to the dataset. It's as simple as specifying the conversion you'd like to apply. If it fails, you'll be informed in the logs about available units to convert to and no unit conversion will be applied.

In [None]:
wrf_wl_data = (cd
    .catalog("cadcat")
    .activity_id("WRF")
    .table_id("day")
    .grid_label("d03")
    .variable("t2") # Temperature at 2 meters
    .processes({
        "warming_level": {
            "warming_levels": [1.5, 2.0, 3.0], # Warming levels in °C
        },
        "clip": "Los Angeles County",
        "convert_units": "degF" # this will convert from Kelvin to Fahrenheit
    })
    .get()
)

wrf_wl_data

In [None]:
wrf_wl_data.isel(time_delta=0, sim=0, warming_level=0).t2.plot(y="lat", x="lon")

### Processor Example 6: Metric Calculation

For convenience some basic metric calculations have been built in as a processor. The basic options include: `min`, `mean`, `median`, `max`, and `percentiles` and are demonstrated below. 

For advanced analyses 1-in-X calculations have also been included in as a processor.

#### Simple Metric Comparison Across Warming Levels

Let's start with a simple metric comparison. We'll evaluate the min, mean, and max across three global warming levels, and then plot all those in a matrix. 

In [None]:
metrics = ['min', 'mean', 'max']
data = []
for metric in metrics:
    data.append(
        (cd.catalog("cadcat")
        .activity_id("WRF")
        .table_id("day")
        .grid_label("d03")
        .variable("t2")
        .processes({
            "warming_level": {
                "warming_levels": [1.5, 2.0, 3.0],
            },
            "clip": "Humboldt County",
            "convert_units": "degF",
            "metric_calc": {
                "metric": metric,
                "dim": ["time_delta", "sim"]
                # note: don't average over warming level
            }
        })
        .get())
    )

In [None]:
fig, axes = plt.subplots(3, 3, figsize=(15, 12), sharex=True, sharey=True)

gwls = [1.5, 2.0, 3.0][::-1]  # Reverse for plotting
for i, metric in enumerate(metrics):
    for j, gwl in enumerate(gwls):
        ax = axes[j, i]
        data[i].isel(warming_level=len(gwls) - 1 - j).t2.plot.contourf(
            ax=ax,
            y="lat",
            x="lon",
            cbar_kwargs={'label': f'Temperature (°F)'}, 
            levels=100,
            cmap='plasma',
            vmin=0,
            vmax=105
        )
        ax.set_title(f'{metric.capitalize()} at {gwl}°C Warming Level')
        
plt.show()

#### Advanced Metric Comparison Across Warming Levels

Now let's do a slightly more advanced analysis. Let's define a reference period as a global warming level of 1.2 degC and calculate the 90th, 95th, and 98th percentile temperatures within this reference period. Next we'll get the average temperature for three global warming levels (1.5, 2.0, 3.0) and count the average number of days per year above the 90th, 95th, and 98th percentile reference period temperature for each warming level. Then we'll plot the results in a matrix.

In [None]:
percentiles = [90, 95, 98]
location = "Humboldt County"
ref_percentiles = (
    cd.catalog("cadcat")
    .activity_id("WRF")
    .table_id("day")
    .grid_label("d03")
    .variable("t2")
    .processes({
        "warming_level": {
            "warming_levels": [1.2],
        },
        "clip": location,
        "convert_units": "degF",
        "metric_calc": {
            "percentiles": percentiles,
            "dim": ["time_delta", "sim"]
        }
    })
    .get()
)

wrf_wl_avg = (
    cd.catalog("cadcat")
    .activity_id("WRF")
    .table_id("day")
    .grid_label("d03")
    .variable("t2")
    .processes({
        "warming_level": {
            "warming_levels": [1.5, 2.0, 3.0],
        },
        "clip": location,
        "convert_units": "degF",
        "metric_calc": {
            "metric": "mean",
            "dim": ["sim"]
        }
    })
    .get()
)

In [None]:
display(ref_percentiles)
display(wrf_wl_avg)

In [None]:
%%time

fig, axes = plt.subplots(3, 3, figsize=(15, 12), sharex=True, sharey=True)

# mask for the spatial coordinates so that we can preserve them after counting
spatial_mask = ref_percentiles['t2_p90'].isel(warming_level=0).notnull()

gwls = [1.5, 2.0, 3.0][::-1]  # Reverse for plotting
for i, metric in enumerate(percentiles):
    for j, gwl in enumerate(gwls):
        ax = axes[j, i]

        days_exceeding = (
            wrf_wl_avg.isel(warming_level=len(gwls) - 1 - j).t2 > ref_percentiles.isel(warming_level=0)[f"t2_p{metric}"]
        ).sum(dim='time_delta', skipna=False) / 30

        # preserve the masked county
        days_exceeding = days_exceeding.where(spatial_mask)

        # contour plot
        days_exceeding.plot.contourf(
            ax=ax,
            y="lat",
            x="lon",
            cbar_kwargs={'label': f'Avg Days Per Year Above Reference'},
            levels=100,
            cmap='plasma',
            vmin=0, vmax=100
        )
        ax.set_title(f'Avg Days / Year > {metric}th pctl T2 at {gwl}°C GWL')

plt.suptitle("Average Number of Days Above Reference Percentile Temp for several Global Warming Levels")
plt.show()

### Processor Example 7: Export