# Basic data access 
This notebook showcases helper functions from `climakitae` that enable you to access and export the AE catalog data, while also allowing you to perform spatial subsetting and view the data options in an easy-to-use fashion. These functions could be easily implemented in a python script.

In [None]:
import climakitae as ck 
from climakitae.new_core.user_interface import ClimateData

## High-level details 
The AE data catalog has many different types of data. Our helper library `climakitae` attempts to make accessing and retrieveing this data intuitive, as well as simplify climate and statistical analysis with the data down the line, by performing some data transformations as the data is retrieved.<br><br> To retrieve the data, you'll need to make some selections as to your climate variable, data resolution, location settings, and many other options. There are also several high-level options you'll need to set when selecting your data, detailed below: 

### Data type: Gridded or Stations
**Gridded**: Gridded (i.e. raster) climate data at various spatial resolutions.<br><br>
**Stations**: Gridded (i.e. raster) climate data at unique grid cell(s) corresponding to the central coordinates of the selected weather station(s). 
- This data is bias-corrected (i.e localized) to the exact location of the weather station using the historical in-situ data from the weather station(s). 
- This data is currently only available for dynamically downscaled air temperature data. 

### Scientific approach: Time or Warming Level
**Time**: Retrieve the data using a traditional time-based approach that allows you to select historical data, future projections, or both, along with a time-slice of interest. 
- “Historical Climate” includes data from 1980-2014 simulated from the same GCMs used to produce the Shared Socioeconomic Pathways (SSPs). It will be automatically appended to a SSP time series when both are selected. Because this historical data is obtained through simulations, it represents average weather during the historical period and is not meant to capture historical timeseries as they occurred.
- “Historical Reconstruction” provides a reference downscaled [reanalysis](https://www.ecmwf.int/en/about/media-centre/focus/2020/fact-sheet-reanalysis) dataset based on atmospheric models fit to satellite and station observations, and as a result will reflect observed historical time-evolution of the weather.
- Future projections are available for [greenhouse gas emission scenario (Shared Socioeconomic Pathway, or SSP)](https://climatescenarios.org/primer/socioeconomic-development) SSP 3-7.0 through 2100 with the dynamically-downscaled General Circulation Models (GCMs).
     - One GCM was additionally downscaled for two additional SSPs (SSP 5-8.5 and SSP 2-4.5)<br>

**Warming Level**: Retrieve the data by future global warming levels, which will automatically retrieve all available model data for the historical+future period and then calculate the time window around which each simulation reaches the selected warming level.  
- Because warming levels are defined based on amount of global mean temperature change, they can be used to compare possible outcomes across multiple scenarios or model simulations.
- This approach includes all simulations that reach a specified amount of warming regardless of when they reach that level of warming, rather than the time-based appraoch, which will preliminarily subset a portion of simulations that follow a given SSP trajectory.
    
### Downscaling method: Dynamical, Statistical, or both
**Dynamical**: [Dynamically downscaled](https://dept.atmos.ucla.edu/alexhall/downscaling-cmip6) WRF data, produced at hourly intervals. If you select 'daily' or 'monthly' for 'Timescale', you will receive an average of the hourly data. The spatial resolution options, on the other hand, are each the output of a different simulation, nesting to higher resolution over smaller areas.<br><br>
**Statistical**: [Hybrid-statistically downscaled](https://loca.ucsd.edu) LOCA2-Hybrid data, available at daily and monthly timescales. Multiple LOCA2-Hybrid simulations are available (100+) at a fine spatial resolution of 3km.

## See the options in our data catalog
The interface provides several methods to explore available data options. You can get a comprehensive overview or explore step by step.

In [None]:
# Initialize the interface
cd = ClimateData()

# Get a comprehensive overview of all available options
cd.show_all_options()

## See the data options for a particular subset of inputs
You can explore options step by step, building your query as you learn about available data.

In [None]:
# Explore options step by step
print("=== Available Catalogs ===")
cd.show_catalog_options()

print("\n=== Choose 'renewable energy generation' catalog and explore installations ===")
renewables_explorer = cd.catalog("renewable energy generation")
renewables_explorer.show_installation_options()

print("\n=== Choose 'pv_utility' installation and explore variables ===")
pv_explorer = renewables_explorer.installation("pv_utility")
pv_explorer.show_variable_options()

You can also explore the climate data catalog:

In [None]:
print("=== Climate Data Catalog ===")
cd = ClimateData()
data_explorer = cd.catalog("cadcat")

print("\n=== WRF (Dynamical Downscaling) Variables ===")
wrf_explorer = data_explorer.activity_id("WRF")
wrf_explorer.show_variable_options()

At any point in building your query, you can check what parameters you've set:

In [None]:
# Build a partial query and check its state
cd = ClimateData()
partial_query = (cd
    .catalog("cadcat")
    .activity_id("WRF")
    .experiment_id("historical")
    .table_id("mon")
    .grid_label("d01")
)

# Check what we've built so far
partial_query.show_query()

# See what variable options are still available
print("\nAvailable variables for this query:")
partial_query.show_variable_options()

You can reset the interface to start a new query at any time:

In [None]:
cd.reset()
print("Interface reset - ready for new query")

## Retrieve data 
The ClimateData interface allows you to chain method calls to build readable queries, and then retrieve the data easily in your query. 
<br><br>
Required components of the query depend on the data catalog you're interested in. In general, the required components for all catalogs are: 
- catalog 
- variable 

### Example 1: Future air temperature data 
You can retrieve data using a dictionary query, or by chaining operations to the ClimateData object. Either is valid and will result in the same output data, so just use whichever method is most intuitive to you. 

#### Method 1: Dictionary query

In [None]:
# Define your query 
climate_query_dict = {
    "catalog": "cadcat", # Catalog name 
    "activity_id": "WRF", # Downscaling method 
    "experiment_id": "ssp370", # Simulation
    "table_id": "mon", # Temporal resolution 
    "grid_label": "d02", # Grid resolution
    "variable_id": "t2" # Variable name 
}

# Load the query 
climate_query = ClimateData().load_query(climate_query_dict)

# Retrieve the data
climate_data = climate_query.get()

### Method 2: Chained operations

This will return the same data as above, but by chaining operations instead!

In [None]:
cd.reset()
climate_data = (cd
    .catalog("cadcat")
    .activity_id("WRF")
    .experiment_id("ssp370")
    .table_id("mon")
    .grid_label("d02")
    .variable("t2")
).get()

### Example 2: Renewable energy model data 
Note that the renewables catalog has an additional query option: `installation`. This indicates the energy generation method, a parameter that is only applicable for this particular catalog. 

In [None]:
# Define your query 
renewables_query_dict = {
    "catalog": "renewable energy generation", # Catalog name 
    "experiment_id": "historical", # Model name 
    "table_id": "day", # Temporal resolution 
    "grid_label": "d03", # Grid resolution
    "variable_id": "cf", # Variable name 
    "installation": "pv_utility", # Renewables catalog only! 
    # "source_id": "MPI-ESM1-2-HR" # Optional: pick a simulation within the model 
}

# Load the query 
renewables_query = ClimateData().load_query(renewables_query_dict)

# Retrieve the data
renewables_data = renewables_query.get()

## Working with Processors
You can further customize your data retrieval using `processors`, which perform operations on the data before it is returned to you. For example, two available processors are: 

- **`concat`** - Concatenate datasets along specified dimensions, default behavior is to concatenate on "time" using a historical+ssp approach.
- **`filter_unbiased_models`** - Remove or include unbiased models (default: "yes" to remove)

These two processors are run by default every time that you retrieve data. Examples of other available processors can be found in the `climakitae` library documentation, or in other example notebooks. <br><br>
It's important to note that processors are applied as a **dictionary**. This enables you to add more than one processor to your chain of operations. 

### Example 3: Concatenation along a specified dimension
By default, when historical data is retrieved in the same operation as future data, the historical data will be appended to the future data, giving a single timeseries. However, you can change this default behavior by setting the query to concatenate along the simulation "`sim`" dimension instead. This will return the historical and future data as separate simulations. Concatenating by `time` or by `sim` have unique benefits, and you'll need to decide which method is most appropriate for your analyses. 

In the returned data from the code below, future time periods for the historical simulation will be infilled with `NaN`, because the model has no data for that time period by definition. The same logic applies to the future simulations: any time in the past will be infilled with `NaN`. 

In [None]:
concat_by_sim = (cd
    .catalog("cadcat")
    .experiment_id(["historical", "ssp370"]) # Retrieve historical and future data 
    .table_id("mon")
    .grid_label("d01")
    .variable("prec") # Precipitation 
    .processes({
        "concat": "sim"  # Concatenate along simulation dimension
    })
    .get()
)

## Exporting data  
To save data as a file, call export and input your desired: 
1. data to export – an [xarray DataArray or Dataset](https://docs.xarray.dev/en/stable/user-guide/data-structures.html) (this is the default data type returned by `ClimateData.get()`)
2. output file name (without file extension)
3. file format ("NetCDF", "Zarr", or "CSV")

We recommend NetCDF or Zarr, which suits data and outputs from the Analytics Engine well – they efficiently store large data containing multiple variables and dimensions. Metadata will be retained in these files.

NetCDF or Zarr can be export locally (such as onto the JupyterHUB user partition). Optionally Zarr can be exported to an AWS S3 scratch bucket for storing very large exports.

CSV can also store Analytics Engine data with any number of variables and dimensions. It works the best for smaller data with fewer dimensions. The output file will be compressed to ensure efficient storage. Metadata will be preserved in a separate file.

CSV stores data in tabular format. Rows will be indexed by the index coordinate(s) of the DataArray or Dataset (e.g. scenario, simulation, time). Columns will be formed by the data variable(s) and non-index coordinate(s).

In [None]:
# Download a subset of the data from example 3 
data_to_export = concat_by_sim.sel(time=slice("2023", "2025")) # Just grab a few years of data (2023-2025)
ck.export(data_to_export, filename="my_filename1", format="NetCDF") 