# Accessing renewables data 
Data access for our derived renewables data is still a work in progress as we build a data catalog and continue generating data products. Eventually, helper functions will be incorporated into `climakitae` to streamline data access. For the time being, here's the best way to access this data using Python.<br><br>For more details on data availability and production, check our memo here: [https://wfclimres.s3.amazonaws.com/era/data-guide_pv-wind.pdf](https://wfclimres.s3.amazonaws.com/era/data-guide_pv-wind.pdf)

## The basics
Retrieve renewables data from the AWS S3 bucket and download it to your current directory as a NetCDF file. 

In [None]:
import intake 

First, read in the catalog file from S3 using the `intake` package. This enables you to view and load the data easily. 

In [None]:
# Read from AWS using S3 URI for JSON file 
cat = intake.open_esm_datastore("https://wfclimres.s3.amazonaws.com/era/era-ren-collection.json")

You can easily view the entire renewables data catalog using the `.df` accessor:  

In [None]:
# Access catalog as dataframe and inspect the first few rows
cat_df = cat.df
cat_df.head()

To see all the available options for each column, use the `.unique()` call for the column of interest: 

In [None]:
# See all model options 
# Replace "source_id" with a different column name to see other unique column options
cat_df["source_id"].unique()

You can subset the catalog and read in the Zarrs as `xarray.Dataset` objects using the method shown below. To change the data downloaded, simply modify the inputs in the dictionary `query`. 

In [None]:
# Form query dictionary
query = {
    # GCM name
    'source_id': 'EC-Earth3',
    # time period - historical or emissions scenario
    'experiment_id': ['historical', 'ssp370'],
    # variable: 'cf' or 'gen' 
    'variable_id': 'cf',
    # time resolution 
    'table_id': 'day',
    # grid resolution: d02 = 9km, d03 = 3km
    'grid_label': 'd03',
    # installation type: 'pv_distributed', 'pv_utility', 'windpower_offshore', 'windpower_onshore'
    'installation': ['pv_distributed','pv_utility']
}

# Subset catalog 
cat_subset = cat.search(**query)

# View the data you've selected before downloading
cat_subset.df

Then, you can download all the files. The files will be downloaded as a dictionary with the following format: 
```
{ <string ID of data> : <xarray.Dataset for that ID> }
```
For example, for the data below: 
```python
{
    'pv_distributed.WRF.ERA.EC-Earth3.ssp370.day.d03'      : <xarray.Dataset> ,
    'pv_utility.WRF.ERA.EC-Earth3.historical.day.d03'      : <xarray.Dataset> ,
    'pv_utility.WRF.ERA.EC-Earth3.ssp370.day.d03'          : <xarray.Dataset> ,
    'pv_distributed.WRF.ERA.EC-Earth3.historical.day.d03'  : <xarray.Dataset> ,
}
```

In [None]:
# Get dataset dictionary 
dsets = cat_subset.to_dataset_dict(
    xarray_open_kwargs={'consolidated': True},
    storage_options={'anon': True}
)

To see all the string IDs for the Datasets in the dictionary, you can print them with the following code: 

In [None]:
list(dsets.keys())

You can easily access the files in the dictionary using the following format: 
```
dsets[<string ID of data>]
```
For example:

In [None]:
# Retrieve a single file
ds = dsets["pv_distributed.WRF.ERA.EC-Earth3.ssp370.day.d03"]
ds

## Make a quick plot of the data 
`xarray` has some nice mapping features that enable you to quickly generate a plot for a single timestep. This lets you get a sense for the data you read in. 

In [None]:
one_timestep = ds['cf'].isel(time=0).compute() # Select the first timestep and read it into memory 
one_timestep.plot();

## Get the closest gridcell for a coordinate pair 
For this, we'll use a helper function from `climakitae`. We'll demonstrate how to do this for the the coordinates of the city of San Francisco. 

In [None]:
from climakitae.util.utils import get_closest_gridcell
import numpy as np

First, we need to retrieve the data using `intake`: 

In [None]:
# Form query dictionary
query = {
    'source_id': 'MPI-ESM1-2-HR',
    'experiment_id': 'historical',
    'variable_id': 'gen',
    'table_id': 'day',
    'grid_label': 'd03',
    'installation': 'pv_distributed'
}

# Subset catalog 
cat_subset = cat.search(**query)

# Get dataset dictionary 
dsets = cat_subset.to_dataset_dict(
    xarray_open_kwargs={'consolidated': True},
    storage_options={'anon': True}
)

# Retrieve the data object 
ds = dsets['pv_distributed.WRF.ERA.MPI-ESM1-2-HR.historical.day.d03']

Next, let's use `climakitae`'s utility function `get_closest_gridcell` to grab the model gridcell that is closest to the coordinates for the city of San Francisco. <br><br>**NOTE**: The renewables data has missing values where data was not generated for a variety of reasons, so this function may return `nan` if your coordinates closest gridcell is over one of these missing value regions. Missing data regions will vary by technology type. 

In [None]:
# Coordinates of San Francisco 
lat = 37.7749
lon = -122.4194

# Reassign attribute so the function can find the resolution 
ds.attrs["resolution"] = ds.attrs["nominal_resolution"]

# Use the function to get the closest gridcell of data 
closest_gridcell = get_closest_gridcell(data=ds, lat=lat, lon=lon)

Finally, let's make a quick plot of the data for the first year of the timeseries. 

In [None]:
# Get the first 365 days of data and read into memory 
to_plot = closest_gridcell.isel(time=np.arange(0,365)).compute()

# Generate a simple lineplot 
to_plot.gen.plot();