# GOAL

Encode the tmeperature information from our NetCDF exercise into an `xarray.DataSet`. 

For reference, the exercise is as follows:

>**Part 1**
>
>Imagine the following scenario: we have a network of 25 weather stations. They are located in a square grid: starting at 30°0′N 60°0′E, there is a station every 10° North and every 10° East. Each station measures the air temperature at a set time for three days, starting on September 1st, 2022. On the first day, all stations record a temperature of 0°C. On the second day, all temperatures are 1°C, and on the third day, all temperatures are 2°C. What are the variables, dimensions and attributes for this data?

> **Part 2**
>
>Now imagine we calculate the average temperature over time at each weather station, and we wish to incorporate this data into the same dataset. How will adding the average temperature data change the dataset’s variables, attributes, and dimensions?

And the conceptual diagram about how to organize this information is the following:

<img src = exercise_diagram.png width=80% height=80% class="center">


## Create an `xarray.DataArray`
Our goal is to make an `xarray.DataArray` that includes the information from our previous exercise about measuring temperature across three days. 

First, we import all the necessary libraries.

In [1]:
# we will use this to create number arrays
import numpy as np  
# we will only use it to create a vector of dates
import pandas as pd 

# THIS IS THE PACKAGE WE WILL EXPLORE
import xarray as xr   

# `xarray.DataArray`

The `xarray.DataArray` is:

* the primary data structure of the `xarray` package

* an n-dimensional array with **labeled dimensions**

* a **representation of a single variable in the NetCDF data format**: it holds the variable’s values, dimensions, and attributes.

Here you can read more about the [`xarray` terminology](https://docs.xarray.dev/en/stable/user-guide/terminology.html).

## Variables

The underlying data in the `xarray.DataArray` is a `numpy.ndarray` that holds the variable values. 

We start by making a `numpy.ndarray` with our mock temperature data:

In [2]:
# values of a single variable at each point of the coords 
temp_data = np.array([np.zeros((5,5)), 
                      np.ones((5,5)), 
                      np.ones((5,5))*2]).astype(int)
temp_data

array([[[0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0]],

       [[1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1]],

       [[2, 2, 2, 2, 2],
        [2, 2, 2, 2, 2],
        [2, 2, 2, 2, 2],
        [2, 2, 2, 2, 2],
        [2, 2, 2, 2, 2]]])

We could think this is "all" we need to represent our data.
But if we stopped at this point, we would need to 

1. remember that the numbers in this array represent the temperature in degrees Celsius (doesn't seem too bad), 

2. remember that the first dimension of the array represents time, the second latitude and the third longitude (maybe ok), and 

3. keep track of the range of values that time, latitude, and longitude take (not so good).

Keeping track of all this information separately could quickly get messy and could make it challenging to share our data and analyses with others. 

This is what the netCDF data model and `xarray` aim to simplify. 

We can get data and its descriptors together in an `xarray.DataArray` by adding the dimensions over which the variable is being measured and including attributes that appropriately describe dimensions and variables.

## Dimensions and Coordinates

To specify the dimensions of our upcoming `xarray.DataArray`, we must examine how we've constructed the `numpy.ndarray` holding the temperature data. 

The diagram below shows how the dimensions of `temp_data` are ordered: the first dimension is time, the second is latitude, and the third is longitude. 


<img src = netcdf_xarray_indexing.png width=50% height=50%>

Remember that indexing in 2-dimensional `numpy.ndarrays` starts at the top-left corner of the array, and it is done by rows first and columns second (like matrices). 
This is why latitude is the second dimension and longitude the third. 

From the diagram, we can also see that the coordinates (values of each dimension) are as follow:

- date coordinates are 2022-09-01, 2022-09-02, 2022-09-03
- latitude coordinates are 70, 60, 50, 40, 30 (notice decreasing order)
- longitude coordinates are 60, 70, 80, 90, 100 (notice increasing order)

We add the dimensions as a tuple of strings and coordinates as a dictionary:

In [3]:
# names of the dimensions in the required order
dims = ('time', 'lat', 'lon')

# create coordinates to use for indexing along each dimension 
coords = {'time' : pd.date_range("2022-09-01", "2022-09-03"),
          'lat' : np.arange(70, 20, -10),
          'lon' : np.arange(60, 110, 10)}  

## Attributes

Next, we add the attributes (metadata) for our temperature data as a dictionary:

In [4]:
# attributes (metadata) of the data array 
attrs = { 'title' : 'temperature across weather stations',
          'standard_name' : 'air_temperature',
          'units' : 'degree_c'}

## Putting It All Together

Finally, we put all these pieces together (data, dimensions, coordinates, and attributes) to create an `xarray.DataArray`:

In [5]:
# initialize xarray.DataArray
temp = xr.DataArray(data = temp_data, 
                    dims = dims,
                    coords = coords,
                    attrs = attrs)
temp

We can also update the variable’s attributes after creating the object. 
Notice that each of the coordinates is also an `xarray.DataArray`, so we can add attributes to them.

In [6]:
# update attributes
temp.attrs['description'] = 'simple example of an xarray.DataArray'

# add attributes to coordinates 
temp.time.attrs = {'description':'date of measurement'}
temp.lat.attrs['standard_name']= 'grid_latitude'
temp.lat.attrs['units'] = 'degree_N'
temp.lon.attrs['standard_name']= 'grid_longitude'
temp.lon.attrs['units'] = 'degree_E'
temp

At this point, since we have a single variable, the dataset attributes and the variable attributes are the same. 

# Reduction
`xarray` has implemented several methods to reduce an `xarray.DataArray` along any number of dimensions. 

One of the advantages of `xarray.DataArray` is that, if we choose to, it can carry over attributes when doing calculations.

For example, we can calculate the average temperature at each weather station over time and obtain a new `xarray.DataArray`. 

In [7]:
# compute average temperature
avg_temp = temp.mean(dim = 'time') 
# to keep attributes add keep_attrs = True

avg_temp.attrs = {'title':'average temperature over three days'}
avg_temp


More about [`xarray` computations](https://docs.xarray.dev/en/stable/user-guide/computation.html).

# `xarray.DataSet`

An `xarray.DataSet`:

* is the second core data structure of `xarray`

* represents a NetCDF file with **multiple variables** (each being an `xarray.DataArray`)

* has dimensions, coordinates, and attributes, forming a **self-describing dataset**. 

Attributes can be specific to each variable, each dimension, or they can describe the whole dataset. 

**Remember!**
The variables in an `xarray.DataSet` can have the same dimensions, share some dimensions, or have no dimensions in common. 

Let's see an example of this.


# Create an `xarray.DataSet`
Following our previous example, we can create an `xarray.DataSet` by combining the temperature data with the average temperature data. 
We also add some attributes that now describe the whole dataset, not only each variable. 


In [8]:
# make dictionaries with variables and attributes
data_vars = {'avg_temp': avg_temp,
            'temp': temp}
attrs = {'title':'temperature data at weather stations: daily and and average',
        'description':'simple example of an xarray.Dataset'}

# create xarray.Dataset
temp_dataset = xr.Dataset( data_vars = data_vars,
                        attrs = attrs)

Take some time to click through the data viewer and read through the variables and metadata in the dataset. 
Notice the following: 

+ `temp_dataset` is a dataset with three dimensions (time, latitude, and longitude), 

+ `temp` is a variable that uses all three dimensions in the dataset, and

+ `aveg_temp` is a variable that only uses two dimensions (latitude and longitude).

In [9]:
temp_dataset

# What would happen next in this class?

Having acquried a solid understanding about the core `xarray` structure, we would continue this class by learning more about how to analyze NetCDF via `xarray` with a *real world dataset*. Such as this one:

> Time series of annual Arctic freshwater fluxes and storage terms. 
> This data was produced for the publication [Jahn and Laiho, 2020](https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2020GL088854) about changes in the Arctic freshwater budget and 
is archived at the Arctic Data Center [doi:10.18739/A2280504J](https://arcticdata.io/catalog/view/doi%3A10.18739%2FA2280504J)

You can see the exercises I designed around this datast in my lesson on [Data Structures and Formats for Large Data](https://learning.nceas.ucsb.edu/2022-09-arctic/sections/08-data-structures-netcdf.html). 

To finish the demo, we will just load and open this dataset, and click around to identify all the components we talked about. 

In [10]:
# this is a library to access datasets online
import urllib 

# this is a library to navigate files in our computer
import os

In [11]:
url = 'https://arcticdata.io/metacat/d1/mn/v2/object/urn%3Auuid%3A792bfc37-416e-409e-80b1-fdef8ab60033'

msg = urllib.request.urlretrieve(url, "FW_data_CESM_LW_2006_2100.nc")

In [12]:
fp = os.path.join(os.getcwd(),'FW_data_CESM_LW_2006_2100.nc')
fw_data = xr.open_dataset(fp)
fw_data