In [1]:
 %pip install --upgrade --user xarray requests tqdm netcdf4

In [1]:
import xarray as xr
import numpy as np

## Xarray DataArrays: Storing multiple arrays together with information about their relationships

Thus far, we have worked with data organized in separate numpy arrays, but when working with real data, keeping track of many arrays stored in separate variables and how they relate to one another can quickly become complicated. Xarray helps you store multiple arrays and information about their relationships together in one ``DataArray``, making it easier for the researcher to work with and understand the data.

In this notebook, we will look at how you can put multiple arrays together in xarray `DataArrays`, access the data stored in a `DataArray`, and how save it to in a file.

### DataArray: Labeling the Indices of Array's Dimensions and Accessing the Data

| Code | Description |
| :-- | :-- |
| `da = xr.DataArray(data=x, coords={'time': y}, name='sensor')` | Make a DataArray from the equal-length arrays `x` and `y`, describing `x` as a sensor data and `y` as the time points for each measurement. |
| `da = xr.DataArray(data=x, coords={'time': y, 'channel': z}, name='sensor')` | Make a 2D DataArray from `x`, `y`, and `z`, where z is the channel names in the sensor data. |
| `da['time']` |  Get all time points at which data was recorded |
| `da['channel']` |  Get the names of all channels |
| `da.loc[1:1.5]` | Get the sensor data from time points 1-1.5 secs. |
| `da.loc[1:1.5, :]` | Get the sensor data from time points 1-1.5 secs, and all channels. |
| `da.loc[1:1.5, ['CHAN-2, 'CHAN-4]]` | Get the sensor data from time points 1-1.5 secs and the channels labeled 'CHAN-2' and 'CHAN-4'. |
| `da.sel(channel='CHAN-3')` | Get the sensor data across the whole time period from the channel labeled 'CHAN-3'|


**Example**: A `DataArray` can be made by simply passing a regular numpy array with the data to the xarray DataArray constructor.

In [3]:
data = np.random.random(size = 10)
data

array([0.67087079, 0.43345729, 0.22973613, 0.74500048, 0.64461041,
       0.48266201, 0.92151077, 0.18326574, 0.85865636, 0.06918388])

In [4]:
data_xr = xr.DataArray(data)
data_xr

**Example**: When we display the resulting ``DataArray``, we see that there is more information that can be added. That's the strength and benefit of DataArrays; but we're not taking advantage of it in the example above. In the following example, we include time information - the month - for which a given data point is recorded. In this hypothetical scenario, it's the sale of hiking boots in a sportswear store over the course of a year.

In [5]:
months = ['Januar', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
hiking_boots_sold = np.random.randint(low=2, high = 50, size = len(months))
hiking_boots_sold

array([17, 29, 48, 32, 22,  4, 15, 11, 24, 16, 47, 25], dtype=int32)

In [6]:
data_boots = xr.DataArray(
    data=hiking_boots_sold, 
    coords={'month': months}, 
    name='sale_hiking_boots',
)
data_boots

**Exercise:** Make a DataArray out of the following variables containing numpy arrays.

In [2]:
days = np.linspace(1,365,365,dtype=int)
hours_of_sunlight = np.random.uniform(low=0, high = 16, size=len(days))

In [3]:
data_sun = xr.DataArray(
    data = hours_of_sunlight,
    coords = {'day': days},
    name = 'hours_of_sunlight_over_year'
)
data_sun

**Exercise**: Get the array containing the days throughout the year.

In [9]:
data_sun['day']

**Exercise**: Get the data on hours of sunlight for day number 3 through 11 using the `loc` method.

In [10]:
data_sun.loc[3:11]

**Exercise**: Get the data on hours of sunlight for day number 3 through 11 using regular indexing for arrays. Do you notice a difference in which indeces you use to access the data?

In [11]:
data_sun[2:11]

**Exercise**: In the hiking boots DataArray from the example, get the number of hiking boots sold in October using the `loc` method.

In [12]:
data_boots.loc['October']

**Example**: Creating DataArrays with **multidimensional** data. Let's say that the company selling hiking boots has stores in multiple cities - Cologne, Berlin, and Munich - and that you want to store data on sales in all three cities throughout the year. In this case, you're storing multidimensional data; data across time and space, similar to neuroscience data.

In [12]:
months = ['Januar', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
cities = ['Cologne', 'Berlin', 'Munich']
hiking_boots_sold = np.random.randint(low=2, high = 50, size = (len(months), len(cities)))
hiking_boots_sold

array([[34, 13, 40],
       [44, 34, 10],
       [44, 25, 13],
       [ 7, 30, 23],
       [24, 28, 39],
       [24, 36, 13],
       [40, 35, 34],
       [36, 39, 45],
       [41, 38,  5],
       [28, 47,  3],
       [43, 11, 40],
       [ 7, 27, 12]], dtype=int32)

In [13]:
data_boots_cities = xr.DataArray(
    data=hiking_boots_sold,
    coords={'month': months, 'city': cities},
    name='hiking_boots_sold_different_cities'
)
data_boots_cities

**Exercise**: Make a 2-D array with data on sunlight throughout the year in Germany, France, and Italy using the variables in the cell below.

In [6]:
days = np.linspace(1,365,365,dtype=int)
countries = ['Germany', 'France', 'Italy']
hours_of_sunlight = np.random.uniform(low=0, high = 16, size=(len(days), len(countries)))

In [7]:
data_sun_country = xr.DataArray(
    data = hours_of_sunlight,
    coords={'day': days, 'country': countries},
    name='hours_of_sunlight_countries'
)
data_sun_country

**Exercise**: Get the data on hours of sunlight from day 3 to 11 for Italy using the `loc` function.

In [8]:
data_sun_country.loc[3:11, 'Italy']

**Exercise**: Get the data on hours of sunlight from day 3 to 11 for *both* Germany and France together.

In [9]:
data_sun_country.loc[3:11, ['Germany', 'France']]

## Saving DataArray data to file.

After the DataArray is constructed, you want to save it to a file so that you can load it and continue to work on it later or share it with others.

| Code | Description |
| :-- | :-- |
| `da.to_netcdf(path='data/filename.nc')` | Write the DataArray variable named "da" to a file with a filename of your choosing in the data directory |
| `data = xr.load_dataarray('data/filename.nc')` | Load the DataArray and put it in a variable |


**Exercise**: Write the DataArray data on sunlight per day to file to save it to a NetCDF (.nc) file.

In [22]:
#Write to file
data_boots_cities.to_netcdf('hiking_boots_sold.nc')

**Exercise**: Use the HDF5 Viewer at https://myhdf5.hdfgroup.org/ to examine the contents of the file.  Browse both the datasets and inspect the attributes--what do you notice about how the data is organized?  How does NetCDF and XArray find out which datasets are coordinates and which are variables?

**Exercise**: Load the DataArray data on sunlight per day in different countries you saved to a variable. Display the variable to check that the data was stored correctly.

In [23]:
data = xr.load_dataarray('hiking_boots_sold.nc')

In [24]:
data