# Xarray

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/giswqs/geog-312/blob/main/book/geospatial/xarray.ipynb)

## Overview

[Xarray](https://docs.xarray.dev) is a powerful Python library designed for working with multi-dimensional labeled datasets, often used in fields such as climate science, oceanography, and remote sensing. It provides a high-level interface for manipulating and analyzing datasets that can be thought of as extensions of NumPy arrays. Xarray is particularly useful for geospatial data because it supports labeled axes (dimensions), coordinates, and metadata, making it easier to work with datasets that vary across time, space, and other dimensions.

## Learning Objectives

By the end of this lecture, you should be able to:

- Understand the basic concepts and data structures in Xarray, including `DataArray` and `Dataset`.
- Load and inspect multi-dimensional geospatial datasets using Xarray.
- Perform basic operations on Xarray objects, such as selection, indexing, and arithmetic operations.
- Use Xarray to efficiently work with large geospatial datasets, including time series and raster data.
- Apply Xarray to common geospatial analysis tasks, such as calculating statistics, regridding, and visualization.

## What is Xarray?

Xarray extends the capabilities of NumPy by providing a data structure for labeled, multi-dimensional arrays. The two main data structures in Xarray are:

- **DataArray**: A labeled, multi-dimensional array, which includes dimensions, coordinates, and attributes.
- **Dataset**: A collection of `DataArray` objects that share the same dimensions.

![](https://docs.xarray.dev/en/stable/_images/dataset-diagram.png)

Xarray is particularly useful for working with datasets where dimensions have meaningful labels (e.g., time, latitude, longitude) and where metadata is important.

## Installing Xarray

Before we start, ensure that Xarray is installed. You can install it via pip:

In [None]:
# %pip install xarray pooch

## Importing Libraries

In [4]:
import matplotlib.pyplot as plt
import numpy as np
import xarray as xr

xr.set_options(keep_attrs=True, display_expand_data=False)
np.set_printoptions(threshold=10, edgeitems=2)

## Xarray Data Structures

Xarray provides two core data structures:

1. **DataArray**: A single multi-dimensional array with labeled dimensions, coordinates, and metadata.
2. **Dataset**: A collection of `DataArray` objects, each corresponding to a variable, sharing the same dimensions and coordinates.

## Loading a Dataset

Xarray offers built-in access to several [tutorial datasets](https://docs.xarray.dev/en/latest/generated/xarray.tutorial.open_dataset.html), which we can load with `xr.tutorial.open_dataset`. Here, we load an air temperature dataset:

In [None]:
#xr. to check what is a

In [None]:
ds = xr.tutorial.open_dataset("air_temperature")
ds

#This dataset has 25 rows and 53 columns and 2920 bands (time steps each 6 hours)
#without indexes we cannot do slicing
#2 years of data

This dataset is stored in the [netCDF](https://www.unidata.ucar.edu/software/netcdf) format, a common format for scientific data. Xarray automatically parses metadata like dimensions and coordinates.

The dataset is downloaded from the internet and stored in a temporary cache directory. You can find the location of the cache directory depending on your operating system:
- Linux: `~/.cache/xarray_tutorial_data`
- macOS: `~/Library/Caches/xarray_tutorial_data`
- Windows: `~/AppData/Local/xarray_tutorial_data`

If we want to download some files we jsut need to change the file name as .nc from the temporal folder.

do the following:

ds = xr.open_dataset("data/air_temperature.nc")




In [7]:
#Opening from a link eg https://github.com/opengeos/datasets/releases    (aparently it wont work same as in you tube video)
#ds.xr.open_dataset("https://github.com/opengeos/datasets/releases/download/netcdf/air_temperature.nc")
#ds


## Working with DataArrays

The `DataArray` is the core data structure in Xarray. It includes data values, dimensions (e.g., time, latitude, longitude), and the coordinates for each dimension.

In [None]:
# Access a specific DataArray so no dataset anymore
temperature = ds["air"]
#temperature = ds.air  # *both notation possible
temperature

You can also access DataArray using dot notation:

In [None]:
ds.air  #this variable needs to exists in the dataarray

## DataArray Components

- **Values**: The actual data stored in a NumPy array or similar structure.
- **Dimensions**: Named axes of the data (e.g., time, latitude, longitude).
- **Coordinates**: Labels for the values in each dimension (e.g., specific times or geographic locations).
- **Attributes**: Metadata associated with the data (e.g., units, descriptions).

You can extract and print the values, dimensions, and coordinates of a `DataArray`:

In [None]:
temperature.values

In [None]:
temperature.dims

In [None]:
temperature.shape

In [None]:
temperature.coords

In [None]:
temperature.attrs

In [17]:
#we can add an attribute

temperature.attrs["creator"] = "NOAA"

In [None]:
temperature.attrs

## Indexing and Selecting Data

Xarray allows you to easily select data based on dimension labels, which is very intuitive when working with geospatial data.

In [None]:
# Select data for a specific time and location
selected_data = temperature.sel(time="2013-01-01", lat=40.0, lon=260.0)
selected_data

In [None]:
selected_data.values

In [None]:
selected_data.time.values  
#this is why we get 4 numbers in the time, because it is a 6 hour interval

In [None]:
#verification by GPT FF
import pandas as pd

# Create a time range with 6-hour intervals
time = pd.date_range('2023-01-01', periods=4, freq='6h')
print(time)
# Create a DataArray with the time dimension
data = xr.DataArray([1, 2, 3, 4], coords=[time], dims=["time"])


In [None]:
# Slice data across a range of times
time_slice = temperature.sel(time=slice("2013-01-01", "2013-01-31"))
time_slice

## Performing Operations on DataArrays

You can perform arithmetic operations directly on `DataArray` objects, similar to how you would with NumPy arrays. Xarray also handles broadcasting automatically.

In [None]:
# Calculate the mean temperature over time when defining the dim we also at the same time consider the coordinates
mean_temperature = temperature.mean(dim="time")
mean_temperature

In [None]:
# if we remove the dim it will calculate the mean over all the data and take the max value

mean_temperature_all = temperature.mean()
mean_temperature_all

In [None]:
temperature.values.mean() #this is the same as the above

In [None]:
# Subtract the mean temperature from the original data
anomalies = temperature - mean_temperature
anomalies

## Visualization with Xarray

Xarray integrates well with Matplotlib and other visualization libraries, making it easy to create plots directly from `DataArray` and `Dataset` objects.

In [56]:
#modifying
mean_temperature.attrs["long_name"] = "Mean air temperature"

In [None]:
# Plot the mean temperature - name sin axes come from attributes
mean_temperature.plot()
plt.show()

You can customize the appearance of plots by passing arguments to the `plot` method. For example, you can specify the color map, add labels, and set the figure size.

In [None]:
#Check matplotlib and search for colormaps

mean_temperature.plot(cmap="jet", figsize=(10, 6))
plt.xlabel("Longitude")
plt.ylabel("Latitude")
plt.title("Mean Temperature")

You can also select a specific location using the `sel` method and plot a time series of temperature at that location.

In [72]:
#temperature.sel(time="2013-01-01").plot()  #plot a bit confusing}

In [None]:
x =temperature.sel(lat=40.0, lon=260.0)
x

In [None]:
len(x) #number of points to verify the bands

In [None]:
# Plot a time series for a specific location
temperature.sel(lat=40.0, lon=260.0).plot()
plt.show()

In [None]:
#To visualize a bit better

plt.figure(figsize=(20, 6))
temperature.sel(lat=40.0, lon=260.0).plot()
plt.show()


In [None]:
#if the point is not in the data it will return an error we need to use the method nearest
#261.0 is the closest to 260.0
temperature.sel(lat=40.0, lon=261.0, method="nearest").plot()

## Working with Datasets

A `Dataset` is a collection of `DataArray` objects. It is useful when you need to work with multiple related variables.

In [None]:
ds

In [None]:
# List all variables in the dataset  -> 1 data variable
print(ds.data_vars)

In [81]:
# Access a DataArray from the Dataset
temperature = ds["air"]

In [None]:
# Perform operations on the Dataset --> aggregating all bands to a single value per location
mean_temp_ds = ds.mean(dim="time")
mean_temp_ds

In [None]:
mean_temp_ds = ds.mean(dim="lon")
mean_temp_ds

#here we are trying to see differences of temperature changes by latitude

## Why Use Xarray?

Xarray is valuable for handling multi-dimensional data, especially in scientific applications. It provides metadata, dimension names, and coordinate labels, making it much easier to understand and manipulate data compared to raw NumPy arrays.

### Without Xarray (Using NumPy)

Here's how a task might look without Xarray, using NumPy arrays:

In [None]:
ds.air.data

In [84]:
lat = ds.air.lat.data
lon = ds.air.lon.data
temp = ds.air.data

In [None]:
print(lat)

In [None]:
temp.shape

In [None]:
plt.figure()
plt.pcolormesh(lon, lat, temp[1500, :, :])

#to try to find the right location is very difficult

While this approach works, it's not clear what `0` refers to (in this case, it's the first time step).

### With Xarray

With Xarray, you can use more intuitive and readable indexing with `sel` and `isel`:

In [None]:
ds.air.isel(time=1500).plot(x="lon")  #at least we get a date

In [None]:
ds.air.sel(time="2013-01-01T00:00:00").plot(x="lon")  #even better we do not need to use column row

This example selects the first time step and plots it using labeled axes (`lat` and `lon`), which is much clearer.



## Advanced Indexing: Label vs. Position-Based Indexing

Xarray supports both label-based and position-based indexing, making it flexible for data selection.

### Label-based Indexing

You can use `.sel()` to select data based on the labels of coordinates, such as specific times or locations:

In [None]:
# Select all data from May 2013
ds.sel(time="2013-05")

In [None]:
# Slice over time, selecting data between May and July 2013
ds.sel(time=slice("2013-05", "2013-07"))

### Position-based Indexing

Alternatively, you can use `.isel()` to select data based on the positions of coordinates:

In [None]:
# Select the first time step, second latitude, and third longitude
ds.air.isel(time=0, lat=2, lon=3)

## High-Level Computations with Xarray

Xarray offers several high-level operations that make common computations straightforward, such as `groupby`, `resample`, `rolling`, and `weighted`.

### GroupBy Operation

You can calculate statistics such as the seasonal mean of the dataset:

In [None]:
seasonal_mean = ds.groupby("time.season").mean()
seasonal_mean.air.plot(col="season")

### Computation with Weights

Xarray allows for weighted computations, useful in geospatial contexts where grid cells vary in size. For example, you can weight the mean of the dataset by cell area.

In [None]:
cell_area = xr.ones_like(ds.air.lon)  # Placeholder for actual area calculation
weighted_mean = ds.weighted(cell_area).mean(dim=["lon", "lat"])
weighted_mean.air.plot()

### Rolling Window Operation

Xarray supports rolling window operations, which are useful for smoothing time series data spatially or temporally. For example, you can smooth the temperature data spatially using a 5x5 window.

In [None]:
ds.air.mean(dim="time").plot()  #original pixel

In [None]:
ds.air.isel(time=0).rolling(lat=5, lon=5).mean().plot()  #resampled with size 5 x 5 degrees

Similarly, you can smooth the temperature data temporally using a 5-day window. (20 because each day has 4 times each 6 hS)

In [None]:
plt.figure(figsize=(20, 6))
# Select the time series at a specific latitude and longitude
temperature = ds["air"].sel(lat=40.0, lon=260.0)

# Plot the original time series
temperature.plot(label="Original")

# Apply rolling mean smoothing with a window size of 20
smoothed_temperature = temperature.rolling(time=20, center=True).mean()

# Plot the smoothed data
smoothed_temperature.plot(label="Smoothed")

# Add a title and labels
plt.title("Temperature Time Series (lat=40.0, lon=260.0)")
plt.xlabel("Time")
plt.ylabel("Temperature (K)")

# Add a legend
plt.legend()

# Show the plot
plt.show()

## Reading and Writing Files

Xarray supports many common scientific data formats, including [netCDF](https://www.unidata.ucar.edu/software/netcdf/) and [Zarr](https://zarr.readthedocs.io/). You can read and write datasets to disk with a few simple commands.

### Writing to netCDF

To save a dataset as a netCDF file:

In [None]:
# Ensure air is in a floating-point format (float32 or float64)
ds["air"] = ds["air"].astype("float32")  #you need this as astype as float32

# Save the dataset to a NetCDF file
ds.to_netcdf("air_temperature.nc")

### Reading from netCDF

To load a dataset from a netCDF file:

In [None]:
loaded_data = xr.open_dataset("air_temperature.nc")
loaded_data

## Exercises

### Exercise 1: Exploring a New Dataset

1. Load the Xarray tutorial dataset `rasm`.
2. Inspect the `Dataset` object and list all the variables and dimensions.
3. Select the `Tair` variable (air temperature).
4. Print the attributes, dimensions, and coordinates of `Tair`.

### Exercise 2: Data Selection and Indexing

1. Select a subset of the `Tair` data for the date `1980-07-01` and latitude `70.0`.
2. Create a time slice for the entire latitude range between January and March of 1980.
3. Plot the selected time slice as a line plot.

### Exercise 3: Performing Arithmetic Operations

1. Compute the mean of the `Tair` data over the `time` dimension.
2. Subtract the computed mean from the original `Tair` dataset to get the temperature anomalies.
3. Plot the mean temperature and the anomalies on separate plots.

### Exercise 4: GroupBy and Resampling

1. Use `groupby` to calculate the seasonal mean temperature (`Tair`).
2. Use `resample` to calculate the monthly mean temperature for 1980.
3. Plot the seasonal mean for each season and the monthly mean.

### Exercise 5: Writing Data to netCDF

1. Select the temperature anomalies calculated in Exercise 3.
2. Convert the `Tair` variable to `float32` to optimize file size.
3. Write the anomalies data to a new netCDF file named `tair_anomalies.nc`.
4. Load the data back from the file and print its contents.

## Summary

Xarray is a powerful library for working with multi-dimensional geospatial data. It simplifies data handling by offering labeled dimensions and coordinates, enabling intuitive operations and making analysis more transparent. Xarray's ability to work seamlessly with NumPy, Dask, and Pandas makes it an essential tool for geospatial and scientific analysis. With Xarray, you can efficiently manage and analyze large, complex datasets, making it a valuable asset for researchers and developers alike.

In [None]:
import nbformat
import re

def extract_titles(notebook_path):
    # Load the notebook
    with open(notebook_path, 'r', encoding='utf-8') as f:
        nb = nbformat.read(f, as_version=4)
    
    titles = []
    
    # Regular expressions for HTML headings
    html_heading_re = re.compile(r'<h([1-6])>(.*?)</h\1>', re.IGNORECASE)
    
    # Iterate through the cells and extract titles
    for cell in nb.cells:
        if cell.cell_type == 'markdown':
            lines = cell.source.split('\n')
            for line in lines:
                if line.startswith('#'):
                    titles.append(line)
                else:
                    # Check for HTML headings
                    match = html_heading_re.match(line)
                    if match:
                        level = int(match.group(1))
                        title_text = match.group(2).strip()
                        titles.append('#' * level + ' ' + title_text)
    
    return titles

def generate_toc(titles):
    toc = []
    for title in titles:
        # Count the number of leading '#' to determine the level
        level = title.count('#')
        # Remove leading '#' and strip leading/trailing whitespace
        title_text = title.lstrip('#').strip()
        # Create the TOC entry with indentation based on the level
        toc.append('  ' * (level - 1) + f'- {title_text}')
    
    return '\n'.join(toc)

def main():
    notebook_path = 'xarray.ipynb'  # Replace with your notebook path
    titles = extract_titles(notebook_path)
    toc = generate_toc(titles)
    print(toc)

if __name__ == '__main__':
    main()