{{title_s1_1}}

This notebook demonstrates working with Sentinel-1 RTC imagery that has been processed on the [ASF On-Demand server](https://docs.asf.alaska.edu/vertex/manual/) and downloaded locally. 

The downloaded time series of Sentinel-1 imagery is very large. We demonstrate strategies for reading data of this nature into memory by creating a *virtual copy* of the data. 

:::{important}
As mentioned, the steps shown in this notebook involve downloading and extracting large volumes of data. **It is not necessary to do this to follow the rest of the content in the tutorial**. We include the demonstration for the purposes of completeness and to help users who may be in this situation.

***To skip downloading the data and proceed with the tutorial, use the VRT files located in the `../tutorial2/data/` directory of the tutorial repository.***
:::

::::{tab-set}
:::{tab-item} Outline   
(content:section_A)=
**[A. Prepare to read data into memory](#a-prepare-to-read-data-into-memory)**  
- {{a1_s1_nb1}}  
- {{a2_s1_nb1}}  

(content:section_B)=
**[B. Read data](#b-read-data)**  
- {{b1_s1_nb1}}  
:::

:::{tab-item} Learning goals

{{concepts}}
- Understand local file storage and create virtual datasets from locally stored files
- Read larger-than-memory data into memory
- Use VRT objects to create xarray objects from large data stacks
:::
::::
:::{admonition} ASF Data Access
You can download the RTC-processed backscatter time series [here](https://zenodo.org/record/7236413#.Y1rNi37MJ-0). This tutorial starts from the point of having the data downloaded and unzipped in a directory. See the path to the directory location and the structure of the directory holding the unzipped files below. 
:::

Expand the next cell to see specific packages used in this notebook and relevant system and version information.

In [None]:
%xmode minimal
import os
import pathlib
import re

import geopandas as gpd
import markdown
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import pathlib
import rioxarray as rio
import xarray as xr

from typing import List

In [3]:
cwd = pathlib.Path.cwd()
tutorial2_dir = pathlib.Path(cwd).parent

TODO Add section explaining how to download and unzip data

## A. Prepare to read data into memory

### {{a1_s1_nb1}}

After the data is extracted from the compressed files, we have a directory containing sub-directories for each Sentinel-1 image acquisition (scene). Within each sub-directory are all of the files associated with that scene. For more information about the files contained in each directory, see this [section](https://hyp3-docs.asf.alaska.edu/guides/rtc_product_guide/#image-files) of the ASF Sentinel-1 RTC Product Guide.

The directory should look like the diagram below:

```
.
└── s1_asf_data
    ├── S1A_IW_20210502T121414_DVP_RTC30_G_gpuned_1424
    │   ├── S1A_IW_20220214T121353_DVP_RTC30_G_gpuned_51E7_VH.tif.xml
    │   ├── S1A_IW_20220214T121353_DVP_RTC30_G_gpuned_51E7_rgb.kmz
    │   ├── S1A_IW_20220214T121353_DVP_RTC30_G_gpuned_51E7_shape.prj
    │   ├── S1A_IW_20220214T121353_DVP_RTC30_G_gpuned_51E7.png.aux.xml
    │   ├── S1A_IW_20220214T121353_DVP_RTC30_G_gpuned_51E7.README.md.txt
    │   ├── S1A_IW_20220214T121353_DVP_RTC30_G_gpuned_51E7_rgb.png.aux.xml
    │   ├── S1A_IW_20220214T121353_DVP_RTC30_G_gpuned_51E7_rgb.png
    │   ├── S1A_IW_20220214T121353_DVP_RTC30_G_gpuned_51E7_VV.tif.xml
    │   ├── S1A_IW_20220214T121353_DVP_RTC30_G_gpuned_51E7.png
    │   ├── S1A_IW_20220214T121353_DVP_RTC30_G_gpuned_51E7_rgb.png.xml
    │   ├── S1A_IW_20220214T121353_DVP_RTC30_G_gpuned_51E7.png.xml
    │   ├── S1A_IW_20220214T121353_DVP_RTC30_G_gpuned_51E7_shape.shp
    │   ├── S1A_IW_20220214T121353_DVP_RTC30_G_gpuned_51E7_VH.tif
    │   ├── S1A_IW_20220214T121353_DVP_RTC30_G_gpuned_51E7.log
    │   ├── S1A_IW_20220214T121353_DVP_RTC30_G_gpuned_51E7_VV.tif
    │   ├── S1A_IW_20220214T121353_DVP_RTC30_G_gpuned_51E7.kmz
    │   ├── S1A_IW_20220214T121353_DVP_RTC30_G_gpuned_51E7_shape.dbf
    │   ├── S1A_IW_20220214T121353_DVP_RTC30_G_gpuned_51E7_ls_map.tif.xml
    │   ├── S1A_IW_20220214T121353_DVP_RTC30_G_gpuned_51E7_shape.shx
    │   └── S1A_IW_20220214T121353_DVP_RTC30_G_gpuned_51E7_ls_map.tif
    └── S1A_IW_20210505T000307_DVP_RTC30_G_gpuned_54B1
        └── ...
```



To build GDAL VRT files, we need to pass a list of the input datasets. This means that we need to extract the file path to every file associated with each variable (VV, VH and L-S). 

:::{note}
If you are following along on your own computer, be sure to replace the `s1_asf_data` file path below with the path to the location of the downloaded files on your own computer.
:::

In [4]:
# Path to directory holding downloaded data
s1_asf_data = pathlib.Path(cwd.parents[3], "sentinel1_rtc/data/asf_rtcs")

In [5]:
scenes_ls = os.listdir(s1_asf_data)

In [6]:
# Function to extract path for target files from each scene
def extract_fnames(data_path: str, scene_name: str) -> list:
    """return a list of files associated with a single S1 scene"""
    # Make list of files within each scene directory in data directory
    scene_files_ls = os.listdir(os.path.join(data_path, scene_name))

    # Make a list to hold README files
    rm = [file for file in scene_files_ls if file.endswith("README.md.txt")]

    # Make a list to hold tif file names for each variable
    scene_files_vv = [fname for fname in scene_files_ls if fname.endswith("_VV.tif")]
    scene_files_vh = [fname for fname in scene_files_ls if fname.endswith("_VH.tif")]
    scene_files_ls = [
        fname for fname in scene_files_ls if fname.endswith("_ls_map.tif")
    ]

    return scene_files_vv, scene_files_vh, scene_files_ls, rm

Below is the output of `extract_fnames()` for two sub-directories in the data directory. Note that `os.listdir()` **does not** preserve the order of the subdirectories as listed on disk. This is okay because we will ensure that the files are sorted in chronological order later.

In [None]:
print(extract_fnames(s1_asf_data, scenes_ls[0]))
print(extract_fnames(s1_asf_data, scenes_ls[1]))

We need to attach the filenames to the path of each file so that we end up with a list of the full paths to the VV and VH band imagery, layover-shadow maps and README files. Within this step, we will also add checks to ensure that the process is doing what we want, which is to create lists of filepaths for each character **with the same order across lists** so that we can combine the lists into a multivariate time series. As noted above, it is okay that the lists are not in chronological order, as long as they are in **the same** order across variables. 

In [8]:
def make_filename_lists(asf_s1_data_path: str):
    # Make list of all scenes in dir
    scenes_ls = os.listdir(asf_s1_data_path)

    # Make empty lists to hold file paths for different variables
    fpaths_vv, fpaths_vh, fpaths_ls, fpaths_rm = [], [], [], []

    for element in range(len(scenes_ls)):
        # Extract filenames of each file of interest
        files_of_interest = extract_fnames(asf_s1_data_path, scenes_ls[element])

        # Make full path with filename for each variable
        path_vv = os.path.join(
            asf_s1_data_path, scenes_ls[element], files_of_interest[0][0]
        )
        path_vh = os.path.join(
            asf_s1_data_path, scenes_ls[element], files_of_interest[1][0]
        )
        path_ls = os.path.join(
            asf_s1_data_path, scenes_ls[element], files_of_interest[2][0]
        )
        path_readme = os.path.join(
            asf_s1_data_path, scenes_ls[element], files_of_interest[3][0]
        )

        # add a check to ensure that the files are aligned correctly
        date_vv = pathlib.Path(path_vv).stem.split("_")[2]
        date_vh = pathlib.Path(path_vh).stem.split("_")[2]
        date_ls = pathlib.Path(path_ls).stem.split("_")[2]
        date_rm = pathlib.Path(path_readme).stem.split("_")[2]
        assert date_vh == date_vv == date_ls == date_rm, (
            "AssertionError: File dates do not match across variables."
        )

        fpaths_vv.append(path_vv)
        fpaths_vh.append(path_vh)
        fpaths_ls.append(path_ls)
        fpaths_rm.append(path_readme)

    # Check that all lists are the same length
    assert len(fpaths_vv) == len(fpaths_vh) == len(fpaths_ls) == len(fpaths_rm), (
        f"Files weren't extracted correctly. Expected all lists to be the same length, received \n"
        "{len(fpaths_vv)}, {len(fpaths_vh)}, {len(fpaths_ls)}, {len(fpaths_rm)}"
    )
    # Check that all lists are the same length
    assert len(fpaths_vv) == len(fpaths_vh) == len(fpaths_ls) == len(fpaths_rm), (
        "Files weren't extracted correctly or fname lists weren't made correctly"
    )
    return (fpaths_vv, fpaths_vh, fpaths_ls, fpaths_rm)

In [9]:
filepaths_vv, filepaths_vh, filepaths_ls, filepaths_rm = make_filename_lists(
    s1_asf_data
)

Create a dictionary to hold the file paths for each variable so that we can use them more easily later:

In [10]:
file_paths_dict = {
    "vv": filepaths_vv,
    "vh": filepaths_vh,
    "ls": filepaths_ls,
    "readme": filepaths_rm,
}

### {{a2_s1_nb1}}

We will be using the `gdalbuildvrt` command. You can find out more about it [here](https://manpages.ubuntu.com/manpages/bionic/man1/gdalbuildvrt.1.html). This command creates a *virtual* GDAL dataset given a list of other GDAL datasets (the Sentinel-1 scenes). `gdalbuildvrt` can make a VRT that either tiles the listed files into a large spatial mosaic, or places them each in a separate band of the VRT. Because we are dealing with a temporal stack of images we want to use the `-separate` flag to place each file into a band of the VRT. </br>

Here is where we use the list of file paths we created above. For each variable, write the list of file paths to a `.txt` file which is then passed to `gdalbuildvrt`. 

In [11]:
def create_vrt_object(fpaths_dict: dict, variable: str):
    """Function to create VRT files for each variable given a
    list of file paths fo that variable."""

    # Write file paths to txt file
    fpath_input_txt = f"s1_{variable}_fpaths.txt"

    # Specify location of vrt file- note that we use
    # tutorial2_dir path defined at top of notebook
    # output_vrt_path = os.path.join(tutorial2_dir,
    output_vrt_path = f"../data/vrt_files/s1_stack_{variable}.vrt"

    # Write file paths to txt file
    with open(fpath_input_txt, "w") as fp:
        for item in fpaths_dict[f"{variable}"]:
            fp.write(f"{item}\n")

    !gdalbuildvrt -separate -input_file_list {fpath_input_txt} {output_vrt_path}

In [None]:
create_vrt_object(file_paths_dict, "vv")
create_vrt_object(file_paths_dict, "vh")
create_vrt_object(file_paths_dict, "ls")

## B. Read data

In [19]:
ds_vv = xr.open_dataset("../data/vrt_files/s1_stackVV.vrt", chunks="auto")
ds_vh = xr.open_dataset("../data/vrt_files/s1_stackVH.vrt", chunks="auto")
ds_ls = xr.open_dataset("../data/vrt_files/s1_stackLS.vrt", chunks="auto")

In [None]:
ds_vv

Building the `VRT` object assigns every object in the .txt file to a different band. In doing this, we lose the metadata that is associated with the files. The next notebook walks through the steps of reconstructing necessary metadata stored in file names and auxiliary files and attaching it to the Xarray objects read from VRTs.

### {{b1_s1_nb1}}

```{image} ../imgs/s1_chunks.png
:alt chunks_image
:align_center
```

Each variable in `ds` has a total shape of (103, 13379, 1742) and is 89.59 GB. It is chunked so that each chunk is (11, 1536, 1536) and 99 MB, with 1080 total chunks per variable. 

Depending on your use-case, you may want to adjust the chunking of the object. For example, if you are interested in analyzing variability along the temporal dimension, it might make sense to re-chunk the dataset so that operations along the that dimension are more easily parallelized. For more detail, see the {term}`chunking` discussion in [Relevant Concepts](../../background/relevant_concepts.md) and the [Parallel Computing with Dask](https://docs.xarray.dev/en/stable/user-guide/dask.html) section of the Xarray documentation. 

## Conclusion

This notebook demonstrated reading large data into memory by creating a virtual dataset that references that full dataset without directly reading it. 

However, we also saw that reading the data in this way produces an object that lacks important metadata. The next notebook will go through the steps of locating and adding relevant metadata to the backscatter data cubes read in this notebook.
