<center>
<table>
  <tr>
    <td><img src="https://portal.nccs.nasa.gov/datashare/astg/training/python/logos/nasa-logo.svg" width="100"/> </td>
     <td><img src="https://portal.nccs.nasa.gov/datashare/astg/training/python/logos/ASTG_logo.png?raw=true" width="80"/> </td>
     <td> <img src="https://www.nccs.nasa.gov/sites/default/files/NCCS_Logo_0.png" width="130"/> </td>
    </tr>
</table>
</center>

        
<center>
<h1><font color= "blue" size="+3">ASTG Python Courses</font></h1>
</center>

---

<center><h1><font color="red" size="+3">Manipulate the TROPOMI Monthly Averages of Tropospheric NO2 </font></h1></center>

# <font color="red">Objectives</font>

- We want to show how to read a collection of monthly NO2 data files and perform manipulations.
- We show how the Python Xarray package is used to read netCDF files and do time series analyses.

# <font color="red">HAQAST Sentinel-5P TROPOMI Nitrogen Dioxide (NO2) Data</font>

- The HAQAST Sentinel-5P TROPOMI Nitrogen Dioxide (NO2) GLOBAL Monthly is a Level 3, gridded dataset providing monthly averages of tropospheric NO2 vertical column density globally.
- It is derived from the Sentinel-5P satellite's TROPOMI instrument by George Washington University as part of the NASA HAQAST program.
- This dataset offers a 0.1 x 0.1 degree (~10 km2) spatial resolution and is available through  [https://access.earthdata.nasa.gov/collections/C2839237129-GES_DISC](NASA's Earthdata Search).
   - It can be downloaded from (after registration from: [HAQ_TROPOMI_NO2_GLOBAL_M_L3_2.4](https://disc.gsfc.nasa.gov/datasets/HAQ_TROPOMI_NO2_GLOBAL_M_L3_2.4/summary)
- The dataset record began in January 2019 and continues to the present. 
- NO2 is an air pollutant that negatively impacts human respiratory health and contributes to premature mortality.
- NO2 is also a precursor to ground-level ozone and fine particulate matter, both of which have severe health consequences. 

![fig_no2](https://docserver.gesdisc.eosdis.nasa.gov/public/project/Images/HAQ_TROPOMI_NO2_GLOBAL_M_L3.2.4.png)

# <font color="red">What is Xarray?</font>
+ `Xarray` is an open source project and Python package that makes working with **labeled multi-dimensional arrays** simple and efficient.
+ Introduces labels in the form of dimensions, coordinates and attributes on top of raw `NumPy`-like arrays, which allows for a more intuitive, more concise, and less error-prone developer experience. 
+ Is inspired by and borrows heavily from `Pandas`.
+ Builds on top of, and seamlessly interoperates with, the core scientific Python packages, such as NumPy, SciPy, Matplotlib, and Pandas
+ Is particularly tailored to working with `netCDF` files and integrates tightly with `Dask` for parallel computing.


![fig_structure](https://tutorial.xarray.dev/_images/xarray-data-structures.png)
Image Source: tutorial.xarray.dev

Here is an example of how we might structure a dataset for a weather forecast:

![fig_dataset](https://docs.xarray.dev/en/stable/_images/dataset-diagram.png)

Image Source: docs.xarray.dev

---

# <font color="red"> Python packages used</font>

- __Matplotlib__: Tool for creating high-quality 2D visualization.
- __Cartopy__: Package designed for geospatial data processing in order to produce maps and other geospatial data analyses.
- __netCDF4__: Python interface to netCDF.
- __Pandas__: Tool for data analysis and visualization on two dimensional labeled data.
- __Xarray__: Package for manipulating labelled multi-dimensional arrays.

In [None]:
try:
    import google.colab
    print("Running in Google Colab")
except:
    print("Not running in Google Colab")
else:
    print("Installing modules in Google Colab")
    !apt-get install libproj-dev proj-data proj-bin
    !apt-get install libgeos-dev
    !pip install cython
    !pip install cartopy
    !python -m pip install dask[dataframe] --upgrade
    !pip install netCDF4
    !pip install xarray

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
import pprint
import datetime
import random

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import cartopy
from cartopy.mpl.ticker import LongitudeFormatter, LatitudeFormatter
import cartopy.crs as ccrs
import cartopy.feature as cfeature
import cartopy.io.shapereader as shapereader

In [None]:
import netCDF4
import numpy as np
import pandas as pd
import xarray as xr

In [None]:
print(f"Version of Numpy:   {np.__version__}")
print(f"Version of Pandas:  {pd.__version__}")
print(f"Version of netCDF4: {netCDF4.__version__}")
print(f"Version of Xarray:  {xr.__version__}")

# <font color='red'>Manipulating NO2 monthly files</font> 

## <font color="blue">File location</font>

- We gathered (for the NASA) monthly tropospheric NO2 files for the year 2024.
- The files are in netCDF-4 format and were transferred to a remote public available location.

In [None]:
#base_url = "https://portal.nccs.nasa.gov/datashare/astg/training/python/cartopy"
base_url = "/Users/jkouatch/Downloads"

In [None]:
list_files = [
    'HAQ_TROPOMI_NO2_GLOBAL_QA75_L3_Monthly_012024_V2.4_20240719.nc4',
    'HAQ_TROPOMI_NO2_GLOBAL_QA75_L3_Monthly_022024_V2.4_20240719.nc4', 
    'HAQ_TROPOMI_NO2_GLOBAL_QA75_L3_Monthly_032024_V2.4_20240719.nc4', 
    'HAQ_TROPOMI_NO2_GLOBAL_QA75_L3_Monthly_042024_V2.4_20240719.nc4', 
    'HAQ_TROPOMI_NO2_GLOBAL_QA75_L3_Monthly_052024_V2.4_20240719.nc4', 
    'HAQ_TROPOMI_NO2_GLOBAL_QA75_L3_Monthly_062024_V2.4_20240719.nc4', 
    'HAQ_TROPOMI_NO2_GLOBAL_QA75_L3_Monthly_072024_V2.4_20240810.nc4', 
    'HAQ_TROPOMI_NO2_GLOBAL_QA75_L3_Monthly_082024_V2.4_20250110.nc4', 
    'HAQ_TROPOMI_NO2_GLOBAL_QA75_L3_Monthly_092024_V2.4_20250110.nc4', 
    'HAQ_TROPOMI_NO2_GLOBAL_QA75_L3_Monthly_102024_V2.4_20250110.nc4', 
    'HAQ_TROPOMI_NO2_GLOBAL_QA75_L3_Monthly_112024_V2.4_20250110.nc4', 
    'HAQ_TROPOMI_NO2_GLOBAL_QA75_L3_Monthly_122024_V2.4_20250110.nc4' 
]

Combine the `base_url` variables and the file names:

In [None]:
list_files = [f"{base_url}/{file}" for file in list_files]

In [None]:
list_files

## <font color="blue">Focus on one file</font>

Create the list of month names:

In [None]:
list_months = [datetime.date(2025, i, 1).strftime('%B') for i in range(1, 13)]
list_months

Select an arbitrary month:

In [None]:
idx = random.randint(0, 11)
mymonth = list_months[idx]
print(f"Index = {idx} --> Month = {mymonth}")

### <font color="green">Read the monthly file for the selected month</font>

- We use Xarray to read the netCDF file.
- We obtain a Xarray Dataset object.

In [None]:
myfile = list_files[idx]
myfile

In [None]:
ds = xr.open_dataset(myfile, engine='netcdf4')

In [None]:
ds

In [None]:
print(f"DataArrays in the Dataset: \n\t {list(ds.keys())}")

In [None]:
print(f"Variables in the Dataset: \n\t {list(ds.variables.keys())}")

In [None]:
print(f"Dimensions in the Dataset: \n\t {list(ds.dims.keys())}")

In [None]:
print(f"Coordinates in the Dataset: \n\t {list(ds.coords.keys())}")

In [None]:
print(f"Global attributes: \n\t {list(ds.attrs)}")

In [None]:
print(f"Global attributes - source: \n\t {ds.attrs['source']}")

Minimum/Maximum latidtude and longitude:

In [None]:
print(f"Min/Max latitudes: \n\t {ds['Latitude'].values.min()} {ds['Latitude'].values.max()}")

In [None]:
print(f"Min/Max longitudes: \n\t {ds['Longitude'].values.min()} {ds['Longitude'].values.max()}")

__Observations__

- There are the latitude and longitude dimensions.
- There are two variables:
   - `Tropospheric_NO2`
   - `Number_obs`
- __There is no time dimension.__
   - <font color="red">If we want to do time series analyses, we need to combine data from all the files and add the time dimension to the combined Xarray Dataset.</font>

### <font color="green">Create plots with the selected monthly file</font>

#### Define a function creating a map

In [None]:
def create_map_contourf(xds: xr.Dataset, 
                        map_projection = None,
                        data_transform = None,
                        mytitle: str = None, 
                        units: str = None):
    """
    Create a countour plot on top of the map of the world. 
    Use the Xarray Dataset object to extract the latitudes, longitudes
    and the data to do the plot.

    Parameters
    ----------
    xds : xr.Dataset
       Xarray Dataset containing the data we want to visualize.
    map_projection :
       Cartopy object representing the map project we want to use for our plot.
    data_transform :
       Cartopy object representing the coordinate system the data is in.
    mytitle : str
       Title of the plot.
    units : str
       Title of the colorbar.
       This title represents the unit of the variable we are plotting.
    """
    if not map_projection:
        map_projection = ccrs.PlateCarree()

    if not data_transform:
        data_transform = ccrs.PlateCarree()
        
    plt.rcParams["figure.figsize"] = [15, 12]
    fig = plt.figure(tight_layout=False)

    ax = fig.add_subplot(1, 1, 1, projection=map_projection)
    data = xds.values
    lats = xds['Latitude'].values
    lons = xds['Longitude'].values

    cp = plt.contourf(lons, lats, data, 60,
                      cmap='jet', transform=data_transform)
    
    ax.add_feature(cfeature.COASTLINE)
    ax.add_feature(cfeature.BORDERS)
    ax.add_feature(cfeature.LAKES)
    ax.add_feature(cfeature.RIVERS)
    
    ax.gridlines(draw_labels=True, dms=True, x_inline=False, y_inline=False)

    if mytitle:
        ax.set_title(mytitle)

    cbar = plt.colorbar(cp, orientation='horizontal', ax=ax, pad=0.05, shrink=0.7)
    if units:
        cbar.set_label(units);

__Extract the NO2 DataArray__

In [None]:
mymonth_tropNO2 = ds['Tropospheric_NO2']

In [None]:
mymonth_tropNO2

In [None]:
myunits = f'{mymonth_tropNO2.attrs["long_name"]} [{mymonth_tropNO2.attrs["units"]}]'
myunits

__Plot NO2 against longitude for specific latitude values__

In [None]:
mymonth_tropNO2.sel(Latitude=slice(-4.0, 5.0, 12)).plot(x="Longitude", hue="Latitude");

__Quick global map__

In [None]:
mymonth_tropNO2.plot(cmap="jet");

__Use function to create map__

In [None]:
mytitle = f'Monthly average tropospheric NO2 for {mymonth}'
units = myunits
create_map_contourf(mymonth_tropNO2, mytitle=mytitle, units=units)

__Zoom in over Africa__

```python
LatIndexer, LonIndexer = 'Latitude', 'Longitude'
SliceData = data.sel(
    **{LatIndexer: slice(min_lat, max_lat),
       LonIndexer: slice(min_lon, max_lon)}
)
```

In [None]:
min_lat, max_lat = -35.3, 38.1
min_lon, max_lon = -19.2, 53.0

LatIndexer, LonIndexer = 'Latitude', 'Longitude'

mymonth_tropNO2_africa = mymonth_tropNO2.sel(
    **{LatIndexer: slice(min_lat, max_lat),
       LonIndexer: slice(min_lon, max_lon)}
)

In [None]:
mymonth_tropNO2_africa

In [None]:
mytitle = f'Monthly average tropospheric NO2 ofer Africa for {mymonth}'
create_map_contourf(mymonth_tropNO2_africa, mytitle=mytitle, units=units)

## <font color="blue">Combine the monthly data</font>

__Loop over the monthly data files to create a unique Xarray Dataset__

In [None]:
datasets = list()
for i, path in enumerate(list_files, start=1):
    ds = xr.open_dataset(path)
    # Assign a time coordinate to each dataset
    # Replace this with your actual time extraction logic
    time_val = pd.to_datetime(f'2024-{i:02}-15')
    ds = ds.assign_coords(time=time_val)
    ds = ds.expand_dims('time') # Expand to explicitly create the time dimension
    datasets.append(ds)

Concatenate the datasets along the `'time'` dimension to create a Dataset with 12 records:

In [None]:
ds = xr.concat(datasets, dim='time')

In [None]:
ds

Extract the time series DataArray for NO2:

In [None]:
tropNO2 = ds['Tropospheric_NO2']
tropNO2

__Monthly plots__

In [None]:
tropNO2.plot(x="Longitude", y="Latitude",
                col="time", col_wrap=3);

### <font color="green">Slicing the data</font>

__Select an arbitrary lat/lon location__

In [None]:
longitude = 15.55 
latitude = 5.05

__Plot the selected location on a map__

In [None]:
fig = plt.figure(figsize=(12, 9))
map_projection = ccrs.PlateCarree()
data_transform = ccrs.PlateCarree()

ax = plt.axes(projection=map_projection)
ax.stock_img()

# Plot the selected location 
plt.plot([longitude], [latitude], 'r*', 
        transform=data_transform,
        color="purple", 
         markersize=10)

ax.set(title=f"Location of the {latitude} Lat and {longitude} Lon Being Used to Slice Your netcdf Climate Data File");

__Interpolate at the selected location to obtain time series data__

In [None]:
one_point = tropNO2.interp(Latitude=latitude, Longitude=longitude)

one_point

- When you slice the data by a single point, the output data only has a single array of values. 
- The values represent air temperature (in K) over time.

In [None]:
one_point.shape

We can get the first few values:

In [None]:
one_point.values

**Time series plot at a single location**

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
one_point.plot.line(ax=ax, marker="o", color="grey",
                    markerfacecolor="purple",
                    markeredgecolor="purple");
ax.set(title=f"Time Series at latitude {latitude:.2f} and longitude {longitude:.2f}");

### <font color='green'> Slice the data by time and location</font>
- We want to slice the data at a selected lat/lon location and for the months of April to June.

In [None]:
beg_date = "2024-04-01"
end_date = "2024-06-30"
tropNO2_apr_jun = tropNO2.sel(time=slice(beg_date, end_date),
                              Latitude=latitude, Longitude=longitude)
tropNO2_apr_jun

In [None]:
print(tropNO2_apr_jun.shape)

We can plot the data:

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
tropNO2_apr_jun.plot.line(ax=ax, marker="o", color="grey",
                       markerfacecolor="purple",
                       markeredgecolor="purple")
ax.set(title="April-June Time Series for A Single Location");

### <font color='green'> Time series at specific latitudes and along a longitude line</font>

- We can use line plots to check the variation of air temperature at three different latitudes along a longitude line:

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
tropNO2.isel(Longitude=10, Latitude=[19, 21, 22]).plot.line(x="time");

### <font color='green'> Perform Correlation Analysis</font>

__Compute the annual mean__

In [None]:
tropNO2_clim = tropNO2.mean(dim='time')

In [None]:
tropNO2_clim

In [None]:
mytitle = "Annual mean of NO2"
create_map_contourf(tropNO2_clim, mytitle=mytitle, units=myunits)

__Compute the anomaly__

In [None]:
tropNO2_anom = tropNO2 - tropNO2_clim

In [None]:
tropNO2_anom

__Plot the anomaly for the selected month__

In [None]:
mytitle = f"Anomaly of NO2 for the month of {mymonth}"
create_map_contourf(tropNO2_anom[idx], mytitle=mytitle, units=myunits)

__Plot anomaly time series at a specific location__

In [None]:
tropNO2_ref = tropNO2_anom.sel(Longitude=longitude, Latitude=latitude, method='nearest')
tropNO2_ref.plot();

__Compute correlation__

In [None]:
def covariance(x, y, dims=None):
    return xr.dot(x - x.mean(dims), y - y.mean(dims), dims=dims) / x.count(dims)

def corrrelation(x, y, dims=None):
    return covariance(x, y, dims) / (x.std(dims) * y.std(dims))

In [None]:
tropNO2_cor = corrrelation(tropNO2_anom, tropNO2_ref, dims='time')

In [None]:
mytitle = f'Correlation btw. global tropNO2 Anomaly and tropNO2 Anomaly at lat={latitude:.2f}/lon={longitude:.2f}'
create_map_contourf(tropNO2_cor, mytitle=mytitle, units=myunits)

__Determine the time series spatial means__

In [None]:
tropNO2_anom_avg = tropNO2_anom.mean(dim=['Latitude', 'Longitude'])
tropNO2_anom_avg

In [None]:
tropNO2_anom_avg.plot();

__Interpolation using datetime strings__

In [None]:
inter_data = tropNO2.interp(time=["2024-03-10", "2024-11-26"])
inter_data

In [None]:
inter_data.plot(x="Longitude", y="Latitude", col="time");

---

**Compute seasonal values:**

For seasons `JFM`, `AMJ`, `JAS` and `OND`:

In [None]:
JFM_dst = tropNO2.resample(time='QS-JAN').mean()
JFM_dst

In [None]:
JFM_dst.plot(x="Longitude", y="Latitude", col="time", col_wrap=3)
plt.suptitle("Seasonal Means (JFM, AMJ, JAS, OND)", y = 1.05)

For seasons `DJF`, `MAM`, `JJA` and `SON`:

In [None]:
DJF_dst = tropNO2.resample(time='QS-DEC').mean()
DJF_dst

Or you can use the following for the seasons `DJF`, `MAM`, `JJA`, `SON`:

In [None]:
DJF_dst2 = tropNO2.groupby('time.season').mean()

In [None]:
DJF_dst.plot(x="Longitude", y="Latitude", col="time", col_wrap=3)
plt.suptitle("Seasonal Means (DJF, MAM, JJA, SON)", y = 1.05)

---

## <font color="red">Useful References</font>
- <a href="http://xarray.pydata.org/en/stable/"> xarray</a>
- <a href="http://gallery.pangeo.io/repos/pangeo-data/pangeo-tutorial-gallery/xarray.html"> XARRAY TUTORIAL</a>
- <a href="https://openresearchsoftware.metajnl.com/articles/10.5334/jors.148/"> xarray: N-D labeled arrays and datasets in Python</a>
- <a href="https://nbviewer.jupyter.org/github/mccrayc/tutorials/blob/master/2_reanalysis/CFSR_Data_Tutorial.ipynb">Importing and mapping reanalysis data with xarray and cartopy</a>
- <a href="https://cbrownley.wordpress.com/tag/xarray/">Visualizing Global Land Temperatures in Python with scrapy, xarray, and cartopy</a>
- [Xarray Introduction and Tutorial](https://boisestate.hosted.panopto.com/Panopto/Pages/Embed.aspx?id=a38a2efc-1ac6-4c02-af0f-acfc015e9444)