# NetCDF and self-describing datasets

Eviatar Bach ([eviatarbach@protonmail.com](mailto:eviatarbach@protonmail.com)), based heavily on the originals by Rebekah Esmaili [bekah@umd.edu](mailto:bekah@umd.edu)

University of Maryland, College Park

Code, Data, and Installation Guide: [https://ter.ps/agupy](https://ter.ps/agupy)


## Examining the 2018 California Wildfires from Space

* 6,870 fires had burned over a 6,000 km${^2}$ area. 
* The smoke from the wildfires also had an impact on air quality both in proximity of the fires as well as across the country.
* We'll look at satellite observations from __Suomi-NPP__ and __GOES-17__ to show the impact of the California wildfires on AOD.

![](img/satellites.png)

We're in the golden age of sallite datasets, which is a blessing and a curse:

* Inundated with datasets, don't know which ones to use
* No single repository for access of the data
* Inconsistent formatting and filetypes

netCDF4 and HDF5 are the dominant formats used in satellite remote sensing (but others do exist)

## netCDF Primer

* Hosted by the Unidata program at the University Corporation for Atmospheric Research (UCAR)
* NetCDF (Network Common Data Form) a set of software libraries and self-describing, machine-independent data formats
* Support the creation, access, and sharing of array-oriented scientific data

Advantages: 
* Open source and free
* Provides standard formatting for Earth science data
* Compression helps with long-term file storage
* Includes additional metadata

Disadvantages: 
* There is a steeper learning curve for working with self-describing file formats

## Panoply
* Pronounced: Pan-OH-plee
* A netCDF, HDF, KMZ, and GRIB data viewer
* Free/open source for Linux, Mac, and Windows
* Developed and maintained by Dr. Robert B. Schmunk of NASA/GISS

Other display tools: 

* Free: HDFView, Ncview, QGIS, Explorer series
* Not free: ENVI/IDL, MATLAB, ArcMap


**Inspecting GOES-17 ABI files**

---
__Your legal disclaimer...__
"These GOES-17 data are preliminary, non-operational data and are undergoing testing. Users bear all responsibility for inspecting the data prior to use and for the manner in which the data are utilized."

---

1. Run Panoply

2. Navigate to "data/OR_ABI-L2-AODC-M3_G17_s20182211612186_e20182211614557_c20182211615551.nc."

3. To make a plot:
    * Variables that have “Geo2D” or “2D” types can be made into plots
    * Double click on AOD
    
    * ![](img/pano-2.PNG)
    * A window will pop up. Choose the first option and click “create” 
    * ![](img/pano-3.PNG)
    * Now we have a map of AOD, but we need to adjust it a little bit.
    * Below the plot, there is a menu where you can change the look, projection, and scale of the plot
    * ![](img/pano-4.PNG)

4. Some formatting options:
    * Zoom: Control + left click and hold to highlight region
    * Scale tab:
        * Change Scale to 0-1
        * Change color table to another palette
        * Edit Captions
    
    * Map tab:
        * Change map projection to orthographic 
    * ![](img/pano-5.PNG)

5. Save Image
    * If you're happy with the way it now looks File &rarr; Save Image
    * Note: If you want to reuse this template: Plot &rarr; Save Plot Settings as Default 
    * ![](img/pano-6.PNG)

## Importing netCDF files using Python
The netCDF4 package is included in Anaconda Python. The main function is Dataset, which reads from an existing file:
```
file_id = Dataset("test.nc", "r", format="NETCDF4")
```
You can choose to 'w' (write), 'r' (read), or 'a'

The foramts can be: NETCDF3_CLASSIC, NETCDF3_64BIT_OFFSET, NETCDF3_64BIT_DATA, NETCDF4_CLASSIC, and NETCDF4 (default)

In [None]:
from netCDF4 import Dataset

In [46]:
# To open the files, call the Dataset constructor
fname='data/JRR-AOD_v1r1_npp_s201808091955538_e201808091957180_c201808092049460.nc'
file_id = Dataset(fname)

In [47]:
# Quickly inspect the contents
list(file_id.variables.keys())

['Latitude',
 'Longitude',
 'StartRow',
 'StartColumn',
 'AOD550',
 'AOD_channel',
 'AngsExp1',
 'AngsExp2',
 'QCPath',
 'AerMdl',
 'FineMdlIdx',
 'CoarseMdlIdx',
 'FineModWgt',
 'SfcRefl',
 'SpaStddev',
 'Residual',
 'AOD550LndMdl',
 'ResLndMdl',
 'MeanAOD',
 'HighQualityPct',
 'RetrievalPct',
 'QCRet',
 'QCExtn',
 'QCTest',
 'QCInput',
 'QCAll']

In [97]:
# Copy the AOD variable into an array object using .variables
# The data can be imported as a netcdf variable:
AOD550 = file_id.variables['AOD550']
print(type(AOD550))

<class 'netCDF4._netCDF4.Variable'>


In [98]:
# Inspect attributes using .ncattrs command
list(file_id.variables['AOD550'].ncattrs())

['long_name', 'coordinates', 'units', '_FillValue', 'valid_range']

In [99]:
# Get some very simple statistics by converting into a NumPy array
import numpy as np

AOD550 = np.array(AOD550)

# Remove missing values
missing = file_id.variables['AOD550']._FillValue
keepRows = AOD550 != missing
AOD550 = AOD550[keepRows]

avgAOD = AOD550.mean()
stdDev = AOD550.std()
nAOD = AOD550.size

print(avgAOD, stdDev, nAOD)

0.41903266 0.38180444 1746384


In [101]:
# Note: importing using [:,:] turns the variable a numpy masked array
AOD550 = file_id.variables['AOD550'][:,:]
print(type(AOD550))

<class 'numpy.ndarray'>


In [102]:
# Using masked arrays, missing values are automatically filtered...
avgAOD = AOD550.mean()
stdDev = AOD550.std()
nAOD = AOD550.size

print(avgAOD, stdDev, nAOD)

-289.09647 453.671 2457600


In [103]:
# Close the file when you're done
file_id.close()

<div class="alert alert-block alert-info">

**Exercise 1**

**Import netCDF file**

using the Dataset command from the netcdf4 package, import:

data/JRR-AOD_v1r1_npp_s201808091955538_e201808091957180_c201808092049460.nc 


**Inspect the list of variables**

Get a list of variables after the file has been opening.

**Inspect the attributes of a given variable**

What are the attributes of the QCAll variable?

</div>

## Importing HDF files using Python

Very similar process to netCDF. Looking at the mean monthly AOD for August, 2018 using the Deep Blue AOD retrieval (output from Panoply below)

* ![](img/db.png)

In [104]:
import h5py

In [105]:
# Open the file
fname='data/DeepBlue-SeaWiFS-1.0_L3M_201008_v004-20130604T140615Z.h5'
file_id_DB = h5py.File(fname, 'r')

In [106]:
# Quickly inspect the contents...
list(file_id_DB.keys())

['aerosol_optical_thickness_550_count_land',
 'aerosol_optical_thickness_550_count_land_ocean',
 'aerosol_optical_thickness_550_count_ocean',
 'aerosol_optical_thickness_550_land',
 'aerosol_optical_thickness_550_land_ocean',
 'aerosol_optical_thickness_550_ocean',
 'aerosol_optical_thickness_550_stddev_land',
 'aerosol_optical_thickness_550_stddev_land_ocean',
 'aerosol_optical_thickness_550_stddev_ocean',
 'aerosol_optical_thickness_count_land',
 'aerosol_optical_thickness_count_ocean',
 'aerosol_optical_thickness_land',
 'aerosol_optical_thickness_ocean',
 'aerosol_optical_thickness_stddev_land',
 'aerosol_optical_thickness_stddev_ocean',
 'angstrom_exponent_count_land',
 'angstrom_exponent_count_land_ocean',
 'angstrom_exponent_count_ocean',
 'angstrom_exponent_land',
 'angstrom_exponent_land_ocean',
 'angstrom_exponent_ocean',
 'angstrom_exponent_stddev_land',
 'angstrom_exponent_stddev_land_ocean',
 'angstrom_exponent_stddev_ocean',
 'diagnostics',
 'land_bands',
 'latitude',
 'l

In [107]:
# Import the data...
AOD = file_id_DB['aerosol_optical_thickness_550_land_ocean']

# Check a value...
#AOD[60, 300]

In [108]:
list(AOD.attrs)

['long_name',
 'standard_name',
 'units',
 'comment',
 '_FillValue',
 'valid_range',
 'DIMENSION_LIST']

In [109]:
# To view the attribute
AOD.attrs['long_name']

b'aerosol optical thickness estimated at 550 nm over land and ocean'

## Other formats:
* GRIB/GRIB2: World Meteorology Association standard format, e.g. commonly used with weather-related models like ECMWF and GFS. Can be opened using [pygrib](https://github.com/jswhit/pygrib)
* BUFR: Another common model format. Open with [python-bufr](https://github.com/pytroll/python-bufr), part of the pytroll project.

## Resources

Searchable satellite data:

* [NOAA CLASS](https://www.avl.class.noaa.gov) Comprehensive Large Array-data Stewardship System (CLASS) is an electronic library of NOAA environmental data.
* [NASA MIRDOR](https://mirador.gsfc.nasa.gov) is an electronic library of NASA environmental data. Most NASA satellite data are stored here.
* [NASA Langley](https://eosweb.larc.nasa.gov) Atmospheric Science Data Center (ASDC) Distributed Active Archive Center (DAAC). Aerosols, clouds, radiation, and field campaign data.
* [EUMETSAT](https://www.eumetsat.int/website/home/Data/DataDelivery/OnlineDataAccess/index.html)

Other channels:

* [Amazon Web Services](https://registry.opendata.aws/?search=earth%20observation) has GOES-16 radiance, Landsat, MODIS, and more 
* [python-AWIPS](https://python-awips.readthedocs.io/en/latest/) Has a good repository of atmospheric datasets
* [Python Satellite Data Analysis Toolkit (pysat)](https://github.com/rstoneback) Can pull space science related datasets (e.g. COSMIC-1) 