## Basic stats on all datasets, from the manifest

We can iterate over all the datasets in the catalog and get the variable count in each category (primary data, secondary data, primary descriptors, secondary descriptors), the first day in the manifest, the last day, and the number of days in between (and infer the number of missing days)

In [1]:
from nnja import DataCatalog

In [None]:
import os

os.environ["NNJA_USE_AUTH"] = "true"
catalog = DataCatalog()
print("catalog json:", catalog.catalog_uri)
print("datasets in catalog:")
catalog.list_datasets()

catalog json: gs://bb-nnja-ai-dev/data/v1/catalog.json
datasets in catalog:


['amsua-1bamua-NC021023',
 'atms-atms-NC021203',
 'mhs-1bmhs-NC021027',
 'cris-crisf4-NC021206',
 'iasi-mtiasi-NC021241',
 'geo-ahicsr-NC021044',
 'geo-gsrasr-NC021045',
 'geo-gsrcsr-NC021046',
 'seviri-sevasr-NC021042',
 'conv-adpsfc-NC000001',
 'conv-adpsfc-NC000002',
 'conv-adpsfc-NC000007',
 'conv-adpsfc-NC000101',
 'conv-adpupa-NC002001']

In [None]:
# For each dataset, load the manifest, get the number of variables. From the manifest, get the OBS_DATE first and last values, and the expected number of days (diff between first and last), and compare to the actual number of days in the dataset.

for dataset in catalog.list_datasets():
    ds = catalog[dataset]
    manifest = ds.manifest
    first_date = manifest.index.min()
    last_date = manifest.index.max()
    expected_days = (last_date - first_date).days
    actual_days = len(manifest.index.unique())
    print("    first date:", first_date.strftime("%Y-%m-%d"))
    print("    last date:", last_date.strftime("%Y-%m-%d"))
    print(f"    days of data: {expected_days} (missing {expected_days - actual_days} days)")
    print(f"    number of variables: {len(ds.variables)}")

first date: 1998-10-25 00:00:00+00:00
last date: 2025-03-31 00:00:00+00:00
days of data: 9654 (missing 1707 days)
number of variables: 49
Loading manifest for dataset 'atms-atms-NC021203'...
first date: 2012-02-15 00:00:00+00:00
last date: 2025-03-31 00:00:00+00:00
days of data: 4793 (missing 17 days)
number of variables: 199
Loading manifest for dataset 'mhs-1bmhs-NC021027'...
first date: 2007-02-27 00:00:00+00:00
last date: 2025-03-31 00:00:00+00:00
days of data: 6607 (missing 29 days)
number of variables: 29
Loading manifest for dataset 'cris-crisf4-NC021206'...
first date: 2018-01-16 00:00:00+00:00
last date: 2025-03-31 00:00:00+00:00
days of data: 2631 (missing 15 days)
number of variables: 474
Loading manifest for dataset 'iasi-mtiasi-NC021241'...
first date: 2008-01-01 00:00:00+00:00
last date: 2025-03-31 00:00:00+00:00
days of data: 6299 (missing 46 days)
number of variables: 648
Loading manifest for dataset 'geo-ahicsr-NC021044'...
first date: 2019-12-01 00:00:00+00:00
last da

In [5]:
ds.manifest.index

DatetimeIndex(['1998-10-25 00:00:00+00:00', '1998-10-26 00:00:00+00:00',
               '1998-10-27 00:00:00+00:00', '1998-10-28 00:00:00+00:00',
               '1998-10-29 00:00:00+00:00', '1998-10-30 00:00:00+00:00',
               '1998-10-31 00:00:00+00:00', '1998-11-01 00:00:00+00:00',
               '1998-11-02 00:00:00+00:00', '1998-11-03 00:00:00+00:00',
               ...
               '2025-03-22 00:00:00+00:00', '2025-03-23 00:00:00+00:00',
               '2025-03-24 00:00:00+00:00', '2025-03-25 00:00:00+00:00',
               '2025-03-26 00:00:00+00:00', '2025-03-27 00:00:00+00:00',
               '2025-03-28 00:00:00+00:00', '2025-03-29 00:00:00+00:00',
               '2025-03-30 00:00:00+00:00', '2025-03-31 00:00:00+00:00'],
              dtype='datetime64[ns, UTC]', name='OBS_DATE', length=7947, freq=None)