# Using NNJA-AI data locally

Some users might not want to rely on the data in the official NNJA repositories and want data closer to their compute (maybe a local copy is preferred, or a copy in a lower latency cloud bucket). This is supported! You can make a copy of the dataset wherever you want. You can either continue to use the NNJA-AI SDK or treat your copy as regular parquet files. Here we'll show an example of both approaches with a local copy of a subset of the data. 

Here we've completely downloaded the conventional NNJA data (surface station and rawinsondes) to a local directory. This is a rich dataset but thankfully not the highest volume, about 90 GiB of data.  

In [None]:
! gcloud storage cp -r gs://nnja-ai/data/v1/conv/ ~/data/nnja-local-demo/
#this will take a long time and download 90 GiB of data

We'll show three ways loading the data for this sample:
1) Using a modified Catalog
2) skipping the Catalog and directly instantiating 
3) skipping NNJA entirely and just loading the parquets

## 1) Use the Catalog

The catalog is a json that points to the various root directories of different datasets, and the DataCatalog class takes in a json relative path and base path for the entire projecct. Assuming we stick with the relative organization of data in the existing NNJA-AI cloud buckets, we can just download the existing catalog, delete all the references to other datasets, and change the base path. 

In [None]:
! gcloud storage cp gs://nnja-ai/data/v1/catalog.json ~/data/nnja-local-demo/
# Make sure to prune this for all datasets except the ones downloaded; in this case we keep all the "conv" datasets

In [None]:
from nnja_ai import DataCatalog
import pandas as pd
local_dir = '~/data/nnja-local-demo'
catalog = DataCatalog(mirror=None, base_path=local_dir, catalog_json='catalog.json')
date = pd.to_datetime("2021-01-01").tz_localize("UTC")
metar_dataset = catalog["conv-adpsfc-NC000001"]
print(metar_dataset)
ds = metar_dataset.sel(time=date).load_dataset(backend="pandas")
print(ds.head(2))

## 2) Skip the Catalog and directly instantiate a NNJADataset

The Catalog isn't that useful if you have already identifed and downloaded the datasets of interest. If you still find it helpful to have other features of the NNJADataset metadata (column descriptors, etc.), you can directly create a NNJADataset from the downloaded directory's metadata file:

In [None]:
from nnja_ai import NNJADataset
metar_dataset = NNJADataset(json_uri=f'{local_dir}/conv/adpsfc/NC000001/.pmetadata', base_path=local_dir)
print(metar_dataset)
ds = metar_dataset.sel(time=date).load_dataset(backend="pandas")
ds.head(2)

## 3) Skip the NNJA-AI SDK and go read files directly

There is no requirement to use this SDK; the NNJA-AI files are published in parquet for maximum flexibility. You can directly read the files yourself using your preferred backend:

In [None]:
# reading a single file with pandas
parquet_file = f'{local_dir}/conv/adpsfc/NC000001/OBS_DATE=2021-01-01/gdas.20210101.NC000001.parquet'
ds = pd.read_parquet([parquet_file], engine="pyarrow")
ds.head(2)


Or using polars and filtering:

In [None]:
import glob
import os
import polars as pl
import datetime
parquet_dir = os.path.expanduser(f'{local_dir}/conv/adpsfc/NC000001')
parquet_files = glob.glob(f'{parquet_dir}/*/*.parquet')

ds = pl.scan_parquet(parquet_files, hive_partitioning=True)
ds.filter(pl.col("OBS_DATE") == datetime.date(2021, 1, 1)).head(2).collect().to_pandas()