# ATEK Demo 2: ATEK Data Store
In Demo 1 (TODO: add link here), we showed how to preprocess Aria data sequences into WebDataset (WDS) files, which can later be loaded directly into PyTorch DataLoaders. However, preprocessing large datasets are time- and resource-consuming. Hence ATEK provides users with a **ATEK Data Store**, where users can directly download preprocessed Aria datasets, completely skipping the preprocessing step. 

This demo walks through the steps for accessing datasets on ATEK Data Store. Here, we use preprocessed AriaDigitalTwin dataset as an example. See link for full list of datasets on ATEK Data Store. (TODO: add link to wiki page)



In [None]:
import faulthandler

import logging
import os
from logging import StreamHandler
import numpy as np
from typing import Dict, List, Optional
import torch
import sys
import subprocess
from tqdm import tqdm

from atek.viz.atek_visualizer_base import NativeAtekSampleVisualizer
from atek.data_loaders.atek_wds_dataloader import (
    create_native_atek_dataloader
)
from atek.util.file_io_utils import load_yaml_and_extract_tar_list
from omegaconf import OmegaConf

faulthandler.enable()

# Configure logging to display the log messages in the notebook
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.StreamHandler(sys.stdout)
    ]
)

logger = logging.getLogger()

### Set up data and code paths

In [None]:
# data_dir = os.path.join(os.path.expanduser("~"), "Documents", "projectaria_tools_adt_data/")
data_dir = "/home/louy/Calibration_data_link/Atek/2024_08_05_DryRun"
atek_src_path = os.path.join(os.path.expanduser("~"), "atek_on_fbsource")
viz_conf = OmegaConf.load(os.path.join(atek_src_path, "atek", "configs", "obb_viz_new.yaml"))

## Download and process ATEK data json file from ATEK Data Store
First, user can download `AriaDigitalTwin_ATEK_download_urls.json` file from ATEK Data Store. This json file contains the ATEK WDS information for preprocessed ADT data. (TODO: add link to download guidance). 

Now, users have 2 options to access the data: 
1. They can download (a selection) of WDS files to local. This is recommended for training.
2. They can also directly stream WDS files via their URLs. Because the URLs expire every 30 days, this is only recommended for small-scale local testing and exploration.

Below we demonstrates both approaches.   

In [None]:
# First, download json file from ATEK Data Store 
atek_json_path = os.path.join(data_dir, "AriaDigitalTwin_ATEK_download_urls.json")
if not os.path.exists(atek_json_path):
    logger.error("Please download AriaDigitalTwin_ATEK_download_urls.json from ATEK Data Store")
    exit()

## [Option 1] Download from ATEK Data Store to local
To download, user should use the `dataverse_url_parser.py` script in ATEK's lib, with `--download-wds-to-local` flag. User can select which preprocessing config to download, train/validation split, and number of sequences to download. The output folder will contain downloaded WDS data, along with 3 yaml files `local_all/train/validation_tars.yaml`, each containing the relative path of the downloaded files, and can be consumed by ATEK lib (usage shown later). 

In [None]:
# Invoke ATEK's url parser tool to download files
download_to_local_command = (
    f"python3 {atek_src_path}/tools/dataverse_url_parser.py"
    f" --config-name cubercnn" 
    f" --input-json-path {atek_json_path}" 
    f" --output-folder-path {data_dir}/downloaded_local_wds_2/" 
    f" --max-num-sequences 2 "
    f" --download-wds-to-local"
)

# TODO: Try to run the command  here, and re-direct stdout to Notebook in real time. 
logger.info(f"Please run the following command in a Terminal window to download WDS files to local: ")
logger.info(f"mamba activate atek; {download_to_local_command}")

## [Option 2] Create Streamable yaml files
User can also use `dataverse_url_parser.py` script **without** the `--download-wds_to-local` flag, which will just create 3 yaml files, `streamable_all/train/validation_tars.yaml`, each containing the urls of the WDS shard files. These yaml files can be consumed by ATEK lib (usage shown later). 

In [None]:
# Invoke ATEK's url parser tool to download files
create_streamable_yaml_command = (
    f"python3 {atek_src_path}/tools/dataverse_url_parser.py"
    f" --config-name cubercnn" 
    f" --input-json-path {atek_json_path}" 
    f" --output-folder-path {data_dir}/streamable_yamls/" 
    f" --max-num-sequences 5 "
)

# TODO: Try to run the command  here, and re-direct stdout to Notebook in real time. 
logger.info(f"Please run the following command in a Terminal window to create streamable json files: ")
logger.info(f"mamba activate atek; {create_streamable_yaml_command}")

## Create PyTorch data loaders from the yaml files
Now user can create a list of tar urls, from either the `local_*.yaml` or the `streamable_*.yaml`, and further create a Pytorch DataLoader to load the WDS content (See Demo 1 - Example 2). Here, we demonstrate how to visualize the WDS content, using the **streamable** yaml files. 

In [None]:
# Load Native ATEK WDS data
logger.info("-------------------- streaming ATEK data directly from DataStore --------------- ")

# Loading local WDS files
tar_file_urls = load_yaml_and_extract_tar_list(yaml_path = os.path.join(data_dir, "streamable_yamls", "streamable_validation_tars.yaml"))

# Batch size is None so that no collation is invoked
atek_dataloader = create_native_atek_dataloader(urls = tar_file_urls, batch_size=None, repeat_flag=False)

# Loop over all samples in DataLoader and visualize
atek_visualizer = NativeAtekSampleVisualizer(viz_prefix = "dataloading_visualizer", conf = viz_conf)
for atek_sample_dict in atek_dataloader:
    # First convert it back to ATEK data sample and visualize
    atek_visualizer.plot_atek_sample_as_dict(atek_sample_dict)