# CLOUD unit test data

There are two types of data used in unit tests in this repo: local and cloud. This notebook concerns itself only with the CLOUD versions of test data, so you can re-generate it.

This also works to initialize data in a new cloud provider, instead of simply copying an existing data set.

## Object catalog: small sky

This is the same "object catalog" with 131 randomly generated radec values inside the order0-pixel11 healpix pixel that is used in HATS and LSDB unit test suites.

In [15]:
import os
import tempfile
from upath import UPath

import hats_import.pipeline as runner
from hats_import.catalog.arguments import ImportArguments
from hats_import.index.arguments import IndexArguments
from hats_import.margin_cache.margin_cache_arguments import MarginCacheArguments
from dask.distributed import Client
from hats.io.file_io import remove_directory

tmp_path = tempfile.TemporaryDirectory()
tmp_dir = tmp_path.name

storage_options = {
    "account_key": os.environ.get("ABFS_LINCCDATA_ACCOUNT_KEY"),
    "account_name": os.environ.get("ABFS_LINCCDATA_ACCOUNT_NAME"),
}
storage_options


output_path = UPath("../cloud/data")

client = Client(n_workers=1, threads_per_worker=1, local_directory=tmp_dir)

Perhaps you already have a cluster running?
Hosting the HTTP server on port 43863 instead


### small_sky

This catalog was generated with the following snippet:

In [7]:
remove_directory(output_path / "small_sky")
with tempfile.TemporaryDirectory() as pipeline_tmp:
    args = ImportArguments(
        input_path="small_sky_parts",
        highest_healpix_order=1,
        file_reader="csv",
        output_path=output_path,
        output_artifact_name="small_sky",
        tmp_dir=pipeline_tmp,
    )
    runner.pipeline_with_client(args, client)

Planning  :   0%|          | 0/4 [00:00<?, ?it/s]

Mapping   :   0%|          | 0/5 [00:00<?, ?it/s]

Binning   :   0%|          | 0/2 [00:00<?, ?it/s]

Splitting :   0%|          | 0/5 [00:00<?, ?it/s]

Reducing  :   0%|          | 0/1 [00:00<?, ?it/s]

Finishing :   0%|          | 0/5 [00:00<?, ?it/s]

### small_sky_order1

This catalog has the same data points as other small sky catalogs, but is coerced to spreading these data points over partitions at order 1, instead of order 0.

This means there are 4 leaf partition files, instead of just 1, and so can be useful for confirming reads/writes over multiple leaf partition files.

NB: Setting `constant_healpix_order` coerces the import pipeline to create leaf partitions at order 1.

This catalog was generated with the following snippet:

In [8]:
remove_directory(output_path / "small_sky_order1")
with tempfile.TemporaryDirectory() as pipeline_tmp:
    args = ImportArguments(
        input_path="small_sky_parts",
        file_reader="csv",
        constant_healpix_order=1,
        output_path=output_path,
        output_artifact_name="small_sky_order1",
        tmp_dir=pipeline_tmp,
    )
    runner.pipeline_with_client(args, client)

Planning  :   0%|          | 0/4 [00:00<?, ?it/s]

Mapping   :   0%|          | 0/5 [00:00<?, ?it/s]

Binning   :   0%|          | 0/2 [00:00<?, ?it/s]

Splitting :   0%|          | 0/5 [00:00<?, ?it/s]

Reducing  :   0%|          | 0/4 [00:00<?, ?it/s]

Finishing :   0%|          | 0/5 [00:00<?, ?it/s]

### small_sky_order1_margin


In [18]:
remove_directory(output_path / "small_sky_order1_margin")
with tempfile.TemporaryDirectory() as pipeline_tmp:
    args = MarginCacheArguments(
        input_catalog_path="small_sky_order1",
        output_path=output_path,
        output_artifact_name="small_sky_order1_margin",
        margin_threshold=7200,
        tmp_dir=pipeline_tmp,
    )
    runner.pipeline_with_client(args, client)

Planning  :   0%|          | 0/3 [00:00<?, ?it/s]

Mapping   :   0%|          | 0/4 [00:00<?, ?it/s]

Binning   :   0%|          | 0/1 [00:00<?, ?it/s]

Reducing  :   0%|          | 0/15 [00:00<?, ?it/s]

Finishing :   0%|          | 0/4 [00:00<?, ?it/s]

### small_sky_object_index

An index table mapping the `"id"` field in the `small_sky_order` catalog to the pixels they can be found in.

In [10]:
remove_directory(output_path / "small_sky_object_index")
with tempfile.TemporaryDirectory() as pipeline_tmp:
    args = IndexArguments(
        input_catalog_path="small_sky_order1",
        indexing_column="id",
        output_path=output_path,
        output_artifact_name="small_sky_object_index",
        tmp_dir=pipeline_tmp,
    )
    runner.pipeline_with_client(args, client)

Finishing :   0%|          | 0/3 [00:00<?, ?it/s]

### small_sky_xmatch


In [12]:
remove_directory(output_path / "small_sky_xmatch")
with tempfile.TemporaryDirectory() as pipeline_tmp:
    args = ImportArguments(
        input_file_list=["xmatch/xmatch_catalog_raw.csv"],
        file_reader="csv",
        constant_healpix_order=1,
        output_path=output_path,
        output_artifact_name="small_sky_xmatch",
        pixel_threshold=100,
        tmp_dir=pipeline_tmp,
    )
    runner.pipeline_with_client(args, client)

Planning  :   0%|          | 0/4 [00:00<?, ?it/s]

Mapping   :   0%|          | 0/1 [00:00<?, ?it/s]

Binning   :   0%|          | 0/2 [00:00<?, ?it/s]

Splitting :   0%|          | 0/1 [00:00<?, ?it/s]

Reducing  :   0%|          | 0/3 [00:00<?, ?it/s]

Finishing :   0%|          | 0/5 [00:00<?, ?it/s]

In [13]:
tmp_path.cleanup()
client.close()

2025-03-05 11:03:56,703 - distributed.diskutils - ERROR - Failed to remove '/tmp/tmpard9lijs/dask-scratch-space/worker-06risbre' (failed in <built-in function lstat>): [Errno 2] No such file or directory: '/tmp/tmpard9lijs/dask-scratch-space/worker-06risbre'
