## TIC collection import

Author: Melissa D

Last run: Nov 21, 2025

https://github.com/astronomy-commons/data.lsdb.io/issues/159

New threshold: threshold between 956_376 and 3_264_430

## Getting the raw data again

I'll try to add the tic_file_list.txt in the `/imports/` directory, along with this notebook.

```
>> screen
>> $ wget --content-disposition --trust-server-names -i tic_file_list.txt 
```

First file is 4.5G, and the full dataset is ~415G. Copied in 2m10s. Estimated full time is ~3.5 hours.

In [None]:
# !wget https://hats-import.readthedocs.io/en/latest/_downloads/fa5e0b89632c46d40d232e2dee087664/tic_types.csv

In [None]:
# !wget https://hats-import.readthedocs.io/en/latest/_downloads/cd80f8e28db0e4e9321644b94c461d03/tic_schema.parquet

In [6]:
import hats
import numpy as np
from dask.distributed import Client
from hats_import import CollectionArguments, pipeline_with_client, pipeline, VerificationArguments
from pathlib import Path
from hats_import.catalog.file_readers import CsvReader
from astropy.io import ascii
import lsdb
import pandas as pd

import hats_import
import hats
from hats.pixel_math import HealpixPixel
import os
import matplotlib.pyplot as plt
import numpy as np
import os
import pyarrow.parquet as pq
from pathlib import Path

hats.__version__

'0.7.1'

In [2]:
## input paths:
raw_dir = Path("/epyc/data3/hats/raw/tic/")
file_list = list(raw_dir.glob("*.csv.gz"))
print("found", len(file_list), "files for import")

found 90 files for import


In [3]:
# Load the column names and types from a side file.
type_frame = pd.read_csv("tic_types.csv")
type_map = dict(zip(type_frame["name"], type_frame["type"]))

In [4]:
args = (
    CollectionArguments(
        completion_email_address="delucchi@andrew.cmu.edu",
        output_artifact_name="tic",
        output_path="/epyc/data3/hats/catalogs/v06",
        progress_bar=True,
        simple_progress_bar=True,
    )
    .catalog(
        output_artifact_name="tic",
        input_file_list=file_list,
        file_reader=CsvReader(
            header=None,
            column_names=type_frame["name"].values.tolist(),
            type_map=type_map,
            chunksize=250_000,
            compression="gzip",
        ),
        ra_column="ra",
        dec_column="dec",
        use_schema_file="tic_schema.parquet",
        highest_healpix_order=8,
        skymap_alt_orders=[2, 4, 6],
        pixel_threshold=2_500_000,
        row_group_kwargs={"num_rows": 200_000},
    )
    .add_margin(margin_threshold=5.0, is_default=True)
)

In [5]:
with Client(
    local_directory="/epyc/data3/hats/tmp/",
    n_workers=20,
    threads_per_worker=1,
) as client:
    pipeline_with_client(args, client)

Planning  : 100%|██████████| 3/3 [00:21<00:00,  7.02s/it]
Mapping   : 100%|██████████| 1578/1578 [12:31<00:00,  2.10it/s]
Binning   :   0%|          | 0/1 [00:00<?, ?it/s]
Reducing  : 100%|██████████| 1578/1578 [00:13<00:00, 117.18it/s]
Finishing : 100%|██████████| 4/4 [00:03<00:00,  1.24it/s]
Finishing : 100%|██████████| 2/2 [00:00<00:00,  5.37it/s]


In [7]:
args = VerificationArguments(
    input_catalog_path="/epyc/data3/hats/catalogs/v06/tic",
    output_path="./verification/tic",
    truth_total_rows=1727987580,
)
pipeline(args)

Loading dataset and schema.

Starting: Test hats.io.validation.is_valid_collection.
Validating collection at path /epyc/data3/hats/catalogs/v06/tic ... 
Validating catalog at path /epyc/data3/hats/catalogs/v06/tic/tic ... 
Found 1578 partitions.
Approximate coverage is 100.00 % of the sky.
Validating catalog at path /epyc/data3/hats/catalogs/v06/tic/tic_5arcs ... 
Found 1578 partitions.
Approximate coverage is 100.00 % of the sky.
Result: PASSED

Starting: Test that files in _metadata match the data files on disk.
Result: PASSED

Starting: Test that number of rows are equal.
	file footers vs catalog properties
	file footers vs _metadata
	file footers vs truth
Result: PASSED

Starting: Test that schemas are equal, excluding metadata.
	_common_metadata vs truth
	_metadata vs truth
	file footers vs truth
Result: PASSED

Verifier results written to verification/tic/verifier_results.csv
Elapsed time (seconds): 36.68
