# Importing catalogs to HiPSCat format

This notebook presents two ways of importing catalogs to HiPSCat format. The first uses the `lsdb.from_dataframe()` method, which is helpful to load smaller catalogs from a single dataframe, while the second uses the `hipscat-import pipeline`.

In [None]:
import lsdb
import os
import pandas as pd
import tempfile

We will be importing `small_sky_order1` from a single CSV file:

In [None]:
catalog_name = "small_sky_order1"
test_data_dir = os.path.join("../../tests", "data")

Let's define the input and output paths:

In [None]:
# Input paths
catalog_dir = os.path.join(test_data_dir, catalog_name)
catalog_csv_path = os.path.join(catalog_dir, f"{catalog_name}.csv")

# Temporary directory for the intermediate/output files
tmp_dir = tempfile.TemporaryDirectory()

## lsdb.from_dataframe

In [None]:
%%time

# Read simple catalog from its CSV file
catalog = lsdb.from_dataframe(
    pd.read_csv(catalog_csv_path),
    catalog_name="from_dataframe",
    catalog_type="object",
    highest_order=5,
    threshold=100,
)

# Save it to disk in HiPSCat format
catalog.to_hipscat(f"{tmp_dir.name}/from_dataframe")

## HiPSCat import pipeline

Let's install the latest release of hipscat-import:

In [None]:
!pip install git+https://github.com/astronomy-commons/hipscat-import.git@main --quiet

In [None]:
from dask.distributed import Client
from hipscat_import.catalog.arguments import ImportArguments
from hipscat_import.pipeline import pipeline_with_client

In [None]:
%%time

args = ImportArguments(
    ra_column="ra",
    dec_column="dec",
    highest_healpix_order=5,
    pixel_threshold=100,
    file_reader="csv",
    input_file_list=[catalog_csv_path],
    output_artifact_name="from_import_pipeline",
    output_path=tmp_dir.name,
    resume=False,
)

with Client(n_workers=1) as client:
    pipeline_with_client(args, client)

Let's read both catalogs, from disk, and check that the two methods produced the same output:

In [None]:
from_dataframe_catalog = lsdb.read_hipscat(f"{tmp_dir.name}/from_dataframe")
from_dataframe_catalog

In [None]:
from_import_pipeline_catalog = lsdb.read_hipscat(f"{tmp_dir.name}/from_import_pipeline")
from_import_pipeline_catalog

In [None]:
# Verify that the pixels they contain are similar
assert from_dataframe_catalog.get_healpix_pixels() == from_import_pipeline_catalog.get_healpix_pixels()

# Verify that resulting dataframes contain the same data
sorted_from_dataframe = from_dataframe_catalog.compute().sort_index()
sorted_from_import_pipeline = from_import_pipeline_catalog.compute().sort_index()
pd.testing.assert_frame_equal(sorted_from_dataframe, sorted_from_import_pipeline)

Finally, tear down the directory used for the intermediate / output files:

In [None]:
tmp_dir.cleanup()