# Indexing division hints - GAIA

In generating an index over a catalog column, we use dask's `set_index` method to shuffle the catalog data around. This can be a very expensive operation. We can save a lot of time and general compute resources if we have some intelligent prior information about the distribution of the values inside the column we're building an index on.

In this notebook, I build some divisions for the GAIA DR3 data on the `designation` column. This is a string that contains within it an integer.

See also:
* https://docs.dask.org/en/latest/generated/dask.dataframe.DataFrame.set_index.html#dask.dataframe.DataFrame.set_index
* https://docs.dask.org/en/latest/dataframe-design.html#partitions
* https://docs.dask.org/en/stable/dataframe-best-practices.html#avoid-full-data-shuffling

In [1]:
from hipscat.io.parquet_metadata import write_parquet_metadata
from hipscat.io import file_io
import os

## Specify the catalog and column you're making your index over.
input_catalog_path="/data3/epyc/data3/hipscat/test_catalogs/gaia_symbolic"
indexing_column="designation"

We're going to look a lot into the `_metadata` file, which is a parquet file at the root of a hipscat catalog. Among other things, it contains statistics about the min and max values our data takes within each leaf parquet file.

Let's make sure that the indexing column really exists in our data. Then, we can visually check that it's correct by looking at the per-column schema in the parquet files.

In [2]:
## you might not need to change anything after that.
total_metadata = file_io.read_parquet_metadata(os.path.join(input_catalog_path, "_metadata"))

num_row_groups = total_metadata.num_row_groups

first_row_group = total_metadata.row_group(0)
index_column_idx = -1

for i in range(0, first_row_group.num_columns):
    column = first_row_group.column(i)
    if column.path_in_schema == indexing_column:
        index_column_idx = i
print("found column at index:", index_column_idx)

found column at index: 1


In [3]:
total_metadata.schema

<pyarrow._parquet.ParquetSchema object at 0x7fda2fd96740>
required group field_id=-1 schema {
  optional int64 field_id=-1 solution_id;
  optional binary field_id=-1 designation (String);
  optional int64 field_id=-1 source_id;
  optional int64 field_id=-1 random_index;
  optional double field_id=-1 ref_epoch;
  optional double field_id=-1 ra;
  optional double field_id=-1 ra_error;
  optional double field_id=-1 dec;
  optional double field_id=-1 dec_error;
  optional double field_id=-1 parallax;
  optional double field_id=-1 parallax_error;
  optional double field_id=-1 parallax_over_error;
  optional double field_id=-1 pm;
  optional double field_id=-1 pmra;
  optional double field_id=-1 pmra_error;
  optional double field_id=-1 pmdec;
  optional double field_id=-1 pmdec_error;
  optional double field_id=-1 ra_dec_corr;
  optional double field_id=-1 ra_parallax_corr;
  optional double field_id=-1 ra_pmra_corr;
  optional double field_id=-1 ra_pmdec_corr;
  optional double field_id=-1

We're making a guess that the `ps1_objid` is uniformly distributed.

First, find the minimum and maximum values across all of our data. We do this just by looking inside that `_metadata` file - we don't need to do a full catalog scan for these high-level statistics!

But these are strings. We can build up some prefixes that will be used as division lower bounds.

Think of this like the divisions between books in a large encyclopedia set: there's one book per letter of the alpbahet (but sometimes the "S" volume will get split up - but we don't know anything about our set of words right now).

In [5]:
import numpy as np

global_min = total_metadata.row_group(0).column(index_column_idx).statistics.min
global_max = total_metadata.row_group(0).column(index_column_idx).statistics.max

for index in range(1, num_row_groups):
    global_min = min(global_min, total_metadata.row_group(index).column(index_column_idx).statistics.min)
    global_max = max(global_max, total_metadata.row_group(index).column(index_column_idx).statistics.max)

print("global min", global_min)
print("global max", global_max)
print("num_row_groups", num_row_groups)

global min Gaia DR3 1000000057322000000
global max Gaia DR3 999999988604363776
num_row_groups 3933


You can take a look at what the prefixes look like:

In [12]:
int_range = np.arange(start = 10000, stop = 99999, step = 25)
divisions = [f"Gaia DR3 {i}" for i in range(10000, 99999, 25)]
divisions.append(global_max)
divisions[-10:]

['Gaia DR3 99775',
 'Gaia DR3 99800',
 'Gaia DR3 99825',
 'Gaia DR3 99850',
 'Gaia DR3 99875',
 'Gaia DR3 99900',
 'Gaia DR3 99925',
 'Gaia DR3 99950',
 'Gaia DR3 99975',
 'Gaia DR3 999999988604363776']