# Estimate pixel threshold

For best performance, we try to keep catalog parquet files between 200-800MB in size.

**Background**

When creating a new catalog through the hipscat-import process, we try to create partitions with approximately the same number of rows per partition. This isn't perfect, because the sky is uneven, but we still try to create smaller-area pixels in more dense areas, and larger-area pixels in less dense areas. We use the argument `pixel_threshold` and will split a partition into smaller healpix pixels until the number of rows is smaller than `pixel_threshold`.

We do this to increase parallelization of reads and downstream analysis: if the files are around the same size, and operations on each partition take around the same amount of time, we're not as likely to be waiting on a single process to complete for the whole pipeline to complete.

In addition, a single catalog file should not exceed a couple GB - we're going to need to read the whole thing into memory, so it needs to fit!

**Objective**

In this notebook, we'll go over *one* strategy for estimating the `pixel_threshold` argument you can use when importing a new catalog into hipscat format.

This is not guaranteed to give you optimal results, but it could give you some hints toward *better* results.

## Create a sample parquet file

The first step is to read in your survey data in its original form, and convert a sample into parquet. This has a few benefits:
- parquet uses compression in various ways, and by creating the sample, we can get a sense of both the overall and field-level compression with real dat
- using the importer `FileReader` interface now sets you up for more success when you get around to importing!

If your data is already in parquet format, just change the `sample_parquet_file` path to an existing file, and don't run the second cell.

In [82]:
sample_parquet_file="/data3/epyc/data3/hipscat/raw/sdss/parquet/calibObj-008162-4-star.parquet"

In [81]:
import pandas as pd
from hipscat_import.catalog.file_readers import FitsReader


MULTIDIMENSIONAL = [
    "ROWC",
    "COLC",
    "M_RR_CC",
    "M_RR_CC_PSF",
    "FLAGS",
    "FLAGS2",
    "PSP_STATUS",
    "PSF_FWHM",
    "EXTINCTION",
    "SKYFLUX",
    "PSFFLUX",
    "PSFFLUX_IVAR",
    "FIBERFLUX",
    "FIBERFLUX_IVAR",
    "FIBER2FLUX",
    "FIBER2FLUX_IVAR",
    "MODELFLUX",
    "MODELFLUX_IVAR",
    "CALIB_STATUS",
    "NMGYPERCOUNT",
    "APERFLUX6",
    "CMODELFLUX_CLEAN",
    "CMODELFLUX_CLEAN_IVAR",
    "CMODELFLUX_CLEAN_VAR",
    "CMODELFLUX_CLEAN_CHI2",
    "CMODEL_CLEAN_NUSE",
    "CMODEL_CLEAN_MJD_MAXDIFF",
    "CMODEL_CLEAN_MJD_VAR",
    "MODELFLUX_CLEAN",
    "MODELFLUX_CLEAN_IVAR",
    "MODELFLUX_CLEAN_VAR",
    "MODELFLUX_CLEAN_CHI2",
    "MODEL_CLEAN_NUSE",
    "MODEL_CLEAN_MJD_MAXDIFF",
    "MODEL_CLEAN_MJD_VAR",
    "PSFFLUX_CLEAN",
    "PSFFLUX_CLEAN_IVAR",
    "PSFFLUX_CLEAN_VAR",
    "PSFFLUX_CLEAN_CHI2",
    "PSF_CLEAN_NUSE",
    "PSF_CLEAN_MJD_MAXDIFF",
    "PSF_CLEAN_MJD_VAR",
]

input_file="/data3/epyc/data3/hipscat/raw/sdss/calibObj-008162-4-star.fits.gz"

## This input CSV file requires header and type data from another source.

file_reader=FitsReader(
    chunksize=50_000,
#     skip_column_names=MULTIDIMENSIONAL,
)
data_frame = next(file_reader.read(input_file))
data_frame.to_parquet(sample_parquet_file)

ValueError: Cannot convert a table with multidimensional columns to a pandas DataFrame. Offending columns are: ['ROWC', 'COLC', 'M_RR_CC', 'M_RR_CC_PSF', 'FLAGS', 'FLAGS2', 'PSP_STATUS', 'PSF_FWHM', 'EXTINCTION', 'SKYFLUX', 'PSFFLUX', 'PSFFLUX_IVAR', 'FIBERFLUX', 'FIBERFLUX_IVAR', 'FIBER2FLUX', 'FIBER2FLUX_IVAR', 'MODELFLUX', 'MODELFLUX_IVAR', 'CALIB_STATUS', 'NMGYPERCOUNT', 'APERFLUX6', 'CMODELFLUX_CLEAN', 'CMODELFLUX_CLEAN_IVAR', 'CMODELFLUX_CLEAN_VAR', 'CMODELFLUX_CLEAN_CHI2', 'CMODEL_CLEAN_NUSE', 'CMODEL_CLEAN_MJD_MAXDIFF', 'CMODEL_CLEAN_MJD_VAR', 'MODELFLUX_CLEAN', 'MODELFLUX_CLEAN_IVAR', 'MODELFLUX_CLEAN_VAR', 'MODELFLUX_CLEAN_CHI2', 'MODEL_CLEAN_NUSE', 'MODEL_CLEAN_MJD_MAXDIFF', 'MODEL_CLEAN_MJD_VAR', 'PSFFLUX_CLEAN', 'PSFFLUX_CLEAN_IVAR', 'PSFFLUX_CLEAN_VAR', 'PSFFLUX_CLEAN_CHI2', 'PSF_CLEAN_NUSE', 'PSF_CLEAN_MJD_MAXDIFF', 'PSF_CLEAN_MJD_VAR']
One can filter out such columns using:
names = [name for name in tbl.colnames if len(tbl[name].shape) <= 1]
tbl[names].to_pandas(...)

## Inspect parquet file and metadata

Now that we have parsed our survey data into parquet, we can check what the data will look like when it's imported into hipscat format.

If you're just here to get a naive estimate for your pixel threshold, we'll do that first, then take a look at some other parquet characteristics later for the curious.

In [83]:
import os
import pyarrow.parquet as pq
import pyarrow as pa

sample_file_size = os.path.getsize(sample_parquet_file)
parquet_file = pq.ParquetFile(sample_parquet_file)
num_rows = parquet_file.metadata.num_rows

## 100MB
ideal_file_small = 100  *1024*1024
## 800MB
ideal_file_large = 800  *1024*1024

threshold_small = ideal_file_small/sample_file_size*num_rows
threshold_large = ideal_file_large/sample_file_size*num_rows

print(f"threshold between {int(threshold_small):_} and {int(threshold_large):_}")

threshold between 150_144 and 1_201_153


## Want to see more?

I'm so glad you're still here! I have more to show you!

The first step below shows us the file-level metadata, as stored by parquet. The number of columns here SHOULD match your expectations on the number of columns in your survey data.

The `serialized_size` value is just the size of the total metadata, not the size of the file. 

In [79]:
import pyarrow.parquet as pq

parquet_file = pq.ParquetFile(sample_parquet_file)
print(parquet_file.metadata)

<pyarrow._parquet.FileMetaData object at 0x7f44ad38cd10>
  created_by: parquet-cpp-arrow version 9.0.0
  num_columns: 44
  num_rows: 50000
  num_row_groups: 1
  format_version: 2.6
  serialized_size: 20972


The next step is to look at the column-level metadata. You can check that the on-disk type of each column matches your expectation of the data. Note that for some integer types, the on-disk type may be a smaller int than originally set (e.g. `bitWidth=8` or `16`). This is part of parquet's multi-part compression strategy.

In [80]:
print(parquet_file.schema)

<pyarrow._parquet.ParquetSchema object at 0x7f44ad182ec0>
required group field_id=-1 schema {
  optional int32 field_id=-1 RUN (Int(bitWidth=16, isSigned=true));
  optional binary field_id=-1 RERUN;
  optional int32 field_id=-1 CAMCOL (Int(bitWidth=8, isSigned=false));
  optional int32 field_id=-1 FIELD (Int(bitWidth=16, isSigned=true));
  optional int32 field_id=-1 ID (Int(bitWidth=16, isSigned=true));
  optional int32 field_id=-1 OBJC_TYPE;
  optional int32 field_id=-1 OBJC_FLAGS;
  optional int32 field_id=-1 OBJC_FLAGS2;
  optional float field_id=-1 OBJC_ROWC;
  optional float field_id=-1 ROWVDEG;
  optional float field_id=-1 ROWVDEGERR;
  optional float field_id=-1 COLVDEG;
  optional float field_id=-1 COLVDEGERR;
  optional double field_id=-1 RA;
  optional double field_id=-1 DEC;
  optional int32 field_id=-1 RESOLVE_STATUS;
  optional int32 field_id=-1 THING_ID;
  optional int32 field_id=-1 IFIELD;
  optional int32 field_id=-1 BALKAN_ID;
  optional int32 field_id=-1 NDETECT;
  op

Parquet will also perform some column-level compression, so not all columns with the same type will take up the same space on disk.

Below, we inspect the row and column group metadata to show the compressed size of the fields on disk. The last column, `percent`, show the percent of total size taken up by the column.

You *can* use this to inform which columns you keep when importing a catalog into hipscat format. e.g. if some columns are less useful for your science, and take up a lot of space, maybe leave them out!

In [75]:
import numpy as np
import pandas as pd

num_cols = parquet_file.metadata.num_columns
num_row_groups = parquet_file.metadata.num_row_groups
sizes = np.zeros(num_cols)

for rg in range(num_row_groups):
    for col in range (num_cols):
        sizes[col] += parquet_file.metadata.row_group(rg).column(col).total_compressed_size

## This is just an attempt at pretty formatting
percents = [f"{s/sizes.sum()*100:.1f}" for s in sizes]
pd.DataFrame({"name":parquet_file.schema.names, "size":sizes.astype(int), "percent": percents}).sort_values("size", ascending=False)

Unnamed: 0,name,size,percent
139,Y,48237,2.5
4,RA,48237,2.5
138,X,48237,2.5
9,GLON,48237,2.5
10,GLAT,48237,2.5
...,...,...,...
52,W2MCOR,129,0.0
113,SSO_FLG,99,0.0
39,,99,0.0
2,SCAN_ID,77,0.0


You can also use this opportunity to create a schema-only parquet file to ensure that your final parquet files use the appropriate field types.

In [76]:
schema_only_file = "/data3/epyc/data3/hipscat/tmp/neowise_schema.parquet"
pq.write_table(pa.Table.from_pandas(data_frame).schema.empty_table(), where=schema_only_file)