# Parquet file inspection

In this notebook, we look at methods to explore the contents of a parquet file.

**Last run (successfully)**: 2023/04/05

**Author**: Melissa DeLucchi (delucchi@andrew.cmu.edu)

First, instantiate your parquet file.

In [14]:
import pyarrow.parquet as pq
import pandas as pd

## You're gonna want to change this file name!!

# file_name = "/data3/epyc/data3/hipscat/catalogs/ztf_mar16/source/Norder=5/Dir=0/Npix=8684.parquet"
# file_name = "/data3/epyc/data3/hipscat/catalogs/ztf_apr13/ztf_object_to_source/Norder=0/Dir=0/Npix=8/join_Norder=5/join_Dir=0/join_Npix=8688.parquet"
# file_name = "/data3/epyc/data3/hipscat/catalogs/ztf_mar16/object_to_source/Norder=1/Dir=0/Npix=33.parquet"
# file_name = "/data3/epyc/data3/hipscat/catalogs/ztf_apr18/ztf_dr14/Norder=0/Dir=0/Npix=7.parquet"
# file_name = "/epyc/data/sdss_parquet/sdss_star_1600.parquet"
# file_name = "/data3/epyc/projects3/ivoa_demo/ztf_dr14/catalog/Norder=6/Dir=20000/Npix=29283.parquet"
# file_name = "/data3/epyc/data3/hipscat/catalogs/ztf_dr14/Norder=6/Dir=20000/Npix=29283.parquet"
# file_name = "/data3/epyc/data3/hipscat/catalogs/ztf_apr18/ztf_dr14/Norder=6/Dir=20000/Npix=29283.parquet"
# file_name = "/astro/users/mmd11/ztf_0065_1990_g.parquet"
# file_name = "/data3/epyc/data3/hipscat/catalogs/ps1/ps1_otmo/_common_metadata"
file_name= "/data3/epyc/data3/hipscat/catalogs/sdss_dr16q/Norder=5/Dir=0/Npix=7599.parquet"
parquet_file = pq.ParquetFile(file_name)

Parquet files contain a lot of internal data that can be read directly.

## Metadata

This is file-level metadata, and is a quick check on the size of your file.

In [16]:
print(parquet_file.metadata)
print(parquet_file.metadata.num_columns)
cols = parquet_file.metadata.num_columns
print(parquet_file.metadata.num_row_groups)
row_groups = parquet_file.metadata.num_row_groups

<pyarrow._parquet.FileMetaData object at 0x7f76c1fafdb0>
  created_by: parquet-cpp-arrow version 9.0.0
  num_columns: 258
  num_rows: 57114
  num_row_groups: 1
  format_version: 2.6
  serialized_size: 130669
258
1


In [11]:
## Size on disk of each column

import numpy as np
import pandas as pd

sizes = np.zeros(cols)

for rg in range(row_groups):
    for col in range (cols):
        sizes[col] += parquet_file.metadata.row_group(0).column(col).total_compressed_size
        
frame = pd.DataFrame({"name":parquet_file.schema.names, "size":sizes})
frame['percent'] = frame['size'] / frame['size'].sum() *100
frame = frame.sort_values("size", ascending=False)
# frame.to_csv("/data3/epyc/data3/hipscat/raw/pan_starrs/sample_detection_size.csv")
print(frame)

                 name    size   percent
0           SDSS_NAME  2141.0  5.088170
37             LOGMBH  1379.0  3.277247
29            LOGLBOL  1379.0  3.277247
13                EBV  1379.0  3.277247
12          Z_SYS_ERR  1379.0  3.277247
11              Z_SYS  1379.0  3.277247
30        LOGLBOL_ERR  1379.0  3.277247
38         LOGMBH_ERR  1379.0  3.277247
5                 DEC  1379.0  3.277247
4                  RA  1379.0  3.277247
39      LOGLEDD_RATIO  1379.0  3.277247
40  LOGLEDD_RATIO_ERR  1379.0  3.277247
15         FEII_UV_EW  1355.0  3.220210
16     FEII_UV_EW_ERR  1355.0  3.220210
24       LOGL2500_ERR  1289.0  3.063359
23           LOGL2500  1287.0  3.058605
8             Z_DR16Q  1196.0  2.842340
10              Z_FIT  1196.0  2.842340
44     _hipscat_index  1176.0  2.794810
6               OBJID  1166.0  2.771044
14      SN_MEDIAN_ALL  1139.0  2.706878
26       LOGL3000_ERR  1129.0  2.683112
25           LOGL3000  1127.0  2.678359
22       LOGL1700_ERR  1015.0  2.412187


## Schema

Show all the columns and their stored type.

In [17]:
print(parquet_file.schema)

<pyarrow._parquet.ParquetSchema object at 0x7f75782efc00>
required group field_id=-1 schema {
  optional int32 field_id=-1 RUN (Int(bitWidth=16, isSigned=true));
  optional binary field_id=-1 RERUN;
  optional int32 field_id=-1 CAMCOL (Int(bitWidth=8, isSigned=false));
  optional int32 field_id=-1 FIELD (Int(bitWidth=16, isSigned=true));
  optional int32 field_id=-1 ID (Int(bitWidth=16, isSigned=true));
  optional int32 field_id=-1 OBJC_TYPE;
  optional int32 field_id=-1 OBJC_FLAGS;
  optional int32 field_id=-1 OBJC_FLAGS2;
  optional float field_id=-1 OBJC_ROWC;
  optional float field_id=-1 ROWVDEG;
  optional float field_id=-1 ROWVDEGERR;
  optional float field_id=-1 COLVDEG;
  optional float field_id=-1 COLVDEGERR;
  optional float field_id=-1 ROWC_u;
  optional float field_id=-1 ROWC_g;
  optional float field_id=-1 ROWC_r;
  optional float field_id=-1 ROWC_i;
  optional float field_id=-1 ROWC_z;
  optional float field_id=-1 COLC_u;
  optional float field_id=-1 COLC_g;
  optional fl

## Contents

Types are helpful, but nothing beats seeing the data. The snippet below will load the parquet file and trim to just the first two rows. By transposing the result, the columns become rows and the whole thing is easier to skim through.

In [15]:
data_frame = pd.read_parquet(file_name, engine="pyarrow")
# print(data_frame.head(1).transpose())
# print(data_frame.loc[[929796]].transpose())

# data_frame.loc[[929796]].to_parquet("/astro/users/mmd11/data/ztf_row.parquet")

print(data_frame.head(10))
# mags = data_frame[["mag", "band"]]
# print(mags)

                      RUN   RERUN  CAMCOL  FIELD    ID  OBJC_TYPE  OBJC_FLAGS   
_hipscat_index                                                                  
8555713393221173248  4076  b'301'       5    105   159          6   269094932  \
8555713397247705088  4076  b'301'       5    105  5203          6   268570880   
8555713400099831808  4076  b'301'       5    105  2527          6   302121232   
8555713405623730176  4076  b'301'       5    105   164          6   269094932   
8555713410677866496  4076  b'301'       5    105  2485          6   268570896   
8555713411185377280  4076  b'301'       5    105  2504          6   302125072   
8555713412896653312  4076  b'301'       5    105  2518          6   302125328   
8555713413873926144  4076  b'301'       5    105  2460          6   302125072   
8555713415266435072  4076  b'301'       5    105  2461          6   268570640   
8555713417992732672  4076  b'301'       5    105  2503          6   302121232   

                     OBJC_F

In [18]:
# print(len(data_frame['mjd_r'][0]))

for index in range(len(data_frame)):
    if len(data_frame["mjd_r"][index]) and len(data_frame["mjd_g"][index]) and len(data_frame["mjd_i"][index]):
        print(index)
        break
#     print(f'{index}, {len(data_frame["mjd_r"][index])}, {len(data_frame["mjd_g"][index])}, {len(data_frame["mjd_i"][index])}')
print("found nothing")

929796
found nothing


I often want to know what are the values that a field can take, if there are only a handful (e.g. for essentially an ENUM type).

The following will fetch all the values from a single column, find all the unique values, and spit out some of them. There's definitely a cuter pandas way to do this.

In [29]:
## you might need to change the id column
id_column = 'band'

assert id_column in data_frame.columns
ids = data_frame[id_column].tolist()
print(f'Number of values: {len(ids)}')
set_ids = [*set(ids)]
print(f'Number of unique values: {len(set_ids)}')
print(f'Sample of unique values: {set_ids[0:5]}')

AssertionError: 

In [42]:
### but what if it's your index that's duplicated?!?

# data_frame.index.duplicated()
# data_frame.index.is_unique

# data_frame.loc[data_frame.index.duplicated(), :]

data_frame.loc[97503219578759644]

Unnamed: 0,objectid,mjd,mag,magerr,fieldid,rcidin,info,flag,objra,objdec
97503219578759644,97503219578759644,58321.363053,13.2432,160,390,40,0,0,321.957855,-8.74228
97503219578759644,97503219578759644,58331.343582,13.3357,377,390,40,67108864,8,321.957855,-8.74236
97503219578759644,97503219578759644,58295.341445,13.2593,98,390,40,0,0,321.957886,-8.74232
97503219578759644,97503219578759644,58303.381645,13.2344,178,390,40,0,0,321.957794,-8.74221
97503219578759644,97503219578759644,58280.381527,13.2350,151,390,40,0,0,321.957855,-8.74227
...,...,...,...,...,...,...,...,...,...,...
97503219578759644,97503219578759644,59429.341118,13.2270,92,390,41,0,0,321.957886,-8.74232
97503219578759644,97503219578759644,59442.262233,13.2334,89,390,41,0,0,321.957886,-8.74238
97503219578759644,97503219578759644,59436.410134,13.2420,133,390,41,0,0,321.957886,-8.74236
97503219578759644,97503219578759644,59446.300061,13.2363,123,390,41,0,0,321.957886,-8.74234


In [67]:
data_frame["obsTime"].max()

56998.3081433

In [55]:
data_frame.min(axis=0)

objID             1.065604e+17
uniquePspsP2id    3.059134e+15
detectID          9.756336e+16
filterID          1.000000e+00
obsTime           5.507663e+04
ra                4.008212e+01
dec              -1.193160e+00
raErr            -9.990000e+02
decErr           -9.990000e+02
zp                2.302310e+01
expTime           3.000000e+01
psfFlux          -9.990000e+02
psfFluxErr       -9.990000e+02
apFlux           -9.990000e+02
apFluxErr        -9.990000e+02
kronFlux         -9.990000e+02
kronFluxErr      -9.990000e+02
infoFlag          0.000000e+00
infoFlag2         0.000000e+00
infoFlag3         0.000000e+00
Norder            6.000000e+00
Dir               1.000000e+04
Npix              1.775200e+04
dtype: float64