# Parquet METADATA file inspection

In this notebook, we look at methods to explore the contents of a `_metadata` parquet file.


**Author**: Melissa DeLucchi (delucchi@andrew.cmu.edu)

First, instantiate your parquet file.

In [42]:
import pyarrow.parquet as pq
import pandas as pd

## You're gonna want to change this file name!!

file_name = "/data3/epyc/data3/hipscat/catalogs/ztf_apr18/ztf_dr14/_metadata"
parquet_file = pq.ParquetFile(file_name)

In [43]:
print(parquet_file.metadata)
print(parquet_file.metadata.num_columns)
cols = parquet_file.metadata.num_columns
print(parquet_file.metadata.num_row_groups)
row_groups = parquet_file.metadata.num_row_groups

<pyarrow._parquet.FileMetaData object at 0x7f72a1b5bc20>
  created_by: parquet-cpp-arrow version 9.0.0
  num_columns: 16
  num_rows: 1234677579
  num_row_groups: 2360
  format_version: 2.6
  serialized_size: 5395265
16
2360


In [48]:
## Size on disk of each column

import numpy as np
import pandas as pd

sizes = np.zeros(cols)

for rg in range(row_groups):
    for col in range (cols):
        sizes[col] += parquet_file.metadata.row_group(rg).column(col).total_compressed_size
        
frame = pd.DataFrame({"name":parquet_file.schema.names, "size":sizes})
frame['percent'] = frame['size'] / frame['size'].sum() *100
frame = frame.sort_values("size", ascending=False)
# frame.to_csv("/data3/epyc/data3/hipscat/raw/pan_starrs/sample_detection_size.csv")
print(frame)

               name          size    percent
10       mean_mag_r  9.315607e+09  14.585721
0         ps1_objid  8.908720e+09  13.948646
15   _hipscat_index  7.317717e+09  11.457565
2               dec  6.963166e+09  10.902435
1                ra  6.296127e+09   9.858032
9        mean_mag_g  6.225676e+09   9.747724
11       mean_mag_i  4.663170e+09   7.301263
5   ps1_iMeanPSFMag  3.603279e+09   5.641760
4   ps1_rMeanPSFMag  3.586042e+09   5.614772
3   ps1_gMeanPSFMag  3.256742e+09   5.099177
7            nobs_r  1.672007e+09   2.617910
6            nobs_g  1.406545e+09   2.202268
8            nobs_i  6.526143e+08   1.021817
12           Norder  1.935050e+05   0.000303
13              Dir  1.935050e+05   0.000303
14             Npix  1.935050e+05   0.000303


In [49]:
row = parquet_file.metadata.row_group(1001)
print(row)

<pyarrow._parquet.RowGroupMetaData object at 0x7f72a1b48d60>
  num_columns: 16
  num_rows: 243669
  total_byte_size: 17237119


maybe also look at the partition info, to check that they match size?
import pa

In [47]:
import pandas as pd
file_name = "/data3/epyc/data3/hipscat/catalogs/ztf_apr18/ztf_dr14/partition_info.csv"
partition_info = pd.read_csv(file_name)
print(len(partition_info))


2352


## Schema

Show all the columns and their stored type.

In [1]:
print(parquet_file.schema)

NameError: name 'parquet_file' is not defined

## Contents

Types are helpful, but nothing beats seeing the data. The snippet below will load the parquet file and trim to just the first two rows. By transposing the result, the columns become rows and the whole thing is easier to skim through.

In [35]:
data_frame = pd.read_parquet(file_name, engine="pyarrow")
# print(data_frame.head(1).transpose())
# print(data_frame.loc[[929796]].transpose())

# data_frame.loc[[929796]].to_parquet("/astro/users/mmd11/data/ztf_row.parquet")

# print(data_frame.head(10))
mags = data_frame[["mag", "magerr"]]
print(mags)

                          mag  magerr
173533132198219908  16.389799     174
173533132243349711  18.985800     612
173533132500747980  17.901400     313
173533132520209275  20.043100    1380
173533132623167299  20.023899    1359
...                       ...     ...
173993134962710289  20.286100    1340
173993134962710289  20.055901    1348
173993134962710289  20.153500    1894
173993134964711814  20.546499    2647
173993134964711814  20.702600    1870

[301348 rows x 2 columns]


In [18]:
# print(len(data_frame['mjd_r'][0]))

for index in range(len(data_frame)):
    if len(data_frame["mjd_r"][index]) and len(data_frame["mjd_g"][index]) and len(data_frame["mjd_i"][index]):
        print(index)
        break
#     print(f'{index}, {len(data_frame["mjd_r"][index])}, {len(data_frame["mjd_g"][index])}, {len(data_frame["mjd_i"][index])}')
print("found nothing")

929796
found nothing


I often want to know what are the values that a field can take, if there are only a handful (e.g. for essentially an ENUM type).

The following will fetch all the values from a single column, find all the unique values, and spit out some of them. There's definitely a cuter pandas way to do this.

In [25]:
## you might need to change the id column
id_column = 'rcidin'

assert id_column in data_frame.columns
ids = data_frame[id_column].tolist()
print(f'Number of values: {len(ids)}')
set_ids = [*set(ids)]
print(f'Number of unique values: {len(set_ids)}')
print(f'Sample of unique values: {set_ids[0:5]}')

Number of values: 301348
Number of unique values: 5
Sample of unique values: [3, 14, 21, 53, 24]


In [4]:
data_frame.max(axis=0)

index                       131071
_hipscat_id    9778440568160911517
ps1_objid        75240618853871027
ra                       63.263256
dec                     -27.299459
catflags                      4100
fieldID                       1248
mag                      21.929874
maggerr                   0.371297
mjd                    59627.16843
rcID                            47
band                             r
dtype: object

In [5]:
data_frame.min(axis=0)

index                            0
_hipscat_id    9777314802224332800
ps1_objid        72000618713066946
ra                       60.475616
dec                     -29.994318
catflags                    -32768
fieldID                        253
mag                      11.443596
maggerr                   0.003859
mjd                    58363.50343
rcID                             0
band                             g
dtype: object