This repository has been archived by the owner on Jan 12, 2024. It is now read-only.
Fix missing / incomplete Parquet & Intake metadata #7
Labels
epacems
The EPA's Continuous Emissions Monitoring System hourly dataset
inframundo
intake
Intake data catalogs
metadata
Data about our liberated data
parquet
Apache Parquet is an open columnar data file format.
The
source.discover()
method shows some details about the internals of a data source within an intake catalog. E.g.However, some of this information doesn't reflect what's in the parquet files as well as it could. We should make sure:
unitid
andunit_id_epa
show up asstring
notobject
category
columnsstate
,so2_mass_measurement_code
,nox_rate_measurement_code
,nox_mass_measurement_code
, andco2_mass_measurement_code
show up ascategory
instead ofint64
(presumably they're appearing as integers because integers are keys in a dictionary of categorical values?)shape
tuple should indicate the number of rows in the dataset, rather thanNone
since that information is stored in the Parquet file metadata.Some of these issues seem to be arising from Intake, and some of them seem to arise from the metadata that's getting written to the Parquet files in ETL. Looking at the type information for a sample of the data after it's been read back into a pandas dataframe:
The categorical values show up correctly as categories, but the other type issues (nullability, string vs/ object) remain. In my experimentation with different ways of writing out the files I think I did see strings, nullable types, and category types coming through fine in this information in the past, so I think there's something wrong with the Parquet metadata. Reading in one file and looking at the metadata directly, they all appear to be correct:
However... the
epacems.schema.pandas_metadata
isNone
so it's relying on the default mapping of PyArrow types to Pandas types, which isn't what we want it to do.Why isn't the pandas metadata being embedded in the Parquet file? Is it possible to explicitly insert it? The function that's writing the Parquet files is:
pudl.etl._etl_one_year_epacems()
and it's usingpa.Table.from_pandas()
so.... wtf?The text was updated successfully, but these errors were encountered: