You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I want to read data using hb.data.ParquetDataset and feed field param using my column names from other place.
When I use hb.data.ParquetDataset(filenames=[My parquet files],fields = [My custom column names]), It throws f'Field {f} is not found in the parquet file {filename}'
Detailed requirements
Hive stores the schema of a table in all lowercase. It makes input-param (fields) maybe not in parquet files.
Like "AGE" in fields but "age" in parquet schema. In this case , should do auto fields transform like below:
schema_ds = ParquetFile(filenames[0])
schema_names = schema_ds.schema_arrow.names
def _fix_name(name):
if name not in schema_names and name.lower() in schema_names:
return (name.lower(),1)
else:
return (name,0)
results_fields = [_fix_name(n) for n in fields]
fixed_fields = [n[0] for n in results_fields]
changed_fields = [n[0] for n in results_fields if n[1]==1]
print("changed-fileds:%s" % changed_fields)
API Compatibility
hb.data.ParquetDataset
The text was updated successfully, but these errors were encountered:
User Story
I want to read data using hb.data.ParquetDataset and feed field param using my column names from other place.
When I use hb.data.ParquetDataset(filenames=[My parquet files],fields = [My custom column names]), It throws f'Field {f} is not found in the parquet file {filename}'
Detailed requirements
Hive stores the schema of a table in all lowercase. It makes input-param (fields) maybe not in parquet files.
Like "AGE" in fields but "age" in parquet schema. In this case , should do auto fields transform like below:
API Compatibility
hb.data.ParquetDataset
The text was updated successfully, but these errors were encountered: