hb.data.ParquetDataset support Case-sensitive-fields #51

karterotte · 2022-05-09T07:59:00Z

User Story

I want to read data using hb.data.ParquetDataset and feed field param using my column names from other place.
When I use hb.data.ParquetDataset(filenames=[My parquet files],fields = [My custom column names]), It throws f'Field {f} is not found in the parquet file {filename}'

Detailed requirements

Hive stores the schema of a table in all lowercase. It makes input-param (fields) maybe not in parquet files.
Like "AGE" in fields but "age" in parquet schema. In this case , should do auto fields transform like below:

schema_ds = ParquetFile(filenames[0])
schema_names = schema_ds.schema_arrow.names
def _fix_name(name):
      if name not in schema_names and name.lower() in schema_names:
            return (name.lower(),1)
      else:
            return (name,0)
results_fields = [_fix_name(n) for n in fields]
fixed_fields = [n[0] for n in results_fields]
changed_fields = [n[0] for n in results_fields if n[1]==1]
print("changed-fileds:%s" % changed_fields)

API Compatibility

hb.data.ParquetDataset

The text was updated successfully, but these errors were encountered:

2sin18 self-assigned this May 10, 2022

2sin18 added the enhancement New feature or request label May 10, 2022

2sin18 mentioned this issue May 30, 2022

[DATA] Improve schema parsing in ParquetDataset #54

Merged

2sin18 linked a pull request May 30, 2022 that will close this issue

[DATA] Improve schema parsing in ParquetDataset #54

Merged

2sin18 closed this as completed in #54 May 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hb.data.ParquetDataset support Case-sensitive-fields #51

hb.data.ParquetDataset support Case-sensitive-fields #51

karterotte commented May 9, 2022

hb.data.ParquetDataset support Case-sensitive-fields #51

hb.data.ParquetDataset support Case-sensitive-fields #51

Comments

karterotte commented May 9, 2022

User Story

Detailed requirements

API Compatibility