[Python][Parquet] improve reading of partitioned parquet datasets whose schema changed #25089

asfimport · 2020-05-27T12:19:39Z

Hi there, i'm encountering the following issue when reading from HDFS:

My situation:

I have a paritioned parquet dataset in HDFS, whose recent partitions contain parquet files with more columns than the older ones. When i try to read data using pyarrow.dataset.dataset and filter on recent data, i still get only the columns that are also contained in the old parquet files. I'd like to somehow merge the schema or use the schema from parquet files from which data ends up being loaded.

when using:

pyarrow.dataset.dataset(path_to_hdfs_directory, paritioning = 'hive', filters = my_filter_expression).to_table().to_pandas()

Is there please a way to handle schema changes in a way, that the read data would contain all columns?

everything works fine when i copy the needed parquet files into a separate folder, however it is very inconvenient way of working.

Environment: Ubuntu 18.04, latest miniconda with python 3.7, pyarrow 0.17.1
Reporter: Ira Saktor

Related issues:

[Python][Dataset] Expose schema inference / validation options in the factory (is duplicated by)

_{Note: This issue was originally created as ARROW-8964. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2020-05-28T18:39:45Z

Joris Van den Bossche / @jorisvandenbossche:
What you are describing should normally already be implemented (so something else should be going wrong for some reason).
When multiple files with different schemas are read in a dataset, the dataset discovery uses a very basic "schema evoluation / normalization", which right now only involves adding missing columns as "null" values (so exactly the use case you are describing, I think). In the future we also want to allow some type evoluation (like the same columns in different files with int32 and int64 type, which right now would raise an error).

Can you show an example of just reading one of the old and one of the new files in the directory (you can pass the exact file name instead of the directory to dataset(..)) ?

asfimport · 2020-05-28T18:46:49Z

Joris Van den Bossche / @jorisvandenbossche:
While pressing "Add comment" button, I realized we changed the dataset discovery process: it by default only checks the schema of the first file it encounters, and then will assume the full dataset has this schema. So that is probably the reason you only see the old columns even for the new files.

There are in principle two solutions for this:

Manually specify the schema (where you yourself ensure it includes all columns of both old and new files)
Specify that the discovery should inspect all files to infer the dataset schema, and not just the first file. The problem here is that this option is not yet exposed in the python interface ..

So for now, only the first is an option.

asfimport · 2020-05-28T18:47:49Z

Joris Van den Bossche / @jorisvandenbossche:
Existing issue for exposing this in python: ARROW-8221

asfimport · 2020-05-28T19:47:31Z

Ira Saktor:
Thank you very much for your fast answer. In the meantime, regarding schema specification, could you please tell me if there a way in pyarrow.dataset to read schema from specific parquet file? I could then simply pass it one of the recent parquet files to infer schema from.

I know how to load schema with pyarrow.parquet, however non-legacy dataset in parquet doesn't yet support schema specification, so i was hoping to manage this with pyarrow.dataset, if that's possible.

asfimport · 2020-05-28T19:55:24Z

Joris Van den Bossche / @jorisvandenbossche:
Yes, if you have the path to a single parquet file, you can easily get the schema from that one.

I think something like this should work:

schema = pyarrow.dataset.dataset(path_to_hdfs_single_parquet_file, paritioning = 'hive').schema
dataset = pyarrow.dataset.dataset(path_to_hdfs_directory, paritioning = 'hive', schema=schema)
dataset.to_table(filter = my_filter_expression).to_pandas()

asfimport · 2020-05-29T07:18:31Z

Ira Saktor:
awesome! thank you, that's what i was looking for. Looking forward to ARROW-8221 being resolved :). Should i close this task as it's pretty much a duplicate?

asfimport · 2020-05-29T13:24:05Z

Ira Saktor:
in an unrelated question, any chance you would also know how to write timestamps to parquet files so that i can then create impala table with columns as timestamp type from the given parquets?

asfimport · 2020-06-09T08:50:15Z

Joris Van den Bossche / @jorisvandenbossche:
I am not familiar with Impala. Looking at https://impala.apache.org/docs/build/html/topics/impala_parquet.html#parquet_data_types, it seems that the INT64 based timestamp types (the default when writing with pyarrow) is only supported in Impala > 3.2). If you need the older INT96 representation for timestamps, there is an option use_deprecated_int96_timestamps=True you can set in pq.write_table

Should i close this task as it's pretty much a duplicate?

Yes, will close it

asfimport closed this as completed Jun 9, 2020

asfimport mentioned this issue Jan 11, 2023

[Python][Dataset] Expose schema inference / validation options in the factory #24418

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python][Parquet] improve reading of partitioned parquet datasets whose schema changed #25089

[Python][Parquet] improve reading of partitioned parquet datasets whose schema changed #25089

asfimport commented May 27, 2020 •

edited

asfimport commented May 28, 2020

asfimport commented May 28, 2020

asfimport commented May 28, 2020

asfimport commented May 28, 2020

asfimport commented May 28, 2020

asfimport commented May 29, 2020

asfimport commented May 29, 2020

asfimport commented Jun 9, 2020

[Python][Parquet] improve reading of partitioned parquet datasets whose schema changed #25089

[Python][Parquet] improve reading of partitioned parquet datasets whose schema changed #25089

Comments

asfimport commented May 27, 2020 • edited

Related issues:

asfimport commented May 28, 2020

asfimport commented May 28, 2020

asfimport commented May 28, 2020

asfimport commented May 28, 2020

asfimport commented May 28, 2020

asfimport commented May 29, 2020

asfimport commented May 29, 2020

asfimport commented Jun 9, 2020

asfimport commented May 27, 2020 •

edited