Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dtypes= option to read_parquet #9476

Open
jrbourbeau opened this issue Sep 8, 2022 · 4 comments
Open

Add dtypes= option to read_parquet #9476

jrbourbeau opened this issue Sep 8, 2022 · 4 comments
Labels
dataframe enhancement Improve existing functionality or make things work better needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. parquet

Comments

@jrbourbeau
Copy link
Member

Similar to dd.read_csv, we should consider making it easy for users to optionally specify dtypes when using dd.read_parquet.

@rjzamora and I were running computations on parquet datasets that have some columns stored as category dtypes. However, it's known that category dtypes can be inefficient in some cases (for example #9392) so we wanted to, in this particular case, use pyarrow[string] dtype instead. This is possible today using DataFrame.astype(...), which is what we used, but we thought it would be more straightforward for users to pass in dtypes={"col-name": "string[pyarrow]"} instead. This would also have the added benefit of being able to pass the desired dtype directly to the parquet read calls.

We currently read in a parquet schema already and then pass that down to the I/O call. To support a dtype= keyword, one approach would be to take the user-provided dtype= specification, convert it to a pyarrow.Schema object, and then merge that with the Schema object we get from looking at the parquet metadata on disk. That merged schema object is the one we would then use when performing parquet read calls.

@jrbourbeau jrbourbeau added dataframe parquet enhancement Improve existing functionality or make things work better labels Sep 8, 2022
@jorloplaz
Copy link
Contributor

Wouldn't that conflict with the Schema that was used to write originally the Parquet file? I assume that you'd give preference to what the user specifies in the dtype dictionary, so that for a given column you check if it's in that dictionary and use that type, and if it's not, you default to the type the metadata Schema specifies.

But I'd say that if something was written in a given binary format and you try to read it in some other format, things won't always work.

Of course, there's always the option of reading as-it-is, and then adding a final astype in read-parquet to whatever the user says in dtype, but that's basically what you already did.

@rjzamora
Copy link
Member

Wouldn't that conflict with the Schema that was used to write originally the Parquet file? I assume that you'd give preference to what the user specifies in the dtype dictionary, so that for a given column you check if it's in that dictionary and use that type, and if it's not, you default to the type the metadata Schema specifies.

Yes, my understanding is that the dtypes= dictionary should specify what the output dtype should be in the backend dataframe library (pandas). You are correct that the user-specified dtype must "agree" with the underlying pyarrow dtype that is encoded in the Schema. However, it is important to cosider that there is not always a simple 1-to-1 mapping between the binary representtion in pyarrow and pandas dtype (a string may become both an "object" and pyarrow string, depending on the pandas metadata).

For a given binary representation in pyarrow (say String), it seems that the output pandas dtype will typically correspond to the numpy dtype specified in the "pandas" metadata within the same parquet file. Therefore, it should be possible to avoid casting after the read in certain cases, as long as we can update or throw away the custom metadata. With that said, this may be a bit tricky in practice.

@jorloplaz
Copy link
Contributor

jorloplaz commented Sep 16, 2022

Yes, I know that the mapping is not always unique. My point is: when can we take advantage of a hypothetical dtypes parameter in read_parquet to do something more efficient than just dd.read_parquet(file).astype(types)? Is there any case actually where we gain something? My understanding is that astype looks at the numpy dtype and doesn't really allocate a new array if not needed: it just changes the view associated to the array, so that's already pretty efficient.

In read_csv of course it makes sense because everything is stored in plain text, so unless all your columns are strings you don't get what you want in df, and a type conversion must be done. However, when a binary format such as Parquet is used, you nearly always get what you want after reading, so the use of astype is pretty rare, and only for some particular columns.

The only common case I can think of are strings that may be loaded as objects. However, if the Parquet file was already written with pyarrow strings, latest Pandas won't load that as object, but as string[pyarrow], as far as I'm concerned.

Perhaps what could be done in read_parquet, even if there's no dtypes parameter, is the following:

  1. Check metadata to look for general object columns (I mean, stored as such).
  2. Try to load those as strings, because most object columns should be actually strings (e.g., if stored with old versions of pyarrow, or if the user forgot to specify explicitly that those were strings).
  3. If for some column that fails, reload it as object.

That is, assuming by default that object columns are actually meant to be strings, unless on runtime it's realised that it is another kind of object, and leaving the few remaining cases for the user to manually change that via astype.

@rjzamora
Copy link
Member

when can we take advantage of a hypothetical dtypes parameter in read_parquet to do something more efficient than just dd.read_parquet(file).astype(types)? Is there any case actually where we gain something?

Yeah, I don’t know of an obvious situation besides choosing between string and object for string-encoded data, or between category and string/int/etc for dictionary-encoded data. Your general point is a very good one: We definitely don’t want to expand the read_parquet API without clear motivation, and the motivation is weak if the proposed option is intended for an edge case, or if using astype is (almost always) just as efficient.

Side Note: I’m not yet sure if modifying the pandas metadata to use ”string” for the "numpy_type” field actually reduces the peak memory usage for the motivating high-cardinality “category”-column case here. I do know that replacing the metadata does change the output pandas dtype as expected, but so does a simple astype operation. I will need to do some tests on high-cardinality data to see if there is any real performance benefit to a dtypes=-like feature.

@github-actions github-actions bot added the needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. label May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataframe enhancement Improve existing functionality or make things work better needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. parquet
Projects
None yet
Development

No branches or pull requests

3 participants