-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dtypes=
option to read_parquet
#9476
Comments
Wouldn't that conflict with the But I'd say that if something was written in a given binary format and you try to read it in some other format, things won't always work. Of course, there's always the option of reading as-it-is, and then adding a final |
Yes, my understanding is that the For a given binary representation in pyarrow (say |
Yes, I know that the mapping is not always unique. My point is: when can we take advantage of a hypothetical In The only common case I can think of are strings that may be loaded as objects. However, if the Parquet file was already written with Perhaps what could be done in
That is, assuming by default that |
Yeah, I don’t know of an obvious situation besides choosing between string and object for string-encoded data, or between category and string/int/etc for dictionary-encoded data. Your general point is a very good one: We definitely don’t want to expand the Side Note: I’m not yet sure if modifying the pandas metadata to use |
Similar to
dd.read_csv
, we should consider making it easy for users to optionally specifydtype
s when usingdd.read_parquet
.@rjzamora and I were running computations on parquet datasets that have some columns stored as
category
dtypes. However, it's known thatcategory
dtypes can be inefficient in some cases (for example #9392) so we wanted to, in this particular case, usepyarrow[string]
dtype
instead. This is possible today usingDataFrame.astype(...)
, which is what we used, but we thought it would be more straightforward for users to pass indtypes={"col-name": "string[pyarrow]"}
instead. This would also have the added benefit of being able to pass the desireddtype
directly to the parquet read calls.We currently read in a parquet schema already and then pass that down to the I/O call. To support a
dtype=
keyword, one approach would be to take the user-provideddtype=
specification, convert it to apyarrow.Schema
object, and then merge that with theSchema
object we get from looking at the parquet metadata on disk. That merged schema object is the one we would then use when performing parquet read calls.The text was updated successfully, but these errors were encountered: