New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas to_parquet error : ArrowNotImplementedError: No support for writing chunked arrays yet. #1300

Closed
brahmbhattspandan opened this Issue Nov 10, 2017 · 6 comments

Comments

Projects
None yet
3 participants
@brahmbhattspandan
Copy link

brahmbhattspandan commented Nov 10, 2017

While trying to save a Pandas DataFrame as parquet using pyarrow engine, I seem to be getting an error of ArrowNotImplementedError: No support for writing chunked arrays yet.

The error seems to occur when the dataframe goes above a certain size limit.

>>>df.shape
(317739, 35)
>>> df_2 = df[:200000]
>>> df_2.to_parquet('path_to_save',engine='pyarrow',compression='gzip')
# Runs without error.
>>> df_25 = df[:250000]
>>> df_25.to_parquet('path_to_save',engine='pyarrow',compression='gzip')
# ArrowNotImplementedError: No support for writing chunked arrays yet.

Error seems to be thrown from this function : parquet.py#270
I am guessing this is related to ARROW-232. Is there a way currently to overcome this limitation ? Or is it a completely different issue ?

@xhochy

This comment has been minimized.

Copy link
Member

xhochy commented Nov 10, 2017

@brahmbhattspandan you might be able to overcome this by doing a copy of the DataFrame: df_25 = df[:250000].copy().

@brahmbhattspandan

This comment has been minimized.

Copy link

brahmbhattspandan commented Nov 10, 2017

Since the final goal is to save the original dataframe, I tried the copy() method but it still throws the same error.

>>>df.shape
(317739, 35)
>>> df_30 = df[:300000].copy()
>>> df_30.to_parquet('path_to_save',engine='pyarrow',compression='gzip')
# ArrowNotImplementedError: No support for writing chunked arrays yet.
@xhochy

This comment has been minimized.

Copy link
Member

xhochy commented Nov 11, 2017

@brahmbhattspandan ok, this is weird. Can you post the output of df_30.info() so that we could see the datatypes and the column layout of the DataFrame?

@wesm

This comment has been minimized.

Copy link
Member

wesm commented Nov 11, 2017

It looks like the problem is that a string column is overflowing the 2GB limit. We should try to do the chunked array support in parquet-cpp 1.4.0, as soon as the decimal patch lands perhaps we can do this (will require a little bit of refactoring in the write path, but good refactoring)

@brahmbhattspandan

This comment has been minimized.

Copy link

brahmbhattspandan commented Nov 11, 2017

@xhochy Below is the output.

>>> df_30.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300000 entries, 0 to 299999
Data columns (total 35 columns):
id                              300000 non-null int64
attempt_id                      300000 non-null int64
session_id                      300000 non-null object
customer_id                     300000 non-null int64
username_hash                   299990 non-null object
cookie                          300000 non-null object
attempt_date                    300000 non-null object
device_type                     300000 non-null object
device_name                     300000 non-null object
device_os                       300000 non-null object
browser                         300000 non-null object
browser_version                 300000 non-null object
ip                              300000 non-null object
latitude                        300000 non-null object
longitude                       300000 non-null object
city                            300000 non-null object
province                        300000 non-null object
country                         300000 non-null object
threat_score                    300000 non-null float64
threat_debug_values             300000 non-null object
threat_annotation               10783 non-null object
analytics_annotation            300000 non-null object
behavioral_annotation           300000 non-null object
session_overall_time            300000 non-null object
custom_data                     276583 non-null object
misc_debug                      300000 non-null object
sensor_data                     270733 non-null object
user_visible                    300000 non-null object
device_data                     300000 non-null object
page_url                        299998 non-null object
autofill_disable_field_count    300000 non-null object
aj_type                         300000 non-null int64
aj_indx                         300000 non-null int64
start_ts                        300000 non-null int64
additional_details              300000 non-null object
dtypes: float64(1), int64(6), object(28)
memory usage: 80.1+ MB

In the above frame, there are couple of string columns which contains long text (max char count is 20285 ). SO as @wesm said, the issue might be due to string column overflowing.

@wesm

This comment has been minimized.

Copy link
Member

wesm commented Nov 29, 2017

This is tracked in https://issues.apache.org/jira/browse/ARROW-232, let's follow the issue there

@wesm wesm closed this Nov 29, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment