OverflowError: value too large to convert to int - fastparquet.cencoding.write_thrift #823

sikanrong · 2022-11-24T15:07:06Z

I'm having an issue storing a large dataset (around 40GB) in a single parquet file.

I'm using the fastparquet library to append pandas.DataFrames to this parquet dataset file, and everything goes fine until the dataset hits 2.3GB, at which point I get the following errors:

OverflowError: value too large to convert to int
Exception ignored in: 'fastparquet.cencoding.write_thrift'

Having debugged my way through the fastparquet code itself, it seems to me that what is happening internally is that to append rows, it has to update the page headers, and it seems that this is done by creating a Thrift object and writing it to a file:

# fastparquet/writer.py:581

ph = parquet_thrift.PageHeader(type=parquet_thrift.PageType.DATA_PAGE,
    uncompressed_page_size=l0,
    compressed_page_size=l1,
    data_page_header=dph, i32=1)

The problem could be that the uncompressed_page_size attribute is typed as int32, and so as the file grows, when it reaches that limit in bytes fastparquet begins to throw these errors on write... The fact that this is a Thrift object (where types are rigidly defined) suggests that this typing choice may be an inherent part of the parquet format itself; is this true?

I'm not unsure if I'm looking at a bug in fastparquet, or if perhaps this is an intended design choice in the parquet format. I've been unable to get clarity on this anywhere else.

The text was updated successfully, but these errors were encountered:

martindurant · 2022-11-24T15:54:21Z

Correct, it is not possible to change the data types in the stored thrift object - at least not if you want the file to be readable by anything other than fastparquet. When using write(), you can specify how many rows go into each row group, so I suggest you use a smaller value. It might be reasonable for fastparquet to be able to guess a good number for this.

sikanrong · 2022-11-25T13:49:25Z

@martindurant adjusting the row group size doesn't seem to make a difference; the number that ends up exceeding the int32 threshold seems linked to the total size of the parquet file, and not any of the individual write sizes.

martindurant · 2022-11-25T14:14:21Z

The total footer size (not data size) is given by a 4byte value, so it's not that. I could check the various byte offsets, but I think they are all 64bit (or var-int). I suppose your guess at the page header is correct, in which case chopping the data into row-groups (each of which contains one page per row) should help. You could also try using "hive" style output, in which each row group becomes a separate file.

sikanrong · 2022-11-25T14:20:54Z

thanks so much for the help @martindurant ; I'm aware of 'hive' style output but trying to avoid using it to facilitate ease-of-read in another non-python language which doesn't support that structure. I was surprised by this part of your answer

in which case chopping the data into row-groups (each of which contains one page per row) should help

...as I thought that's what was already happening under the hood when I call fastparquet.write. Am I wrong in thinking so? In general, the code I'm using to write to the file is as follows:

fastparquet.write(
  filename=fname, 
  data=df, 
  append=(os.path.exists(fname))
)

Is there another API that I would have to use such that it will write row groups instead?

martindurant · 2022-11-25T14:24:19Z

The argument row_group_offsets gives you control over how big the row groups are. The default is geared towards a "tall and narrow" table layout of the sort parquet was designed for.

another non-python language which doesn't support that structure

I'm surprised if there are frameworks that wouldn't be able to read this output.

sikanrong · 2022-11-25T14:30:23Z

@martindurant I am relatively new to parquet and still getting the "lay of the land", so to speak. Trying to discern what's part of the format specification and which parts are implementation-specific... Anyway the other language is Go and the other library is parquet-go - afaik it doesn't seem to support any parquet file hierarchy structures such as hive or drill.

Again, thanks so much for the help; I've toyed around with row_group_offsets but it seems to make little difference; perhaps I just haven't found the right value yet(?). I also thought about instantiating a ParquetFile object and using the write_row_groups instance method, but it seemed to me like that should be generally the same as what's internally going on calling fastparquet.write.

martindurant · 2022-11-25T14:32:53Z

"hive" style without partitioning amounts to splitting each row group into a file, but no encoding of information into the path structure. It is certainly worth a try, although I know nothing about parquet-go.

Yes, the writing method(s) on ParquetFile are conveniences for append and alter operations on existing datasets, which use the same code beneath.

sikanrong · 2022-11-25T14:34:19Z

@martindurant thanks for all the support and prompt communication; closing this issue as it is not an issue with the lib

sikanrong closed this as completed Nov 25, 2022

sikanrong mentioned this issue Nov 26, 2022

Parquet files can't exceed 2.14GB? Write throws overflow errors when filesize in bytes exceeds int32 limit... #825

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OverflowError: value too large to convert to int - fastparquet.cencoding.write_thrift #823

OverflowError: value too large to convert to int - fastparquet.cencoding.write_thrift #823

sikanrong commented Nov 24, 2022

martindurant commented Nov 24, 2022

sikanrong commented Nov 25, 2022

martindurant commented Nov 25, 2022

sikanrong commented Nov 25, 2022 •

edited

Loading

martindurant commented Nov 25, 2022

sikanrong commented Nov 25, 2022

martindurant commented Nov 25, 2022

sikanrong commented Nov 25, 2022

OverflowError: value too large to convert to int - fastparquet.cencoding.write_thrift #823

OverflowError: value too large to convert to int - fastparquet.cencoding.write_thrift #823

Comments

sikanrong commented Nov 24, 2022

martindurant commented Nov 24, 2022

sikanrong commented Nov 25, 2022

martindurant commented Nov 25, 2022

sikanrong commented Nov 25, 2022 • edited Loading

martindurant commented Nov 25, 2022

sikanrong commented Nov 25, 2022

martindurant commented Nov 25, 2022

sikanrong commented Nov 25, 2022

sikanrong commented Nov 25, 2022 •

edited

Loading