Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] combine_chunks fails on column of table, but does not error on table itself #28850

Open
asfimport opened this issue Jun 23, 2021 · 3 comments

Comments

@asfimport
Copy link

asfimport commented Jun 23, 2021

combine_chunks fails on column of table, but does not error on table itself (but creates 3 chunks instead).

Is there a reason why they are not handled the same?

In [90]: pa.__version__
Out[90]: '4.0.0'

# Get shape
In [85]: pa_table.shape
Out[85]: (102753589, 1)In [86]: pa_col1_array = pa_table.column(0)

# Get number of chunks
In [87]: pa_col1_array.num_chunks
Out[87]: 4404

# Combining chunks on the pyarrow table with one column works.
In [88]: pa_table.combine_chunks()
Out[88]: 
pyarrow.Table
# id=TEW__014e25__c14e1d__Multiome_RNA_brain_10x_no_perm: string

# Combining chunks on the column itself does not work.
In [89]: pa_col1_array.combine_chunks()
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
<ipython-input-89-fdd0d0056a8e> in <module>
----> 1 pa_col1_array.combine_chunks()
/software/miniconda3/envs/cisTopic/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.ChunkedArray.combine_chunks()
/software/miniconda3/envs/cisTopic/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib.concat_arrays()
/software/miniconda3/envs/cisTopic/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
/software/miniconda3/envs/cisTopic/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowInvalid: offset overflow while concatenating arrays

# Assign combine chunks table to new tabled.
In [91]: pa_table_combined = pa_table.combine_chunks()

# Get first column
In [92]: pa_col1_array_from_pa_table_combined = pa_table_combined.column(0)

# Get number of chunks
In [93]: pa_col1_array_from_pa_table_combined.num_chunks
Out[93]: 3

# Try to combine column 1 again.
In [94]: pa_col1_array_from_pa_table_combined.combine_chunks()
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
<ipython-input-94-e2e323e6519f> in <module>
----> 1 pa_col1_array_from_pa_table_combined.combine_chunks()
/software/miniconda3/envs/cisTopic/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.ChunkedArray.combine_chunks()
/software/miniconda3/envs/cisTopic/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib.concat_arrays()
/software/miniconda3/envs/cisTopic/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
/software/miniconda3/envs/cisTopic/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowInvalid: offset overflow while concatenating arrays

# Get sizes of each chunk.
In [106]: [chunk.nbytes for chunk in pa_col1_array_from_pa_table_combined.chunks]
Out[106]: [2341650593, 2342925682, 241257842]

Reporter: Gert Hulselmans / @ghuls

Related issues:

Note: This issue was originally created as ARROW-13150. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
Both are implemented differently. The Table version is backed by the C++ arrow::Table::CombineChunks (

Result<std::shared_ptr<Table>> Table::CombineChunks(MemoryPool* pool) const {
const int ncolumns = num_columns();
std::vector<std::shared_ptr<ChunkedArray>> compacted_columns(ncolumns);
for (int i = 0; i < ncolumns; ++i) {
const auto& col = column(i);
if (col->num_chunks() <= 1) {
compacted_columns[i] = col;
continue;
}
if (is_binary_like(col->type()->id())) {
// ARROW-5744 Allow binary columns to be combined into multiple chunks to avoid
// buffer overflow
ArrayVector chunks;
int chunk_i = 0;
while (chunk_i < col->num_chunks()) {
ArrayVector safe_chunks;
int64_t data_length = 0;
for (; chunk_i < col->num_chunks(); ++chunk_i) {
const auto& chunk = col->chunk(chunk_i);
data_length += checked_cast<const BinaryArray&>(*chunk).total_values_length();
if (data_length >= kBinaryMemoryLimit) {
break;
}
safe_chunks.push_back(chunk);
}
chunks.emplace_back();
ARROW_ASSIGN_OR_RAISE(chunks.back(), Concatenate(safe_chunks, pool));
}
compacted_columns[i] = std::make_shared<ChunkedArray>(std::move(chunks));
} else {
ARROW_ASSIGN_OR_RAISE(auto compacted, Concatenate(col->chunks(), pool));
compacted_columns[i] = std::make_shared<ChunkedArray>(compacted);
}
}
return Table::Make(schema(), std::move(compacted_columns), num_rows_);
}
), which has specific handling to allow multiple chunks in the output to avoid overflow errors.

The ChunkedArray version is backed by pa.concat_arrays, which uses the C++ arrow::Concatenate function to concatenate arrays (

/// \brief Concatenate arrays
///
/// \param[in] arrays a vector of arrays to be concatenated
/// \param[in] pool memory to store the result will be allocated from this memory pool
/// \return the concatenated array
ARROW_EXPORT
Result<std::shared_ptr<Array>> Concatenate(const ArrayVector& arrays,
MemoryPool* pool = default_memory_pool());
), which doesn't have this handling (CombineChunks is actually calling Concatenate on each column and handling the possible overflow).

Now, that's the explanation of the difference, but the question is of course if we want to unify that behaviour. Or for example have a keyword to indicate that multiple chunks is also fine for the Array case.

@asfimport
Copy link
Author

David Li / @lidavidm:
I think ARROW-7245 and possibly ARROW-9003 would be relevant.

@asfimport
Copy link
Author

Gert Hulselmans / @ghuls:
https://issues.apache.org/jira/browse/ARROW-7245 indeed would solve this issue.

It is actually the solution that will be used in polars for handling this problem:
pola-rs/polars#862

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant