Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41047: [C#] Address performance issue of reading from StringArray #41048

Merged
merged 3 commits into from
Apr 8, 2024

Conversation

keshen-msft
Copy link
Contributor

@keshen-msft keshen-msft commented Apr 5, 2024

Rationale for this change

The motivation here is to address #41047. There is severe performance drawback in reading a StringArray as value array of a DictionaryArray, because of repeated and unnecessary UTF 8 string decoding.

What changes are included in this PR?

  • Added a new function Materialize() to materialize the values to a list. When materialized, GetString() reads from the vector directly.
  • Added test coverage.

Are these changes tested?

Yes

Are there any user-facing changes?

No. This change maintains backwards compatibility on the API surface. It is up to the client application to decide whether to materialize the array and gain performance.

Copy link

github-actions bot commented Apr 5, 2024

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@CurtHagenlocher CurtHagenlocher changed the title [C#] Address performance issue of reading from StringArray GH-41047: [C#] Address performance issue of reading from StringArray Apr 7, 2024
Copy link

github-actions bot commented Apr 7, 2024

⚠️ GitHub issue #41047 has been automatically assigned in GitHub to PR creator.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Apr 7, 2024
@CurtHagenlocher CurtHagenlocher merged commit fe38d47 into apache:main Apr 8, 2024
9 checks passed
@CurtHagenlocher CurtHagenlocher removed the awaiting committer review Awaiting committer review label Apr 8, 2024
@github-actions github-actions bot added the awaiting committer review Awaiting committer review label Apr 8, 2024
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit fe38d47.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

raulcd pushed a commit that referenced this pull request Apr 9, 2024
…41048)

### Rationale for this change

The motivation here is to address #41047. There is severe performance drawback in reading a StringArray as value array of a DictionaryArray, because of repeated and unnecessary UTF 8 string decoding.

### What changes are included in this PR?

- Added a new function Materialize() to materialize the values to a list. When materialized, GetString() reads from the vector directly.
- Added test coverage.

### Are these changes tested?

Yes

### Are there any user-facing changes?

No. This change maintains backwards compatibility on the API surface. It is up to the client application to decide whether to materialize the array and gain performance. 

* GitHub Issue: #41047

Authored-by: Keshuang Shen <keshen@microsoft.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
verma-kartik pushed a commit to verma-kartik/arrow that referenced this pull request Apr 11, 2024
…Array (apache#41048)

### Rationale for this change

The motivation here is to address apache#41047. There is severe performance drawback in reading a StringArray as value array of a DictionaryArray, because of repeated and unnecessary UTF 8 string decoding.

### What changes are included in this PR?

- Added a new function Materialize() to materialize the values to a list. When materialized, GetString() reads from the vector directly.
- Added test coverage.

### Are these changes tested?

Yes

### Are there any user-facing changes?

No. This change maintains backwards compatibility on the API surface. It is up to the client application to decide whether to materialize the array and gain performance. 

* GitHub Issue: apache#41047

Authored-by: Keshuang Shen <keshen@microsoft.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
vibhatha pushed a commit to vibhatha/arrow that referenced this pull request Apr 15, 2024
…Array (apache#41048)

### Rationale for this change

The motivation here is to address apache#41047. There is severe performance drawback in reading a StringArray as value array of a DictionaryArray, because of repeated and unnecessary UTF 8 string decoding.

### What changes are included in this PR?

- Added a new function Materialize() to materialize the values to a list. When materialized, GetString() reads from the vector directly.
- Added test coverage.

### Are these changes tested?

Yes

### Are there any user-facing changes?

No. This change maintains backwards compatibility on the API surface. It is up to the client application to decide whether to materialize the array and gain performance. 

* GitHub Issue: apache#41047

Authored-by: Keshuang Shen <keshen@microsoft.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
tolleybot pushed a commit to tmct/arrow that referenced this pull request May 2, 2024
…Array (apache#41048)

### Rationale for this change

The motivation here is to address apache#41047. There is severe performance drawback in reading a StringArray as value array of a DictionaryArray, because of repeated and unnecessary UTF 8 string decoding.

### What changes are included in this PR?

- Added a new function Materialize() to materialize the values to a list. When materialized, GetString() reads from the vector directly.
- Added test coverage.

### Are these changes tested?

Yes

### Are there any user-facing changes?

No. This change maintains backwards compatibility on the API surface. It is up to the client application to decide whether to materialize the array and gain performance. 

* GitHub Issue: apache#41047

Authored-by: Keshuang Shen <keshen@microsoft.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
…Array (apache#41048)

### Rationale for this change

The motivation here is to address apache#41047. There is severe performance drawback in reading a StringArray as value array of a DictionaryArray, because of repeated and unnecessary UTF 8 string decoding.

### What changes are included in this PR?

- Added a new function Materialize() to materialize the values to a list. When materialized, GetString() reads from the vector directly.
- Added test coverage.

### Are these changes tested?

Yes

### Are there any user-facing changes?

No. This change maintains backwards compatibility on the API surface. It is up to the client application to decide whether to materialize the array and gain performance. 

* GitHub Issue: apache#41047

Authored-by: Keshuang Shen <keshen@microsoft.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
CurtHagenlocher pushed a commit that referenced this pull request Jul 19, 2024
…#43269)

### Rationale for this change

See #43266. Note that LargeBinary and LargeString are still limited to 2 GiB buffers, and LargeList is limited to offsets that can be represented as int32.

### What changes are included in this PR?

* Add new Array subtypes: LargeBinaryArray, LargeStringArray and LargeListArray
* Support round-tripping these array types via the IPC format
* Support round-tripping these array types via the C Data Interface
* Improve error messages when importing arrays that are too large via IPC or C Data Interface
* Enable integration tests for the new types 
* Update documentation

### Are these changes tested?

Yes, I've added some basic tests specifically for the new array types, and added these to the test data generator so they're covered by the existing tests for round tripping using IPC and C Data Interface.

### Are there any user-facing changes?

Yes, this is a new user facing feature.

### Implementation notes

* I haven't added builders for these new array types. Given they're added to help with interoperability with other libraries, I wouldn't expect .NET users to build arrays of these types as they provide no other benefit over the non-large types until we have proper large memory support. But I'm happy to add this if it would be useful.
* The new array types share a lot of logic with the non-large types. I considered trying to consolidate this logic by adding a new `BinaryArrayBase<TOffset>` class for example, but I think this would require generic math support to work nicely, and would still complicate the code quite a bit and add extra virtual method call overhead. So I think it's fine to keep these new Array subtypes independent from the non-large types.
* I haven't included support for materializing a LargeStringArray (see #41048). I'm not sure whether there would be a use for this, but it could be added later if needed.
* GitHub Issue: #43266

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants