Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C#] Performance issue of reading StringArray #41047

Closed
keshen-msft opened this issue Apr 5, 2024 · 1 comment
Closed

[C#] Performance issue of reading StringArray #41047

keshen-msft opened this issue Apr 5, 2024 · 1 comment

Comments

@keshen-msft
Copy link
Contributor

Describe the enhancement requested

The general principle of Zero Copy does not work well in case of StringArray in the C# library. This is because the value buffer is UTF 8 encoded, while C# uses wide char. So, for reading each value, we need to go through UTF 8 decoding. This is especially bad in the case of reading a DictonaryArray of string value type since the value array is guaranteed to store unique strings, but the StringArray API forces reader code to decode string on encountered offsets repeatedly. In our profiling, we tested dictionary array of string and int columns in the same RecordBatch, and we see dominant CPU used on calling StringArray.GetString() comparing to reading int column.

C++ library on the other hand does not have this issue because Arrow C++ API exposes std::string and std::string_view, which work with UTF 8 natively.

Component(s)

C#

CurtHagenlocher pushed a commit that referenced this issue Apr 8, 2024
…41048)

### Rationale for this change

The motivation here is to address #41047. There is severe performance drawback in reading a StringArray as value array of a DictionaryArray, because of repeated and unnecessary UTF 8 string decoding.

### What changes are included in this PR?

- Added a new function Materialize() to materialize the values to a list. When materialized, GetString() reads from the vector directly.
- Added test coverage.

### Are these changes tested?

Yes

### Are there any user-facing changes?

No. This change maintains backwards compatibility on the API surface. It is up to the client application to decide whether to materialize the array and gain performance. 

* GitHub Issue: #41047

Authored-by: Keshuang Shen <keshen@microsoft.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
@CurtHagenlocher CurtHagenlocher added this to the 17.0.0 milestone Apr 8, 2024
@CurtHagenlocher
Copy link
Contributor

Issue resolved by pull request 41048
#41048

@raulcd raulcd modified the milestones: 17.0.0, 16.0.0 Apr 9, 2024
raulcd pushed a commit that referenced this issue Apr 9, 2024
…41048)

### Rationale for this change

The motivation here is to address #41047. There is severe performance drawback in reading a StringArray as value array of a DictionaryArray, because of repeated and unnecessary UTF 8 string decoding.

### What changes are included in this PR?

- Added a new function Materialize() to materialize the values to a list. When materialized, GetString() reads from the vector directly.
- Added test coverage.

### Are these changes tested?

Yes

### Are there any user-facing changes?

No. This change maintains backwards compatibility on the API surface. It is up to the client application to decide whether to materialize the array and gain performance. 

* GitHub Issue: #41047

Authored-by: Keshuang Shen <keshen@microsoft.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
verma-kartik pushed a commit to verma-kartik/arrow that referenced this issue Apr 11, 2024
…Array (apache#41048)

### Rationale for this change

The motivation here is to address apache#41047. There is severe performance drawback in reading a StringArray as value array of a DictionaryArray, because of repeated and unnecessary UTF 8 string decoding.

### What changes are included in this PR?

- Added a new function Materialize() to materialize the values to a list. When materialized, GetString() reads from the vector directly.
- Added test coverage.

### Are these changes tested?

Yes

### Are there any user-facing changes?

No. This change maintains backwards compatibility on the API surface. It is up to the client application to decide whether to materialize the array and gain performance. 

* GitHub Issue: apache#41047

Authored-by: Keshuang Shen <keshen@microsoft.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
vibhatha pushed a commit to vibhatha/arrow that referenced this issue Apr 15, 2024
…Array (apache#41048)

### Rationale for this change

The motivation here is to address apache#41047. There is severe performance drawback in reading a StringArray as value array of a DictionaryArray, because of repeated and unnecessary UTF 8 string decoding.

### What changes are included in this PR?

- Added a new function Materialize() to materialize the values to a list. When materialized, GetString() reads from the vector directly.
- Added test coverage.

### Are these changes tested?

Yes

### Are there any user-facing changes?

No. This change maintains backwards compatibility on the API surface. It is up to the client application to decide whether to materialize the array and gain performance. 

* GitHub Issue: apache#41047

Authored-by: Keshuang Shen <keshen@microsoft.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
tolleybot pushed a commit to tmct/arrow that referenced this issue May 2, 2024
…Array (apache#41048)

### Rationale for this change

The motivation here is to address apache#41047. There is severe performance drawback in reading a StringArray as value array of a DictionaryArray, because of repeated and unnecessary UTF 8 string decoding.

### What changes are included in this PR?

- Added a new function Materialize() to materialize the values to a list. When materialized, GetString() reads from the vector directly.
- Added test coverage.

### Are these changes tested?

Yes

### Are there any user-facing changes?

No. This change maintains backwards compatibility on the API surface. It is up to the client application to decide whether to materialize the array and gain performance. 

* GitHub Issue: apache#41047

Authored-by: Keshuang Shen <keshen@microsoft.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
vibhatha pushed a commit to vibhatha/arrow that referenced this issue May 25, 2024
…Array (apache#41048)

### Rationale for this change

The motivation here is to address apache#41047. There is severe performance drawback in reading a StringArray as value array of a DictionaryArray, because of repeated and unnecessary UTF 8 string decoding.

### What changes are included in this PR?

- Added a new function Materialize() to materialize the values to a list. When materialized, GetString() reads from the vector directly.
- Added test coverage.

### Are these changes tested?

Yes

### Are there any user-facing changes?

No. This change maintains backwards compatibility on the API surface. It is up to the client application to decide whether to materialize the array and gain performance. 

* GitHub Issue: apache#41047

Authored-by: Keshuang Shen <keshen@microsoft.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants