-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C#] Performance issue of reading StringArray #41047
Comments
CurtHagenlocher
pushed a commit
that referenced
this issue
Apr 8, 2024
…41048) ### Rationale for this change The motivation here is to address #41047. There is severe performance drawback in reading a StringArray as value array of a DictionaryArray, because of repeated and unnecessary UTF 8 string decoding. ### What changes are included in this PR? - Added a new function Materialize() to materialize the values to a list. When materialized, GetString() reads from the vector directly. - Added test coverage. ### Are these changes tested? Yes ### Are there any user-facing changes? No. This change maintains backwards compatibility on the API surface. It is up to the client application to decide whether to materialize the array and gain performance. * GitHub Issue: #41047 Authored-by: Keshuang Shen <keshen@microsoft.com> Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
Issue resolved by pull request 41048 |
raulcd
pushed a commit
that referenced
this issue
Apr 9, 2024
…41048) ### Rationale for this change The motivation here is to address #41047. There is severe performance drawback in reading a StringArray as value array of a DictionaryArray, because of repeated and unnecessary UTF 8 string decoding. ### What changes are included in this PR? - Added a new function Materialize() to materialize the values to a list. When materialized, GetString() reads from the vector directly. - Added test coverage. ### Are these changes tested? Yes ### Are there any user-facing changes? No. This change maintains backwards compatibility on the API surface. It is up to the client application to decide whether to materialize the array and gain performance. * GitHub Issue: #41047 Authored-by: Keshuang Shen <keshen@microsoft.com> Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
verma-kartik
pushed a commit
to verma-kartik/arrow
that referenced
this issue
Apr 11, 2024
…Array (apache#41048) ### Rationale for this change The motivation here is to address apache#41047. There is severe performance drawback in reading a StringArray as value array of a DictionaryArray, because of repeated and unnecessary UTF 8 string decoding. ### What changes are included in this PR? - Added a new function Materialize() to materialize the values to a list. When materialized, GetString() reads from the vector directly. - Added test coverage. ### Are these changes tested? Yes ### Are there any user-facing changes? No. This change maintains backwards compatibility on the API surface. It is up to the client application to decide whether to materialize the array and gain performance. * GitHub Issue: apache#41047 Authored-by: Keshuang Shen <keshen@microsoft.com> Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
vibhatha
pushed a commit
to vibhatha/arrow
that referenced
this issue
Apr 15, 2024
…Array (apache#41048) ### Rationale for this change The motivation here is to address apache#41047. There is severe performance drawback in reading a StringArray as value array of a DictionaryArray, because of repeated and unnecessary UTF 8 string decoding. ### What changes are included in this PR? - Added a new function Materialize() to materialize the values to a list. When materialized, GetString() reads from the vector directly. - Added test coverage. ### Are these changes tested? Yes ### Are there any user-facing changes? No. This change maintains backwards compatibility on the API surface. It is up to the client application to decide whether to materialize the array and gain performance. * GitHub Issue: apache#41047 Authored-by: Keshuang Shen <keshen@microsoft.com> Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
tolleybot
pushed a commit
to tmct/arrow
that referenced
this issue
May 2, 2024
…Array (apache#41048) ### Rationale for this change The motivation here is to address apache#41047. There is severe performance drawback in reading a StringArray as value array of a DictionaryArray, because of repeated and unnecessary UTF 8 string decoding. ### What changes are included in this PR? - Added a new function Materialize() to materialize the values to a list. When materialized, GetString() reads from the vector directly. - Added test coverage. ### Are these changes tested? Yes ### Are there any user-facing changes? No. This change maintains backwards compatibility on the API surface. It is up to the client application to decide whether to materialize the array and gain performance. * GitHub Issue: apache#41047 Authored-by: Keshuang Shen <keshen@microsoft.com> Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
vibhatha
pushed a commit
to vibhatha/arrow
that referenced
this issue
May 25, 2024
…Array (apache#41048) ### Rationale for this change The motivation here is to address apache#41047. There is severe performance drawback in reading a StringArray as value array of a DictionaryArray, because of repeated and unnecessary UTF 8 string decoding. ### What changes are included in this PR? - Added a new function Materialize() to materialize the values to a list. When materialized, GetString() reads from the vector directly. - Added test coverage. ### Are these changes tested? Yes ### Are there any user-facing changes? No. This change maintains backwards compatibility on the API surface. It is up to the client application to decide whether to materialize the array and gain performance. * GitHub Issue: apache#41047 Authored-by: Keshuang Shen <keshen@microsoft.com> Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the enhancement requested
The general principle of Zero Copy does not work well in case of StringArray in the C# library. This is because the value buffer is UTF 8 encoded, while C# uses wide char. So, for reading each value, we need to go through UTF 8 decoding. This is especially bad in the case of reading a DictonaryArray of string value type since the value array is guaranteed to store unique strings, but the StringArray API forces reader code to decode string on encountered offsets repeatedly. In our profiling, we tested dictionary array of string and int columns in the same RecordBatch, and we see dominant CPU used on calling StringArray.GetString() comparing to reading int column.
C++ library on the other hand does not have this issue because Arrow C++ API exposes std::string and std::string_view, which work with UTF 8 natively.
Component(s)
C#
The text was updated successfully, but these errors were encountered: