[SPARK-33206][CORE][3.2] Fix shuffle index cache weight calculation for small index files #35714

attilapiros · 2022-03-02T16:23:30Z

What changes were proposed in this pull request?

Increasing the shuffle index weight with a constant number to avoid underestimating retained memory size caused by the bookkeeping objects: the java.io.File (depending on the path ~ 960 bytes) object and the ShuffleIndexInformation object (~180 bytes).

Why are the changes needed?

Underestimating cache entry size easily can cause OOM in the Yarn NodeManager.
In the following analyses of a prod issue (HPROF file) we can see the leak suspect Guava's LocalCache$Segment objects:

Going further we can see a ShuffleIndexInformation for a small index file (16 bytes) but the retained heap memory is 1192 bytes:

Finally we can see this is very common within this heap dump (using MAT's Object Query Language):

I have even exported the data to a CSV and done some calculations with awk:

$ tail -n+2 export.csv | awk -F, 'BEGIN { numUnderEstimated=0; } { sumOldSize += $1; corrected=$1 + 1176; sumCorrectedSize += corrected; sumRetainedMem += $2; if (corrected < $2) numUnderEstimated+=1; } END { print "sum old size: " sumOldSize / 1024 / 1024   " MB, sum corrected size: " sumCorrectedSize / 1024 / 1024 " MB, sum retained memory:" sumRetainedMem / 1024 / 1024  " MB, num under estimated: " numUnderEstimated }'

It gives the followings:

sum old size: 76.8785 MB, sum corrected size: 1066.93 MB, sum retained memory:1064.47 MB, num under estimated: 0

So using the old calculation we were at 7.6.8 MB way under the default cache limit (100 MB).
Using the correction (applying 1176 as increment to the size) we are at 1066.93 MB (~1GB) which is close to the real retained sum heap: 1064.47 MB (~1GB) and there is no entry which was underestimated.

But we can go further and get rid of java.io.File completely and store the ShuffleIndexInformation for the file path.
This way not only the cache size estimate is improved but the its size is decreased as well.
Here the path size is not counted into the cache size as that string is interned.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

With the calculations above.

attilapiros · 2022-03-02T16:25:34Z

Backport of #35559 to 3.2

…all index files Increasing the shuffle index weight with a constant number to avoid underestimating retained memory size caused by the bookkeeping objects: the `java.io.File` (depending on the path ~ 960 bytes) object and the `ShuffleIndexInformation` object (~180 bytes). Underestimating cache entry size easily can cause OOM in the Yarn NodeManager. In the following analyses of a prod issue (HPROF file) we can see the leak suspect Guava's `LocalCache$Segment` objects: <img width="943" alt="Screenshot 2022-02-17 at 18 55 40" src="https://user-images.githubusercontent.com/2017933/154541995-44014212-2046-41d6-ba7f-99369ca7d739.png"> Going further we can see a `ShuffleIndexInformation` for a small index file (16 bytes) but the retained heap memory is 1192 bytes: <img width="1351" alt="image" src="https://user-images.githubusercontent.com/2017933/154645212-e0318d0f-cefa-4ae3-8a3b-97d2b506757d.png"> Finally we can see this is very common within this heap dump (using MAT's Object Query Language): <img width="1418" alt="image" src="https://user-images.githubusercontent.com/2017933/154547678-44c8af34-1765-4e14-b71a-dc03d1a304aa.png"> I have even exported the data to a CSV and done some calculations with `awk`: ``` $ tail -n+2 export.csv | awk -F, 'BEGIN { numUnderEstimated=0; } { sumOldSize += $1; corrected=$1 + 1176; sumCorrectedSize += corrected; sumRetainedMem += $2; if (corrected < $2) numUnderEstimated+=1; } END { print "sum old size: " sumOldSize / 1024 / 1024 " MB, sum corrected size: " sumCorrectedSize / 1024 / 1024 " MB, sum retained memory:" sumRetainedMem / 1024 / 1024 " MB, num under estimated: " numUnderEstimated }' ``` It gives the followings: ``` sum old size: 76.8785 MB, sum corrected size: 1066.93 MB, sum retained memory:1064.47 MB, num under estimated: 0 ``` So using the old calculation we were at 7.6.8 MB way under the default cache limit (100 MB). Using the correction (applying 1176 as increment to the size) we are at 1066.93 MB (~1GB) which is close to the real retained sum heap: 1064.47 MB (~1GB) and there is no entry which was underestimated. But we can go further and get rid of `java.io.File` completely and store the `ShuffleIndexInformation` for the file path. This way not only the cache size estimate is improved but the its size is decreased as well. Here the path size is not counted into the cache size as that string is interned. No. With the calculations above. Closes apache#35559 from attilapiros/SPARK-33206. Authored-by: attilapiros <piros.attila.zsolt@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 42f118a)

dongjoon-hyun

Thank you for making a backport, @attilapiros .

attilapiros · 2022-03-02T17:53:52Z

Thank you for making a backport, @attilapiros .

@dongjoon-hyun Tomorrow I will open one against branch-3.1.

dongjoon-hyun · 2022-03-02T18:06:41Z

Thanks!

cc @mridulm , @srowen , @Ngone51 , @wypoon too

dongjoon-hyun

+1, LGTM. The linter job error is irrelevant to this PR.

pip._vendor.urllib3.exceptions.ProtocolError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
[95](https://github.com/attilapiros/spark/runs/5395854606?check_suite_focus=true#step:8:95)
WARNING: You are using pip version 21.1.2; however, version 22.0.3 is available.

If you are ready, could you switch into a normal PR from Draft, @attilapiros ?

dongjoon-hyun · 2022-03-02T20:23:59Z

Thank you, @attilapiros and @srowen .
Merged to branch-3.2 for Apache Spark 3.2.2.

…or small index files ### What changes were proposed in this pull request? Increasing the shuffle index weight with a constant number to avoid underestimating retained memory size caused by the bookkeeping objects: the `java.io.File` (depending on the path ~ 960 bytes) object and the `ShuffleIndexInformation` object (~180 bytes). ### Why are the changes needed? Underestimating cache entry size easily can cause OOM in the Yarn NodeManager. In the following analyses of a prod issue (HPROF file) we can see the leak suspect Guava's `LocalCache$Segment` objects: <img width="943" alt="Screenshot 2022-02-17 at 18 55 40" src="https://user-images.githubusercontent.com/2017933/154541995-44014212-2046-41d6-ba7f-99369ca7d739.png"> Going further we can see a `ShuffleIndexInformation` for a small index file (16 bytes) but the retained heap memory is 1192 bytes: <img width="1351" alt="image" src="https://user-images.githubusercontent.com/2017933/154645212-e0318d0f-cefa-4ae3-8a3b-97d2b506757d.png"> Finally we can see this is very common within this heap dump (using MAT's Object Query Language): <img width="1418" alt="image" src="https://user-images.githubusercontent.com/2017933/154547678-44c8af34-1765-4e14-b71a-dc03d1a304aa.png"> I have even exported the data to a CSV and done some calculations with `awk`: ``` $ tail -n+2 export.csv | awk -F, 'BEGIN { numUnderEstimated=0; } { sumOldSize += $1; corrected=$1 + 1176; sumCorrectedSize += corrected; sumRetainedMem += $2; if (corrected < $2) numUnderEstimated+=1; } END { print "sum old size: " sumOldSize / 1024 / 1024 " MB, sum corrected size: " sumCorrectedSize / 1024 / 1024 " MB, sum retained memory:" sumRetainedMem / 1024 / 1024 " MB, num under estimated: " numUnderEstimated }' ``` It gives the followings: ``` sum old size: 76.8785 MB, sum corrected size: 1066.93 MB, sum retained memory:1064.47 MB, num under estimated: 0 ``` So using the old calculation we were at 7.6.8 MB way under the default cache limit (100 MB). Using the correction (applying 1176 as increment to the size) we are at 1066.93 MB (~1GB) which is close to the real retained sum heap: 1064.47 MB (~1GB) and there is no entry which was underestimated. But we can go further and get rid of `java.io.File` completely and store the `ShuffleIndexInformation` for the file path. This way not only the cache size estimate is improved but the its size is decreased as well. Here the path size is not counted into the cache size as that string is interned. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? With the calculations above. Closes #35714 from attilapiros/SPARK-33206-3.2. Authored-by: attilapiros <piros.attila.zsolt@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

mridulm · 2022-03-03T18:24:14Z

Late LGTM, thanks @attilapiros !

…or small index files ### What changes were proposed in this pull request? Increasing the shuffle index weight with a constant number to avoid underestimating retained memory size caused by the bookkeeping objects: the `java.io.File` (depending on the path ~ 960 bytes) object and the `ShuffleIndexInformation` object (~180 bytes). ### Why are the changes needed? Underestimating cache entry size easily can cause OOM in the Yarn NodeManager. In the following analyses of a prod issue (HPROF file) we can see the leak suspect Guava's `LocalCache$Segment` objects: <img width="943" alt="Screenshot 2022-02-17 at 18 55 40" src="https://user-images.githubusercontent.com/2017933/154541995-44014212-2046-41d6-ba7f-99369ca7d739.png"> Going further we can see a `ShuffleIndexInformation` for a small index file (16 bytes) but the retained heap memory is 1192 bytes: <img width="1351" alt="image" src="https://user-images.githubusercontent.com/2017933/154645212-e0318d0f-cefa-4ae3-8a3b-97d2b506757d.png"> Finally we can see this is very common within this heap dump (using MAT's Object Query Language): <img width="1418" alt="image" src="https://user-images.githubusercontent.com/2017933/154547678-44c8af34-1765-4e14-b71a-dc03d1a304aa.png"> I have even exported the data to a CSV and done some calculations with `awk`: ``` $ tail -n+2 export.csv | awk -F, 'BEGIN { numUnderEstimated=0; } { sumOldSize += $1; corrected=$1 + 1176; sumCorrectedSize += corrected; sumRetainedMem += $2; if (corrected < $2) numUnderEstimated+=1; } END { print "sum old size: " sumOldSize / 1024 / 1024 " MB, sum corrected size: " sumCorrectedSize / 1024 / 1024 " MB, sum retained memory:" sumRetainedMem / 1024 / 1024 " MB, num under estimated: " numUnderEstimated }' ``` It gives the followings: ``` sum old size: 76.8785 MB, sum corrected size: 1066.93 MB, sum retained memory:1064.47 MB, num under estimated: 0 ``` So using the old calculation we were at 7.6.8 MB way under the default cache limit (100 MB). Using the correction (applying 1176 as increment to the size) we are at 1066.93 MB (~1GB) which is close to the real retained sum heap: 1064.47 MB (~1GB) and there is no entry which was underestimated. But we can go further and get rid of `java.io.File` completely and store the `ShuffleIndexInformation` for the file path. This way not only the cache size estimate is improved but the its size is decreased as well. Here the path size is not counted into the cache size as that string is interned. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? With the calculations above. Closes apache#35714 from attilapiros/SPARK-33206-3.2. Authored-by: attilapiros <piros.attila.zsolt@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 0f03b54) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

github-actions bot added the CORE label Mar 2, 2022

attilapiros marked this pull request as draft March 2, 2022 16:30

attilapiros force-pushed the SPARK-33206-3.2 branch from 305c0a1 to 3f1644c Compare March 2, 2022 17:42

dongjoon-hyun reviewed Mar 2, 2022

View reviewed changes

srowen approved these changes Mar 2, 2022

View reviewed changes

dongjoon-hyun approved these changes Mar 2, 2022

View reviewed changes

attilapiros marked this pull request as ready for review March 2, 2022 19:59

dongjoon-hyun closed this Mar 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-33206][CORE][3.2] Fix shuffle index cache weight calculation for small index files #35714

[SPARK-33206][CORE][3.2] Fix shuffle index cache weight calculation for small index files #35714

attilapiros commented Mar 2, 2022

attilapiros commented Mar 2, 2022

dongjoon-hyun left a comment

attilapiros commented Mar 2, 2022

dongjoon-hyun commented Mar 2, 2022

dongjoon-hyun left a comment

dongjoon-hyun commented Mar 2, 2022

mridulm commented Mar 3, 2022

[SPARK-33206][CORE][3.2] Fix shuffle index cache weight calculation for small index files #35714

[SPARK-33206][CORE][3.2] Fix shuffle index cache weight calculation for small index files #35714

Conversation

attilapiros commented Mar 2, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

attilapiros commented Mar 2, 2022

dongjoon-hyun left a comment

Choose a reason for hiding this comment

attilapiros commented Mar 2, 2022

dongjoon-hyun commented Mar 2, 2022

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Mar 2, 2022

mridulm commented Mar 3, 2022