Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xgboost4j matrix memory leak #10300

Closed
huangwei907781034 opened this issue May 20, 2024 · 15 comments · Fixed by #10307
Closed

xgboost4j matrix memory leak #10300

huangwei907781034 opened this issue May 20, 2024 · 15 comments · Fixed by #10307

Comments

@huangwei907781034
Copy link

problem:
I loaded xgboost model files based on java spring to provide online model prediction services. However, while the service was running, the memory kept increasing. Through JVM NMT, I found that it was Java_ml_dmlc_xgboost4j_java_XGBoostJNI_XGDMatrixCreateFromCSR that caused Internal Memory to keep increasing.

verison:

ml.dmlc
xgboost4j_2.12
1.5.0

code:
dMatrix = new DMatrix(ListUtil.toLongArray(headers), ListUtil.toIntArray(indices), ListUtil.toFloatArray(data), DMatrix.SparseType.CSR, (int) numFeature);

try code:
dMatrix = new DMatrix(ListUtil.toLongArray(headers), ListUtil.toIntArray(indices), ListUtil.toFloatArray(data), DMatrix.SparseType.CSR, (int) numFeature);
dMatrix.dispose();

In order to avoid being affected by my other code, I tried to process the matrix immediately after new, but the memory was still not released. I don’t know if there is any problem with dispose.

NMT log:
Native Memory Tracking:

Total: reserved=7363MB +31MB, committed=4289MB +31MB

-                 Java Heap (reserved=3072MB, committed=3072MB)
                            (mmap: reserved=3072MB, committed=3072MB)

-                     Class (reserved=1119MB, committed=105MB)
                            (classes #16959)
                            (  instance classes #15884, array classes #1075)
                            (malloc=3MB #46273 +33)
                            (mmap: reserved=1116MB, committed=103MB)
                            (  Metadata:   )
                            (    reserved=92MB, committed=91MB)
                            (    used=89MB)
                            (    free=2MB)
                            (    waste=0MB =0.00%)
                            (  Class space:)
                            (    reserved=1024MB, committed=12MB)
                            (    used=11MB)
                            (    free=1MB)
                            (    waste=0MB =0.00%)

-                    Thread (reserved=2046MB, committed=185MB)
                            (thread #2029)
                            (stack: reserved=2037MB, committed=175MB)
                            (malloc=7MB #10156)
                            (arena=2MB #4056)

-                      Code (reserved=244MB, committed=45MB)
                            (malloc=3MB #12101 +46)
                            (mmap: reserved=242MB, committed=42MB)

-                        GC (reserved=163MB, committed=163MB)
                            (malloc=17MB #40973 +4)
                            (mmap: reserved=146MB, committed=146MB)

-                  Compiler (reserved=3MB, committed=3MB)
                            (malloc=3MB #1913 +8)

-                  Internal (reserved=614MB +30MB, committed=614MB +30MB)
                            (malloc=614MB +30MB #37792 +272)

-                     Other (reserved=70MB, committed=70MB)
                            (malloc=70MB #251)

-                    Symbol (reserved=22MB, committed=22MB)
                            (malloc=19MB #233751 +11)
                            (arena=3MB #1)

-    Native Memory Tracking (reserved=8MB +1MB, committed=8MB +1MB)
                            (malloc=1MB +1MB #19766 +7291)
                            (tracking overhead=6MB)

[0x00007fc52398d029] jni_GetIntArrayElements+0x169
[0x00007fc4c1341db4] Java_ml_dmlc_xgboost4j_java_XGBoostJNI_XGDMatrixCreateFromCSR+0x74
[0x00007fc50ca77cf9]
                             (malloc=290MB type=Internal +16MB #1606 +88)

[0x00007fc52398daa9] jni_GetFloatArrayElements+0x169
[0x00007fc4c1341dce] Java_ml_dmlc_xgboost4j_java_XGBoostJNI_XGDMatrixCreateFromCSR+0x8e
[0x00007fc50ca77cf9]
                             (malloc=290MB type=Internal +16MB #1605 +87)
@trivialfis
Copy link
Member

Hi, is the DMatrix object released after use?

@huangwei907781034
Copy link
Author

Yes, I wrote a demo and dispose immediately after new, but the internal memory still shows that xgboost jni has increased. My java version is 11.20

@huangwei907781034
Copy link
Author

I understand that they are all running on jvm, and theoretically it should have nothing to do with the operating system. However, I run the same code on my own jvm on macos, and this problem does not occur. Can this provide you with some tips?

@huangwei907781034
Copy link
Author

The operating system where the problem occurred is linux 5.15.0-52

@trivialfis
Copy link
Member

trivialfis commented May 20, 2024

There are two places responsible for memory consumption, first is the DMatrix object itself for storing input data, second is the prediction cache. The prediction cache is NOT immediately freed after the destruction of a DMatrix, rather it's checked when a new cache item is requested:

void ClearExpired() {
. A cache item is ready to be cleared if a). the DMatrix object is freed, b). the cache is full.

As a result, the result of the memory usage might not look consistent when the caching is being built.

@huangwei907781034
Copy link
Author

I understand that if it is a cache, it should not prompt Java_ml_dmlc_xgboost4j_java_XGBoostJNI_XGDMatrixCreateFromCSR. If the cache also prompts this, please tell me how to clear it. My service will continue to grow due to this cache until the memory reaches 100% and then crash.

@trivialfis
Copy link
Member

I don't think XGBoost will take up that much memory. As mentioned previously, a cache item is evicted as soon as a new cache item is requested and the previous DMatrix is freed:

if (it->second.ref.expired()) {
We need some investigation to understand why your pipeline is eating up all the memory.

@trivialfis
Copy link
Member

Are you launching new threads for prediction without finishing the thread afterward?

@huangwei907781034
Copy link
Author

Yes, I created an xgboost object pool with 20 boosts in it, and then I will use multi-threading to provide prediction capabilities

@huangwei907781034
Copy link
Author

But I think even if there are 20 boost objects, there are only 20 caches, because predict is a synchronous method

@huangwei907781034
Copy link
Author

Now there is a strange problem. I cannot reproduce this phenomenon of internal memory growing on Mac. It only occurs in the production environment.

@huangwei907781034
Copy link
Author

[0x00007fc52398d029] jni_GetIntArrayElements+0x169
[0x00007fc4c1341db4] Java_ml_dmlc_xgboost4j_java_XGBoostJNI_XGDMatrixCreateFromCSR+0x74
[0x00007fc50ca77cf9]
(malloc=290MB type=Internal +16MB #1606 +88)

[0x00007fc52398daa9] jni_GetFloatArrayElements+0x169
[0x00007fc4c1341dce] Java_ml_dmlc_xgboost4j_java_XGBoostJNI_XGDMatrixCreateFromCSR+0x8e
[0x00007fc50ca77cf9]
(malloc=290MB type=Internal +16MB #1605 +87)

As shown in the log, it is theoretically impossible for malloc memory to be as large as 290MB.

@wbo4958
Copy link
Contributor

wbo4958 commented May 21, 2024

Hi @huangwei907781034, Could you have the minimal code including the mimiced data to repro it ? and also BTW, How to get the Native Memory Tracking? Thx

@huangwei907781034
Copy link
Author

Hi, I have located the problem. The input data contains the value of inf, which causes the matrix to be created abnormally. However, the problem is that the matrix is ​​created abnormally, but the memory will not be released. After I determined the value of inf, my memory was normal.
exception log:
/workspace/src/data/data.cc:945: Check failed: valid: Input data contains inf or nan
Stack trace:
[bt] (0) /tmp/libxgboost4j5095247005832545355.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x53) [0x7f8261aae843]
[bt] (1) /tmp/libxgboost4j5095247005832545355.so(unsigned long xgboost::SparsePage::Pushxgboost::data::CSRAdapterBatch(xgboost::data::CSRAdapterBatch const&, float, int)+0x470) [0x7f8261b8dca0]
[bt] (2) /tmp/libxgboost4j5095247005832545355.so(xgboost::data::SimpleDMatrix::SimpleDMatrixxgboost::data::CSRAdapter(xgboost::data::CSRAdapter*, float, int)+0x29c) [0x7f8261b9e6ac]
[bt] (3) /tmp/libxgboost4j5095247005832545355.so(xgboost::DMatrix* xgboost::DMatrix::Createxgboost::data::CSRAdapter(xgboost::data::CSRAdapter*, float, int, std::string const&, unsigned long)+0x45) [0x7f8261b950f5]
[bt] (4) /tmp/libxgboost4j5095247005832545355.so(XGDMatrixCreateFromCSREx+0x7f) [0x7f8261abbaaf]
[bt] (5) /tmp/libxgboost4j5095247005832545355.so(Java_ml_dmlc_xgboost4j_java_XGBoostJNI_XGDMatrixCreateFromCSREx+0xa6) [0x7f8261aa98f6]
[bt] (6) [0x7f82b851a662]

@trivialfis
Copy link
Member

Thank you for sharing! That's really helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants