Search before asking
Version
Doris 4.0.3-rc03
The problem occurs both in Cloud Mode and non-Cloud Mode
What's Wrong?
The stream load api may cause a slow memory leak in LoadManager.
http://FE/api/db/table/_stream_load&group_commit=async_mode
The LoadManager may have a large number of pending tasks
# Check the count of PENDING tasks in LoadManager
[arthas@2844843]$ ognl '@org.apache.doris.catalog.Env@getCurrentEnv().getLoadManager().idToLoadJob.values().{? #this.getState().name() == "PENDING"}.size()'
@Integer[4048847]
Looking at the task details, many of these PENDING jobs were created several days ago and have not been purged, despite the label_keep_max_second being set to 600 seconds.
# Sampling PENDING tasks (showing State--CreateTimestamp:Label#Id)
[arthas@2844843]$ ognl '#list=@org.apache.doris.catalog.Env@getCurrentEnv().getLoadManager().idToLoadJob.values().{? #this.getState().name() == "PENDING"}, #size=#list.size(), #end=#size > 10 ? 10 : #size, new java.util.ArrayList(#list.subList(0, #end)).{#this.getState().name()+"--"+#this.getCreateTimestamp() + ":" + #this.getLabel()+"#"+#this.getId()}'
@ArrayList[
@String[PENDING--1774695394786:group_commit_834ce3a6d7aa549c_0ec5c85be1a6128d#1774602956553],
@String[PENDING--1774695393686:group_commit_1f4c5fa642f4e779_ddcc3cc03a68428a#1774602956552],
@String[PENDING--1774695395590:group_commit_274264d3e3c52a75_c3abcdab7dc1d7bd#1774602956555],
...
]
And In the source Code, it seems that a PENDING task will be never removed
public void removeOldLoadJob() {
long currentTimeMs = System.currentTimeMillis();
removeLoadJobIf(job -> job.isExpired(currentTimeMs));
}
public boolean isExpired(long currentTimeMs) {
// Logic Issue: If the job remains in PENDING state, isCompleted() is false,
// so the job will NEVER be removed by the cleaner.
if (!isCompleted()) {
return false;
}
long expireTime = Config.label_keep_max_second;
if (jobType == EtlJobType.INSERT) {
expireTime = Config.streaming_label_keep_max_second;
}
return (currentTimeMs - getFinishTimestamp()) / 1000 > expireTime;
}
public boolean isCompleted() {
return state == JobState.FINISHED || state == JobState.CANCELLED || state == JobState.UNKNOWN;
}
I found the place where the job is loaded
public AbstractInsertExecutor(ConnectContext ctx, TableIf table, String labelName, NereidsPlanner planner,
Optional<InsertCommandContext> insertCtx, boolean emptyInsert, long jobId) {
this.ctx = ctx;
this.database = table.getDatabase();
this.insertLoadJob = new InsertLoadJob(database.getId(), labelName, jobId);
// Do not add load job if job id is -1.
if (jobId != -1) {
ctx.getEnv().getLoadManager().addLoadJob(insertLoadJob);
}
this.coordinator = EnvFactory.getInstance().createCoordinator(
ctx, planner, ctx.getStatsErrorEstimator(), insertLoadJob.getId());
this.labelName = labelName;
this.table = table;
this.insertCtx = insertCtx;
this.emptyInsert = emptyInsert;
this.jobId = jobId;
}
I tried this, do not load job if it is groupCommit, but doesn't work.
if (jobId != -1 && !ctx.isGroupCommit()) {
ctx.getEnv().getLoadManager().addLoadJob(insertLoadJob);
}
/*
if (jobId != -1) {
ctx.getEnv().getLoadManager().addLoadJob(insertLoadJob);
}
*/
Then I add some logs on the httpStreamPutImpl
private void httpStreamPutImpl(TStreamLoadPutRequest request, TStreamLoadPutResult result){
/...
LOG.info("[DEBUG-request] groupCommitMode={}",request.getGroupCommitMode());
ctx.getSessionVariable().groupCommit = request.getGroupCommitMode();
LOG.info("[DEBUG-session] groupCommitMode={}",ctx.getSessionVariable().groupCommit);
....
}
The groupCommitMode here is always null
[DEBUG-request] groupCommitMode=null
[DEBUG-session] groupCommitMode=null
What You Expected?
- There shouldn't be so many old and dead PENDING tasks in the LoadManager when groupCommit is on. They should be either cleared automatically or just skip the loadJob function.
- Inside httpStreamPutImpl , the groupCommitMode shouldn't be null
How to Reproduce?
Very easy to reproduce:
- Download and extract the apache-doris-4.0.3, all use default configurations, Start fe and be
- Use HttpStream load api http://FE/api/db/table/_stream_load&group_commit=async_mode,
- Use Arthas can see the steady increase of the LoadManager
[arthas@2844843]$ ognl '@org.apache.doris.catalog.Env@getCurrentEnv().getLoadManager().idToLoadJob.values().{? #this.getState().name() == "PENDING"}.size()'
Anything Else?
The problem occur in both Cloud Mode and non-Cloud Mode
The problem doesn't occur in 3.x.x
Are you willing to submit PR?
Code of Conduct
Search before asking
Version
Doris 4.0.3-rc03
The problem occurs both in Cloud Mode and non-Cloud Mode
What's Wrong?
The stream load api may cause a slow memory leak in LoadManager.
http://FE/api/db/table/_stream_load&group_commit=async_mode
The LoadManager may have a large number of pending tasks
Looking at the task details, many of these PENDING jobs were created several days ago and have not been purged, despite the label_keep_max_second being set to 600 seconds.
And In the source Code, it seems that a PENDING task will be never removed
I found the place where the job is loaded
I tried this, do not load job if it is groupCommit, but doesn't work.
Then I add some logs on the httpStreamPutImpl
The groupCommitMode here is always null
What You Expected?
How to Reproduce?
Very easy to reproduce:
[arthas@2844843]$ ognl '@org.apache.doris.catalog.Env@getCurrentEnv().getLoadManager().idToLoadJob.values().{? #this.getState().name() == "PENDING"}.size()'Anything Else?
The problem occur in both Cloud Mode and non-Cloud Mode
The problem doesn't occur in 3.x.x
Are you willing to submit PR?
Code of Conduct