-
Notifications
You must be signed in to change notification settings - Fork 25.5k
[GPU] Support for performance profiling #136021
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
x-pack/plugin/gpu/src/main/java/org/elasticsearch/xpack/gpu/codec/ES92GpuHnswVectorsWriter.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/gpu/src/main/java/org/elasticsearch/xpack/gpu/codec/CuVSResourceManager.java
Outdated
Show resolved
Hide resolved
qa/vector/src/main/java/org/elasticsearch/test/knn/KnnIndexer.java
Outdated
Show resolved
Hide resolved
if (System.getenv("DO_PROFILING") != null) { | ||
jvmArgs '-XX:StartFlightRecording=dumponexit=true,maxsize=250M,filename=knn.jfr,settings=profile.jfc' | ||
} | ||
def asyncProfilerPath = System.getProperty("asyncProfiler.path", null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am uncertain about this part about async profiler, but I trust your expertise on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ldematte Thanks Lorenzo, makes sense. I've left some comments to address, but otherwise it looks good.
…csearch into knn-index-tester-changes
if (asyncProfilerPath != null) { | ||
if (OS.current().equals(OS.MAC)) { | ||
def asyncProfilerAgent = "${asyncProfilerPath}/lib/libasyncProfiler.dylib" | ||
println "Using async-profiler agent ${asyncProfilerAgent}" | ||
jvmArgs "-agentpath:${asyncProfilerAgent}=start,event=cpu,interval=10ms,file=${layout.buildDirectory.asFile.get()}/tmp/elasticsearch-0_%t_%p.jfr" | ||
} else if (OS.current().equals(OS.LINUX)) { | ||
def asyncProfilerAgent = "${asyncProfilerPath}/lib/libasyncProfiler.so" | ||
println "Using async-profiler agent ${asyncProfilerAgent}" | ||
jvmArgs "-agentpath:${asyncProfilerAgent}=start,event=cpu,interval=10ms,wall=50ms,file=${layout.buildDirectory.asFile.get()}/tmp/elasticsearch-0_%t_%p.jfr" | ||
} else { | ||
println "Ignoring 'asyncProfiler.path': not available on ${OS.current()}"; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am cool with this. However, why don't we add wall to MAC as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried it and had to back off. Looking at the error I got and at the async-profiler code, apparently only Linux has an implementation that uses perf events, which let you record both cpu time and wall time at the same time. On Mac, the engine behind is less flexible/precise, and you can have one or the other.
I'm wondering: maybe I should add an option of that, like adding a -DasyncProfiler.event=
, default to cpu
, but can be changed to wall
so Mac users can have this choice?
iwc.setMaxBufferedDocs(IndexWriterConfig.DISABLE_AUTO_FLUSH); | ||
iwc.setRAMBufferSizeMB(writerBufferSizeInMb); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to be careful, we should by default benchmark with ES defaults. Optimizing our benchmarks but not our production code can give a false sense of improvement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setRAMBufferSizeMB
is the same default as before, but now I'm not sure it is the same as ES. I'll check.
As for the number of docs... I'll check that too. I supposed these settings are exclusive, but it might be they are "first met wins" instead.
If that's the case, I'll default to what ES has and add an input option for that too.
} catch (IOException ioe) { | ||
throw new UncheckedIOException(ioe); | ||
} | ||
logger.debug("Index thread times: [{}ms] read, [{}ms] add doc", readTime / 1_000_000, docAddTime / 1_000_000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we are going to worry about this nuance, we should simply adjust this to return the time each thread spent indexing and sum those up in the results
Additionally, we would need to separate the call to:
ConcurrentMergeScheduler cms = (ConcurrentMergeScheduler) iwc.getMergeScheduler();
cms.sync();
and maybe have a "exact index time" vs "overall index time" or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this because I was seeing weird variance on the machine I'm using (on AWS) and wanted to see if read/write performance was at the root of the issue.
I'm not sure there is value in keeping this, other than looking at the logs and say "ah, yes, this run has issues with reading, I'll treat it as an outlier" (or not, depending on what you are measuring).
But I do agree that a plain log is not the best tool for this; I'll make the change you suggest, it makes sense and it's not too much of a change.
In order to better understand the performance characteristics of vector indexing with a GPU, this PR introduces 2 changes: