Reintroduction of Cursor code to optimization memory usage of CoExpression Service #6834
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
As a follow-up to the investigation of crashes on public portal backend (the source of which has now been identified as requests to the CoExpression service), the PR reintroduces the use of Cursors as a way to limit the amount of heap used by the CoExpression service implementation.
The following screen shots were made by profiling the CoExpression service satisfying requests against the Cancer Cell Line Encyclopedia (Broad, 2019) with a query gene of SCML2.
On multiple occasions, at its peak memory usage, the CoExpression services uses over 7GB to satisfy a request. The is the result of an accumulation of GeneMolecularAlteration instances and a pileup of calls to string splitting. In this case, we have ~33k instances of GeneMolecularAlteration instances, each of which contains > 1500 alteration measurements (string splitting results in close to 50 million strings made for each entity-sample measurement). The GeneMolecularAlteration instances (and resultant genetic alteration strings due to string splitting) are accumulated in memory before any spearman correlation computations are made.
The following screenshot is an example memory telemetries which highlights the peak memory (7.89GB) consumption of a CoExpression service call (this was the greatest peak capture during profiling):
With the introduction of cursors, there is only a single GeneMolecularAlteration instance is in memory at any one moment in time (well two, because the one representing the "query" entity is kept in memory for the correlation computation). Once the spearman correlation is made for this GeneMolecularAlteration instance, the instance is discarded and the next one is fetched from memory.
This screenshot is an example memory telemetry which highlights peak memory (5.73 GB) consumption with the introduction of cursors:
Timings captured indicated that cursors do not add any overhead to satisfying the CoExpression request. In all captured cases, cursors outperformed the not-cursor implementation
In this example, pre-cursor code completes in ~41 seconds
With cursor code, the request takes 38 seconds:
While results will probably vary based on the host environment and datasets evaluated, the introduction of cursors do seem to benefit the CoExpression service implementation.