Skip to content
This repository has been archived by the owner on Jan 11, 2022. It is now read-only.

RNA insert metrics #234

Merged
merged 19 commits into from
Jan 8, 2020
Merged

RNA insert metrics #234

merged 19 commits into from
Jan 8, 2020

Conversation

morsecodist
Copy link
Contributor

@morsecodist morsecodist commented Jan 3, 2020

Description

This change computes insert size metrics for all paired end DNA samples and all paired end RNA samples provided we have a gtf file for the host genome.

Version

Tests

  • I have verified in IDseq staging that the pipeline still completes successfully:
    • for single-end inputs
    • for paired-end inputs
    • for FASTQ inputs
    • for FASTA inputs.
  • I have validated that my change does not introduce any correctness bugs to existing output types.
  • I have validated that my change does not introduce significant performance regressions or I have discussed with the team that the benefits of the change are substantial enough that we're comfortable accepting the size of the measured performance penalty.

@morsecodist
Copy link
Contributor Author

I have been unable to find any fasta files despite searching all of our samples. Am I searching incorrectly?

README.md Outdated
@@ -232,6 +232,9 @@ Version numbers for this repo take the form X.Y.Z.
- We increase X for a paradigm shift in how the pipeline is conceived. Example: adding a de-novo assembly step and then reassigning hits based on the assembled contigs.
Changes to X or Y force recomputation of all results when a sample is rerun using idseq-web. Changes to Z do not force recomputation when the sample is rerun - the pipeline will lazily reuse existing outputs in AWS S3.

- 3.15.5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be 3.15.0, no?

Copy link
Contributor

@gregdingle gregdingle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work! Please fix the version number, and I made some requests for clarification.

Also, it's not clear how you tested this in RNA and DNA modes... are both represented in examples/generic_test_dag.json?

idseq_dag/util/s3.py Show resolved Hide resolved
examples/generic_test_dag.json Show resolved Hide resolved
@@ -1,2 +1,2 @@
''' idseq_dag '''
__version__ = "3.14.5"
__version__ = "3.15.5"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3.15.0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still hasn't changed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this

idseq_dag/steps/run_star.py Show resolved Hide resolved
idseq_dag/steps/run_star.py Show resolved Hide resolved
Copy link
Contributor

@gregdingle gregdingle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good comments but version number still hasn't fully changed.

@@ -1,2 +1,2 @@
''' idseq_dag '''
__version__ = "3.14.5"
__version__ = "3.15.5"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still hasn't changed

@morsecodist morsecodist changed the title Rna insert metrics RNA insert metrics Jan 8, 2020
@morsecodist morsecodist merged commit 36e7ea8 into master Jan 8, 2020
@morsecodist morsecodist deleted the rna-insert-metrics branch January 8, 2020 17:15
def list_s3_keys(s3_path_prefix):
with botolock:
rate_limit_boto()
return _list_s3_keys(s3_path_prefix)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I am late to the party here but I have a number of concerns with the use of boto3 regarding the fact that it tends to implicitly use a global session object that cannot be shared between threads.

Is there a way to rewrite this without using boto3? If not, would you mind creating your own dedicated boto session for every client ant paginator creation?

You could look for instance here https://stackoverflow.com/questions/52820971/is-boto3-client-thread-safe but by no means is that the only place where its thread safety is questioned. The concern that the implicit global boto session should not be used from multiple threads is documented in the official boto docs. All code in s3.py is meant to be thread safe, and this example is even more important because the paginator is a long lived object (a generator).

If you must use boto3, it seems to me the botolock and the rate_limit_boto() functions are placed in the wrong place. Those should protect lines 92 and 93 above, where the first boto3 operations occur, not this line. It's also unclear to me if the paginator

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants