trivial spark tool for extracting original SAM records based on a file … #3589

SHuang-Broad · 2017-09-19T15:34:20Z

…containing read names. PrintReadsSpark requires bam to be coordinate sorted, this doesn't.

I find myself frequently using this tool for looking at the SAM records of some templates. Maybe useful for others as well. So I'm putting it in, but hiding it from displaying in help.

@vruano mind reviewing?

codecov-io · 2017-09-19T16:27:45Z

Codecov Report

Merging #3589 into master will increase coverage by 0.017%.
The diff coverage is 88.235%.

@@               Coverage Diff               @@
##              master     #3589       +/-   ##
===============================================
+ Coverage     79.736%   79.753%   +0.017%     
- Complexity     18164     18173        +9     
===============================================
  Files           1218      1219        +1     
  Lines          66671     66688       +17     
  Branches       10430     10431        +1     
===============================================
+ Hits           53161     53186       +25     
+ Misses          9297      9288        -9     
- Partials        4213      4214        +1

Impacted Files	Coverage Δ	Complexity Δ
...ls/ExtractOriginalAlignmentRecordsByNameSpark.java	`88.235% <88.235%> (ø)`	`6 <6> (?)`
...er/tools/spark/sv/discovery/AlignmentInterval.java	`88.889% <0%> (+0.463%)`	`52% <0%> (+1%)`	⬆️
...oadinstitute/hellbender/utils/gcs/BucketUtils.java	`78.571% <0%> (+1.948%)`	`39% <0%> (ø)`	⬇️
...ute/hellbender/tools/spark/sv/utils/FileUtils.java	`24% <0%> (+24%)`	`2% <0%> (+2%)`	⬆️

vruano

Some minor changes... what about integration testing?

vruano · 2017-09-19T17:57:27Z

...oadinstitute/hellbender/tools/spark/sv/utils/ExtractOriginalAlignmentRecordsByNameSpark.java

+    private String outputSAM;
+
+    @Argument(doc = "to require RG tag on reads or not [false]", shortName = "rg",
+            fullName = "requireRG", optional = true)


I would define constants for full and short names of all these arguments that do not make reference to the StandardArgumentDefinitions constants.

Address comments above and bellow.

added short name for readNameFile

vruano · 2017-09-19T17:58:34Z

...oadinstitute/hellbender/tools/spark/sv/utils/ExtractOriginalAlignmentRecordsByNameSpark.java

+        final Broadcast<HashSet<String>> namesToLookForBroadcast = ctx.broadcast(parseReadNames());
+
+        final JavaRDD<GATKRead> reads =
+                getUnfilteredReads().repartition(80)


why 80? magic number?

historically put there before using writeReads for performance tuning.
Removed.

vruano · 2017-09-19T18:01:40Z

...oadinstitute/hellbender/tools/spark/sv/utils/ExtractOriginalAlignmentRecordsByNameSpark.java

+            final HashSet<String> namesToLookFor = new HashSet<>();
+            String line;
+            while ( (line = rdr.readLine()) != null ) {
+                namesToLookFor.add(line.replace("@", "")


Is there a guaranteed that such sequences @ /1 /2 cannot be in the middle of a read name? Otherwise you could use a regex ("^@") and ("/\d$").

perhaps then you would call replaceAll or replaceFirst instead.

Also you could use BufferedReader's Stream<String> lines():

rdr.lines().map(s -> s.replace(...).replace(...)).collect(Collectors.toSet());

But remember to capture UncheckedIOException.

vruano · 2017-09-19T18:03:33Z

...oadinstitute/hellbender/tools/spark/sv/utils/ExtractOriginalAlignmentRecordsByNameSpark.java

+
+        try ( final BufferedReader rdr =
+                      new BufferedReader(new InputStreamReader(BucketUtils.openFile(readNameFile))) ) {
+            final HashSet<String> namesToLookFor = new HashSet<>();


Be as unspecific as possible... here namesToLookFor could be declared as Set<String>.

vruano · 2017-09-19T18:05:16Z

...oadinstitute/hellbender/tools/spark/sv/utils/ExtractOriginalAlignmentRecordsByNameSpark.java

+@BetaFeature
+public final class ExtractOriginalAlignmentRecordsByNameSpark extends GATKSparkTool {
+    private static final long serialVersionUID = 1L;
+    private final Logger localLogger = LogManager.getLogger(ExtractOriginalAlignmentRecordsByNameSpark.class);


I think that logger has already been initialized appropriately, you don't need to declared another logger here unless I'm missing something.

I've been doing this for a while. Seems unnecessary. Removed. Thanks!

vruano · 2017-09-19T18:07:20Z

...oadinstitute/hellbender/tools/spark/sv/utils/ExtractOriginalAlignmentRecordsByNameSpark.java

+
+    @Argument(doc = "to require RG tag on reads or not [false]", shortName = "rg",
+            fullName = "requireRG", optional = true)
+    private boolean require = false;


Seem that require is never used.

vruano · 2017-09-19T18:33:22Z

Done for now, please take a look at the suggested changes.
@SHuang-Broad

vruano

Please address comments.

vruano · 2017-09-21T22:29:39Z

...oadinstitute/hellbender/tools/spark/sv/utils/ExtractOriginalAlignmentRecordsByNameSpark.java

+    private String outputSAM;
+
+    @Argument(doc = "to require RG tag on reads or not [false]", shortName = "rg",
+            fullName = "requireRG", optional = true)


Address comments above and bellow.

SHuang-Broad

Did requested changes. Also added an integration test.
back to you @vruano . Thanks!

SHuang-Broad · 2017-09-22T15:00:03Z

...oadinstitute/hellbender/tools/spark/sv/utils/ExtractOriginalAlignmentRecordsByNameSpark.java

+@BetaFeature
+public final class ExtractOriginalAlignmentRecordsByNameSpark extends GATKSparkTool {
+    private static final long serialVersionUID = 1L;
+    private final Logger localLogger = LogManager.getLogger(ExtractOriginalAlignmentRecordsByNameSpark.class);


I've been doing this for a while. Seems unnecessary. Removed. Thanks!

SHuang-Broad · 2017-09-22T15:01:27Z

...oadinstitute/hellbender/tools/spark/sv/utils/ExtractOriginalAlignmentRecordsByNameSpark.java

+    private String outputSAM;
+
+    @Argument(doc = "to require RG tag on reads or not [false]", shortName = "rg",
+            fullName = "requireRG", optional = true)


added short name for readNameFile

SHuang-Broad · 2017-09-22T15:02:12Z

...oadinstitute/hellbender/tools/spark/sv/utils/ExtractOriginalAlignmentRecordsByNameSpark.java

+
+    @Argument(doc = "to require RG tag on reads or not [false]", shortName = "rg",
+            fullName = "requireRG", optional = true)
+    private boolean require = false;


SHuang-Broad · 2017-09-22T15:04:10Z

...oadinstitute/hellbender/tools/spark/sv/utils/ExtractOriginalAlignmentRecordsByNameSpark.java

+        final Broadcast<HashSet<String>> namesToLookForBroadcast = ctx.broadcast(parseReadNames());
+
+        final JavaRDD<GATKRead> reads =
+                getUnfilteredReads().repartition(80)


historically put there before using writeReads for performance tuning.
Removed.

SHuang-Broad · 2017-09-22T15:05:20Z

...oadinstitute/hellbender/tools/spark/sv/utils/ExtractOriginalAlignmentRecordsByNameSpark.java

+
+        try ( final BufferedReader rdr =
+                      new BufferedReader(new InputStreamReader(BucketUtils.openFile(readNameFile))) ) {
+            final HashSet<String> namesToLookFor = new HashSet<>();


lbergelson · 2017-09-22T18:26:01Z

It's weird that PrintReadsSpark requires a coordinate sorted bam, we should fix that.

SHuang-Broad · 2017-09-22T18:35:59Z

@lbergelson There's an old ticket #929

vruano · 2017-09-22T21:25:45Z

...er/tools/spark/sv/integration/ExtractOriginalAlignmentRecordsByNameSparkIntegrationTest.java

+    public void testExtractOriginalAlignmentRecordsByNameSparkRunnableLocal() throws IOException {
+
+        final File tempWorkingDir = BaseTest.createTempDir("extractOriginalAlignmentRecordsByNameSparkIntegrationTest");
+        tempWorkingDir.deleteOnExit();


createTempDir already calls IOUtils.deleteRecursivelyOnExit().

In anycase deleteOnExit() won't work with directories, at best if the directory is empty it will delete.

vruano · 2017-09-22T21:27:27Z

...er/tools/spark/sv/integration/ExtractOriginalAlignmentRecordsByNameSparkIntegrationTest.java

+            expectedHeader = readsSource.getHeader();
+            expectedRecords = Utils.stream(readsSource.iterator()).filter(r -> r.getName().equals("asm013903:tig00002"))
+                    .sorted(Comparator.comparingInt(GATKRead::getAssignedStart)).map(r -> r.convertToSAMRecord(expectedHeader)).collect(Collectors.toList());
+


blank line.

vruano · 2017-09-22T21:29:20Z

...er/tools/spark/sv/integration/ExtractOriginalAlignmentRecordsByNameSparkIntegrationTest.java

+        try (final ReadsDataSource readsSource = new ReadsDataSource(IOUtils.getPath(tempWorkingDir+"/names.bam"))) {
+            Assert.assertEquals(expectedHeader, readsSource.getHeader());
+            final List<SAMRecord> samRecords = Utils.stream(readsSource.iterator()).map(r -> r.convertToSAMRecord(expectedHeader)).collect(Collectors.toList());
+            Assert.assertEquals(expectedRecords.stream().map(SAMRecord::getSAMString).collect(Collectors.toList()),


I think expected and actual in Assert.assertEquals are in the reverse order in its signature.

vruano · 2017-09-22T21:39:57Z

...oadinstitute/hellbender/tools/spark/sv/utils/ExtractOriginalAlignmentRecordsByNameSpark.java

-                        .filter(read -> namesToLookForBroadcast.getValue().contains(read.getName())).cache();
-        localLogger.info("Found these many alignments: " + reads.count());
-
+                getUnfilteredReads().filter(read -> namesToLookForBroadcast.getValue().contains(read.getName())).cache();


I wonder what is the cpu/walltime impact of this cache just for the sake of outputting the count in the INFO message below. Is it really worth it?

There's a count after the write, so I prefer to cache it.

Yeah... but is it worth the extra CPU/Wall time... the inside you get by outputting the count
is it worth the extra CPU/Wall time? In any case that one was jus a comment, no change needed.

vruano · 2017-09-23T00:58:26Z

@SHuang-Broad looks good, you can merge at your discretion.

…containning read names

vruano self-assigned this Sep 19, 2017

vruano reviewed Sep 19, 2017

View reviewed changes

vruano assigned SHuang-Broad and unassigned vruano Sep 19, 2017

vruano suggested changes Sep 21, 2017

View reviewed changes

SHuang-Broad commented Sep 22, 2017

View reviewed changes

SHuang-Broad assigned vruano and unassigned SHuang-Broad Sep 22, 2017

vruano suggested changes Sep 22, 2017

View reviewed changes

vruano approved these changes Sep 23, 2017

View reviewed changes

vruano assigned SHuang-Broad and unassigned vruano Sep 23, 2017

added spark tool for extracting original SAM records based on a file …

4669ad4

…containning read names

SHuang-Broad force-pushed the sh_sam_extraction_by_readnames branch from 5539d40 to 4669ad4 Compare September 24, 2017 15:53

SHuang-Broad merged commit ac89f91 into master Sep 24, 2017

SHuang-Broad deleted the sh_sam_extraction_by_readnames branch September 24, 2017 16:46

trivial spark tool for extracting original SAM records based on a file … #3589

trivial spark tool for extracting original SAM records based on a file … #3589

Conversation

SHuang-Broad commented Sep 19, 2017

codecov-io commented Sep 19, 2017 • edited

Codecov Report

vruano left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vruano Sep 21, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vruano Sep 19, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vruano commented Sep 19, 2017

vruano left a comment

Choose a reason for hiding this comment

vruano Sep 21, 2017 • edited

Choose a reason for hiding this comment

SHuang-Broad left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lbergelson commented Sep 22, 2017

SHuang-Broad commented Sep 22, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vruano commented Sep 23, 2017

codecov-io commented Sep 19, 2017 •

edited

vruano Sep 21, 2017 •

edited

vruano Sep 19, 2017 •

edited

vruano Sep 21, 2017 •

edited