R Chart/PDF output filenames escape '%' with '%%' #1671

alanhoyle · 2021-04-27T16:41:07Z

Description

The output filenames for R charts now escape % characters to %%

It does this by taking running .replaceAll("%", "%%") on the PDF chart output filename that gets sent to the Rscript.

This pull fixes #1412 (reported by someone else), and my forum post about the error.

The R pdf() function has an auto-numbering feature (onefile=false) that Picard does not appear to use. However, even when not enabled, R sends the PDF output filename through a sprintf() function which uses the % character as a prefix for formatting and variable replacement.

If the filename requested by Picard contains a % character, this results in either Picard failing due to a ProcessExecutor invalid 'file' argument error, or (rarely) malformed filenames if the filename happens to conform sprintf() formatting (e.g. if the filename happened to be something-%dilution.pdf it would result in an output file something-1ilution.pdf)

This issue can show up in the following Picard tools that use RExecutor() method

CollectAlignmentSummaryMetrics
CollectBaseDistributionByCycle
CollectGcBiasMetrics
CollectInsertSizeMetrics
CollectRnaSeqMetrics
CollectRrbsMetrics
CollectWgsMetricsWithNonZeroCoverage
MeanQualityByCycle
QualityScoreDistribution

Another way to solve this would be to make the change in the RScripts in picard/src/main/resources/analysis by doing something like this before the pdf(outputFile)

outputFile <- gsub('%', '%%', outputFile)

It also fixes a typo in a comment in the Dockerfile and removes trailing whitespace on a couple of lines.

Checklist (never delete this)

Never delete this, it is our record that procedure was followed. If you find that for whatever reason one of the checklist points doesn't apply to your PR, you can leave it unchecked but please add an explanation below.

Content

Added or modified tests to cover changes and any new functionality
Edited the README / documentation (if applicable)
All tests passing on Travis

Review

Final thumbs-up from reviewer
Rebase, squash and reword as applicable

For more detailed guidelines, see https://github.com/broadinstitute/picard/wiki/Guidelines-for-pull-requests

gbggrant · 2021-04-28T18:40:53Z

Hi @alanhoyle this looks good. Could you add a test case for this? Perhaps add to CollectInsertSizeMetrics a test case where it's generating a pdf with an '%' in the file name?

gbggrant

Please add a test case

alanhoyle · 2021-04-29T02:27:42Z

@gbggrant Please pardon my ignorance: I'm not much of a Java developer. What do I need to do to add a test case?

Taking your suggestion, I assume I need to edit `CollectInsertSizeMetricsTest.java. (edit) Creating a new method

How would I confirm that the test passes? Are those things automatically run if I do a Docker build on the source tree?

For the record, I tested my code manually after building a local docker image, I am just unfamiliar with how to do automated testing in Java.

alanhoyle · 2021-04-29T13:05:51Z

I think I've figured this out and added method testPercentCharInPdfFilename() to CollectInsertSizeMetricsTest.java. It turns out that my dev environment (VSCode) has some pretty great default extensions that made it easy for me to configure and run tests.

gbggrant

Hi @alanhoyle looks good - thanks for adding the test.
It's a pity the substitution couldn't be done in a central place (like RExecutor) to get rid of duplicated code.

gbggrant · 2021-04-29T19:17:58Z

src/test/java/picard/analysis/CollectInsertSizeMetricsTest.java

+            "LEVEL=LIBRARY",
+            "LEVEL=READ_GROUP"
+        };
+        Assert.assertEquals(runPicardCommandLine(args), 0);


Would it make sense to check that the outfile is created with the expected name? That is, still with a single %? (just to future proof this).

It would. I'll add that in a new commit.

Added below.

gbggrant · 2021-04-29T19:19:52Z

Oh, and the 'Branch' build is currently not working on forks, so if you're all set with this (that is you don't have anything to add), just say so and I'll do the merge to master.

alanhoyle

I've added an additional test immediately below the selected lines.

alanhoyle · 2021-04-29T19:53:52Z

src/test/java/picard/analysis/CollectInsertSizeMetricsTest.java

+            "LEVEL=LIBRARY",
+            "LEVEL=READ_GROUP"
+        };
+        Assert.assertEquals(runPicardCommandLine(args), 0);


Added below.

alanhoyle · 2021-04-29T20:07:11Z

Hi @alanhoyle looks good - thanks for adding the test.
It's a pity the substitution couldn't be done in a central place (like RExecutor) to get rid of duplicated code.

I don't think you can put it in RExecutor() since the parameters passed to the R scripts vary from tool to tool. If you do a blanket substitution in there, you'd end up breaking things where it wouldn't know how to find the input files used to generate the figures.

As I said in the original request, it might make slightly more sense to incorporate this directly into the R scripts in the pdf(outfile) calls, but that doesn't solve the duplicated code issue. I note that in one of the R scripts (gcBias.R), it explictly says pdf(outputFile, onefile=TRUE); and the onefile feature implementation seems to be the thing that causes the problematic behavior. If RProject changes the behavior to where when onefile=TRUE that it doesn't do the sprintf() on the filename, it would result in Picard doubling the % chars in the PDF filenames, but it would at least still run to completion.

gbggrant · 2021-04-29T20:35:04Z

Hi @alanhoyle looks good - thanks for adding the test.
It's a pity the substitution couldn't be done in a central place (like RExecutor) to get rid of duplicated code.

I don't think you can put it in RExecutor() since the parameters passed to the R scripts vary from tool to tool. If you do a blanket substitution in there, you'd end up breaking things where it wouldn't know how to find the input files used to generate the figures.

As I said in the original request, it might make slightly more sense to incorporate this directly into the R scripts in the pdf(outfile) calls, but that doesn't solve the duplicated code issue. I note that in one of the R scripts (gcBias.R), it explictly says pdf(outputFile, onefile=TRUE); and the onefile feature implementation seems to be the thing that causes the problematic behavior. If RProject changes the behavior to where when onefile=TRUE that it doesn't do the sprintf() on the filename, it would result in Picard doubling the % chars in the PDF filenames, but it would at least still run to completion.

Yeah, I realized It couldn't go in RExecutor. Thanks for the good work.

alanhoyle · 2021-04-29T21:24:19Z

Before you merge this, I also have a few changes to the .gitignore that others might find useful if they use a similar build environment. (see below) @gbggrant ?

Should I commit/add those, do a separate pull request, or just move those into my global .gitignore?

*.class
.project
org.eclipse*
bin/
.classpath
*.swp

gbggrant · 2021-04-30T17:54:40Z

I think we'd prefer that you not commit your '.gitignore' - those all seem fairly specific to your environment and so might cause others confusion. At the minimum it should go in another PR.

R Chart output filenames escape '%' to '%%'

42741ee

gbggrant requested changes Apr 28, 2021

View reviewed changes

alanhoyle added 3 commits April 28, 2021 22:28

Merge branch 'master' into master

95deab5

Test case for CollectInsertSizeMetrics %.pdf

aaa35af

Merge branch 'master' of github.com:alanhoyle/picard

17c8b09

alanhoyle requested a review from gbggrant April 29, 2021 19:11

gbggrant approved these changes Apr 29, 2021

View reviewed changes

Specific test for % PDF existence.

5bf7475

alanhoyle commented Apr 29, 2021

View reviewed changes

gbggrant merged commit 949d7f9 into broadinstitute:master Apr 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

R Chart/PDF output filenames escape '%' with '%%' #1671

R Chart/PDF output filenames escape '%' with '%%' #1671

alanhoyle commented Apr 27, 2021

gbggrant commented Apr 28, 2021

gbggrant left a comment

alanhoyle commented Apr 29, 2021 •

edited

alanhoyle commented Apr 29, 2021 •

edited

gbggrant left a comment

gbggrant Apr 29, 2021

alanhoyle Apr 29, 2021

alanhoyle Apr 29, 2021

gbggrant commented Apr 29, 2021

alanhoyle left a comment

alanhoyle Apr 29, 2021

alanhoyle commented Apr 29, 2021

gbggrant commented Apr 29, 2021

alanhoyle commented Apr 29, 2021 •

edited

gbggrant commented Apr 30, 2021

R Chart/PDF output filenames escape '%' with '%%' #1671

R Chart/PDF output filenames escape '%' with '%%' #1671

Conversation

alanhoyle commented Apr 27, 2021

Description

Checklist (never delete this)

Content

Review

gbggrant commented Apr 28, 2021

gbggrant left a comment

Choose a reason for hiding this comment

alanhoyle commented Apr 29, 2021 • edited

alanhoyle commented Apr 29, 2021 • edited

gbggrant left a comment

Choose a reason for hiding this comment

gbggrant Apr 29, 2021

Choose a reason for hiding this comment

alanhoyle Apr 29, 2021

Choose a reason for hiding this comment

alanhoyle Apr 29, 2021

Choose a reason for hiding this comment

gbggrant commented Apr 29, 2021

alanhoyle left a comment

Choose a reason for hiding this comment

alanhoyle Apr 29, 2021

Choose a reason for hiding this comment

alanhoyle commented Apr 29, 2021

gbggrant commented Apr 29, 2021

alanhoyle commented Apr 29, 2021 • edited

gbggrant commented Apr 30, 2021

alanhoyle commented Apr 29, 2021 •

edited

alanhoyle commented Apr 29, 2021 •

edited

alanhoyle commented Apr 29, 2021 •

edited