Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

R Chart/PDF output filenames escape '%' with '%%' #1671

Merged
merged 5 commits into from Apr 30, 2021

Conversation

alanhoyle
Copy link
Contributor

Description

The output filenames for R charts now escape % characters to %%

It does this by taking running .replaceAll("%", "%%") on the PDF chart output filename that gets sent to the Rscript.

This pull fixes #1412 (reported by someone else), and my forum post about the error.

The R pdf() function has an auto-numbering feature (onefile=false) that Picard does not appear to use. However, even when not enabled, R sends the PDF output filename through a sprintf() function which uses the % character as a prefix for formatting and variable replacement.

If the filename requested by Picard contains a % character, this results in either Picard failing due to a ProcessExecutor invalid 'file' argument error, or (rarely) malformed filenames if the filename happens to conform sprintf() formatting (e.g. if the filename happened to be something-%dilution.pdf it would result in an output file something-1ilution.pdf)

This issue can show up in the following Picard tools that use RExecutor() method

  • CollectAlignmentSummaryMetrics
  • CollectBaseDistributionByCycle
  • CollectGcBiasMetrics
  • CollectInsertSizeMetrics
  • CollectRnaSeqMetrics
  • CollectRrbsMetrics
  • CollectWgsMetricsWithNonZeroCoverage
  • MeanQualityByCycle
  • QualityScoreDistribution

Another way to solve this would be to make the change in the RScripts in picard/src/main/resources/analysis by doing something like this before the pdf(outputFile)

outputFile <- gsub('%', '%%', outputFile)

It also fixes a typo in a comment in the Dockerfile and removes trailing whitespace on a couple of lines.


Checklist (never delete this)

Never delete this, it is our record that procedure was followed. If you find that for whatever reason one of the checklist points doesn't apply to your PR, you can leave it unchecked but please add an explanation below.

Content

  • Added or modified tests to cover changes and any new functionality
  • Edited the README / documentation (if applicable)
  • All tests passing on Travis

Review

  • Final thumbs-up from reviewer
  • Rebase, squash and reword as applicable

For more detailed guidelines, see https://github.com/broadinstitute/picard/wiki/Guidelines-for-pull-requests

@gbggrant
Copy link
Contributor

Hi @alanhoyle this looks good. Could you add a test case for this? Perhaps add to CollectInsertSizeMetrics a test case where it's generating a pdf with an '%' in the file name?

Copy link
Contributor

@gbggrant gbggrant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a test case

@alanhoyle
Copy link
Contributor Author

alanhoyle commented Apr 29, 2021

@gbggrant Please pardon my ignorance: I'm not much of a Java developer. What do I need to do to add a test case?

Taking your suggestion, I assume I need to edit `CollectInsertSizeMetricsTest.java. (edit) Creating a new method

How would I confirm that the test passes? Are those things automatically run if I do a Docker build on the source tree?

For the record, I tested my code manually after building a local docker image, I am just unfamiliar with how to do automated testing in Java.

@alanhoyle
Copy link
Contributor Author

alanhoyle commented Apr 29, 2021

I think I've figured this out and added method testPercentCharInPdfFilename() to CollectInsertSizeMetricsTest.java. It turns out that my dev environment (VSCode) has some pretty great default extensions that made it easy for me to configure and run tests.

Screen Shot 2021-04-29 at 9 04 05 AM

@alanhoyle alanhoyle requested a review from gbggrant April 29, 2021 19:11
Copy link
Contributor

@gbggrant gbggrant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @alanhoyle looks good - thanks for adding the test.
It's a pity the substitution couldn't be done in a central place (like RExecutor) to get rid of duplicated code.

"LEVEL=LIBRARY",
"LEVEL=READ_GROUP"
};
Assert.assertEquals(runPicardCommandLine(args), 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to check that the outfile is created with the expected name? That is, still with a single %? (just to future proof this).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would. I'll add that in a new commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added below.

@gbggrant
Copy link
Contributor

Oh, and the 'Branch' build is currently not working on forks, so if you're all set with this (that is you don't have anything to add), just say so and I'll do the merge to master.

Copy link
Contributor Author

@alanhoyle alanhoyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added an additional test immediately below the selected lines.

"LEVEL=LIBRARY",
"LEVEL=READ_GROUP"
};
Assert.assertEquals(runPicardCommandLine(args), 0);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added below.

@alanhoyle
Copy link
Contributor Author

Hi @alanhoyle looks good - thanks for adding the test.
It's a pity the substitution couldn't be done in a central place (like RExecutor) to get rid of duplicated code.

I don't think you can put it in RExecutor() since the parameters passed to the R scripts vary from tool to tool. If you do a blanket substitution in there, you'd end up breaking things where it wouldn't know how to find the input files used to generate the figures.

As I said in the original request, it might make slightly more sense to incorporate this directly into the R scripts in the pdf(outfile) calls, but that doesn't solve the duplicated code issue. I note that in one of the R scripts (gcBias.R), it explictly says pdf(outputFile, onefile=TRUE); and the onefile feature implementation seems to be the thing that causes the problematic behavior. If RProject changes the behavior to where when onefile=TRUE that it doesn't do the sprintf() on the filename, it would result in Picard doubling the % chars in the PDF filenames, but it would at least still run to completion.

@gbggrant
Copy link
Contributor

Hi @alanhoyle looks good - thanks for adding the test.
It's a pity the substitution couldn't be done in a central place (like RExecutor) to get rid of duplicated code.

I don't think you can put it in RExecutor() since the parameters passed to the R scripts vary from tool to tool. If you do a blanket substitution in there, you'd end up breaking things where it wouldn't know how to find the input files used to generate the figures.

As I said in the original request, it might make slightly more sense to incorporate this directly into the R scripts in the pdf(outfile) calls, but that doesn't solve the duplicated code issue. I note that in one of the R scripts (gcBias.R), it explictly says pdf(outputFile, onefile=TRUE); and the onefile feature implementation seems to be the thing that causes the problematic behavior. If RProject changes the behavior to where when onefile=TRUE that it doesn't do the sprintf() on the filename, it would result in Picard doubling the % chars in the PDF filenames, but it would at least still run to completion.

Yeah, I realized It couldn't go in RExecutor. Thanks for the good work.

@alanhoyle
Copy link
Contributor Author

alanhoyle commented Apr 29, 2021

Before you merge this, I also have a few changes to the .gitignore that others might find useful if they use a similar build environment. (see below) @gbggrant ?

Should I commit/add those, do a separate pull request, or just move those into my global .gitignore?

*.class
.project
org.eclipse*
bin/
.classpath
*.swp

@gbggrant
Copy link
Contributor

I think we'd prefer that you not commit your '.gitignore' - those all seem fairly specific to your environment and so might cause others confusion. At the minimum it should go in another PR.

@gbggrant gbggrant merged commit 949d7f9 into broadinstitute:master Apr 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

paths continaing odd charaters not properly escaped in call to R
2 participants