-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
F-Seq, trimmer choices, peak count #17
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking good. a few questions.
pipelines/ATACseq.py
Outdated
peak_output_file = os.path.join(peak_folder, args.sample_name + "_peaks.narrowPeak") | ||
peak_input_file = shift_bed | ||
|
||
if args.peak_caller not in PEAK_CALLERS: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's put the failures at the beginning, so it doesn't get partway through and then fail (fail early).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
pipelines/ATACseq.py
Outdated
if args.peak_caller not in PEAK_CALLERS: | ||
raise ValueError("Peak caller not in {}: '{}'".format(PEAK_CALLERS, args.peak_caller)) | ||
|
||
if args.peak_caller == "fseq": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having trouble following the fseq section -- what is fseq_optnames for? it looks like it's trying to grab params right out of param
and not, for example, param.fseq
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoa, good catch! yes, it needs to change to param.fseq
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That variable itself allows the options to be specified in the config (ATACseq.yaml
) as either the true one- or two-character option names that are used by fseq
, or as more human-readable names for what they represent. Probably not an issue for someone super familiar with fseq
, but my thinking is nice for others to not need to memorize those or constantly refer back to the documentation to see what represents what if they want to make a change.
pipelines/ATACseq.py
Outdated
import pypiper | ||
|
||
|
||
PEAK_CALLERS = ["fseq", "macs2"] | ||
TRIMMERS = ["trimmomatic", "pyadapt", "skewer"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is great, a perfect use of these variables. Nice!
pipelines/ATACseq.py
Outdated
help='Reference peak set for calculating FRIP') | ||
|
||
parser.add_argument('--pyadapt', action="store_true", | ||
help="Use pyadapter_trim for trimming? [Default: False]") | ||
parser.add_argument("--peak-caller", dest="peak_caller", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call to do this on peak callers as well!
pipelines/ATACseq.py
Outdated
cmd = cmds | ||
|
||
else: # default to trimmomatic | ||
if args.trimmer not in TRIMMERS: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can move this failure up, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just going to get rid of these, I think--they'll always pass so long as the args
attribute each for trimmer
and for peak_caller
isn't modified within main
. argparse
will catch any invalid specs for these via choices
. I get paranoid with multi-branch conditional blocks in which there's a default, but I think it's fine to let it go here ;D
pipelines/ATACseq.py
Outdated
base = os.path.join(tools.scripts_dir, "pyadapter_trim.py") | ||
flags = {"-a": local_input_files[0], "-b": local_input_files[1], "-o": out_fastq_pre} | ||
flag_text = " ".join(["{} {}".format(flag, value) for flag, value in flags]) | ||
cmd = "{} {} -u".format(base, flag_text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one advantage of the odd +=
method employed previously is that it's really easy to comment out individual options. that's the rationale there... we can discuss
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about:
flags = [
("-a", local_input_files[0]),
("-b", "local_input_files[1]")
...
]
Then we'd get both benefits -- easy to comment out the individual options, and any typing errors would still be in Python syntax rather than logical/part of the actual command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it was good that you drew attention to this--right now the iteration should be over flags.items()
in the pyadapt
section in the flag_text
comprehension; I accidentally wrote the skewer
and pyadapt
sections with different builtins 😱
pipelines/ATACseq.py
Outdated
cmd1 += " -o {0}".format(out_fastq_pre) | ||
cmd1 += " {0}".format(local_input_files[0]) | ||
|
||
if args.paired_end: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
local_input_files[1] if args.paired else None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was looking for this and then remembered -- I took out these conditionals because of where args.paired_end
is always set to True
at the start of main
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but that is there because we are trying to accept either paired or not. right now it's paired only but this will change (hopefully soon). so if you take it out where it's already set up, we will have to add it back in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK well I already changed all instances of this so I'll have to revert those
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad I shoulda picked up on that's what was going on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no worries, I don't think we had much in there, it was mostly relics from previous stuff...
pipelines/ATACseq.py
Outdated
|
||
parser.add_argument("--prealignments", default=[], type=str, nargs="+", | ||
help="Space-delimited list of reference genomes to align to before primary alignment.") | ||
|
||
# F-seq as peak caller | ||
parser.add_argument("--fragment-size", type=int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parameters to be passed to specific tools we usually do through the pipeline yaml file, rather than as a command-line argument; these can become quite numerous and would cloud the command line space, and are also usually modulated at the project level (not the sample level) meaning they are best included in the config file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK yeah good thinking. I added this when I was playing around with getting fseq
working as the peak caller on the test_project
, but I'll remove it now. Totally agree with trying to control the expansion of the option/parameter universe.
if not args.input: | ||
parser.print_help() | ||
raise SystemExit | ||
|
||
return args | ||
|
||
|
||
|
||
def build_command(chunks): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🏆
pipelines/ATACseq.py
Outdated
# Rename the logfile. | ||
#skewer_filename_pairs.append(("{}-trimmed.log".format(out_fastq_pre), trimLog)) | ||
|
||
trim_cmd_base = tools.skewer #+ " --quiet" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would propose merging cmd_base
with cmd_options
so there's only cmd_chunks
from the beginning... I'm not sure I see the need to define them separately...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
pipelines/ATACseq.py
Outdated
if args.single_or_paired == "paired": | ||
args.paired_end = True | ||
else: | ||
args.paired_end = True | ||
|
||
# Initialize | ||
outfolder = os.path.abspath(os.path.join(args.output_parent, args.sample_name)) | ||
pm = pypiper.PipelineManager(name="ATACseq", outfolder=outfolder, args=args, version=__version__) | ||
pm = pypiper.PipelineManager(name="ATACseq", outfolder=outfolder, args=args, version=__version__, strict_config=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we think of a way to do this without introducing a reliance on a pypiper dev upgrade? What needs the attribute dict behavior change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not super easily, but wouldn't be a giant lift either. I'd rather go ahead with the upgrade, though, since the open PR there is tiny and the way this is written now shouldn't need to change much--if at all--once pypiper
and looper
share the same AttributeDict
. What about a minor pypiper
release so that the dependency is still on master
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, possible -- but what is the reason for the change? what new thing is using this? I have not run into this before, what is it that needs it?
also I really don't want to make a pipeline developer put another argument in to the PipelineManager
. I'd rather make this the default if we really must change pypiper
pipelines/ATACseq.py
Outdated
if not args.paired_end: | ||
cmd2 = "mv {0} {1}".format(out_fastq_pre + "-trimmed.fastq", trimmed_fastq) | ||
cmds.append(cmd2) | ||
skewer_mode = "pe" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not format this command construction in the same way as the others? I think we should be consistent and construct all commands the same way if we can
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the others a bit more consistent; this one I went a different way so that the sections are more well-defined and grouped together. That is, there's only one conditional check on args.paired_end
, and the mode setting, input files, and rename targets are all handled together, based on that single check. If they're split apart, there's more clutter from multiple checks about paired_end
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regardless, I'm glad you drew attention here; I'd omitted the skewer_input_files
from the command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happened to the if args.paired end "pe" else None
concept? would that help here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't remember the exact way you suggested...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be one of the spots where the args.paired_end
would be checked. Then it'd be checked separately to determine the input filenames and how to rename them.
pipelines/ATACseq.py
Outdated
|
||
parser.add_argument('--skewer', action="store_true", | ||
help="Use skewer for trimming? [Default: False]") | ||
parser.add_argument("--skip-tss", dest="skip_tss", action="store_true", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would someone want to skip-tss from the cli?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've mentioned a desire to allow someone to run the pipeline outside of looper
; I think more hypothetical users will feel comfortable toggling behavior from the command line than by editing a configuration file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree on the fragment-size
front since that's much more internal to the behavior, but this feels more amenable to opt-in/-out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, this is a command-line thing -- I just don't think it needs to be configurable at all, not CLI vs config file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other words, TSS enrichment should not be optional. this gets into issue #11
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You & @rcorces certainly know better than I do here, so I'll remove this.
…re present; missing/extra commas
# TODO: determine if it's safe to handle this requirement with argparse. | ||
# It may be that communication between pypiper and a pipeline via | ||
# the pipeline interface (and/or) looper, and how the partial argument | ||
# parsing is handled, that makes this more favorable. | ||
if not args.input: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah -- args.input
comes from parser = pypiper.add_pypiper_args(parser, all_args = True)
-- this is standard in all our pipelines. open to other suggestions but this works at least
# filter peaks in blacklist | ||
else: | ||
# MACS2 | ||
macs_cmd_chunks = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this. it looks great!
q: 0.01 | ||
shift: 0 | ||
fseq: | ||
of: npf # narrowPeak as output format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
crossed my mind; if you really wanted to hard-code a second backup default, you could add it here in comments; so:
l: 600 # feature length; default: 600
I think it's unnecessary personally, people can just add it in here if they change the original and want a record. but this is place I would do it if I were going to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After pondering a bit more, I'm OK with leaving them off. I'm thinking about it something like this:
"Keep our baseline configuration file as simple as possible; not providing the defaults here will still effect the same behavior from fseq
. If a user really wants to fiddle with fseq
, he/she should go read about the tool itself. Assuming the defaults have been suitably determined, I'd rather not invite tinkering when that may disrupt fseq
in such a way that leads a user to mistakenly believe that there's a problem with the ATACseq pipeline."
Ok, looks fine for me if you're happy with this. So we don't forget: eventually we want |
Yeah definitely, that'll tidy this up a bit and make the command builder more broadly accessible. |
OK, going ahead and merging... |
Aims to address two of the items in #6
Repo-internal version of #16