Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

F-Seq, trimmer choices, peak count #17

Merged
merged 49 commits into from
Jul 19, 2017
Merged

F-Seq, trimmer choices, peak count #17

merged 49 commits into from
Jul 19, 2017

Conversation

vreuter
Copy link
Member

@vreuter vreuter commented Jul 14, 2017

Aims to address two of the items in #6
Repo-internal version of #16

Copy link
Member

@nsheff nsheff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good. a few questions.

peak_output_file = os.path.join(peak_folder, args.sample_name + "_peaks.narrowPeak")
peak_input_file = shift_bed

if args.peak_caller not in PEAK_CALLERS:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put the failures at the beginning, so it doesn't get partway through and then fail (fail early).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

if args.peak_caller not in PEAK_CALLERS:
raise ValueError("Peak caller not in {}: '{}'".format(PEAK_CALLERS, args.peak_caller))

if args.peak_caller == "fseq":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having trouble following the fseq section -- what is fseq_optnames for? it looks like it's trying to grab params right out of param and not, for example, param.fseq ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoa, good catch! yes, it needs to change to param.fseq

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That variable itself allows the options to be specified in the config (ATACseq.yaml) as either the true one- or two-character option names that are used by fseq, or as more human-readable names for what they represent. Probably not an issue for someone super familiar with fseq, but my thinking is nice for others to not need to memorize those or constantly refer back to the documentation to see what represents what if they want to make a change.

import pypiper


PEAK_CALLERS = ["fseq", "macs2"]
TRIMMERS = ["trimmomatic", "pyadapt", "skewer"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is great, a perfect use of these variables. Nice!

help='Reference peak set for calculating FRIP')

parser.add_argument('--pyadapt', action="store_true",
help="Use pyadapter_trim for trimming? [Default: False]")
parser.add_argument("--peak-caller", dest="peak_caller",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call to do this on peak callers as well!

cmd = cmds

else: # default to trimmomatic
if args.trimmer not in TRIMMERS:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can move this failure up, too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just going to get rid of these, I think--they'll always pass so long as the args attribute each for trimmer and for peak_caller isn't modified within main. argparse will catch any invalid specs for these via choices. I get paranoid with multi-branch conditional blocks in which there's a default, but I think it's fine to let it go here ;D

base = os.path.join(tools.scripts_dir, "pyadapter_trim.py")
flags = {"-a": local_input_files[0], "-b": local_input_files[1], "-o": out_fastq_pre}
flag_text = " ".join(["{} {}".format(flag, value) for flag, value in flags])
cmd = "{} {} -u".format(base, flag_text)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one advantage of the odd += method employed previously is that it's really easy to comment out individual options. that's the rationale there... we can discuss

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about:

flags = [
    ("-a", local_input_files[0]), 
    ("-b", "local_input_files[1]")
    ...
]

Then we'd get both benefits -- easy to comment out the individual options, and any typing errors would still be in Python syntax rather than logical/part of the actual command.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it was good that you drew attention to this--right now the iteration should be over flags.items() in the pyadapt section in the flag_text comprehension; I accidentally wrote the skewer and pyadapt sections with different builtins 😱

cmd1 += " -o {0}".format(out_fastq_pre)
cmd1 += " {0}".format(local_input_files[0])

if args.paired_end:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

local_input_files[1] if args.paired else None

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was looking for this and then remembered -- I took out these conditionals because of where args.paired_end is always set to True at the start of main

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but that is there because we are trying to accept either paired or not. right now it's paired only but this will change (hopefully soon). so if you take it out where it's already set up, we will have to add it back in.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK well I already changed all instances of this so I'll have to revert those

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad I shoulda picked up on that's what was going on

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no worries, I don't think we had much in there, it was mostly relics from previous stuff...

@vreuter vreuter changed the title fseq option for peak calling, and choice model for trimmer options F-Seq, trimmer choices, peak count Jul 18, 2017

parser.add_argument("--prealignments", default=[], type=str, nargs="+",
help="Space-delimited list of reference genomes to align to before primary alignment.")

# F-seq as peak caller
parser.add_argument("--fragment-size", type=int,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parameters to be passed to specific tools we usually do through the pipeline yaml file, rather than as a command-line argument; these can become quite numerous and would cloud the command line space, and are also usually modulated at the project level (not the sample level) meaning they are best included in the config file

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK yeah good thinking. I added this when I was playing around with getting fseq working as the peak caller on the test_project, but I'll remove it now. Totally agree with trying to control the expansion of the option/parameter universe.

if not args.input:
parser.print_help()
raise SystemExit

return args



def build_command(chunks):
Copy link
Member

@nsheff nsheff Jul 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🏆

# Rename the logfile.
#skewer_filename_pairs.append(("{}-trimmed.log".format(out_fastq_pre), trimLog))

trim_cmd_base = tools.skewer #+ " --quiet"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would propose merging cmd_base with cmd_options so there's only cmd_chunks from the beginning... I'm not sure I see the need to define them separately...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

if args.single_or_paired == "paired":
args.paired_end = True
else:
args.paired_end = True

# Initialize
outfolder = os.path.abspath(os.path.join(args.output_parent, args.sample_name))
pm = pypiper.PipelineManager(name="ATACseq", outfolder=outfolder, args=args, version=__version__)
pm = pypiper.PipelineManager(name="ATACseq", outfolder=outfolder, args=args, version=__version__, strict_config=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we think of a way to do this without introducing a reliance on a pypiper dev upgrade? What needs the attribute dict behavior change?

Copy link
Member Author

@vreuter vreuter Jul 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not super easily, but wouldn't be a giant lift either. I'd rather go ahead with the upgrade, though, since the open PR there is tiny and the way this is written now shouldn't need to change much--if at all--once pypiper and looper share the same AttributeDict. What about a minor pypiper release so that the dependency is still on master?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, possible -- but what is the reason for the change? what new thing is using this? I have not run into this before, what is it that needs it?

also I really don't want to make a pipeline developer put another argument in to the PipelineManager. I'd rather make this the default if we really must change pypiper

if not args.paired_end:
cmd2 = "mv {0} {1}".format(out_fastq_pre + "-trimmed.fastq", trimmed_fastq)
cmds.append(cmd2)
skewer_mode = "pe"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not format this command construction in the same way as the others? I think we should be consistent and construct all commands the same way if we can

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the others a bit more consistent; this one I went a different way so that the sections are more well-defined and grouped together. That is, there's only one conditional check on args.paired_end, and the mode setting, input files, and rename targets are all handled together, based on that single check. If they're split apart, there's more clutter from multiple checks about paired_end.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless, I'm glad you drew attention here; I'd omitted the skewer_input_files from the command.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happened to the if args.paired end "pe" else None concept? would that help here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't remember the exact way you suggested...

Copy link
Member Author

@vreuter vreuter Jul 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be one of the spots where the args.paired_end would be checked. Then it'd be checked separately to determine the input filenames and how to rename them.


parser.add_argument('--skewer', action="store_true",
help="Use skewer for trimming? [Default: False]")
parser.add_argument("--skip-tss", dest="skip_tss", action="store_true",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would someone want to skip-tss from the cli?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've mentioned a desire to allow someone to run the pipeline outside of looper; I think more hypothetical users will feel comfortable toggling behavior from the command line than by editing a configuration file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on the fragment-size front since that's much more internal to the behavior, but this feels more amenable to opt-in/-out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, this is a command-line thing -- I just don't think it needs to be configurable at all, not CLI vs config file

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words, TSS enrichment should not be optional. this gets into issue #11

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You & @rcorces certainly know better than I do here, so I'll remove this.

# TODO: determine if it's safe to handle this requirement with argparse.
# It may be that communication between pypiper and a pipeline via
# the pipeline interface (and/or) looper, and how the partial argument
# parsing is handled, that makes this more favorable.
if not args.input:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah -- args.input comes from parser = pypiper.add_pypiper_args(parser, all_args = True) -- this is standard in all our pipelines. open to other suggestions but this works at least

# filter peaks in blacklist
else:
# MACS2
macs_cmd_chunks = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this. it looks great!

q: 0.01
shift: 0
fseq:
of: npf # narrowPeak as output format
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

crossed my mind; if you really wanted to hard-code a second backup default, you could add it here in comments; so:

l: 600 # feature length; default: 600

I think it's unnecessary personally, people can just add it in here if they change the original and want a record. but this is place I would do it if I were going to

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After pondering a bit more, I'm OK with leaving them off. I'm thinking about it something like this:
"Keep our baseline configuration file as simple as possible; not providing the defaults here will still effect the same behavior from fseq. If a user really wants to fiddle with fseq, he/she should go read about the tool itself. Assuming the defaults have been suitably determined, I'd rather not invite tinkering when that may disrupt fseq in such a way that leads a user to mistakenly believe that there's a problem with the ATACseq pipeline."

@nsheff
Copy link
Member

nsheff commented Jul 19, 2017

Ok, looks fine for me if you're happy with this. So we don't forget: eventually we want build_command in pypiper, but for now having it here is fine.

@vreuter
Copy link
Member Author

vreuter commented Jul 19, 2017

Yeah definitely, that'll tidy this up a bit and make the command builder more broadly accessible.

@vreuter
Copy link
Member Author

vreuter commented Jul 19, 2017

OK, going ahead and merging...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants