Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Adding somatic neoepitope preparation workflow #495

Open
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

giacuong171
Copy link
Contributor

Adding a draft workflow for somatic neoepitope preparation for using pvactool later

@giacuong171 giacuong171 linked an issue Apr 4, 2024 that may be closed by this pull request
@giacuong171 giacuong171 changed the title Adding somatic neoepitope preparation workflow feat: Adding somatic neoepitope preparation workflow Apr 4, 2024
Copy link

github-actions bot commented Apr 4, 2024

  • Please format your Python code with ruff: make fmt
  • Please check your Python code with ruff: make check
  • Please format your Snakemake code with snakefmt: make snakefmt

You can trigger all lints locally by running make lint

@coveralls
Copy link

coveralls commented Apr 4, 2024

Coverage Status

coverage: 85.894% (+0.1%) from 85.778%
when pulling 03c9714 on 472-adding-neoepitope-prediction-pipeline
into fbb3de1 on main.

@giacuong171 giacuong171 self-assigned this Apr 10, 2024
Copy link
Contributor

@ericblanc20 ericblanc20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very good, thanks a lot for your efforts. I think that there are still a few points to be fixed, but otherwise it is great.

Copy link
Contributor

@ericblanc20 ericblanc20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last bit: ensure that the app can deal with columns selected by number rather than by name.
Perhaps it would be good to have the choice of 4 columns

  1. The gene or transcript id (with or without version)
  2. The TPMs
  3. If there is no value for the TPM column, then the counts and
  4. If there is no value for the TPM column, then the feature lengths.

If there if no TPM column and no feature length column, then the gtf is used to get the length of transcripts, and an approximate value for the gene lengths.

mapper="star",
library_name=rna_library,
)
ext = {"expression", "bam", "bai"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this statement?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nesting is also pretty deep, perhaps it can be restructured (by breaking it into smaller chunks / moving common stuff into dedicated functions) a bit to make it a bit more palatable

"{mapper}.{var_caller}.{anno_caller}.{tumor_library}"
)
# Need to change for work on many different tools
key_ext = {"vcf": ".full.vcf.gz", "vcf_tbi": ".full.vcf.gz.tbi"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The choice between the full annotation .full.vcf.gz and the selected transcript only .vcf.gz may be left to the user.

tools_somatic_variant_calling: [] # deafult to those configured for somatic_variant_calling
max_depth: "4000"
preparation:
format: 'star' # REQUIRED - The file format of the expression file to process. (stringtie,kallisto,cufflinks,custom)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed before, snappy now outputs a 2-columns GeneCounts.tab file: the gene id in the first, and the counts in the second. Therefore, I suggest that the defaults should be:

      format: "custom"
      id-column: 1
      expression-column: 2

snappy_wrappers/wrappers/pvactools/combining/comb_rna.py Outdated Show resolved Hide resolved
snappy_wrappers/wrappers/pvactools/combining/comb_rna.py Outdated Show resolved Hide resolved
snappy_wrappers/wrappers/pvactools/combining/comb_rna.py Outdated Show resolved Hide resolved
snappy_wrappers/wrappers/pvactools/combining/comb_rna.py Outdated Show resolved Hide resolved
snappy_wrappers/wrappers/pvactools/combining/wrapper.py Outdated Show resolved Hide resolved
Copy link
Contributor

@tedil tedil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some change suggestions from a different perspective than Eric's ;)

mapper="star",
library_name=rna_library,
)
ext = {"expression", "bam", "bai"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nesting is also pretty deep, perhaps it can be restructured (by breaking it into smaller chunks / moving common stuff into dedicated functions) a bit to make it a bit more palatable

snappy_wrappers/wrappers/pvactools/combining/comb_rna.py Outdated Show resolved Hide resolved
snappy_wrappers/wrappers/pvactools/combining/comb_rna.py Outdated Show resolved Hide resolved
snappy_wrappers/wrappers/pvactools/combining/wrapper.py Outdated Show resolved Hide resolved
Copy link
Contributor

@ericblanc20 ericblanc20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many comments here are general musings about how to improve snappy in general. We should discuss them at some point, and involve Till as much as possible.

)
yield "combine_vcf", prepare_tpl
hla_typing = self.parent.sub_workflows["hla_typing"]
hla_tpl = "output/optitype.{ngs_library}/out/optitype.{ngs_library}.txt"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, the HLA caller tools shouldn't be hard-coded.
However, to keep it general, we should consider the possibility that some HLA caller tools require mapped data.
So there is a possible file naming issue there, and the code needs to do something like the pseudo-code below:

hla_caller_prefix = ""
# Test if the selected HLA caller gets its input from the ngs_mapping step
if "path_ngs_mapping" in self.w_config.step_config.hla_typing[self.config.hla_caller].keys():
    hla_caller_prefix += f"{self.config.mapper}."
hla_caller_prefix += f"{self.config.hla_caller}"

This is ugly, likely to fail (I believe that path_ngs_mapping is defined regardless of its need in the hla_typing step model) and I don't like it.
Ideally, I think that each step should provide a facility to get the naming prefix, or something like that.

But perhaps there are better ideas, or perhaps we have to live with ugly solutions.
Ideas, comments?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why the rules for tools must be put under an if statement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have discussed with Till about this bug. The rule hla_typing_arcashla_run gets evaluated, even that arcashla is not defined. So this is a quick fix for this bug.

key,
ngs_mapping(
(rna_tpl + ext).format(
mapper=self.w_config.step_config["ngs_mapping"].tools.rna[0],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case there are many tools to quantify expression, I think it's better to give the choice of mapper to the user, rather than take the first one.
Ideally, I would have a rna_mapper entry in the configuration, so the user can choose the one she likes. If it's empty, then snappy takes the first rna mapper defined the the ngs_mapping configuration step.

def _get_output_files_prepare(self):
if self.config["preparation"]["mode"] == "gene":
prefix = (
"work/prepare/{mapper}.{var_caller}.{anno_caller}.GX.{tumor_library}/out/"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to store the prepared vcf in work/{mapper}.{var_caller}.{anno_caller}.(GX|TX).{tumor_library}/prepare.
It is more like the rest of the pipeline, where the rule is on subdirectory per sample in work.
The only exceptions are input_links and docker containers, R packages, & any files used during the creation of every output.
(Actually, when I think of it, there should not be a input_links directory, it would be better to have work/xxx.{library}/input_links, what do you think)

@dictify
def _get_output_files_predict(self):
key_ext = {"all_epitopes": ".all_epitopes.tsv", "filtered_epitopes": ".filtered.tsv"}
prefix = "work/predict/{mapper}.{var_caller}.{anno_caller}.{mode}.{tumor_library}.epitopes/out/MHC_Class_I/{tumor_library}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, work/{mapper}.{caller}.{anno_caller}.pvacseq.{tumor_library}/out/{mapper}.{caller}.{anno_caller}.pvacseq.{tumor_library}.{mode}.MHC_Class_I might be a better prefix.
Another comment: I am not too keen on including all tools in directories & filenames. If we want to follow the rule strictly, we should add the HLA typing tool, and the RNA mapper. We might also add adapter trimming, for example.
I think it is unmanageable to add all the tools, and we might need to think about alternative naming schemes.
Ideas, comments?

else ""
)

op_dir = "/".join(snakemake.output.all_epitopes.split("/")[:-2])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please replace "/".join with os.path.join


op_dir = "/".join(snakemake.output.all_epitopes.split("/")[:-2])

files_to_bind = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that currently you need to bind the any directory. My understanding is that the pvacseq command requires only files within the step directory, both for input and output.

__author__ = "Pham Gia Cuong"
__email__ = "pham.gia-cuong@bih-charite.de"

step = snakemake.config["pipeline_step"]["name"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

step & config should not be needed anymore, is that correct?

{maximum_transcript_support_level}"
echo 'TMPDIR=/bindings/d2' > $TMPDIR/{snakemake.wildcards.tumor_library}.sh
echo $cmd >> $TMPDIR/{snakemake.wildcards.tumor_library}.sh
apptainer exec --home $PWD -B $TMPDIR:/bindings/d2 {bindings} {config[path_container]} bash /bindings/d2/{snakemake.wildcards.tumor_library}.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The path to the container should be passed to the wrapper as a parameter.

{NORMAL_VAF}\
{exclude_NAs}\
{maximum_transcript_support_level}"
echo 'TMPDIR=/bindings/d2' > $TMPDIR/{snakemake.wildcards.tumor_library}.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the command is written to a temporary script? Isn't it possible to pass it as a string?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adding neoepitope prediction pipeline
4 participants