Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consider determining the exact output files from the start for complete compatibility #12

Open
1 task
sreichl opened this issue Mar 30, 2024 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@sreichl
Copy link
Contributor

sreichl commented Mar 30, 2024

DEA modules with explicit output to enable usage as module with subsequent modules (avoiding missing input exception) e.g., for enrichment analysis as input.
Requires loading all metadata files and explicitly defining final output files using limma /lmfit variable naming scheme.

Pro:

  • enables smooth usage as a module with explicit outputs to be used for subsequent inputs

Con

  • requires data dependent configuration/annotation -> considered bad practice/to be avoided
    • read up on why and how this translates to this use case

if done, do the same in dea_seurat.

@sreichl sreichl self-assigned this Mar 30, 2024
@sreichl sreichl added the enhancement New feature or request label Mar 30, 2024
@sreichl
Copy link
Contributor Author

sreichl commented May 21, 2024

Idea 1: pre-generate all feature list names

  • Make feature list generating rule a checkpoint with a subsequent aggregation rule that creates a csv similar to the input annotation of enrichment analysis (name, path, background,…) for each analysis. -> this is then required in the target rule instead of the feature_list folder
  • Thereby the missing input problem is solved without using the internal data and the annotation of enrichment analysis module has become less cumbersome.
  • -> enabling run from A to Z
  • Need to explicitly determine the exact filenames before execution and then instruct rules -> Is this actually possible?! I did not manage before in genome_track to make outputs conditional, only inputs using input functions.
  • This requires the function dmatrix from library patsy, which in turn requires the Global Workflow Dependency functionality of Snakemake 8
  • need to make empty files for groups without DEGs

Idea 2: use checkpoints

Idea 3: use for loops around the rule

  • Check if for loops for rules are supported. Then one rule per analysis with the respective expand for the result files.

Idea 4: input = output?

  • Can I have a rule that has its input as output?!

Idea 5: adapt enrichmnet_analysis input

  • Change enrichment analysis input to a pattern of the output directory of the differential analysis. Think it threw before testing and implementing

Idea 6: Split up the feature list generation per group

  • Con: waste of resources as the result is loaded over and over
  • Pro: specific outputs supported by Snakemake
  • Request in the final target rule all pre determined feature lists and use wildcards for each group within each analyses.
  • Solves the problem without checkpoints or other problems (but requires Snakemake 8)
  • To save resources the explicit rule can take the input from the checkpoint but selects only for the lists per analysis and then copies or touches them?

@sreichl
Copy link
Contributor Author

sreichl commented May 26, 2024

Goal: Run analyses from rAw/reAds to pathwayZ/enrichmentZ i.e., close the gap between dea_limma/_seurat and enrichment-anlaysis module

if explicit pre generation of file names, then Snakemake 8 is required

  • install Snakemake 8
  • setup & document SLURM executor for CeMM HPC
  • change module to work with Snakemake 8 and SLURM executor (e.g., move partition from param to resource)
    • change & test all other modules, then switch min_version to 8.X.X
  • add global workflow dependency ie envs/global.yaml with library patsy for function dmatrix
  • develop function that generates file names using patsy
  • add it to target rule all as final outcome
  • add rule that touches (or copies?) respective files per group from checkpoint or call a new rule/script for feature list generation per group
    input:
        get_feature_lists,
    output:
        up = os.path.join(result_path,'{analysis}','feature_lists','{group}_up_features.txt'),
        up_annot = os.path.join(result_path,'{analysis}','feature_lists','{group}_up_features_annot.txt') if config["feature_annotation"]["path"]!="" else [],
        # same for down and featureScores.csv

@sreichl
Copy link
Contributor Author

sreichl commented Jun 21, 2024

predetermining result names potential problem
Requires to look into annotation/metatada data that is upstream generated by eg spilterlize or scRNAseq processing… hence can’t be used for a real A to Z run… But isn't that then a general problem? Think about it thoroughly before testing, then test easily without heavy developing.
Which brings me back to checkpoints between modules being the solution?!?!

@sreichl
Copy link
Contributor Author

sreichl commented Jun 30, 2024

Annotations coming from previous outputs should be copied to the respective config folder. Best practice usage should be working through a project module by module and thereby creating the respective annotation files. Only at the end a ring from A to Z should be possible for rerunning not investigation/exploration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant