Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding microbiome analysis workflows to IWC with test data #182

Open
wants to merge 53 commits into
base: main
Choose a base branch
from

Conversation

EngyNasr
Copy link

@EngyNasr EngyNasr commented Mar 7, 2023

I tried to reduce the test-data in this PR, hope it works.

Thanks a lot,
Engy <3

@mvdbeek
Copy link
Member

mvdbeek commented Mar 7, 2023

Thanks, can you add a README, Changelog and dockstore.yml files ? (https://github.com/galaxyproject/iwc/blob/main/workflows/README.md#structure-of-the-directory)

@EngyNasr
Copy link
Author

@mvdbeek did I miss something else ?

Thanks a lot for helping me out :)

@EngyNasr
Copy link
Author

@wm75 Can you help me revising and merging this PR
Thanks a lot <3

@EngyNasr
Copy link
Author

EngyNasr commented Mar 27, 2023

To do as discussed with @wm75 :

1- remove "latest" from workflow names
2- keep only one file to test for all the workflows, except the pre-processing
3- pre-processing tests we can use the txt file to compare content
4- make a readme inside each folder as well
5- make the changelog only inside each folder
6- remove the big docker.yml in the main folder and keep only the one in every folder

@EngyNasr
Copy link
Author

@bebatut I have added the 5 workflows for the single samples run and the 4 workflows for the collection of samples run so in total 9 workflows with their test data, I have chosen the minimum size sample data which contain VFs, Contigs, etc. the maximum file size is 50Mb, but the other files are either in Bytes or Kbs

@wm75
Copy link
Contributor

wm75 commented Jun 13, 2023

@EngyNasr @bebatut I don't see much value in offering the single-sample workflows, when collection-based flavors exist that could be run with 1-element collections. Having the single-sample WFs published would just mean more maintenance and synchronization efforts.
Or is there anything that can be achieved with the single-sample versions, that the collection-based versions don't do?

@wm75
Copy link
Contributor

wm75 commented Jun 13, 2023

@EngyNasr can you please run some json reformatter tool over your workflows. Single-line JSON is just not very nice to review and prevents meaningful diffs. Use, e.g., python3 -m json.tool collection-version/pathogen-detection-nanopore-pre-processing-collection/Pathogen-Detection-Nanopore-Pre-Processing-collection.ga collection-version/pathogen-detection-nanopore-pre-processing-collection/Pathogen-Detection-Nanopore-Pre-Processing-collection_pretty.ga or any other tool you like.

Copy link
Contributor

@wm75 wm75 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments:

  • At least some of your workflows are lacking a release attribute, which you need to add manually.
  • In the Preprocessing workflow, the conversion of fastq.gz to plain fastq, just for the purpose of filtering reads by their ID, is rather unfriendly for a user's quota.
    There is toolshed.g2.bx.psu.edu/repos/iuc/seqtk/seqtk_subseq/1.3.1 which can filter compressed fastqs directly (and which is probably faster in all cases). It's downside is that it only keeps matching IDs, but can't discard them, or write both to separate files (like toolshed.g2.bx.psu.edu/repos/peterjc/seq_filter_by_id/seq_filter_by_id/0.2.7).
    So if you want the non-host reads, you'll have to invert the action of Filter Tabular at the step before.
    If you really need also the host reads as a separate file (which I'm not entirely convinced of), you would have to run Filter Tabular and seqtk subseq twice, but even that might still be better than the current way?

@EngyNasr
Copy link
Author

@EngyNasr @bebatut I don't see much value in offering the single-sample workflows, when collection-based flavors exist that could be run with 1-element collections. Having the single-sample WFs published would just mean more maintenance and synchronization efforts. Or is there anything that can be achieved with the single-sample versions, that the collection-based versions don't do?

it was just the old way we used to do the analysis and we use these workflows in the current training material, thats why we wanted to have both as two versions of the workflow. but definitely they are useless now since the collection version does the same exact job, but it will never take a single file it has to always be a collection

@paulzierep
Copy link

This tool should also work here: toolshed.g2.bx.psu.edu/repos/iuc/krakentools_extract_kraken_reads/krakentools_extract_kraken_reads/1.2+galaxy0
Can use fastq.gz

…sing, to be added once the PR of the tool update is merges
@EngyNasr
Copy link
Author

EngyNasr commented Jun 22, 2023

I need help in tests @wm75 @paulzierep @bebatut :

I noticed that the tool id for these tools is not like the rest of the tools e.g. toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_getfastabed/2.30.0+galaxy1

Is there a way to solve that for these tests to succeed?

…with an update to Krakentool, and also pushing the latest updates to the workflows, still the planemo tests fails for the same reasons, I need help with that
@mvdbeek
Copy link
Member

mvdbeek commented Jun 28, 2023

@EngyNasr once galaxyproject/planemo#1377 is merged and a new version is released it should work.

@EngyNasr
Copy link
Author

@EngyNasr once galaxyproject/planemo#1377 is merged and a new version is released it should work.

thank you so much :)

@EngyNasr
Copy link
Author

EngyNasr commented Jun 28, 2023

@EngyNasr once galaxyproject/planemo#1377 is merged and a new version is released it should work.

@mvdbeek, Is it possible that it is also done for FILTER_EMPTY_DATASETS ?

@bebatut
Copy link
Member

bebatut commented Jun 29, 2023

I think it was closed by mistake

@bebatut bebatut reopened this Jun 29, 2023
@mvdbeek
Copy link
Member

mvdbeek commented Jun 29, 2023

Looks like it worked and you only need to work on your test assertions.

@wm75
Copy link
Contributor

wm75 commented Jun 30, 2023

@EngyNasr two questions on the latest preprocessing version:

  • why is the host ID now not a WF parameter anymore? You could use the "Map parameter value" expression tool to map a user-selected species name to the required taxid.
  • is there a specific reason why you're using fastqc, when you're already running fastp? MultiQC can also visualize fastp JSON output dataset and it doesn't look too different from what two FastQC results (before and after trimming).

@wm75
Copy link
Contributor

wm75 commented Jun 30, 2023

The Pathogen-Detection-Nanopore-All-Samples-Analysis WF needs much better annotations and input dataset labels to make it understandable what it is good for. Right now, understanding the purpose without knowing the tutorial is really hard. Even the README file only talks about VF genes, but many people wouldn't know what that is.

…database size, where we use now a input parameter instead, thank you so much Avatar
@mvdbeek
Copy link
Member

mvdbeek commented Aug 16, 2023

The one error you've got is still the same:

Loading database information...Failed attempt to allocate 65426688000bytes;
you may not have enough free memory to load this database.
If your computer has enough RAM, perhaps reducing memory usage from
other programs could help you load this database?
classify: unable to allocate hash table memory

you'd need a smaller database for testing, I think @wm75 or @paulzierep were interested in doing that ?

@EngyNasr
Copy link
Author

The one error you've got is still the same:

Loading database information...Failed attempt to allocate 65426688000bytes;
you may not have enough free memory to load this database.
If your computer has enough RAM, perhaps reducing memory usage from
other programs could help you load this database?
classify: unable to allocate hash table memory

you'd need a smaller database for testing, I think @wm75 or @paulzierep were interested in doing that ?

I can use the Standard-8 or Standard-16 databases instead for now, I just some administrative help to add to the kraken2 tool on Galaxy EU and Org.

@mvdbeek
Another question, the HTML output of the Github testing, shows an error in the Genebased Pathogen identification workflow too: Expecting value: line 1 column 1 (char 0), which actually blocks the runs of the Pre-processing and the taxonomy profiling workflows.

this error I dont understand, can you please check that, or is it only the database size problem of the taxonomy profiling workflow?

@EngyNasr
Copy link
Author

EngyNasr commented Apr 23, 2024

Dears :) @wm75

This PR is finally ready for review, it would be great if we could merge it this week

Thanks a lot,
Engy

@mvdbeek mvdbeek requested a review from wm75 April 24, 2024 15:21
…es, to make it opened for any microbiome workflows
Copy link
Member

@bebatut bebatut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @EngyNasr
I made some comments mostly for the 1st workflow, but they apply also to other workflows

@EngyNasr
Copy link
Author

EngyNasr commented May 2, 2024

Thanks @EngyNasr I made some comments mostly for the 1st workflow, but they apply also to other workflows

Thank you so much I will follow them all and ping you again, thanks a lot

EngyNasr added 15 commits May 8, 2024 16:18
… and after host removal to the Multiqc output of the preprocessing workflow
…, now the table is included and multiq runs correctly :)
…the collections comming from genebased pathogen identification, which may happen when no contigs are found by metaFlye for some samples
…ollections, to protect the workflow from the carry on error that might occur for samples with fewer number of contigs, VF genes and AMR genes found
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants