Skip to content
This repository has been archived by the owner on Mar 17, 2023. It is now read-only.

usage of nanopype, storage module #5

Closed
ginolhac opened this issue Aug 20, 2019 · 12 comments
Closed

usage of nanopype, storage module #5

ginolhac opened this issue Aug 20, 2019 · 12 comments
Assignees
Labels
bug Something isn't working

Comments

@ginolhac
Copy link

Hello,
your workflow fits perfectly my needs and I really hope I can manage to get it working in our set-up, which is on a HPC, with slurm.
I could install all tools but I want to use the singularity image that would be more easy to share the tool with people here.
I don't understand how I am supposed to use this image, I created it after pulling from docker hub.
the import of data when fine:

after booking interactive resources, and activating the virtual env of nanopype

python3  ~/install/virtualenvs/nanopype/nanopype/scripts/nanopype_import.py \
  data/raw/ EM_S1/20190806_0812_MN22103_FAK07438_17f539d2/fast5

gives

20.08.2019 09:49:22 [INFO] Logger created
20.08.2019 09:49:22 [INFO] Writing output to /mnt/irisgpfs/projects/lsru/minion/mouse_embryo/data/raw/reads
20.08.2019 09:49:22 [INFO] Inspect existing files and archives
20.08.2019 09:49:22 [INFO] 0 raw files already archived
20.08.2019 09:49:22 [INFO] 41 raw files to be archived
20.08.2019 09:50:45 [INFO] Archived 41 reads in /mnt/irisgpfs/projects/lsru/minion/mouse_embryo/data/raw/reads/0.tar
20.08.2019 09:50:45 [INFO] Mission accomplished

afterwards I naively tried this after creating a slurm profile

snakemake -j 999 --cluster-config cluster.json --cluster "sbatch -c 2 -t 1" --profile slurm \
  --use-singularity --singularity-prefix ~/scratch/nanopype-v0.8.0.simg \
  --snakefile ~/install/virtualenvs/nanopype/nanopype/Snakefile data/raw/reads.fofn 

gives

/mnt/irisgpfs/users/aginolhac/install/virtualenvs/nanopype/lib/python3.7/site-packages/snakemake/workflow.py:85: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Load
er is unsafe. Please read https://msg.pyyaml.org/load for full details.
  self.globals = globals()
[WARNING] Genome for GRCm38.p3 not found in ~/lsru/minion/references/GRCm38.p3.fasta, skipping entry.
[WARNING] Genome for test not found in references/chr6.fa, skipping entry.
[WARNING] flappie not found as flappie and is only available in singularity rules.
[WARNING] bedtools not found as bedtools and is only available in singularity rules.
[WARNING] graphmap not found as graphmap and is only available in singularity rules.
[WARNING] ngmlr not found as ngmlr and is only available in singularity rules.
[WARNING] sniffles not found as sniffles and is only available in singularity rules.
[WARNING] deepbinner not found as deepbinner-runner.py and is only available in singularity rules.
[WARNING] racon not found as racon and is only available in singularity rules.
[WARNING] cdna_classifier not found as cdna_classifier.py and is only available in singularity rules.
[WARNING] spliced_bam2gff not found as spliced_bam2gff and is only available in singularity rules.
[WARNING] cluster_gff not found as cluster_gff and is only available in singularity rules.
[WARNING] collapse_partials not found as collapse_partials and is only available in singularity rules.
[WARNING] polish_clusters not found as polish_clusters and is only available in singularity rules.
[WARNING] strique not found as STRique.py and is only available in singularity rules.
RuleException in line 76 of /home/users/aginolhac/install/virtualenvs/nanopype/nanopype/rules/demux.smk:
Singularity directive is only allowed with shell, script or wrapper directives (not with run).
  File "/home/users/aginolhac/install/virtualenvs/nanopype/nanopype/Snakefile", line 231, in <module>
  File "/home/users/aginolhac/install/virtualenvs/nanopype/nanopype/rules/demux.smk", line 76, in <module>

I tried to bind the mouse genome folder in the singularity image but I don't get it right. Also, what about all the warnings concerning the binary missing? Are they not found in the singularity image?

Then it is complaining about the demux.smk that is

 singularity:
        "docker://nanopype/demux:{tag}".format(tag=config['version']['tag'])
    run:
        import os, itertools, collection

Forgot to say that I successfully run the tests when I was inside the singularity image but not when running python3 test/test_rules.py test_unit_singularity which gave the same error as trying to index the fast5 files.

Thanks in advance for your time.

Aurelien

@giesselmann
Copy link
Owner

Hi Aurelien,
thank you for the interest in our pipeline. Singularity and slurm is a bit tricky as I can't test both in detail here. I'm currently working on the demux module, the error you report will be fixed in v0.9.0 which is coming soon.
For the moment in your nanopype repository, could you delete/comment out these two lines (they tell snakemake to use the specified container for this rule which is just unnecessary):

singularity:
"docker://nanopype/demux:{tag}".format(tag=config['version']['tag'])

Regarding binaries: If you use singularity, you can ignore the warnings about missing binaries, they're all in the container.

Regarding reference genome: Please first make sure the file '~/lsru/minion/references/GRCm38.p3.fasta' exists. In your env.yaml where you configure the genomes, can you replace the given path with an absolute path, I'm right now not sure if the ~ is handled properly.

I still have open pull requests to the snakemake master regarding singularity and group jobs in the cluster. But I've seen a few fixes on their side too.
If you encounter more errors please let me know, I have a strong interest to get the setup you have up and running smoothly!
Pay

@giesselmann giesselmann added the bug Something isn't working label Aug 20, 2019
@giesselmann giesselmann self-assigned this Aug 20, 2019
@ginolhac
Copy link
Author

thanks a lot for your fast answer!
Ok, I commented out the 2 lines and indeed it solved this part.
Then, I found the correct syntax to bind folder in singularity.

snakemake -j 999 --cluster-config cluster.json --cluster "sbatch -c 2 -t 1" --profile slurm \
  --use-singularity --singularity-args "--bind /mnt/irisgpfs/projects/lsru:/lsru" --singularity-prefix ~/scratch/img \
  --snakefile ~/install/virtualenvs/nanopype/nanopype/Snakefile data/raw/reads.fofn 

gives

/mnt/irisgpfs/users/aginolhac/install/virtualenvs/nanopype/lib/python3.7/site-packages/snakemake/workflow.py:85: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  self.globals = globals()
[WARNING] Genome for GRCm38.p3 not found in /lsru/minion/references/GRCm38.p3.fasta, skipping entry.
[WARNING] Genome for test not found in references/chr6.fa, skipping entry.
[WARNING] flappie not found as flappie and is only available in singularity rules.
[WARNING] bedtools not found as bedtools and is only available in singularity rules.
[WARNING] graphmap not found as graphmap and is only available in singularity rules.
[WARNING] ngmlr not found as ngmlr and is only available in singularity rules.
[WARNING] sniffles not found as sniffles and is only available in singularity rules.
[WARNING] deepbinner not found as deepbinner-runner.py and is only available in singularity rules.
[WARNING] racon not found as racon and is only available in singularity rules.
[WARNING] cdna_classifier not found as cdna_classifier.py and is only available in singularity rules.
[WARNING] spliced_bam2gff not found as spliced_bam2gff and is only available in singularity rules.
[WARNING] cluster_gff not found as cluster_gff and is only available in singularity rules.
[WARNING] collapse_partials not found as collapse_partials and is only available in singularity rules.
[WARNING] polish_clusters not found as polish_clusters and is only available in singularity rules.
[WARNING] strique not found as STRique.py and is only available in singularity rules.
Building DAG of jobs...
MissingRuleException:
No rule to produce data/raw/reads.fofn (if you use input functions make sure that they don't raise unexpected exceptions).

and then I realised the data in singularity should have also an absolute path. So I updated the nanopype.yaml with :

storage_data_raw : /lsru/minion/mouse_embryo/data/raw/

and the updated command line:

snakemake -j 999 --cluster-config cluster.json --cluster "sbatch -c 2 -t 1" --profile slurm   --use-singularity --singularity-args "--bind /mnt/irisgpfs/projects/lsru:/lsru" --singularity-prefix ~/scratch/img   --snakefile ~/install/virtualenvs/nanopype/nanopype/Snakefile /lsru/minion/mouse_embryo/data/raw/reads.fofn 

and now

RuntimeError in line 68 of /home/users/aginolhac/install/virtualenvs/nanopype/nanopype/Snakefile:
[ERROR] Raw data archive not found.
  File "/home/users/aginolhac/install/virtualenvs/nanopype/nanopype/Snakefile", line 68, in <module>

seems like the singularity image is not loaded, and links are used outside the container. As a demonstration, inside the container it is ok:

singularity shell --bind /mnt/irisgpfs/projects/lsru:/lsru ~/scratch/nanopype-v0.8.0.simg

bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
Singularity nanopype-v0.8.0.simg:/mnt/irisgpfs/projects/lsru/minion/mouse_embryo> ls -l /lsru/minion/mouse_embryo/data/raw/
total 0
drwxr-Sr-- 2 aginolhac lsru 4096 Aug 20 07:49 reads
Singularity nanopype-v0.8.0.simg:/mnt/irisgpfs/projects/lsru/minion/mouse_embryo> head -1 /lsru/minion/references/GRCm38.p3.fasta
>1 GRC primary reference assembly

sorry the mess, I guess this is not exactly trivial as a set-up. Even if, it would so awesome to get it working ;)

@giesselmann
Copy link
Owner

Hey,
let's start with the first error:
'No rule to produce data/raw/reads.fofn'
means that the configured raw data archive doesn't match the requested read index file.
Can you make sure, your raw (test) data archive is structured in the following way:
~/data/raw
FAH1234_something/
reads/
0.tar
Meaning an offset 'data/raw/' then one folder per run 'FAH1234_something' and within a 'reads' subfolder containing batches of .tar or the more recent multi-read fast5.

With this you should be able to require the file '~/data/raw/FAH1234_something/reads.fofn' from the pipeline.

Importantly, you do not need to mount any host path into the container, this is all done by snakemake!

For debugging purposes, you should be able to run the raw data indexing outside of the container, it's pure python code (I hope I have included all packages in the requirements)

Pay

@ginolhac
Copy link
Author

I think I am lost here, I am not working in my home directory (we have strict quota on this one) so how can the container access the data then? In singularity as far as I know only the home directory is bind by default.
For the raw data, I am missing the FAH1234_something level, I will add it.
I was asking only for indexing to reduce the number of steps and testing before running basecalling. That will be the next difficulty because I need to run it on one of our GPU so save a lot of time, but that's a slurm profile issue and the next step

@giesselmann
Copy link
Owner

No worries, home or not doesn't matter, as long as the path is accessible from any cluster node. I'm only using home in the docs since people are familiar with it.
With the runname level in the raw data it should work.

In general, snakemake is mounting the input files of a rule (for indexing the raw read batch) into the working directory within the container.

Basecalling with GPU is not yet tested by me, we have a CPU only cluster. When you reach this point, can you open a new issue with some specs on how GPUs are set up for you? Are there nodes with one or multiple GPUs, how do you control which process is using which GPU etc.

@ginolhac
Copy link
Author

yes, I have GPU settings that allow really fast basecalling (<30 min) instead of > 10 hours so I am really keen on continuing basecalling on those units. But, I am stopping here and will open another issue.

I didn't know snakemake could mount the files into the home of the container, that's handy.

Unfortunately, now I am trying easier things, no singularity, just the indexing as in the https://nanopype.readthedocs.io/en/stable/rules/storage/ and got this:

snakemake --snakefile ~/install/virtualenvs/nanopype/nanopype/Snakefile data/raw/20190806_0812_MN22103_FAK07438_17f53 9d2/reads.fofn

Building DAG of jobs...
MissingRuleException:
No rule to produce data/raw/20190806_0812_MN22103_FAK07438_17f539d2/reads.fofn (if you use input functions make sure that they don't raise unexpected exceptions).

@giesselmann
Copy link
Owner

okay, I see where this might come from, I'm converting the 'storage_data_raw' setting to an absolute path in the Snakefile. Can you try the entire setup with absolute paths?
So in the 'nanopype.yaml' set the storage_data_raw to the absolute path of 'data/raw'

@ginolhac
Copy link
Author

I made a progress, actually leaving the relative path in nanopype.yaml was fine then the command that helped a bit was to add the reads folder (in the doc it is not present):

snakemake --snakefile ~/install/virtualenvs/nanopype/nanopype/Snakefile /mnt/irisgpfs/projects/lsru/minion/mouse_embryo/data/raw/20190806_0812_MN22103_FAK07438_17f539d2/reads/reads.fofn

and then:

Job counts:
        count   jobs
        1       storage_index_run
        1
[Tue Aug 20 16:19:01 2019]
Error in rule storage_index_run:
    jobid: 0
    output: /mnt/irisgpfs/projects/lsru/minion/mouse_embryo/data/raw/20190806_0812_MN22103_FAK07438_17f539d2/reads/reads.fofn

RuleException:
AttributeError in line 71 of /home/users/aginolhac/install/virtualenvs/nanopype/nanopype/rules/storage.smk:
'InputFiles' object has no attribute 'batches'
  File "/home/users/aginolhac/install/virtualenvs/nanopype/nanopype/rules/storage.smk", line 71, in __rule_storage_index_run
  File "/opt/apps/resif/data/devel/default/software/lang/Python/3.7.2-GCCcore-8.2.0/lib/python3.7/concurrent/futures/thread.py", line 57, in run
Exiting because a job execution failed. Look above for error message

@giesselmann
Copy link
Owner

I think I can now reproduce the issue. In short could you please report the folder/file structure below data/raw/...? It seems the pipeline is finding the run folder, but no raw data batches inside the reads folder. Please see the following two examples.

I followed the first steps in the tutorial to extract the test data, and running e.g.

snakemake --snakefile ~/nanopype/Snakefile data/raw/20170725_FAH14126_FLO-MIN107_SQK-LSK308_human_Hues64/reads.fofn -n

gives me:

Building DAG of jobs...
Job counts:
count jobs
2 storage_index_batch
1 storage_index_run
3

[Tue Aug 20 16:56:00 2019]
rule storage_index_batch:
input: /project/minion/src/nanopype/test/data/raw/20170725_FAH14126_FLO-MIN107_SQK-LSK308_human_Hues64/reads/0.tar
output: /project/minion/src/nanopype/test/data/raw/20170725_FAH14126_FLO-MIN107_SQK-LSK308_human_Hues64/reads/0.fofn
jobid: 1
wildcards: runname=20170725_FAH14126_FLO-MIN107_SQK-LSK308_human_Hues64, batch=0
resources: mem_mb=4000, time_min=15

[Tue Aug 20 16:56:00 2019]
rule storage_index_batch:
input: /project/minion/src/nanopype/test/data/raw/20170725_FAH14126_FLO-MIN107_SQK-LSK308_human_Hues64/reads/1.tar
output: /project/minion/src/nanopype/test/data/raw/20170725_FAH14126_FLO-MIN107_SQK-LSK308_human_Hues64/reads/1.fofn
jobid: 2
wildcards: runname=20170725_FAH14126_FLO-MIN107_SQK-LSK308_human_Hues64, batch=1
resources: mem_mb=4000, time_min=15

[Tue Aug 20 16:56:00 2019]
localrule storage_index_run:
input: /project/minion/src/nanopype/test/data/raw/20170725_FAH14126_FLO-MIN107_SQK-LSK308_human_Hues64/reads/0.fofn, /project/minion/src/nanopype/test/data/raw/20170725_FAH14126_FLO-MIN107_SQK-LSK308_human_Hues64/reads/1.fofn
output: /project/minion/src/nanopype/test/data/raw/20170725_FAH14126_FLO-MIN107_SQK-LSK308_human_Hues64/reads.fofn
jobid: 0
wildcards: runname=20170725_FAH14126_FLO-MIN107_SQK-LSK308_human_Hues64
Job counts:
count jobs
2 storage_index_batch
1 storage_index_run
3
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution

however, if I create a dummy run with

mkdir -p data/raw/dummy/reads

and ask to index this one:

snakemake --snakefile ~/nanopype/Snakefile data/raw/dummy/reads.fofn -n

I get

Building DAG of jobs...
Job counts:
count jobs
1 storage_index_run
1

[Tue Aug 20 17:04:40 2019]
localrule storage_index_run:
output: /project/minion/src/nanopype/test/data/raw/dummy/reads.fofn
jobid: 0
wildcards: runname=dummy

and finaly without -n

Job counts:
count jobs
1 storage_index_run
1
[Tue Aug 20 17:13:57 2019]
Error in rule storage_index_run:
jobid: 0
output: /project/minion/src/nanopype/test/data/raw/dummy/reads.fofn

RuleException:
AttributeError in line 71 of /project/minion/src/nanopype/rules/storage.smk:
'InputFiles' object has no attribute 'batches'

Which is close to your output. I will think of a warning/error message if no read batches are found, the current error doesn't give a hint on the acutal problem.

Please let me know if this was helpful.

@ginolhac
Copy link
Author

Hello Pay,
thanks for trying to help.
the structure of data is:

/home/users/aginolhac/lsru/minion/mouse_embryo/data/
/home/users/aginolhac/lsru/minion/mouse_embryo/data/raw
/home/users/aginolhac/lsru/minion/mouse_embryo/data/raw/20190806_0812_MN22103_FAK07438_17f539d2
/home/users/aginolhac/lsru/minion/mouse_embryo/data/raw/20190806_0812_MN22103_FAK07438_17f539d2/reads
/home/users/aginolhac/lsru/minion/mouse_embryo/data/raw/20190806_0812_MN22103_FAK07438_17f539d2/reads/0.tar

Now I was trying the tests again and it failed :(
snakemake --snakefile ~/install/virtualenvs/nanopype/nanopype/Snakefile data/raw/20170725_FAH14126_FLO-MIN107_SQK-LSK308_human_Hues64/reads.fofn

and I get

MissingRuleException:
No rule to produce data/raw/20170725_FAH14126_FLO-MIN107_SQK-LSK308_human_Hues64/reads.fofn (if you use input functions make sure that they don't raise unexpected exceptions).

of note, the test/nanopype.yaml contains storage_data_raw: /mnt/irisgpfs/users/aginolhac/install/virtualenvs/nanopype/nanopype/test/data/raw so I guess with a full path it should work no ?

for info

ll /mnt/irisgpfs/users/aginolhac/install/virtualenvs/nanopype/nanopype/test/data/raw
total 0
drwxr----- 3 aginolhac clusterusers 4096 Dec  5  2018 20170725_FAH14126_FLO-MIN107_SQK-LSK308_human_Hues64
drwxr----- 3 aginolhac clusterusers 4096 Dec  5  2018 20170727_FAH14083_FLO-MIN107_SQK-LSK308_human_Hues64
drwxr----- 3 aginolhac clusterusers 4096 Dec  5  2018 20170807_FAH17458_FLO-MIN107_SQK-LSK108_human_Hues64
drwxr----- 3 aginolhac clusterusers 4096 Dec  5  2018 20170809_FAH17667_FLO-MIN107_SQK-LSK108_human_Hues64
drwxr----- 3 aginolhac clusterusers 4096 Dec  5  2018 20170822_FAH18797_FLO-MIN107_SQK-LSK108_human_Hues64
drwxr----- 3 aginolhac clusterusers 4096 Dec  5  2018 20170824_FAH18790_FLO-MIN107_SQK-LSK108_human_Hues64
drwxr----- 3 aginolhac clusterusers 4096 Dec  5  2018 20170830_FAH21992_FLO-MIN107_SQK-LSK108_human_Hues64
drwxr----- 3 aginolhac clusterusers 4096 Dec  5  2018 20170907_FAH21434_FLO-MIN107_SQK-LSK108_human_Hues64
drwxr----- 3 aginolhac clusterusers 4096 Dec  5  2018 20180202_FAH42054_FLO-MIN107_SQK-LSK108_human_Hues64
drwxr----- 3 aginolhac clusterusers 4096 Dec  5  2018 20180202_FAH48560_FLO-MIN107_SQK-LSK108_human_Hues64
drwxr----- 3 aginolhac clusterusers 4096 Dec  5  2018 20180221_FAH45430_FLO-MIN106_SQK-LSK108_human_Hues64
drwxr----- 3 aginolhac clusterusers 4096 Dec  5  2018 20180221_FAH48596_FLO-MIN107_SQK-LSK108_human_Hues64

@ginolhac ginolhac changed the title usage of nanopype with singularity image is unclear usage of nanopype, storage module Aug 21, 2019
@giesselmann
Copy link
Owner

yes, almost there, the configuration is right, if you run

snakemake --snakefile ~/nanopype/Snakefile data/raw/runname/reads.fofn

you get the 'No rule to produce...'
If you however from the test directory use

snakemake --snakefile ~/nanopype/Snakefile $(pwd)/data/raw/runname/reads.fofn

it works. This behavior is currently not consistent with the documentation, I will think of a way to handle either both absolute and relative paths or document it better.

To explain the error, the configuration is now with an absolute path but the snakemake workflow got called with a relative path. Snakemake doesn't detect this and doesn't find a rule to produce the relative output. This is only relevant for the storage module, all other modules work on relative paths in the working directory.

@ginolhac
Copy link
Author

I am sorry Pay, but providing the absolute path for calling the index file still produce No rule to produce. I will close for now because I see that there are many steps that are going to be tricky afterwards like slurm and singularity and I need to get the files processed, even by hand. I will start playing with a shorter snakemake pipeline later to better grasp the essence of it. Thanks again for kind help, will probably reopen latter.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants