-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugs in conda mae pipeline #105
Comments
Hi, Any update about this bug? Thanks. |
Hi, that is indeed a bug, thanks for pointing that out, we will be fixing it soon. The issue is that the point/period in the filename causes the output filename to be truncated early. For the moment, you can create a symlink to your fasta file (so you don't have to copy the file) that does not contain any point characters before the
and replacing the corresponding entry in the config to
Hope this solves the problem for now. |
I tried, but still getting errors... MissingInputException in line 38 of /gpfs/home/evrong01/.local/lib/python3.6/site-packages/wbuild/wBuild.snakefile: Shutting down, this might take some time. Building DAG of jobs... Building DAG of jobs... |
Make sure that you specify the filename correctly, the fasta file doesn't seem to exist. Also, remove the last dot before the file ending from
to
otherwise the error will persist. If the unlock us giving you problems, try deleting the |
Next bug I'm getting. This is with the conda environment... Error: package or namespace load failed for ‘tMAE’: |
Which R version are you using?
It should be the one in your conda environment |
I'm using the drop conda environment. So if there is an issue, it is because of an issue in the drop conda environment configuration. R version 4.0.2 (2020-06-22) -- "Taking Off Again" |
Hm, seems as though R might access different library paths. Which paths do you get when you call
in R in the environment? You should only have a single path that links to the conda environment. Also, which Bioconductor version do you have?
|
Ok, the libPaths was sourcing my regular R first instead of the conda R. This is a bug in conda. The point of conda is to have a separate environment. However, conda still sources the original R library paths from the root user. See here for details: The solution is: this line: export R_LIBS=DROPCONDA_ENVIRONMENT/lib/R/library |
Next bug in mae conda. How do I fix this? Started with deseq |
Thanks for finding the issue. For some reason, we I don't get this behaviour on my machine, despite having a system installation of R, will need some time to figure put what is going on. |
This error is most likely due to a failed GATK run. Try removing all the MAE output ( |
Ok I will try removing failed MAE output. Why doesn't snakemake or drop detect this automatically and rerun? This should be fixed. |
Deleting the mae directory in the root/processed_results folder gives the prior bug/error: Shutting down, this might take some time. |
It also gives this error: Building DAG of jobs... |
In general, mae is a lot more buggy than the other pipelines. There must be something in its foundation that causes it to have so many problems compared to the other drop pipelines. |
This directory that it complains about doesn't even exist... I tried 'drop update' and it still doesn't work. So the suggestion of deleting $ROOT/processed_data/mae broke the pipeline. I'm not sure how to fix it now. |
Sorry, the mae-pipeline folder exists, but the mae-pipeline/.drop folder does not exist: [evrong01@bigpurple-ln1 droptest2]$ ls -a /gpfs/scratch/evrong01/droptest2/.drop/modules/mae-pipeline/ So this error points to files that do not exist. This is definitely a bug in the underlying code. The pipeline has path names that don't exist. This is what causes the unlock commands to have errors every time. I suspect the unlock commmand has several bugs. MissingInputException in line 62 of /gpfs/scratch/evrong01/droptest2/.drop/modules/mae-pipeline/Snakefile: |
Can you print the contents of Yes MAE is a bit buggy, as GATK commands don't error in the same way as other commands, and require unlocking, which doesn't work as it should for submodules. We are still working on these issues on the new release. It still takes time before we establish that main features are tested more systematically, so we don't run into the same reproducibility issues as we do now. |
I think the unlock command or drop update have a bug in how it sets the base folder location, because it looks like it has the similar folder path twice: = /gpfs/scratch/evrong01/droptest2/.drop/modules/mae-pipeline + .drop/modules/mae-pipeline + Scripts/MAE/filterSNVs.sh But that is the wrong path. The correct path is: /gpfs/scratch/evrong01/droptest2/.drop/modules/mae-pipeline/Scripts/MAE/filterSNVs.sh |
Here is the mae Snakefile METHOD = 'MAE' parser = drop.config(config, METHOD) FUNCTIONSdef fasta_dict(fasta_file): def getVcf(rna_id, vcf_id="qc"): def getQC(format): def getChrMap(SCRIPT_ROOT, conversion): def getScript(type, name): rule all: rule sampleQC: rule create_dict: MAErule create_SNVs: rule allelic_counts: QCrule renameChrQC: rule allelic_counts_qc: rulegraph_filename = f'{config["htmlOutputPath"]}/{METHOD}_rulegraph' rule create_graph: rule unlock: |
The bugs are not from gatk. The bugs are from the mae drop scripts. There is an issue in some path setting and the unlock scripts. There is no feasible way to run mae. There's a bunch of unlock errors that happen due to bugs in the mae scripts themselves. This is before any gatk is ever run: Building DAG of jobs... Shutting down, this might take some time. |
Yes, I know that error, but it's difficult to find out why it works sometimes breaks at other times. Try creating the empty |
The fact that Scripts_MAE_deseq_mae_R was called before it failed means that gatk was run. However errors in the gatk command don't get recognised by snakemake, so it only occurs when the broken output is read in R. That's why I was trying to get you to rerun the mae pipeline from scratch. |
Seeing as you are working with the demo and that we habe resolved the dependency issues, it might be easier to just redo the setup in a new empty directory, instead of trying to isolate the mae pipeline for now |
Neither of these things fixes the problem. Changing the SCRIPT_ROOT = os.getcwd() now gives this error when running snakemake unlock: |
How do I redo the setup in a new empty directory without having rerun all the other pipelines, which takes 3 days of very significant compute resources? |
It takes 3 days to execute the demo? It should only take about 20min. And the total input is just abt 600MB, so total memory consumption shouldn't be high. What are your resources, what system are you working on? |
I ran drop for 110 samples, not the demo. Is there any way to transfer all the results of the other pipelines (except for mae) to the new directory? |
Try to run drop on a new instance of drop demo to see if the problem still occurs first. Right now I don't have any good solution why the code suddenly breaks, as it worked before |
OK, good to know that that's a way we can fix the issue. You should definitely keep a copy of the output you have at the moment, before you have a full pipeline run. Also be careful if you save in a |
Thanks. Which specific directories do I need to copy from the old directory to the new directory? |
Create the new project first, call
and don't touch the |
I made a new directory, then I did drop init, then drop update, then I copied the config.yaml and samples.tsv file, then snakemake -n. It ran fine. However, there is no root directory. I only have these files in the directory... -rw-rw---- 1 evrong01 evrong01 1.8K Aug 15 15:54 config.yaml |
Now I created the root directory manually. Then I copied all the processed_data and processed_results except for mae. Then I ran snakemake --touch and it ran fine. Then I ran snakemake -n and it ran fine. But now I am running snakemake unlock again, and again I'm getting the same error. There must be some bug in the mae code. Building DAG of jobs... Shutting down, this might take some time. |
Also, running 'snakemake aberrantExpression' finishes and says "Nothing to be done.", but it didn't recreate the html_output directory. So just moving the processed_data and processed_results files to the new directory doesn't help, because then it doesn't recreate the html_output directory. I guess I can just copy those over too. But still the problem is I can't get mae to run. Unlock doesn't work, and there is no way around it. |
Also, if I run snakemake mae without doing unlock in the new directory, it also says "Nothing to be done", even though there is no mae results in the folder. I suspect that snakemake --touch makes it think that mae is already finished. |
And if I run snakemake mae without doing unlock in the old directory, I get this error. Maybe this is where the bug is. Subworkflow AE: Nothing to be done. TypeError in line 34 of /gpfs/scratch/evrong01/droptest2/.drop/modules/mae-pipeline/Snakefile: |
In the new directory, I manually deleted the two MAE.done files and now mae seems to run. However, it hit another new error: Error in eval(jsub, SDenv, parent.frame()) : Calls: [ -> [.data.table -> eval -> eval |
I think things will just get too messy from here. I'd suggest that you try to use the version of drop that I'm still developing on, as it will make reruns so much easier. The output structure of the HTML will be a bit different and it will still miss some features, but the pipeline core should be fully functional. If you already have a local clone of the drop repo set a new remote to
Otherwise just clone the above URL. Then checkout
Next, you need to install drop to your current drop conda environment (as you already have all the dependencies).
Let me know once you have installed the new version and whether you can get the |
I'm getting an installation error: |
Seems as though your pip is not working. It could be that a wrong version is referenced (seeing as it has worked before). Make sure you are using the pip provided by anaconda. |
conda remove wbuild broke pip. I reinstalled pip and then managed to install the alternate drop version. snakemake -n gives this next error: The downloaded source packages are in I went into R and manually installed remotes. Then I get this next error:
|
Have you considered distributing drop as a docker? It might solve all of these installation/configuration issues. |
yes, we have a docker https://github.com/c-mertes/docker_drop, but it uses the old version of drop and we would have to take the same steps to update it. You can try to use that instead but you'll need to update drop, as the bugs you are experiencing are still there. Just note that it will need a considerable amount of storage (15GB compressed). I haven't managed to test it on my machine yet due to limited storage though, so I can only help you with the issues that are not related to docker. As for your error message, it seems as though you aren't accessing the conda environment properly, as you still have other versions of R that you are accessing. In conda, you shouldn't be reinstalling the R packages, as they should already exist. Maybe check your PATH variable You can decide which environment to use. For docker, note that you'll have to mount your data in the specific folder structure as described in the README https://github.com/c-mertes/docker_drop. I'm not sure if using your old output data works, but you can definitely give it a try. |
Are you updated Docker-drop regularly on https://github.com/c-mertes/docker_drop? Does it have the most up to date version of drop? When I uninstalled wbuild, it messed up the conda environment. It removed and switched many dependencies and for some reason caused R to get uninstalled. I'm not sure why. This is why in my opinion dockers are better. Conda is in general a buggy system, because it is not completely disconnected from the local environment and therefore it is not good at handling complicated dependencies. Even though docker requires more memory, it will save users the headache of trying to get the pipeline working. |
I fixed the R PATH issue for the conda environment. But snakemake -n for the demo still gave an error: I tried to manually installed these and looks like the error is because dplyr and associated packages were not installed. It gives an option to install it, so it must be that when DROP runs the demo it doesn't answer 'yes' to install dependencies such as dplyr. |
I succeeded in getting all the packages installed. Now I'm getting this error for the demo: droptest3]$ snakemake -n |
That's a bit surprising, as you had all the dependencies before and wbuild doesn't depend on any R packages, so removing wbuild shouldn't remove any R dependencies. Are you still using the correct R library path If you want to use the docker instead, you only need to do download the developer version of drop (as you did before on your local machine) and remove wbuild using pip, as described below. Then you can reinstall the newest drop version from the local repo. Don't forget to prepare and mount your input data as described in the README. Updating DROP and wbuild
then
And make sure wbuild is installed. In case that doesn't work, try Then you should have the correct dependencies to get a successful dryrun. |
For some reason now gatk is not being found in my conda environment. It was there before. I think it's too complicated. I'm not sure what to try next. If there is a simpler way to setup the drop environment, I'm happy to try, but conda seems complicated. If you have a docker that is built and ready to go, I'm happy to try that. |
It seems as though you aren't properly accessing your conda environment. You can verify with If you are working with docker, just follow the instructions in https://github.com/c-mertes/docker_drop and then follow my previous instructions on updating to the developer version of drop (git@github.com:mumichae/drop.git, branch |
Thanks. To make it more simple for users, do you have a docker that is already built and confirmed to work on docker hub? |
Well yes, we have a prebuilt docker container that os about 15GB in compressed size. You don't need to rebuild bit just run it as described in the README and it'll be downloaded from the mertes/drop repository (which will take time)
for writing the demo or if you are mounting your own data:
It has the old version of drop, so you'll need to manually update to use any github developer version (what I explained before). Does that answer your question? |
Ok, I think to keep things simple I will just wait until a more stable version of drop is released with the above issues resolved. I would appreciate if you let me know once it is released. Thanks. |
Hi, We have a new RNA-seq sample from a syndromic family that we would like to try on DROP. Are any of the new versions of DROP available yet? Thanks. |
Hi,
Here is the next bug in the mae pipeline I am getting. It looks much more like a bug, because gatk is getting an error due to an existing .dict file. This should not give an error, and the gatk command should be told to overwrite it if it exists.
Please let me know when it is fixed.
There are 3 other bugs in the conda mae pipeline that prevent it from working. But I think it will be easier to fix them one by one. I can post the next one after this is fixed.
[Mon Aug 3 22:51:08 2020]
rule create_dict:
input: /gpfs/data/reference-files/GRCh38_gencode-STAR/GRCh38.primary_assembly.genome.fa
output: /gpfs/data/reference-files/GRCh38_gencode-STAR/GRCh38.dict
jobid: 31
INFO 2020-08-03 22:51:54 CreateSequenceDictionary Output dictionary will be written in /gpfs/data/reference-files/GRCh38_gencode-STAR/GRCh38.primary_assembly.genome.dict
22:51:54.446 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gpfs/data/bin/drop_conda/share/gatk4-4.1.8.1-0/gatk-package-4.1.8.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Mon Aug 03 22:51:54 EDT 2020] CreateSequenceDictionary --REFERENCE /gpfs/data/reference-files/GRCh38_gencode-STAR/GRCh38.primary_assembly.genome.fa --TRUNCATE_NAMES_AT_WHITESPACE true --NUM_SEQUENCES 2147483647 --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Mon Aug 03 22:51:56 EDT 2020] Executing as evrong01@cn-0044 on Linux 3.10.0-693.17.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_192-b01; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.1.8.1
[Mon Aug 03 22:51:57 EDT 2020] picard.sam.CreateSequenceDictionary done. Elapsed time: 0.05 minutes.
Runtime.totalMemory()=2667577344
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
picard.PicardException: /gpfs/data/reference-files/GRCh38_gencode-STAR/GRCh38.primary_assembly.genome.dict already exists. Delete this file and try again, or specify a different output file.
at picard.sam.CreateSequenceDictionary.doWork(CreateSequenceDictionary.java:220)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:305)
at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:25)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
Using GATK jar /gpfs/data/bin/drop_conda/share/gatk4-4.1.8.1-0/gatk-package-4.1.8.1-local.jar
Running:
The text was updated successfully, but these errors were encountered: