Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Structure fails without explanation #464

Closed
alexkrohn opened this issue Sep 22, 2021 · 23 comments
Closed

Structure fails without explanation #464

alexkrohn opened this issue Sep 22, 2021 · 23 comments

Comments

@alexkrohn
Copy link

I'm running Structure from ipyrad analysis on jupyter (v 4.6.1) notebook (v 5.7.10) on Mac OSX (10.14.6). I've gotten Structure to work multiple times in the past on the same machine, but for some reason this dataset is failing.

Basically, instead of taking ~2 hours to run, struct.run(nreps=20, kpop=[2,3,4,5], force = True, auto=True) takes ~10 seconds to run, generates a bunch of temp strfiles, mainparams and extraparams, then ends. No explanation given. It seems obvious the run did not complete, but I'm not sure why.

Here's a link to the HDF5 and the ipynb that I'm running.

When I go to extract the evanno table, it fails with an indexing error (understandably). Other than that and the quick run time, I wouldn't have any idea that the run failed. I've tried running different values of k, restarting the notebook, etc. Other Structure runs on different datasets continue to work.

If it's helpful, the vcf used to generate this HDF5 was made by removing a bunch of individuals from the original ipyrad output vcf using vcftools --remove and a file of individuals to remove.

Any idea what's going on? It might be a useful feature to give an error report when the .run() function fails.

@alexkrohn
Copy link
Author

alexkrohn commented Sep 23, 2021

More info on the code that I'm running. I'm not sure if I could replicate this without my original dataset, mostly becuase this same code works on other datasets.

import ipyrad.analysis as ipa
import toyplot
import pandas as pd

# Load the data
data = "analysis-vcf2hdf5/structure-subset.snps.hdf5"

# Group into populations. 

# Import from csv, then convert to dictionary
df = pd.read_csv("inds-and-pops-subset.txt", sep=',')

imap = df.groupby('pop.for.pca.structure')['ind'].apply(list).to_dict()
                        
# Require that x% of the samples in each pop have data
minmap = {i: 0.001 for i in imap}

# init analysis object with input data and (optional) parameter options. Mincov is the proportion of SNPs shared across all samples
struct = ipa.structure(
    name="mate-subset-60inds",
    data=data,
    imap=imap,
    minmap=minmap,
    mincov=0.80,
)

#Samples: 60
#Sites before filtering: 227971
#Filtered (bi-allel): 21737
#Filtered (mincov): 227271
#Filtered (minmap): 220599
#Filtered (combined): 227337
#Sites after filtering: 634
#Sites containing missing values: 631 (99.53%)
#Missing values in SNP matrix: 6240 (16.40%) 

struct.mainparams.burnin = 10000
struct.mainparams.numreps = 25000

struct.run(nreps=20, kpop=[2,3,4,5], force = True, auto=True)

#Parallel connection | Alexs-MacBook-Air.local: 4 cores
#[####################] 100% 0:00:15 | running 80 structure jobs

etable = struct.get_evanno_table([2,3,4,5])
etable 

Which leads to the first error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-8-4d8b0eaddda6> in <module>
----> 1 etable = struct.get_evanno_table([2,3,4,5])
      2 etable

//anaconda/envs/py36/lib/python3.6/site-packages/ipyrad/analysis/structure.py in get_evanno_table(self, kvalues, max_var_multiple, quiet)
    569                 raise ValueError('max_variance_multiplier must be >1')
    570 
--> 571         table = _get_evanno_table(self, kvalues, max_var_multiple, quiet)
    572         return table
    573 

//anaconda/envs/py36/lib/python3.6/site-packages/ipyrad/analysis/structure.py in _get_evanno_table(self, kpops, max_var_multiple, quiet)
    930     for kpop in kpops:
    931         ## concat results for k=x
--> 932         reps, excluded = _concat_reps(self, kpop, max_var_multiple, quiet)
    933 
    934         ## report if some results were excluded

//anaconda/envs/py36/lib/python3.6/site-packages/ipyrad/analysis/structure.py in _concat_reps(self, kpop, max_var_multiple, quiet, **kwargs)
    890             min_var_across_reps = np.min([i.var_lnlik for i in reps])
    891         else:
--> 892             min_var_across_reps = reps[0].var_lnlik
    893 
    894         ## iterate over reps

IndexError: list index out of range

@isaacovercast
Copy link
Collaborator

Hi Alex, this will only happen if you have structure installed but it's not set as executable. ipyrad.analysis.structure tests for the presence of the structure binary and raises an error which helpfully suggests how to install this if it doesn't exist. The structure module tests with os.path.exists, so if the binary exists then it passes this check. Later, when you call run(), if the binary exists, but is not set to be executable then the call will fail. I added a check for executability of the binaries (9a59482) that explains how to fix the mode using chmod. This will go out in the next conda package, but for now you'll have to check and fix the file permissions on the structure binary by hand. Good luck.

@alexkrohn
Copy link
Author

Interesting. These same exact commands worked on this machine previously, but not with this dataset. (The dataset later worked on another machine.) I'll keep playing around with it to see if I can figure out why the permissions were wrong during this instance, but worked fine in others.

@isaacovercast
Copy link
Collaborator

Does it work now? Does it work on the exact same computer with a different dataset? The dataset used shouldn't change the behavior in this way if my hypothesis is correct. Are you certain that your jupyter notebook is running in the expected conda env (which has ipyrad and structure installed)? The same command can work differently on the same computer if you are running jupyter inside different conda environments.

@alexkrohn
Copy link
Author

alexkrohn commented Oct 4, 2021 via email

@isaacovercast
Copy link
Collaborator

The vcf file is imported and converted to hdf5 very early in the process of initializing the structure object, and yours seems to be importing fine. Any problems with the VCF format would arise at this stage, so I'm very dubious its a problem with the vcf file. What is the name of the conda environment you're running in? Can you show me conda env list? Can you do this:

conda activate <your_ipyrad_environment>
which structure
ls -l /path/to/your/structure/binary

Of course you'll need to put the name of your conda env there, and also the actual path, not the fake path I put as a standin.

@isaacovercast
Copy link
Collaborator

Can you do this:

ps -ef | grep notebook

@alexkrohn
Copy link
Author

Sure. I installed ipyrad in a Python 3.6 environment on my Mac, running OS 10.14.6
Screen Shot 2021-10-04 at 9 58 30 AM
Screen Shot 2021-10-04 at 9 58 53 AM

@alexkrohn
Copy link
Author

And with the notebook actually runnin
Screen Shot 2021-10-04 at 10 01 03 AM
g

@isaacovercast
Copy link
Collaborator

Mm. How about this: which ipcluster

@alexkrohn
Copy link
Author

Screen Shot 2021-10-04 at 10 09 34 AM

@isaacovercast
Copy link
Collaborator

Well, it certainly all looks like it should work.... The next thing to try is running the structure command by hand outside the notebook. The command we build inside the structure module looks like this:

    cmd = [
        STRUCTURE,
        "-m", mname,
        "-e", ename,
        "-K", str(kpop),
        "-D", str(seed),
        "-N", str(ntaxa),
        "-L", str(nsites),
        "-i", sname,
        "-o", outname,
    ]

So if you can build this command by hand on the command line and run it (inside the py36 environment) then this should tell us the problem. The -m, -e, and -i parameters should indicate the temporary files created inside your structure analysis directory, like this:

structure -m structure-analysis/tmp-watdo-3-1.mainparams.txt -e structure-analysis/tmp-watdo-3-1.extraparams.txt -K 3 -d 1234 -i   structure-analysis/tmp-watdo-3-1.strfile.txt

The rest of the parameters should be straightforward. Run this and let me know what the output is.

@isaacovercast
Copy link
Collaborator

You can also email me your vcf file and your "inds-and-pops-subset.txt" file.

@alexkrohn
Copy link
Author

I emailed the vcf and population assignment file.

Structure fails because there is no NUMIND or NUMLOCI line in the mainparams file.

tmp-mate-subset-60inds-3-1.mainparams.txt
Screen Shot 2021-10-04 at 10 33 22 AM

@isaacovercast
Copy link
Collaborator

I should have been more specific about completing the rest of the command line arguments, my apologies. The NUMIND and NUMLOCI error are because you didn't include the -N and -L command line arguments (these can be written to the file or provided on the command line, which is how ipyrad does it).

@isaacovercast
Copy link
Collaborator

The vcf file you sent works for me:

image

though I wonder about the vcf file because it is quite different in terms of number of snps than what is reported above. Did you send me a subset of the vcf file you originally posted about?

image

@isaacovercast
Copy link
Collaborator

I suspect this is a 'mac' thing, since i'm running on linux. Can you try adding the -N and -L arguments to the structure command as you had it above to see if it gets any further?

The other thing to try is updating to the most recent version of ipyrad (0.9.81) there's a non-zero probability that this problem you are reporting has already been fixed, if you're working with an older version.

@alexkrohn
Copy link
Author

I sent you the full vcf via email, sorry for the wrong file.

I used this command to run Structure:

structure -m tmp-mate-subset-60inds-3-1.mainparams.txt -e tmp-mate-subset-60inds-3-1.extraparams.txt -K 3 -d 1234 -i tmp-mate-subset-60inds-3-1.strfile.txt -N 60 -L 634

634 is the total number of sites remaining after filtering, not the total in the HDF5. Structure gives this error, which I'm not sure how to interpret. Why would it think there should be 1482 sites?

Screen Shot 2021-10-04 at 11 13 34 AM

Would updating ipyrad just involve running conda update ipyrad while in the environment that has ipyrad installed? Or do I have to reinstall ipyrad completely using conda?

@isaacovercast
Copy link
Collaborator

The new file also works for me:
image

conda install -c bioconda ipyrad --force-reinstall

If it isn't an ipyrad version issue then its certainly a mac issue, we don't see many mac users, so bugs tend to hang around longer.

@alexkrohn
Copy link
Author

Interesting. I updated ipyrad and am still having the same issue. I guess it must be a Mac issue. Luckily other datasets still work with this code, so if I find myself in this situation again, maybe I'll have better luck figuring out what causes this behavior. Thanks a lot for trying to work though it with me.

@isaacovercast
Copy link
Collaborator

It's not a mac thing. It must be something goofy with your environment. I did a clean conda install on a mac using the data you sent and it works fine:

cd ~
rm -rf miniconda3
bash Miniconda*
conda create -n ipyrad python=3.8
conda activate ipyrad
conda install -c conda-forge -c bioconda ipyrad
conda install structure clumpp -c ipyrad
import ipyrad.analysis as ipa
import toyplot
import pandas as pd

# Load the data
#data = "subset-and-08maxmissing.recode.vcf"
data = "subset-only-for-ipyrad.recode.vcf"

# Group into populations. 

# Import from csv, then convert to dictionary
df = pd.read_csv("inds-and-pops-subset.txt", sep=',')

imap = df.groupby('pop.for.pca.structure')['ind'].apply(list).to_dict()
                        
# Require that x% of the samples in each pop have data
minmap = {i: 0.001 for i in imap}

# init analysis object with input data and (optional) parameter options. Mincov is the proportion of SNPs shared across all samples
struct = ipa.structure(
    name="mate-subset-60inds",
    data=data,
    imap=imap,
    minmap=minmap,
    mincov=0.80,
)
Converting vcf to HDF5 using default ld_block_size: 20000
Typical RADSeq data generated by ipyrad/stacks will ignore this value.
You can use the ld_block_size parameter of the structure() constructor to change
this value.

Indexing VCF to HDF5 database file
VCF: 227971 SNPs; 78927 scaffolds
[                    ]   0% 0:00:00 | converting VCF to HDF5 
This appears to be a denovo assembly, ld_block_size arg is being ignored.
[########            ]  43% 0:00:34 | converting VCF to HDF5 
This appears to be a denovo assembly, ld_block_size arg is being ignored.
[#################   ]  87% 0:01:04 | converting VCF to HDF5 
This appears to be a denovo assembly, ld_block_size arg is being ignored.
[####################] 100% 0:01:12 | converting VCF to HDF5 
HDF5: 227971 SNPs; 78928 linkage group
SNP database written to ./analysis-vcf2hdf5/subset-only-for-ipyrad.recode.snps.hdf5
Samples: 60
Sites before filtering: 227971
Filtered (indels): 0
Filtered (bi-allel): 21737
Filtered (mincov): 227271
Filtered (minmap): 220599
Filtered (subsample invariant): 35192
Filtered (minor allele frequency): 0
Filtered (combined): 229829
Sites after filtering: 509
Sites containing missing values: 508 (99.80%)
Missing values in SNP matrix: 5048 (16.53%)
SNPs (total): 509
SNPs (unlinked): 259
struct.mainparams.burnin = 10000
struct.mainparams.numreps = 25000

struct.run(nreps=20, kpop=[2,3,4,5], force = True, auto=True)
[######              ]  30% 0:03:06 | running 80 structure jobs 

Good luck!

@alexkrohn
Copy link
Author

alexkrohn commented Oct 4, 2021 via email

@kindofausername
Copy link

Encountered the same issue due to ipyclient = ipp.Client(cluster_id="ipyrad") running in a different conda environment with a different Python version.

ipyclient was running on python 3.10 and structure was running on python 3.7.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants