Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in streaming SVD of abundance matrix #8

Open
condomitti opened this issue Nov 19, 2015 · 5 comments
Open

Error in streaming SVD of abundance matrix #8

condomitti opened this issue Nov 19, 2015 · 5 comments

Comments

@condomitti
Copy link

Hello,

I've been trying to run LSA scripts with my own dataset but I'm getting 'float division by zero' error doesn't matter what I do with input data.
I was able to run the entire pipeline with test data but couldn't with my own set (Illumina MiSeq paired-end reads, organized in a single file in an interleaved fashion as generated by LSFScripts/merge_and_split_pair_files.py).

This is the error LSA is printing out:

Starting streaming SVD of conditioned k-mer abundance matrix
8 printing end of last log file...
9 self.add_documents(corpus)
10 File "/usr/lib/python2.7/dist-packages/gensim/models/lsimodel.py", line 387, in add_documents
11 update = Projection(self.num_terms, self.num_topics, job, extra_dims=self.extra_samples, power_iters=self.power_iters)
12 File "/usr/lib/python2.7/dist-packages/gensim/models/lsimodel.py", line 127, in init
13 extra_dims=self.extra_dims)
14 File "/usr/lib/python2.7/dist-packages/gensim/models/lsimodel.py", line 742, in stochastic_svd
15 keep = clip_spectrum(s**2, rank, discard=eps)
16 File "/usr/lib/python2.7/dist-packages/gensim/models/lsimodel.py", line 86, in clip_spectrum
17 small = 1 + len(numpy.where(rel_spectrum > min(discard, 1.0 / k))[0])
18 ZeroDivisionError: float division by zero

Is this an issue or am I doing something wrong? Apparently this error comes after Hashcounting is finished.

Thank you in advance.

Best,
Condomitti.

@brian-cleary
Copy link
Owner

Hi,

Sorry for my slow response.

Are you running the distributed version, or the single instance version?

Do you mind sending me the output of "ls -l" for hashed_reads/ and
cluster_vectors/? I think that will help me to diagnose the issue.

On Thu, Nov 19, 2015 at 12:54 PM, Condomitti notifications@github.com
wrote:

Hello,

I've been trying to run LSA scripts with my own dataset but I'm getting
'float division by zero' error doesn't matter what I do with input data.
I was able to run the entire pipeline with test data but couldn't with my
own set (Illumina MiSeq paired-end reads, organized in a single file in an
interleaved fashion as generated by
LSFScripts/merge_and_split_pair_files.py).

This is the error LSA is printing out:

Starting streaming SVD of conditioned k-mer abundance matrix
8 printing end of last log file...
9 self.add_documents(corpus)
10 File "/usr/lib/python2.7/dist-packages/gensim/models/lsimodel.py", line
387, in add_documents
11 update = Projection(self.num_terms, self.num_topics, job,
extra_dims=self.extra_samples, power_iters=self.power_iters)
12 File "/usr/lib/python2.7/dist-packages/gensim/models/lsimodel.py", line
127, in init
13 extra_dims=self.extra_dims)
14 File "/usr/lib/python2.7/dist-packages/gensim/models/lsimodel.py", line
742, in stochastic_svd
15 keep = clip_spectrum(s**2, rank, discard=eps)
16 File "/usr/lib/python2.7/dist-packages/gensim/models/lsimodel.py", line
86, in clip_spectrum
17 small = 1 + len(numpy.where(rel_spectrum > min(discard, 1.0 / k))[0])
18 ZeroDivisionError: float division by zero

Is this an issue or am I doing something wrong? Apparently this error
comes after Hashcounting is finished.

Thank you in advance.

Best,
Condomitti.


Reply to this email directly or view it on GitHub
#8.

@baravalle
Copy link

Hi,
I'm trying this on a different dataset but I get stuck on exactly the same error. Did you manage to go past this?

Any suggestions?

I have included below the content of my hashed_reads and cluster_vectors folders.
Andres

ls -l hashed_reads/
total 18967940
-rw-r--r--. 1 root root 2 Feb 16 18:54 hashParts.txt
-rw-r--r--. 1 root root 8388608 Feb 18 05:41 MET0432.count.hash
-rw-r--r--. 1 root root 16777216 Feb 18 05:41 MET0432.count.hash.conditioned
-rw-r--r--. 1 root root 30228672 Feb 18 05:41 MET0432.nonzero.npy
-rw-r--r--. 1 root root 19367708753 Feb 18 02:48 MET0432.prinseqoutput.hashq.gz
-rw-r--r--. 1 root root 50001 Feb 16 18:54 Wheels.txt

ls -l cluster_vectors/
total 16388
-rw-r--r--. 1 root root 16777296 Feb 18 05:41 global_weights.npy

@brian-cleary
Copy link
Owner

Hi Andres,

Is it the case that you have only a single sample there? The premise of LSA
is to use covariance information across multiple samples, and the SVD step
in particular will need multiple samples to work. I haven't tested the
pipeline with a single sample to see if it generates this error, but
certainly could be the case.

On Mon, Feb 22, 2016 at 7:08 AM, Andres Baravalle notifications@github.com
wrote:

Hi,
I'm trying this on a different dataset but I get stuck on exactly the same
error. Did you manage to go past this?

Any suggestions?

I have included below the content of my hashed_reads and cluster_vectors
folders.

Andres

ls -l hashed_reads/
total 18967940
-rw-r--r--. 1 root root 2 Feb 16 18:54 hashParts.txt
-rw-r--r--. 1 root root 8388608 Feb 18 05:41 MET0432.count.hash
-rw-r--r--. 1 root root 16777216 Feb 18 05:41
MET0432.count.hash.conditioned
-rw-r--r--. 1 root root 30228672 Feb 18 05:41 MET0432.nonzero.npy
-rw-r--r--. 1 root root 19367708753 Feb 18 02:48
MET0432.prinseqoutput.hashq.gz
-rw-r--r--. 1 root root 50001 Feb 16 18:54 Wheels.txt

ls -l cluster_vectors/
total 16388
-rw-r--r--. 1 root root 16777296 Feb 18 05:41 global_weights.npy


Reply to this email directly or view it on GitHub
#8 (comment)
.

@condomitti
Copy link
Author

Hi Andres and Brian,
Sorry my late response at this time.
I could manage to get to the final results by executing LSA steps
separately rather than calling the single script as is in the sample
page. Other than that nothing special was necessary.
Take care,
Condomitti.
Em Seg, 2016-02-22 às 04:48 -0800, brian-cleary escreveu:

Hi Andres,

Is it the case that you have only a single sample there? The premise
of LSA
is to use covariance information across multiple samples, and the SVD
step
in particular will need multiple samples to work. I haven't tested
the
pipeline with a single sample to see if it generates this error, but
certainly could be the case.

On Mon, Feb 22, 2016 at 7:08 AM, Andres Baravalle
ub.com>
wrote:

Hi,
I'm trying this on a different dataset but I get stuck on exactly
the same
error. Did you manage to go past this?

Any suggestions?

I have included below the content of my hashed_reads and
cluster_vectors
folders.

Andres

ls -l hashed_reads/
total 18967940
-rw-r--r--. 1 root root 2 Feb 16 18:54 hashParts.txt
-rw-r--r--. 1 root root 8388608 Feb 18 05:41 MET0432.count.hash
-rw-r--r--. 1 root root 16777216 Feb 18 05:41
MET0432.count.hash.conditioned
-rw-r--r--. 1 root root 30228672 Feb 18 05:41 MET0432.nonzero.npy
-rw-r--r--. 1 root root 19367708753 Feb 18 02:48
MET0432.prinseqoutput.hashq.gz
-rw-r--r--. 1 root root 50001 Feb 16 18:54 Wheels.txt

ls -l cluster_vectors/
total 16388
-rw-r--r--. 1 root root 16777296 Feb 18 05:41 global_weights.npy


Reply to this email directly or view it on GitHub

ecomment-187146513>
.


Reply to this email directly or view it on GitHub.

@baravalle
Copy link

Hi Brian, Condomitti,
thanks for your answers.

Brian, I'm coming to this from a computing background (not that familiar with LSA right now) as part of a multi-disciplinary team. Apparently you are right, the data we used as a test might have been from a single sample.

Will do a new test tomorrow, hopefully with the right data, and will ping back.

Thanks again for the help,

  Andres

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants