Error in streaming SVD of abundance matrix #8

condomitti · 2015-11-19T17:54:26Z

Hello,

I've been trying to run LSA scripts with my own dataset but I'm getting 'float division by zero' error doesn't matter what I do with input data.
I was able to run the entire pipeline with test data but couldn't with my own set (Illumina MiSeq paired-end reads, organized in a single file in an interleaved fashion as generated by LSFScripts/merge_and_split_pair_files.py).

This is the error LSA is printing out:

Starting streaming SVD of conditioned k-mer abundance matrix
8 printing end of last log file...
9 self.add_documents(corpus)
10 File "/usr/lib/python2.7/dist-packages/gensim/models/lsimodel.py", line 387, in add_documents
11 update = Projection(self.num_terms, self.num_topics, job, extra_dims=self.extra_samples, power_iters=self.power_iters)
12 File "/usr/lib/python2.7/dist-packages/gensim/models/lsimodel.py", line 127, in init
13 extra_dims=self.extra_dims)
14 File "/usr/lib/python2.7/dist-packages/gensim/models/lsimodel.py", line 742, in stochastic_svd
15 keep = clip_spectrum(s**2, rank, discard=eps)
16 File "/usr/lib/python2.7/dist-packages/gensim/models/lsimodel.py", line 86, in clip_spectrum
17 small = 1 + len(numpy.where(rel_spectrum > min(discard, 1.0 / k))[0])
18 ZeroDivisionError: float division by zero

Is this an issue or am I doing something wrong? Apparently this error comes after Hashcounting is finished.

Thank you in advance.

Best,
Condomitti.

brian-cleary · 2015-12-01T15:09:24Z

Hi,

Sorry for my slow response.

Are you running the distributed version, or the single instance version?

Do you mind sending me the output of "ls -l" for hashed_reads/ and
cluster_vectors/? I think that will help me to diagnose the issue.

On Thu, Nov 19, 2015 at 12:54 PM, Condomitti notifications@github.com
wrote:

Hello,

I've been trying to run LSA scripts with my own dataset but I'm getting
'float division by zero' error doesn't matter what I do with input data.
I was able to run the entire pipeline with test data but couldn't with my
own set (Illumina MiSeq paired-end reads, organized in a single file in an
interleaved fashion as generated by
LSFScripts/merge_and_split_pair_files.py).

This is the error LSA is printing out:

Starting streaming SVD of conditioned k-mer abundance matrix
8 printing end of last log file...
9 self.add_documents(corpus)
10 File "/usr/lib/python2.7/dist-packages/gensim/models/lsimodel.py", line
387, in add_documents
11 update = Projection(self.num_terms, self.num_topics, job,
extra_dims=self.extra_samples, power_iters=self.power_iters)
12 File "/usr/lib/python2.7/dist-packages/gensim/models/lsimodel.py", line
127, in init
13 extra_dims=self.extra_dims)
14 File "/usr/lib/python2.7/dist-packages/gensim/models/lsimodel.py", line
742, in stochastic_svd
15 keep = clip_spectrum(s**2, rank, discard=eps)
16 File "/usr/lib/python2.7/dist-packages/gensim/models/lsimodel.py", line
86, in clip_spectrum
17 small = 1 + len(numpy.where(rel_spectrum > min(discard, 1.0 / k))[0])
18 ZeroDivisionError: float division by zero

Is this an issue or am I doing something wrong? Apparently this error
comes after Hashcounting is finished.

Thank you in advance.

Best,
Condomitti.

—
Reply to this email directly or view it on GitHub
#8.

baravalle · 2016-02-22T12:08:06Z

Hi,
I'm trying this on a different dataset but I get stuck on exactly the same error. Did you manage to go past this?

Any suggestions?

I have included below the content of my hashed_reads and cluster_vectors folders.
Andres

ls -l hashed_reads/
total 18967940
-rw-r--r--. 1 root root 2 Feb 16 18:54 hashParts.txt
-rw-r--r--. 1 root root 8388608 Feb 18 05:41 MET0432.count.hash
-rw-r--r--. 1 root root 16777216 Feb 18 05:41 MET0432.count.hash.conditioned
-rw-r--r--. 1 root root 30228672 Feb 18 05:41 MET0432.nonzero.npy
-rw-r--r--. 1 root root 19367708753 Feb 18 02:48 MET0432.prinseqoutput.hashq.gz
-rw-r--r--. 1 root root 50001 Feb 16 18:54 Wheels.txt

ls -l cluster_vectors/
total 16388
-rw-r--r--. 1 root root 16777296 Feb 18 05:41 global_weights.npy

brian-cleary · 2016-02-22T12:48:16Z

Hi Andres,

Is it the case that you have only a single sample there? The premise of LSA
is to use covariance information across multiple samples, and the SVD step
in particular will need multiple samples to work. I haven't tested the
pipeline with a single sample to see if it generates this error, but
certainly could be the case.

On Mon, Feb 22, 2016 at 7:08 AM, Andres Baravalle notifications@github.com
wrote:

Hi,
I'm trying this on a different dataset but I get stuck on exactly the same
error. Did you manage to go past this?

Any suggestions?

I have included below the content of my hashed_reads and cluster_vectors
folders.

Andres

ls -l hashed_reads/
total 18967940
-rw-r--r--. 1 root root 2 Feb 16 18:54 hashParts.txt
-rw-r--r--. 1 root root 8388608 Feb 18 05:41 MET0432.count.hash
-rw-r--r--. 1 root root 16777216 Feb 18 05:41
MET0432.count.hash.conditioned
-rw-r--r--. 1 root root 30228672 Feb 18 05:41 MET0432.nonzero.npy
-rw-r--r--. 1 root root 19367708753 Feb 18 02:48
MET0432.prinseqoutput.hashq.gz
-rw-r--r--. 1 root root 50001 Feb 16 18:54 Wheels.txt

ls -l cluster_vectors/
total 16388
-rw-r--r--. 1 root root 16777296 Feb 18 05:41 global_weights.npy

—
Reply to this email directly or view it on GitHub
#8 (comment)
.

condomitti · 2016-02-22T17:44:11Z

Hi Andres and Brian,
Sorry my late response at this time.
I could manage to get to the final results by executing LSA steps
separately rather than calling the single script as is in the sample
page. Other than that nothing special was necessary.
Take care,
Condomitti.
Em Seg, 2016-02-22 às 04:48 -0800, brian-cleary escreveu:

Hi Andres,

Is it the case that you have only a single sample there? The premise
of LSA
is to use covariance information across multiple samples, and the SVD
step
in particular will need multiple samples to work. I haven't tested
the
pipeline with a single sample to see if it generates this error, but
certainly could be the case.

On Mon, Feb 22, 2016 at 7:08 AM, Andres Baravalle
ub.com>
wrote:

Hi,
I'm trying this on a different dataset but I get stuck on exactly
the same
error. Did you manage to go past this?

Any suggestions?

I have included below the content of my hashed_reads and
cluster_vectors
folders.

Andres

ls -l hashed_reads/
total 18967940
-rw-r--r--. 1 root root 2 Feb 16 18:54 hashParts.txt
-rw-r--r--. 1 root root 8388608 Feb 18 05:41 MET0432.count.hash
-rw-r--r--. 1 root root 16777216 Feb 18 05:41
MET0432.count.hash.conditioned
-rw-r--r--. 1 root root 30228672 Feb 18 05:41 MET0432.nonzero.npy
-rw-r--r--. 1 root root 19367708753 Feb 18 02:48
MET0432.prinseqoutput.hashq.gz
-rw-r--r--. 1 root root 50001 Feb 16 18:54 Wheels.txt

ls -l cluster_vectors/
total 16388
-rw-r--r--. 1 root root 16777296 Feb 18 05:41 global_weights.npy

—
Reply to this email directly or view it on GitHub

ecomment-187146513>
.

—
Reply to this email directly or view it on GitHub.

baravalle · 2016-02-22T20:00:27Z

Hi Brian, Condomitti,
thanks for your answers.

Brian, I'm coming to this from a computing background (not that familiar with LSA right now) as part of a multi-disciplinary team. Apparently you are right, the data we used as a test might have been from a single sample.

Will do a new test tomorrow, hopefully with the right data, and will ping back.

Thanks again for the help,

  Andres

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in streaming SVD of abundance matrix #8

Error in streaming SVD of abundance matrix #8

condomitti commented Nov 19, 2015

brian-cleary commented Dec 1, 2015

baravalle commented Feb 22, 2016

brian-cleary commented Feb 22, 2016

condomitti commented Feb 22, 2016

baravalle commented Feb 22, 2016

Error in streaming SVD of abundance matrix #8

Error in streaming SVD of abundance matrix #8

Comments

condomitti commented Nov 19, 2015

brian-cleary commented Dec 1, 2015

baravalle commented Feb 22, 2016

brian-cleary commented Feb 22, 2016

condomitti commented Feb 22, 2016

baravalle commented Feb 22, 2016