TypeError: Cannot use scipy.linalg.eigh for sparse A with k >= N #38

dulanafdo · 2020-10-08T11:53:22Z

I'm using a set of text documents (pdf documents converted into text) for topic modeling. While training the model I'm getting this error.
It's a great help if someone can help me to sort this out.
C:\Users\prabo\Desktop\Topic modeling pipeline.venv\lib\site-packages\umap\umap_.py:1678: UserWarning: n_neighbors is larger than the dataset size; truncating to X.shape[0] - 1
warn(
C:\Users\prabo\Desktop\Topic modeling pipeline.venv\lib\site-packages\scipy\sparse\linalg\eigen\arpack\arpack.py:1590: RuntimeWarning: k >= N for N * N square matrix. Attempting to use scipy.linalg.eigh instead.
warnings.warn("k >= N for N * N square matrix. "
Traceback (most recent call last):
File "c:/Users/prabo/Desktop/Topic modeling pipeline/test.py", line 27, in
model = Top2Vec(documents=df.text, speed="learn", workers=8)
File "C:\Users\prabo\Desktop\Topic modeling pipeline.venv\lib\site-packages\top2vec\Top2Vec.py", line 222, in init
umap_model = umap.UMAP(n_neighbors=15,
File "C:\Users\prabo\Desktop\Topic modeling pipeline.venv\lib\site-packages\umap\umap_.py", line 1965, in fit
self.embedding_ = simplicial_set_embedding(
File "C:\Users\prabo\Desktop\Topic modeling pipeline.venv\lib\site-packages\umap\umap_.py", line 1033, in simplicial_set_embedding
initialisation = spectral_layout(
File "C:\Users\prabo\Desktop\Topic modeling pipeline.venv\lib\site-packages\umap\spectral.py", line 324, in spectral_layout
eigenvalues, eigenvectors = scipy.sparse.linalg.eigsh(
File "C:\Users\prabo\Desktop\Topic modeling pipeline.venv\lib\site-packages\scipy\sparse\linalg\eigen\arpack\arpack.py", line 1595, in eigsh
raise TypeError("Cannot use scipy.linalg.eigh for sparse A with "
TypeError: Cannot use scipy.linalg.eigh for sparse A with k >= N. Use scipy.linalg.eigh(A.toarray()) or reduce k.

ddangelov · 2020-10-08T12:44:45Z

What is in df.text?

dulanafdo · 2020-10-08T12:56:23Z

What is in df.text?

It's a pandas data frame column which contains set of text documents

ddangelov · 2020-10-08T12:59:58Z

Sounds like you have less than 15 documents, if so then that is the issue. For best results you need thousands, but you surely need more than 15, since UMAP looks for 15 nearest neighbours when doing dimensionality reduction.

dulanafdo · 2020-10-08T13:22:21Z

Then it should be the issue. I just gave only a few documents to test it out.
Thank you very much for your help.

ddangelov · 2020-10-08T13:26:06Z

No problem!

dulanafdo · 2020-10-09T04:10:06Z

With higher number of documents, the error which I previously had is gone. but now I'm getting this error.

File "c:/Users/prabo/Desktop/Topic modeling pipeline/test.py", line 44, in
model = Top2Vec(documents=df.text, speed="learn", workers=5,min_count=1)
File "C:\Users\prabo\Desktop\Topic modeling pipeline.venv\lib\site-packages\top2vec\Top2Vec.py", line 236, in init
self.create_topic_vectors(cluster.labels)
File "C:\Users\prabo\Desktop\Topic modeling pipeline.venv\lib\site-packages\top2vec\Top2Vec.py", line 268, in _create_topic_vectors
self.topic_vectors = np.vstack([self.model.docvecs.vectors_docs[np.where(cluster_labels == label)[0]]
File "<array_function internals>", line 5, in vstack
File "C:\Users\prabo\Desktop\Topic modeling pipeline.venv\lib\site-packages\numpy\core\shape_base.py", line 283, in vstack
return _nx.concatenate(arrs, 0)
File "<array_function internals>", line 5, in concatenate
ValueError: need at least one array to concatenate

ddangelov · 2020-10-09T07:00:12Z

Without knowing what your data looks like I cannot say what is causing this error. Could you try running this below:

from top2vec import Top2Vec
from sklearn.datasets import fetch_20newsgroups

newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

model = Top2Vec(documents=newsgroups.data, speed="learn", workers=8)

dulanafdo · 2020-10-09T07:38:09Z

Without knowing what your data looks like I cannot say what is causing this error. Could you try running this below:

from top2vec import Top2Vec
from sklearn.datasets import fetch_20newsgroups

newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

model = Top2Vec(documents=newsgroups.data, speed="learn", workers=8)

This works fine without any trouble.
Still I'm using a small corpus with only 20 documents. Is it the problem?

This is how my code looks like.

ddangelov · 2020-10-09T12:22:46Z

Yes you need more than 20 documents, usually thousands for best results. Top2Vec has to learn both word and document vectors which it is probably unable to do with so little data. There may be a pre-trained model in the future and perhaps an option for other embedding methods that do not require lots of data. For the time being using more documents that are long which allow the model to learn vectors will be necessary.

ddangelov closed this as completed Oct 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TypeError: Cannot use scipy.linalg.eigh for sparse A with k >= N #38

TypeError: Cannot use scipy.linalg.eigh for sparse A with k >= N #38

dulanafdo commented Oct 8, 2020

ddangelov commented Oct 8, 2020

dulanafdo commented Oct 8, 2020

ddangelov commented Oct 8, 2020

dulanafdo commented Oct 8, 2020

ddangelov commented Oct 8, 2020

dulanafdo commented Oct 9, 2020

ddangelov commented Oct 9, 2020

dulanafdo commented Oct 9, 2020

ddangelov commented Oct 9, 2020

TypeError: Cannot use scipy.linalg.eigh for sparse A with k >= N #38

TypeError: Cannot use scipy.linalg.eigh for sparse A with k >= N #38

Comments

dulanafdo commented Oct 8, 2020

ddangelov commented Oct 8, 2020

dulanafdo commented Oct 8, 2020

ddangelov commented Oct 8, 2020

dulanafdo commented Oct 8, 2020

ddangelov commented Oct 8, 2020

dulanafdo commented Oct 9, 2020

ddangelov commented Oct 9, 2020

dulanafdo commented Oct 9, 2020

ddangelov commented Oct 9, 2020