Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: Cannot use scipy.linalg.eigh for sparse A with k >= N #38

Closed
dulanafdo opened this issue Oct 8, 2020 · 9 comments
Closed

Comments

@dulanafdo
Copy link

I'm using a set of text documents (pdf documents converted into text) for topic modeling. While training the model I'm getting this error.
It's a great help if someone can help me to sort this out.
C:\Users\prabo\Desktop\Topic modeling pipeline.venv\lib\site-packages\umap\umap_.py:1678: UserWarning: n_neighbors is larger than the dataset size; truncating to X.shape[0] - 1
warn(
C:\Users\prabo\Desktop\Topic modeling pipeline.venv\lib\site-packages\scipy\sparse\linalg\eigen\arpack\arpack.py:1590: RuntimeWarning: k >= N for N * N square matrix. Attempting to use scipy.linalg.eigh instead.
warnings.warn("k >= N for N * N square matrix. "
Traceback (most recent call last):
File "c:/Users/prabo/Desktop/Topic modeling pipeline/test.py", line 27, in
model = Top2Vec(documents=df.text, speed="learn", workers=8)
File "C:\Users\prabo\Desktop\Topic modeling pipeline.venv\lib\site-packages\top2vec\Top2Vec.py", line 222, in init
umap_model = umap.UMAP(n_neighbors=15,
File "C:\Users\prabo\Desktop\Topic modeling pipeline.venv\lib\site-packages\umap\umap_.py", line 1965, in fit
self.embedding_ = simplicial_set_embedding(
File "C:\Users\prabo\Desktop\Topic modeling pipeline.venv\lib\site-packages\umap\umap_.py", line 1033, in simplicial_set_embedding
initialisation = spectral_layout(
File "C:\Users\prabo\Desktop\Topic modeling pipeline.venv\lib\site-packages\umap\spectral.py", line 324, in spectral_layout
eigenvalues, eigenvectors = scipy.sparse.linalg.eigsh(
File "C:\Users\prabo\Desktop\Topic modeling pipeline.venv\lib\site-packages\scipy\sparse\linalg\eigen\arpack\arpack.py", line 1595, in eigsh
raise TypeError("Cannot use scipy.linalg.eigh for sparse A with "
TypeError: Cannot use scipy.linalg.eigh for sparse A with k >= N. Use scipy.linalg.eigh(A.toarray()) or reduce k.

@ddangelov
Copy link
Owner

What is in df.text?

@dulanafdo
Copy link
Author

What is in df.text?

It's a pandas data frame column which contains set of text documents

@ddangelov
Copy link
Owner

Sounds like you have less than 15 documents, if so then that is the issue. For best results you need thousands, but you surely need more than 15, since UMAP looks for 15 nearest neighbours when doing dimensionality reduction.

@dulanafdo
Copy link
Author

Then it should be the issue. I just gave only a few documents to test it out.
Thank you very much for your help.

@ddangelov
Copy link
Owner

No problem!

@dulanafdo
Copy link
Author

With higher number of documents, the error which I previously had is gone. but now I'm getting this error.

File "c:/Users/prabo/Desktop/Topic modeling pipeline/test.py", line 44, in
model = Top2Vec(documents=df.text, speed="learn", workers=5,min_count=1)
File "C:\Users\prabo\Desktop\Topic modeling pipeline.venv\lib\site-packages\top2vec\Top2Vec.py", line 236, in init
self.create_topic_vectors(cluster.labels)
File "C:\Users\prabo\Desktop\Topic modeling pipeline.venv\lib\site-packages\top2vec\Top2Vec.py", line 268, in _create_topic_vectors
self.topic_vectors = np.vstack([self.model.docvecs.vectors_docs[np.where(cluster_labels == label)[0]]
File "<array_function internals>", line 5, in vstack
File "C:\Users\prabo\Desktop\Topic modeling pipeline.venv\lib\site-packages\numpy\core\shape_base.py", line 283, in vstack
return _nx.concatenate(arrs, 0)
File "<array_function internals>", line 5, in concatenate
ValueError: need at least one array to concatenate

@ddangelov
Copy link
Owner

Without knowing what your data looks like I cannot say what is causing this error. Could you try running this below:

from top2vec import Top2Vec
from sklearn.datasets import fetch_20newsgroups

newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

model = Top2Vec(documents=newsgroups.data, speed="learn", workers=8)

@dulanafdo
Copy link
Author

Without knowing what your data looks like I cannot say what is causing this error. Could you try running this below:

from top2vec import Top2Vec
from sklearn.datasets import fetch_20newsgroups

newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

model = Top2Vec(documents=newsgroups.data, speed="learn", workers=8)

This works fine without any trouble.
Still I'm using a small corpus with only 20 documents. Is it the problem?

This is how my code looks like.
Capture

@ddangelov
Copy link
Owner

Yes you need more than 20 documents, usually thousands for best results. Top2Vec has to learn both word and document vectors which it is probably unable to do with so little data. There may be a pre-trained model in the future and perhaps an option for other embedding methods that do not require lots of data. For the time being using more documents that are long which allow the model to learn vectors will be necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants