Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run out of memory on 1.6m point dataset with 300 dimensions. #56

Closed
kongyq opened this issue Nov 1, 2020 · 11 comments
Closed

Run out of memory on 1.6m point dataset with 300 dimensions. #56

kongyq opened this issue Nov 1, 2020 · 11 comments

Comments

@kongyq
Copy link

kongyq commented Nov 1, 2020

Hi, great work for Top2Vec, I am trying to apply it to my dataset which has 1.6million instances. I successfully trained Doc2vec inside Top2vec. with 300 dimensions as the default. but I run out of memory on the Umap procedure in 2 minutes. BTW I have a 32g memory. I also try low_memory=True. The same oom.

So, I wonder that how many memory UMAP gonna take for 2m points with 300 dimensions? For precaution, how many more memory HDBScan gonna cost?

Thank you!

@ddangelov
Copy link
Owner

Thank you!

Which version of UMAP are you using? This should not be happening with the memory you have and the size of the dataset.

@kongyq
Copy link
Author

kongyq commented Nov 1, 2020

It's 0.4.6. It comes with top2vec. when I install top2vec with pip.

@ddangelov
Copy link
Owner

Do you have a screenshot of the error?

@kongyq
Copy link
Author

kongyq commented Nov 1, 2020

there are no details for the error. But I am pretty sure it's because run out of memory, I use pycharm to run the notebook. And also monitor the memory with the terminal. when running UMAP and it eats all memory, the notebook server reboot.

@ddangelov
Copy link
Owner

You could try UMAP with init='random'.

@kongyq
Copy link
Author

kongyq commented Nov 1, 2020

And all this happens in 2 minutes. I know that doc2vec takes about 8g of memory but there are still 24g for UMAP to use. And the author of UMAP mentioned it's memory hungry. But that hungry? OK, I will try with init='random'. Thank you, I will let you know the result.

@kongyq
Copy link
Author

kongyq commented Nov 2, 2020

Hi, I just tried with parameter init='random'. the issue is the same, oom.

@ddangelov
Copy link
Owner

I have ran Top2Vec on my own laptop with a 1.2 million dataset, with no issues and it has less RAM then than your system. This seems to be a UMAP problem so unfortunately I cannot help any further.

@kongyq
Copy link
Author

kongyq commented Nov 2, 2020

Hi, I think I just need more RAM. I just did some tests with fewer points. 20k points take 7g, 80k points take 21g. Thus, 1.6m should take 42g. I think UMAP really need to optimize the usage of RAM in the next version.

@kongyq
Copy link
Author

kongyq commented Nov 2, 2020

If you don't mind, could you please share your gig's specs (CPU and RAM)? and how do you install the UMAP? through pip install umap-learn? or through pip install top2vec?Thanks a lot!

@ddangelov
Copy link
Owner

I created a fresh conda environment, followed by pip install top2vec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants