Run out of memory on 1.6m point dataset with 300 dimensions. #56

kongyq · 2020-11-01T21:33:04Z

Hi, great work for Top2Vec, I am trying to apply it to my dataset which has 1.6million instances. I successfully trained Doc2vec inside Top2vec. with 300 dimensions as the default. but I run out of memory on the Umap procedure in 2 minutes. BTW I have a 32g memory. I also try low_memory=True. The same oom.

So, I wonder that how many memory UMAP gonna take for 2m points with 300 dimensions? For precaution, how many more memory HDBScan gonna cost?

Thank you!

ddangelov · 2020-11-01T22:04:17Z

Thank you!

Which version of UMAP are you using? This should not be happening with the memory you have and the size of the dataset.

kongyq · 2020-11-01T22:06:42Z

It's 0.4.6. It comes with top2vec. when I install top2vec with pip.

ddangelov · 2020-11-01T22:08:13Z

Do you have a screenshot of the error?

kongyq · 2020-11-01T22:12:03Z

there are no details for the error. But I am pretty sure it's because run out of memory, I use pycharm to run the notebook. And also monitor the memory with the terminal. when running UMAP and it eats all memory, the notebook server reboot.

ddangelov · 2020-11-01T22:14:47Z

You could try UMAP with init='random'.

kongyq · 2020-11-01T22:16:10Z

And all this happens in 2 minutes. I know that doc2vec takes about 8g of memory but there are still 24g for UMAP to use. And the author of UMAP mentioned it's memory hungry. But that hungry? OK, I will try with init='random'. Thank you, I will let you know the result.

kongyq · 2020-11-02T15:56:46Z

Hi, I just tried with parameter init='random'. the issue is the same, oom.

ddangelov · 2020-11-02T17:27:25Z

I have ran Top2Vec on my own laptop with a 1.2 million dataset, with no issues and it has less RAM then than your system. This seems to be a UMAP problem so unfortunately I cannot help any further.

kongyq · 2020-11-02T17:52:28Z

Hi, I think I just need more RAM. I just did some tests with fewer points. 20k points take 7g, 80k points take 21g. Thus, 1.6m should take 42g. I think UMAP really need to optimize the usage of RAM in the next version.

kongyq · 2020-11-02T17:55:08Z

If you don't mind, could you please share your gig's specs (CPU and RAM)? and how do you install the UMAP? through pip install umap-learn? or through pip install top2vec？Thanks a lot!

ddangelov · 2020-11-02T18:12:06Z

I created a fresh conda environment, followed by pip install top2vec.

ddangelov closed this as completed Nov 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run out of memory on 1.6m point dataset with 300 dimensions. #56

Run out of memory on 1.6m point dataset with 300 dimensions. #56

kongyq commented Nov 1, 2020

ddangelov commented Nov 1, 2020

kongyq commented Nov 1, 2020

ddangelov commented Nov 1, 2020

kongyq commented Nov 1, 2020

ddangelov commented Nov 1, 2020

kongyq commented Nov 1, 2020

kongyq commented Nov 2, 2020

ddangelov commented Nov 2, 2020

kongyq commented Nov 2, 2020

kongyq commented Nov 2, 2020

ddangelov commented Nov 2, 2020

Run out of memory on 1.6m point dataset with 300 dimensions. #56

Run out of memory on 1.6m point dataset with 300 dimensions. #56

Comments

kongyq commented Nov 1, 2020

ddangelov commented Nov 1, 2020

kongyq commented Nov 1, 2020

ddangelov commented Nov 1, 2020

kongyq commented Nov 1, 2020

ddangelov commented Nov 1, 2020

kongyq commented Nov 1, 2020

kongyq commented Nov 2, 2020

ddangelov commented Nov 2, 2020

kongyq commented Nov 2, 2020

kongyq commented Nov 2, 2020

ddangelov commented Nov 2, 2020