-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running pycisTopic on very large datasets [PERFORMANCE] #106
Comments
Hi @simozhou This step can take a long time, however 4 days is still a lot. Did any intermediate models finish in this time, or is it stuck at running the model with 2 topics? I would also suggest to specify a All the best, Seppe |
Hi @SeppeDeWinter, Thank you so much for your feedback! All models do run eventually, although very slowly (2 topics runs faster, then for obvious reasons larger models with more topics are slower). I will definitely add a save path to avoid recalculating all models every time. I am providing 450GB of RAM for this job. Do you believe that a larger amount of RAM may help with the speed of computations? Thanks again and best regards, |
Hi @simozhou 450 GB of RAM should be enough. I'm not sure why it's running so slowly for you... All the best, Seppe |
I am also running mallet with a very large dataset. I have saved intermediate models, in case it terminates before completion. I am wondering how I can combine multiple runs to combine the different topic modelings under mallet.pkl in this case? |
Hi @tiffanywc We store each model as an entry in a list. import os
import pickle
models = []
for file in os.lisdir(<PATH_TO_DIRECTORY_WITH_MODELS>:
# check wether file is a result from topic modelling, e.g. based on the name
if file.endswith(".pkl"):
model = pickle.load(open(os.path.join(<PATH_TO_DIRECTORY_WITH_MODELS>, file), "rb"))
models.append(model) I hope this helps? All the best, Seppe |
Hello @simozhou, I'm wondering if you managed to find a resolution, because I'm currently facing a similar challenge:
Despite the seemingly small number of topics and substantial computational resources, the process is taking an unexpectedly long time. Have you encountered any solutions or optimizations that might help in this scenario? Any insights or workarounds you've discovered would be greatly appreciated. Thank you! |
Hi @TemiLeke, In short, no, I have not yet solved my time problem. There are a few improvements that helped make it at least tolerable.
This is the code I'm currently using:
I would like to point out that the computational time is still very slow, and it would be good to address this problem. I have been running my 1 million cells dataset and it took 8 days of computations to run with the aforementioned parameters. (which was kinda foreseen, but it would be ideal to shorten this time for the next iteration if possible :) ) @SeppeDeWinter is there something we can do to help? I would be happy to contribute and possibly figure out why this is so slow! |
Thanks a lot for the detailed reply @simozhou. I'm currently trying this out. I unfortunately only have access to a 40-core system, so it would even take longer. I agree it would be good to address the problem, and I'd be very happy to contribute in any capacity. @SeppeDeWinter |
Hi @simozhou and @TemiLeke, I am running (or trying to run) topic modelling on a dataset with almost 1.5 million cells and 600,000 regions. Indeed these operations require a lot of memory and take a long time, but with the latest command line version, the speed of pre- and postprocessing of the corpus is already twice as fast as what it used to be, so you can maybe try to use that one. I have not yet managed to let the full topic modelling finish, since I am still trying to figure out how much memory I should exactly use (was requesting too little, giving me out of memory errors), but most of the time the mallet training step completes, while the memory bottleneck is in the loading of the assigned topics. The cli code I am using now looks like this, and I am only running it for one topic number at the time: |
HI @JulieDeMan I haven't tried out the CL version yet, but it would be interesting to see how significantly it speeds up the training. The memory consuming part of the pipeline indeed has to do with the To resolve this, I developed a custom function that significantly reduces memory usage, albeit at the cost of increased processing time (see below). It involves loading and processing the topics in smaller chunks, such that the Here's how I implemented this in pycistopic,
Please note that this is a crude application and there could possibly be a more efficient approach. |
What type of problem are you experiencing and which function is you problem related too
I am running cisTopic on a very large dataset (200k cells) and it takes apparently very long. It has approx 80k regions.
I am running the mallet version of pycisTopic, and the function has these params:
Is there a way I can speed up computations? At the moment it runs for more than 4 days, and I have plans to run it on an even bigger dataset (1M cells), and I have the feeling I might be doing something wrong, and that maybe I could do something differently (maybe not use Mallet? not sure). Do you have suggestions on this?
The machine it runs on has 64 CPUs and 500GB of RAM available.
Version information
pycisTopic: 1.0.3.dev20+g8955c76
The text was updated successfully, but these errors were encountered: