-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MongoDB may throw an overflow error #37
Comments
Hey! |
We're at 60k files and more than 50 million functions, pushing this system to the limit as it seems 🦔. Thank you for the quick response, I'll be glad to help if I can regarding this issue |
Oh, haha, I see. :D I mean, the default configuration given in the docker repo is pretty much tailored to what I expected to be the envisioned use case (up to ~10k files, i.e. on manually curated data sets) but I would expect it to still work fine for a low multiple of that. As you seem to aim for way beyond more than magnitude in size, I would still expect that changing some of the parameters should still yield tolerable performance. |
So, currently our setup is three EC2 instances: one for mongodb alone (4 cores/16gb/disk that should be fast enough), one for server/15 worker replicas (16 cores/64gb) and a pretty small one for nginx/web. The first immediate thing I noticed is that when having a lot (10+) of submitter processes (submitting files to be indexed in mcrit), the default number of threads in waitress (8) was not enough. This made using the web basically not usable while submitting files, it would take a lot of time for even simple queries. Adding The next pretty immediate thing is that some parts of the code don't scale well. For example, in the Also, it seems like the workers don't quite work to their full capacity. Those 15 workers we have don't really work 100% in parallel even though there are more than 15 |
Sounds like a nice setup!
Do you mean this one? Line 27 in 2ba6f08
Then I will make
Funny enough, we noticed that at some point as well and had introduced internal counters in the family collection. I just noticed that we never updated the method to deliver statistics after this. This was just addressed in mcrit 1.0.19, pushed an hour ago. ;)
This may also be related to how the queue works, I would need to replicate that with a bigger setup myself to get some introspection on that. Normally, both minhash calculation and matching should be fully parallelized. |
Oh and the original issue with the mongo error above is indeed related to int64/uint64, or rather BSON not being able to store uint64 values. My guess is that you have some binaries/smda reports where the base address is above 0x7FFFFFFFFFFFFFFF. I will look into the conversion of such problematic values, at least for the purpose of database storage. UPDATE: I was able to replicate the issue with a crafted SMDA report. Therefore, the original issue was now fixed in the just published mcrit 1.0.20. I am now converting potentially large values (base addresses, function offsets) in two complement for storage to achieve BSON int64 compatibility. |
Hey!
Some reoccurring mongo related error pops up in mcrit server logs from time to time when running an indexing process that submits files to mcrit.
Not sure if this is a mongo issue or an issue in mcrit, but it seems to be related to the ID generation done in mcrit. Can some field in the metadata saved to mongo be bigger than the 8 byte integer limit in BSON?
The text was updated successfully, but these errors were encountered: