-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfault with REE on Linux and Mac OS X #3
Comments
Can you send me a code example that causes this behaviour and/or the core file that is dumped? |
I've not been able to pin down the code that is causing this. I typically have a lot of background jobs where I use Amatch for soma string comparisons. Once in a while (maybe after a a hundred jobs processed, maybe after more than a thousand) it crashes. I don't know either how to get the core dump. In my console or in the logs, I get a line saying that REE crashed, but nothing more. If you know what I can do, tell me, I'll do it. Thanks. |
@seamusabshere I pretty sure that my code is not threaded. It's a Rails 2.3 app, running on REE 1.8.7 |
@flori is there any other debugging information I can get you? |
I've attached a
I don't know if this is relevant to my issue since I know absolutely nothing about I'd be more than happy to help debug this further if I had some instructions. |
It is related to my issue. If I terminate the |
should be using xfree() rather than free() to free any memory alloced with ALLOC() or ALLOC_N(). those ALLOC functions do more than just a malloc(), and xfree() is the correct complement for them (or so it seems, i haven't dug that deeply - i spent enough time worming my way through macro-land in this gem's source) this will likely fix all your segfaults. it fixed mine. please fix this so we can use a working version in production. |
sorry, that was incorrect. it was another change i made in the caller - i was using the JaroWinkler matcher and noticed the segfaults occured in the xfree call at the end in the FREE_STRINGS macro (which is actually a correct xfree() call). to stop the segfault, i used ignore_case=false and downcased the strings in ruby to avoid the if clause altogether. this likely won't apply for the other posters, but i'll debug the c a bit further tomorrow and post again if there is anything worth mentioning. |
@redbeardenterprises if you write a .patch that I can easily apply, I might be able to test it with my code base. |
the patch wouldn't fix your problem, it didn't end up fixing mine. it is definitely a problem that those ALLOC calls aren't matched with xfree. if you grep gc.c in ruby source for CALC_EXACT_MALLOC_SIZE you'll see that those calls alloc an extra size_t above than the size requested and tack a header on the allocated block which indicates the size of the allocation. they also track the allocated memory within the GC context. this means that an unmatched free would free the wrong pointer (without doing the decrement) which would normally blow the heap immediately, the problem wouldn't be sporadic. thus it's unlikely that any of us are actually using ruby interpreters where CALC_EXACT_MALLOC_SIZE was enabled during compilation. when it's not enabled, free will work normally. this does not mean those frees shouldn't be fixed, they should still be changed to xfree() calls. i'm digging into the gem a bit further now. |
don't have much more time to spend on this, will either use another gem (fuzzy-string-match works for me since i'm using the jaro metric) or roll my own implementation, which would likely be faster than debugging this one. |
That's too bad. I really thought there was something here. Thanks for having tried. |
I'm also getting a segfault, but without REE. I'm just using default ruby-1.9.2-p290 with RVM on Mac. I can send whatever other details are required as well. I had to drop back to version 0.2.5 before I was able to stop my app from faulting. |
In case it could be useful to anyone who understand that gibberish :
|
Is it a coincidence that the only two calls to
If only I had a test case that always blows up... but maybe somebody knows off the top of their head? |
They should be xfrees by now, but it seems I overlooked them in a previous patch. I just released the 0.2.10 version, that should get this right. Maybe enterprise ruby doesn't alias free as xfree and this would explain the REE crashes? Anyway I feel pretty stupid now... |
great, thanks, i'll test this right now! i'm on ruby 1.9.2 (heroku) and was still seeing segfaults on pair_distance_similar. perhaps this issue should be renamed to "Segfaults" |
that appears to have fixed segfaults on Pair Distance for me. thank you! |
I've just had a SEGFAULT on the same setup as before, but with Amatch 0.2.10 :-/ I'll tell you if I get more |
It is strange! I've just written a small stress test (https://gist.github.com/1770127) and I can't crash it. But in my real world app, with a very similar use case, it still crashes with 0.2.10. I'll try to reproduce and let you know. |
Running the stress test 4 times yielded 3 different results...
Here are all the crash reports: https://gist.github.com/1772342 For (2) and (3), I found these lines:
@flori maybe we can ask @wyhaines to look at the memory-handling code? |
Hey again. You piqued my interest. I observed failures using 0.2.10 atop MRI 1.9.2-p290 + OS X 10.7.3 using the stress test. Ran several times, most failed, typically in the jaro routines. Reverted to 0.2.5, could not reproduce. Moved to 0.2.6, reproduced several times, easily. Looked at code, the only interesting thing was the addition of some free calls pulled by Florian from another author. Forked the code. Removing these frees did help with the issue, though there is nothing wrong with them and they are indeed necessary to prevent a big memory leak. Turned to Valgrind. Noticed many invalid read/write errors. Moved macros around so I could debug a bit better and when reading through COMPUTE_JARO, noticed this code: ...
high = (i + max_dist < b_len ? i + max_dist : b_len);
for(j = low; j <= high; j++) {
if (!l[1][j] && a_ptr[i] == b_ptr[j]) {
... Can cause a buffer overflow since high is limited by b_len, but the arrays l[1] and b_ptr are both of b_len length. Since these arrays are zero-indexed (C language) referring to the jth index will result in an overflow with unpredictable results - particularly in the assignment a couple lines down: l[1][j] = 1; Without knowing the Jaro algorithm well I'd hesitate to say that changing the <= to a strict < is the correct solution, but it should stop the oveflows and seems like a reasonable solution for these segfaults. After changing the <= to a strict < I can no longer reproduce the issue. I've forked and patched the issue in my own repository (@redbeardenterprises/amatch). I'm going to keep using my fork and see if any more trouble arises. |
Same issue here, MRI 1.9.2-p290 + OS X 10.7.3. Switching to @redbeardenterprises 's fork resolved the issue for me. Hopefully this is a solution and not just a bandaid :/ |
Did you have issues with either the Jaro or the Jaro-Winkler algorithm before? |
I can't say that my segfaults were specifically due to Jaro or Jarowinkler. |
This is my first usage of the gem; We're only using Jaro-Winkler thus far, it seemed to be the one we wanted to use, though I didn't try others. Some members of the team were getting segfaults similar to above. |
It seems pretty difficult to find a common pattern in all of these reports, so there might be different causes for the segfaults. Also, if a segfault happens during garbage collection, this might have been caused by a bug in some other ruby extension or ruby itself. The actual reason is sometimes difficult to figure out. I will look into the Jaro-Winkler computation and release a new version soon, though. |
We were using Jaro-Winkler and Substring. Substring never gave any problems and Jaro-Winkler did consistently. I haven't noticed any issues since I've patched it but we're not going to be using heavily for another few days. |
Shoot, the segfaults are still happening with Pair Distance. https://gist.github.com/8bfe1132e31ba97a3102 As you can see, this is happening on Heroku Celadon Cedar running ruby 1.9.2p290 (2011-07-09 revision 32553) [x86_64-linux] |
Yes, remember I mentioned that unless the interpreter was built with the CALC_EXACT_MALLOC_SIZE, the ALLOC/xfree mismatches wouldn't cause a problem (at least in the version of ruby source I was looking at). Can you reproduce this without your app? I would be willing to use valgrind to investigate what is happening. |
For everyone using Jaro/Jaro-Winkler, I just updated @redbeardenterprises/amatch with the correct solution after reviewing the algorithm. The strict The two obvious ways to fix the overflow while preserving behavior are:
I chose the latter because it seemed more intuitive. The overflow/segfault should still be fixed (tested with @jelcour's stress test and the test case provided in #5). |
As an update, we just used the @redbeardenterprises fork and jaro winkler heavily in our application for hours and observed no segfaults. Previously they presented within minutes - so at least in our case, that overflow appears to have been the main culprit. |
Hi, tonight, I had many SEGFAULTs, but one of them gave me more information than usual :
It might be useful. |
@redbeardenterprises you rock, thanks for the patch, I'm using your branch now :) |
@redbeardenterprises can you open a Pull Request for that? |
👍 for a pull-request and a new release |
Since this morning, I've been running some tasks that use intensively Amatch, and every few minutes I've had a segfault. |
Have a look: https://rubygems.org/gems/amatch/versions/0.2.11 I have just merged the @redbeardenterprise changes and released a new version. |
Thanks guys, I missed the message from @felipecsl over the holiday. Was going to do the pull request but you beat me to it! |
No, worries, I am much worse than you in this regard. ;) |
Thanks to everyone involved in this epic issue. I'm really glad we have finally got it fixed. |
Hi,
I've been using AMatch for 2 years, with a regular MRI 1.8.7 in production, but with versions 2.0.5 and 2.0.7, I have SEGFAULTs on a regular basis when running with REE.
I thought it was happening only with Debian 64 bits, but recently I've had some with a 32 bits Ubuntu and now also with Mac OS X (10.6.8).
There is a more detailed ticket in the REE issue tracker : http://code.google.com/p/rubyenterpriseedition/issues/detail?id=71
Tell me if there is anything I can do to help solving this very annoying thing.
Besides that, I'm very happy with AMatch. Thanks for maintaining it.
The text was updated successfully, but these errors were encountered: