Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segmentation fault #26

Closed
Hamper opened this issue Dec 8, 2014 · 20 comments
Closed

segmentation fault #26

Hamper opened this issue Dec 8, 2014 · 20 comments

Comments

@Hamper
Copy link

Hamper commented Dec 8, 2014

I have segfaults in client on high load (bin content is map object like {key1: timestamp, key2: timestamp, ...}):
#0 0x00007ffff6c5ead0 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007ffff6c60146 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007ffff6c62c95 in malloc () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x00007ffff7783ded in operator new(unsigned long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4 0x00007ffff6994343 in prepare (args=...) at ../src/main/client/put.cc:88
#5 0x00007ffff699905b in async_invoke (args=..., prepare=0x7ffff69942d0 <prepare(v8::Arguments const&)>, execute=0x7ffff6994140 <execute(uv_work_t*)>, respond=0x7ffff6993e70 <respond(uv_work_t*, int)>) at ../src/main/util/async.cc:37
#6 0x00007ffff6994c1e in AerospikeClient::Put (args=...) at ../src/main/client/put.cc:290
#7 0x00003f02aad1ad99 in ?? ()
#8 0x00007fffffffaad8 in ?? ()
#9 0x00007fffffffaaf0 in ?? ()
#10 0x0000000000000003 in ?? ()
#11 0x0000000000000000 in ?? ()


#0 0x00007ffff6c5ead0 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007ffff6c60146 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007ffff6c62c95 in malloc () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x00007ffff7783ded in operator new(unsigned long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4 0x00007ffff6995a98 in prepare (args=...) at ../src/main/client/select.cc:90
#5 0x00007ffff699905b in async_invoke (args=..., prepare=0x7ffff6995a30 <prepare(v8::Arguments const&)>, execute=0x7ffff69958e0 <execute(uv_work_t*)>, respond=0x7ffff6995620 <respond(uv_work_t*, int)>) at ../src/main/util/async.cc:37
#6 0x00007ffff699630e in AerospikeClient::Select (args=...) at ../src/main/client/select.cc:294
#7 0x0000392b8fe53339 in ?? ()
#8 0x00007fffffffde98 in ?? ()
#9 0x00007fffffffdeb0 in ?? ()
#10 0x0000000000000003 in ?? ()
#11 0x0000000000000000 in ?? ()

@GayathriKaliyamoorthy
Copy link
Contributor

Hi Hamper,
I am not able to reproduce the issue in house.
Can you give input on the kind of load that generates this segfault, and the client machine configuration on which this high load is being run?

Thanks.

@Hamper
Copy link
Author

Hamper commented Dec 23, 2014

this test script can reproduce error: https://dl.dropboxusercontent.com/u/12199274/github/test.tar.gz

client machine: Intel® Core™ i7-4770 Quadcore Haswell, 32 GB DDR3 RAM, Ubuntu 12.04

@GayathriKaliyamoorthy
Copy link
Contributor

Hamper,
Thanks for providing us with the scripts to reproduce the problem.
I ran the script in ubuntu, debian7 and centos6 for more than 4 hours. But could not reproduce the same segfault. The stacktrace you reported shows some memory allocation failure. We have fixed a bunch of memory leaks and other problems, released the latest version to npm. Could you give a shot now and see if you land in the same problem again?

Thanks.

@Hamper
Copy link
Author

Hamper commented Jan 23, 2015

I still can reproduce with this script (ubuntu 14.10, nodejs 0.10.35, aerospikeclient 1.0.26, aerospike server 3.4.1 with default settings), also node main.js -P 1 -I 10 -O 100000 from benchmark reproduces error

@courtneycouch
Copy link

We're also getting sporadic segmentation faults under high loads.

@courtneycouch
Copy link

We set up totally fresh servers on ec2 (r3.xlarge instances) running Ubuntu and only the bare minimum installed to get the nodejs client, and the bare minimum to get aerospike installed to verify there wasn't anything odd conflicting.

Both are: Ubuntu 14.04.1 LTS and Node.js version: v0.10.36

In case it impacts anything the aerospike server was initialized with 10m keys. Doesn't seem like the aerospike server would have any impact here as it's a client error though.

We're considering moving some of our higher load data over but this would be accessed through the nodejs client.. as a result it really needs to be rock solid under heavy highly concurrent load.

Using this we can get the faults pretty consistently. Sometimes it'll nearly finish, other times the fault happens right away. The Aerospike server is remote to the client as well.

Obviously this is hugely stripped down.. just wanted the bare minimum to reproduce it there.

var async = require('async');
var aerospike = require('aerospike');
var status = aerospike.status;

var client = aerospike.client({
    hosts: [ { addr: '10.10.10.220', port: 3000 } ]
});

var key = {
  ns: 'store_disk_ebs',
  set: 'benchmark',
  key: 'keytest'
};

var iteration_count = 100000;
var concurrency_count = 10;

//connect to server
client.connect(function(err, client) {

  //verify connected ok
  if (err.code != status.AEROSPIKE_OK) throw 'failed connection';

  //loop iteration_count number of times in a series
  async.timesSeries(iteration_count, function(n, iteration) {

    //print iteration count every 1000 iterations
    if (n % 1000 === 0) console.log('iteration: ' + n);

    //for each iteration run this command concurrency_count times in parallel
    async.times(concurrency_count, function(n, concurrent) {

      //get the key
      client.get(key, function() {
        //we don't need to do anything with the result. just complete this command
        concurrent();
      });

    }, iteration); //after the concurrency_count commands are complete, start a new iteration

  });
});

@GayathriKaliyamoorthy
Copy link
Contributor

Thanks for your patience.

We have identified the issue. There was a problem with the way we were handling buffers in V8 layer. We fixed that and made an official release to npm. The latest version is 1.0.28. Please use the latest and give us your feedback.

Thanks.

@vivekkrbajpai
Copy link

Hey Gayathri,

It seems the version 1.0.28 also contains this error. I am getting these things in amazon linux log messages.

[ 1514.330637] node[21522]: segfault at 4 ip 00007fea72009833 sp 00007fea6bffea88 error 6 in libc-2.17.so[7fea71eda000+19b000]
Feb 9 16:47:15 ip-xx-xxx-xx-xx kernel: [ 1514.375729] node[21531]: segfault at 4 ip 00007fe77003d833 sp 00007fe76e059a88 error 6 in libc-2.17.so[7fe76ff0e000+19b000]
Feb 9 16:47:15 ip-xx-xxx-xx-xx kernel: [ 1514.381353] node[21538]: segfault at 4 ip 00007f9bdc501833 sp 00007f9bd6be2a88 error 6 in libc-2.17.so[7f9bdc3d2000+19b000]
Feb 9 16:47:15 ip-xx-xxx-xx-xx kernel: [ 1514.388211]
Feb 9 16:47:16 ip-xx-xxx-xx-xx kernel: [ 1515.116007] node[21566]: segfault at 4 ip 00007f16308f9833 sp 00007f162affca88 error 6 in libc-2.17.so[7f16307ca000+19b000]
Feb 9 16:47:17 ip-xx-xxx-xx-xx kernel: [ 1516.078188] node[21595]: segfault at 4 ip 00007eff66f56833 sp 00007eff64f72a88 error 6 in libc-2.17.so[7eff66e27000+19b000]
Feb 9 16:47:17 ip-xx-xxx-xx-xx kernel: [ 1516.672639] node[21599]: segfault at 4 ip 00007f13bb000833 sp 00007f13b901ca88 error 6 in libc-2.17.so[7f13baed1000+19b000]
Feb 9 16:47:17 ip-xx-xxx-xx-xx kernel: [ 1516.694516] node[21591]: segfault at 4 ip 00007fc4cfe8e833 sp 00007fc4ccea8a88 error 6 in libc-2.17.so[7fc4cfd5f000+19b000]
Feb 9 16:47:17 ip-xx-xxx-xx-xx kernel: [ 1516.901260] node[21631]: segfault at 4 ip 00007fa381e3c833 sp 00007fa37b7fda88 error 6 in libc-2.17.so[7fa381d0d000+19b000]
Feb 9 16:47:18 ip-xx-xxx-xx-xx kernel: [ 1517.719136] node[21643]: segfault at 4 ip 00007ff011a87833 sp 00007ff00a7fba88 error 6 in libc-2.17.so[7ff011958000+19b000]

Could you please check.

@ryanwitt
Copy link
Contributor

@vivekkrbajpai are you using batchGet()? I'm getting segfaults in this function on 1.0.28 and not with get()

@GayathriKaliyamoorthy
Copy link
Contributor

@ryanwitt Thanks for the input regarding batch_get, I'll try reproducing the segfault with batch_get and root cause the issue.

Thanks

@vivekkrbajpai
Copy link

@GayathriKaliyamoorthy @ryanwitt yeah i am using batchGet(). During high concurrent load it produces segfault.

@GayathriKaliyamoorthy
Copy link
Contributor

@vivekkrbajpai @ryanwitt I have identified a loop hole in batchGet logic, where if one of the keys in the batchKeys is corrupted or not constructed properly the driver segfaults. Can you confirm that, your application always sends a well constructed batchKeys to aerospike nodejs driver.

And also I tried reproducing a segfault under heavy load, but could not reproduce. Could you give a sample code snippet so that I can work on reproducing the segfault under high load.

Thanks

@vivekkrbajpai
Copy link

Yups my keys are well formed and not corrupted or empty.
On Feb 10, 2015 7:02 PM, "Gayathri" notifications@github.com wrote:

@vivekkrbajpai https://github.com/vivekkrbajpai @ryanwitt
https://github.com/ryanwitt I have identified a loop hole in batchGet
logic, where if one of the keys in the batchKeys is corrupted or not
constructed properly the driver segfaults. Can you confirm that, your
application always sends a well constructed batchKeys to aerospike nodejs
driver.

And also I tried reproducing a segfault under heavy load, but could not
reproduce. Could you give a sample code snippet so that I can work on
reproducing the segfault under high load.

Thanks


Reply to this email directly or view it on GitHub
#26 (comment)
.

@ryanwitt
Copy link
Contributor

They're all of the form:

var keys = [
 { ns: 'namespace', set: 'some_set', key: 'some, possibly very long key' },
 ...
];

Is there any limit to the length of the key itself?

@GayathriKaliyamoorthy
Copy link
Contributor

Could you give an approximate size of each batch requests?

Thanks

@vivekkrbajpai
Copy link

Its about 5 to 10 in each batch.

On Wed, Feb 11, 2015 at 2:57 PM, Gayathri notifications@github.com wrote:

Could you give an approximate size of each batch requests?

Thanks


Reply to this email directly or view it on GitHub
#26 (comment)
.

@GayathriKaliyamoorthy
Copy link
Contributor

@Hamper @courtneycouch One of our customers had reported back saying after the fix, the driver ran without any segfault for 19 continuous hours, could you also confirm this please?

Thanks

@GayathriKaliyamoorthy
Copy link
Contributor

@vivekkrbajpai @ryanwitt could you please open another issue stating segfault in batchGet API, it will be easy for us to track the issues.

Thanks

@vivekkrbajpai
Copy link

Sure

@Hamper
Copy link
Author

Hamper commented Feb 11, 2015

It's work, Thanks.

@Hamper Hamper closed this as completed Feb 11, 2015
jhecking added a commit that referenced this issue Apr 26, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants