500 InternalServerError when trying to PUT a key into a cluster #1538

rocketraman · 2014-10-30T21:37:14Z

Using master as of commit 6087e2b, started a 3-member cluster:

$ bin/goreman start
17:34:16 etcd1 | START
17:34:16 etcd2 | START
17:34:16 etcd3 | START
17:34:16 proxy | START
17:34:16 etcd1 | 2014/10/30 17:34:16 etcd: no data-dir provided, using default data-dir ./node1.etcd
17:34:16 etcd2 | 2014/10/30 17:34:16 etcd: no data-dir provided, using default data-dir ./node2.etcd
17:34:16 etcd2 | 2014/10/30 17:34:16 etcdserver: start node 8a69d5f6b7814500 in cluster c6ab534d07e8fcc4
17:34:16 etcd3 | 2014/10/30 17:34:16 etcd: no data-dir provided, using default data-dir ./node3.etcd
17:34:16 etcd3 | 2014/10/30 17:34:16 etcdserver: start node eca0338f4ea31566 in cluster c6ab534d07e8fcc4
17:34:16 etcd1 | 2014/10/30 17:34:16 etcdserver: start node 2c7d3e0b8627375b in cluster c6ab534d07e8fcc4
17:34:16 etcd2 | 2014/10/30 17:34:16 etcd: listening for peers on http://127.0.0.1:7002
17:34:16 etcd2 | 2014/10/30 17:34:16 etcd: listening for client requests on http://127.0.0.1:4002
17:34:16 etcd3 | 2014/10/30 17:34:16 etcd: listening for peers on http://127.0.0.1:7003
17:34:16 etcd3 | 2014/10/30 17:34:16 etcd: listening for client requests on http://127.0.0.1:4003
17:34:16 etcd1 | 2014/10/30 17:34:16 etcd: listening for peers on http://127.0.0.1:7001
17:34:16 etcd1 | 2014/10/30 17:34:16 etcd: listening for client requests on http://127.0.0.1:4001
17:34:16 proxy | 2014/10/30 17:34:16 etcd: proxy listening for client requests on 127.0.0.1:8080
17:34:18 etcd1 | 2014/10/30 17:34:18 raft: elected leader 2c7d3e0b8627375b at term 1
17:34:18 etcd2 | 2014/10/30 17:34:18 raft: elected leader 2c7d3e0b8627375b at term 1
17:34:18 etcd3 | 2014/10/30 17:34:18 raft: elected leader 2c7d3e0b8627375b at term 1
17:34:18 etcd1 | 2014/10/30 17:34:18 etcdserver: published {Name:node1 ClientURLs:[http://127.0.0.1:4001]} to cluster c6ab534d07e8fcc4
17:34:20 etcd3 | 2014/10/30 17:34:20 etcdserver: published {Name:node3 ClientURLs:[http://127.0.0.1:4003]} to cluster c6ab534d07e8fcc4
17:34:20 etcd2 | 2014/10/30 17:34:20 etcdserver: published {Name:node2 ClientURLs:[http://127.0.0.1:4002]} to cluster c6ab534d07e8fcc4

Attempt to put a key:

$ curl -v -L http://localhost:4001/v2/keys/foo -XPUT -d value="this is awesome"
* Adding handle: conn: 0xf40a20
* Adding handle: send: 0
* Adding handle: recv: 0
* Curl_addHandleToPipeline: length: 1
* - Conn 0 (0xf40a20) send_pipe: 1, recv_pipe: 0
* About to connect() to localhost port 4001 (#0)
*   Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 4001 (#0)
> PUT /v2/keys/foo HTTP/1.1
> User-Agent: curl/7.32.0
> Host: localhost:4001
> Accept: */*
> Content-Length: 21
> Content-Type: application/x-www-form-urlencoded
> 
* upload completely sent off: 21 out of 21 bytes
< HTTP/1.1 500 Internal Server Error
< Content-Type: text/plain; charset=utf-8
< X-Etcd-Cluster-Id: c6ab534d07e8fcc4
< Date: Thu, 30 Oct 2014 21:34:35 GMT
< Transfer-Encoding: chunked
< 
Internal Server Error
* Connection #0 to host localhost left intact

Despite the 500 Internal Server Error, the key is set:

$ curl -L http://localhost:4001/v2/keys/?recursive=true | jsontidy
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   162    0   162    0     0  23366      0 --:--:-- --:--:-- --:--:-- 27000
{
  "node": {
    "createdIndex": 10,
    "modifiedIndex": 10,
    "nodes": [
      {
        "createdIndex": 10,
        "modifiedIndex": 10,
        "value": "this is awesome",
        "key": "/foo"
      }
    ],
    "dir": true
  },
  "action": "get"
}

The console shows:

17:34:35 etcd1 | 2014/10/30 17:34:35 etcdhttp: unexpected error: etcdserver: request timed out

The text was updated successfully, but these errors were encountered:

philips · 2014-10-30T23:02:58Z

@rocketraman I am unable to reproduce. What platform, version of go, etc?

philips · 2014-10-30T23:05:13Z

@rocketraman Nevermind, another developer, @unihorn, knows about this issue and apparently it is intermittent and based on the timing of proposals vs the http response. We will work on it.

yichengq · 2014-10-31T06:14:58Z

I think this is an expected behavior. We set timeout for each client request, and cancel it if deadline comes. Current timeout is fixed to be 0.5s, which is short IMO.
We meet this also in integration testing: #1479 (comment)

Some thoughts:

We should print out better error message for this case. this could help user understand what happens.
We could set longer default timeout. this could cause better user experiences.
If we want to see whether this request is committed, it needs more discussion about the strategy.

@rocketraman @philips @xiangli-cmu vote for them?

rocketraman · 2014-10-31T15:07:35Z

Current timeout is fixed to be 0.5s

Ah, that explains it. But why would my cluster latency be so slow? Everything is running on localhost on a fast machine. Cluster statistics show that communication is faster than pre-0.5 code, but it is indeed averaging just above the 0.5s mark:

{
  "followers": {
    "eca0338f4ea31566": {
      "counts": {
        "success": 574,
        "fail": 0
      },
      "latency": {
        "maximum": 2.758554,
        "minimum": 0.191394,
        "standardDeviation": 0.28647915770740107,
        "average": 0.5837433815331005,
        "current": 0.411192
      }
    },
    "8a69d5f6b7814500": {
      "counts": {
        "success": 638,
        "fail": 0
      },
      "latency": {
        "maximum": 3.390786,
        "minimum": 0.161651,
        "standardDeviation": 0.28758292287036613,
        "average": 0.5975516410658306,
        "current": 0.486271
      }
    }
  },
  "leader": "2c7d3e0b8627375b"
}

We should print out better error message for this case. this could help user understand what happens.

👍 A 202 Accepted would be an appropriate response. A 504 Gateway Timeout might also be appropriate.

We could set longer default timeout. this could cause better user experiences.

It would increase complexity but the timeout could be based on the cluster statistics?

If we want to see whether this request is committed, it needs more discussion about the strategy.

The HTTP response could have a processing id to get completion status in a subsequent request.

yichengq · 2014-10-31T17:17:01Z

@rocketraman I am unclear about whether it mostly relates to the network latency and we will analyze the performance of it soon.

In 0.4, it doesn't have this timeout and will return only when it gets the result.

philips · 2014-10-31T18:01:51Z

@unihorn In most all cases the disk sync can be more expensive than the network latency. My guess is that the same is happening here.

jonboulle · 2014-10-31T18:06:45Z

@rocketraman FWIW we have bumped the timeout to 5 mins in 0.5 to restore the same behaviour as 0.4.

rocketraman · 2014-10-31T18:58:56Z

In most all cases the disk sync can be more expensive than the network latency. My guess is that the same is happening here.

Is there a way I can help analyze this by running some profiling on my machine? Are other people seeing similar numbers when doing goreman start on localhost?

yichengq · 2014-10-31T21:00:52Z

@rocketraman How often do you meet this case? I run the local cluster in my medium-range MAC pro and never meet this.
etcd uses Sync system call whenever writes the data to ensure it is synced to the disk. Could this be rather slow in you case?

rocketraman · 2014-11-01T05:44:56Z

How often do you meet this case? I run the local cluster in my medium-range MAC pro and never meet this.

I get this timeout almost 100% of the time. Its a new test cluster created by goreman start. I have reasonably fast hardware: a 6-core i7 @ 3.2 GHz with 32 GB of memory.

etcd uses Sync system call whenever writes the data to ensure it is synced to the disk. Could this be rather slow in you case?

Is it really a sync system call or is it fsync? I can imagine sync taking more than 0.5s, but fsync shouldn't take that long.

I'm running on a software RAID-5 array which might slow things down a bit, but it shouldn't be significant on the order of fractions of a second.

Here is a dd-based benchmark using dsync:

# dd if=/dev/zero of=tmp/foo/bar.bin bs=1G count=1 oflag=dsync
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 7.85743 s, 137 MB/s

# dd if=/dev/zero of=tmp/foo/bar.bin bs=512 count=1000 oflag=dsync
1000+0 records in
1000+0 records out
512000 bytes (512 kB) copied, 84.2859 s, 6.1 kB/s

So many small writes of 512b with sync are pretty slow @ 84ms each (as might be expected on a software RAID-5), but are still well within the 0.5s timeout window.

Are there any other benchmarks I can provide?

I also noticed that when the cluster is running, but idle, the etcd processes are generating a significant number of small write requests continuously. From atop:

LVM | id10-lv_home | busy    100% | read       1 |              | write   4307 | KiB/r      4 |  KiB/w      6 | MBr/s   0.00 |              | MBw/s   0.35 | avq     2.47 | avio 17.6 ms |
...
  PID                         TID                        RDDSK                        WRDSK                        WCANCL                       ADSK                       CMD
22132                           -                           0K                        1172K                            0K                        28%                       etcd
22130                           -                           0K                        1164K                            0K                        28%                       etcd
22128                           -                           0K                        1152K                            0K                        27%                       etcd

Here is some strace summary output showing the system calls for one of the etcd processes over about 10 seconds:

strace -c -f -p 22130
% time     seconds  usecs/call     calls    errors syscall                                                                                                                                
------ ----------- ----------- --------- --------- ----------------                                                                                                                       
 67.35    0.219278         251       873        95 futex                                                                                                                                  
 19.35    0.063000        1086        58           fsync                                                                                                                                  
 11.72    0.038148         112       341           epoll_wait                                                                                                                             
  1.06    0.003455           5       644           select                                                                                                                                 
  0.31    0.001000        1000         1         1 restart_syscall                                                                                                                        
  0.18    0.000597           4       160           write                                                                                                                                  
  0.03    0.000084           0       250       125 read                                                                                                                                   
  0.00    0.000000           0         2           sched_yield                                                                                                                            
------ ----------- ----------- --------- --------- ----------------                                                                                                                       
100.00    0.325562                  2329       221 total

yichengq · 2014-11-04T08:35:42Z

@rocketraman The continuous small requests are reasonable, and it is caused by internal implementation.
The benchmarks are reasonable and I don't know where is the problem.
For comparison, in my 2.4 GHz i5, 8GB Mac, goreman uses 1.7s to finish all publish, and I never meet request timeout.
@kelseyhightower any thoughts?

rocketraman · 2014-11-05T15:26:15Z

The benchmarks are reasonable and I don't know where is the problem.

For comparison, I setup etcd to use my SSD for storage. I have no problems on the SSD, so the issue appears to be related to my RAID array. The dd benchmark above, shows the RAID array is indeed slow to write small files, but is it so slow that a 0.5s timeout should be exceeded on a simple PUT?

rocketraman · 2014-11-05T17:23:11Z

I started an etcd cluster with 3 members (via goreman) and performed a PUT every 0.1s for 30s, while tracing I/O using blktrace, to check if there were any obvious issues with my I/O subsystem.

The blktrace -> blkparse -> btt output is here: https://gist.github.com/rocketraman/a8ea43255674b561c53b

The device major/minor numbers are:
( 8, x) : Physical disks
( 9, 0) : RAID device
(253, 2) : Device mapper for LVM

I'm not seeing any obvious problems at the I/O layer that should be causing timeouts in etcd. I do note that etcd "Q2Q" time (Queue-to-queue) is averaging 65ms (max of 1.02s). If there are multiple writes involved for each PUT, and etcd is taking that long between each of them, that would explain the timeouts. However, etcd does so many writes continuously it is hard to distinguish between its normal ongoing writes and the ones involved with the PUTs.

xiang90 · 2014-12-11T17:28:45Z

@rocketraman Can you please try with the current master? We have fixed a bunch of small issues, which might cause this.

xiang90 · 2014-12-14T18:35:23Z

@rocketraman I am going to close this issue. Thanks for reporting!
Feel free to open another issue if there still a problem.

xiang90 added this to the v0.5.0-alpha.1 milestone Oct 30, 2014

xiang90 added the type/bug label Oct 30, 2014

yichengq self-assigned this Oct 31, 2014

yichengq mentioned this issue Oct 31, 2014

etcdhttp: bump default Server timeout to 5 mins #1550

Merged

xiang90 modified the milestones: v0.5.0, v0.5.0-alpha.1 Oct 31, 2014

xiang90 modified the milestones: v0.5.0-rc, v0.5.0 Dec 13, 2014

xiang90 closed this as completed Dec 14, 2014

philips unassigned yichengq Aug 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

500 InternalServerError when trying to PUT a key into a cluster #1538

500 InternalServerError when trying to PUT a key into a cluster #1538

rocketraman commented Oct 30, 2014

philips commented Oct 30, 2014

philips commented Oct 30, 2014

yichengq commented Oct 31, 2014

rocketraman commented Oct 31, 2014

yichengq commented Oct 31, 2014

philips commented Oct 31, 2014

jonboulle commented Oct 31, 2014

rocketraman commented Oct 31, 2014

yichengq commented Oct 31, 2014

rocketraman commented Nov 1, 2014

yichengq commented Nov 4, 2014

rocketraman commented Nov 5, 2014

rocketraman commented Nov 5, 2014

xiang90 commented Dec 11, 2014

xiang90 commented Dec 14, 2014

500 InternalServerError when trying to PUT a key into a cluster #1538

500 InternalServerError when trying to PUT a key into a cluster #1538

Comments

rocketraman commented Oct 30, 2014

philips commented Oct 30, 2014

philips commented Oct 30, 2014

yichengq commented Oct 31, 2014

rocketraman commented Oct 31, 2014

yichengq commented Oct 31, 2014

philips commented Oct 31, 2014

jonboulle commented Oct 31, 2014

rocketraman commented Oct 31, 2014

yichengq commented Oct 31, 2014

rocketraman commented Nov 1, 2014

yichengq commented Nov 4, 2014

rocketraman commented Nov 5, 2014

rocketraman commented Nov 5, 2014

xiang90 commented Dec 11, 2014

xiang90 commented Dec 14, 2014