New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write operations can lock the nfs server #7
Comments
OS Version, JVM Version, Hadoop Version? Kill -3 prints nothing to the .out or .err file? |
OS: ubuntu 11.10 Looks like I was looking in the wrong place for kill -3 output: Full thread dump OpenJDK 64-Bit Server VM (20.0-b11 mixed mode): "OutputStreamHandler-192.168.1.10:944" daemon prio=10 tid=0x0000000001629000 nid=0x6549 waiting on condition [0x00007f7093599000] "RPCServer-192.168.1.10:944" prio=10 tid=0x0000000001635000 nid=0x6548 runnable [0x00007f709086c000] "ResponseProcessor for block blk_-1335885186091250665_38679" daemon prio=10 tid=0x0000000001a28800 nid=0x64d8 runnable [0x00007f7090e72000] "ResponseProcessor for block blk_1791805563866326323_38679" daemon prio=10 tid=0x0000000001d56000 nid=0x64d2 runnable [0x00007f7091175000] "WriteOrderHandler-/tmp/foo5.test" daemon prio=10 tid=0x00000000014b8000 nid=0x64c1 waiting on condition [0x00007f709076b000] "ResponseProcessor for block blk_-1767613122637356870_38679" daemon prio=10 tid=0x00000000019c9000 nid=0x64c0 runnable [0x00007f7091e82000] "WriteOrderHandler-/tmp/foo4.test" daemon prio=10 tid=0x00000000019c7000 nid=0x64b9 waiting on condition [0x00007f709288c000] "WriteOrderHandler-/tmp/foo3.test" daemon prio=10 tid=0x00000000018ca800 nid=0x64b5 waiting on condition [0x00007f7091276000] "ResponseProcessor for block blk_1423889860524175386_38679" daemon prio=10 tid=0x0000000001aa4000 nid=0x64b4 runnable [0x00007f7090c70000] "WriteOrderHandler-/tmp/foo2.test" daemon prio=10 tid=0x00000000014cd000 nid=0x64af waiting on condition [0x00007f7090a6e000] "DataStreamer for file /tmp/foo5.test block blk_-1335885186091250665_38679" daemon prio=10 tid=0x0000000001e69800 nid=0x64ab in Object.wait() [0x00007f7091c80000] "DataStreamer for file /tmp/foo4.test block blk_-1767613122637356870_38679" daemon prio=10 tid=0x0000000001ff5800 nid=0x64a7 in Object.wait() [0x00007f7092286000] "DataStreamer for file /tmp/foo3.test block blk_1791805563866326323_38679" daemon prio=10 tid=0x0000000001470000 nid=0x64a2 in Object.wait() [0x00007f7091f83000] "DataStreamer for file /tmp/foo2.test block blk_1423889860524175386_38679" daemon prio=10 tid=0x0000000001604800 nid=0x649f in Object.wait() [0x00007f709278b000] "LeaseChecker" daemon prio=10 tid=0x0000000001e50800 nid=0x6173 waiting on condition [0x00007f709096d000] "pool-1-thread-10" prio=10 tid=0x00000000013b2800 nid=0x6116 sleeping[0x00007f7092b8e000] "pool-1-thread-9" prio=10 tid=0x00000000013b0800 nid=0x6115 sleeping[0x00007f7092c8f000] "pool-1-thread-8" prio=10 tid=0x00000000013ae800 nid=0x6114 sleeping[0x00007f7092d90000] "pool-1-thread-7" prio=10 tid=0x00000000013ac800 nid=0x6113 sleeping[0x00007f7092e91000] "pool-1-thread-6" prio=10 tid=0x0000000001563000 nid=0x6112 sleeping[0x00007f7092f92000] "pool-1-thread-5" prio=10 tid=0x0000000001560800 nid=0x6111 sleeping[0x00007f7093093000] "pool-1-thread-4" prio=10 tid=0x0000000001425000 nid=0x6110 sleeping[0x00007f7093194000] "pool-1-thread-3" prio=10 tid=0x0000000001423000 nid=0x610f sleeping[0x00007f7093295000] "pool-1-thread-2" prio=10 tid=0x000000000162b800 nid=0x610e sleeping[0x00007f7093396000] "pool-1-thread-1" prio=10 tid=0x0000000001567800 nid=0x610d sleeping[0x00007f7093497000] "RPCServer-NFS4Handler-2050" prio=10 tid=0x000000000180a000 nid=0x610a runnable [0x00007f7093bc3000] "MetricsPrinter" daemon prio=10 tid=0x0000000001625000 nid=0x60fb waiting on condition [0x00007f70942ea000] "Low Memory Detector" daemon prio=10 tid=0x00000000013f1000 nid=0x60f9 runnable [0x0000000000000000] "C2 CompilerThread1" daemon prio=10 tid=0x00000000013ed800 nid=0x60f8 waiting on condition [0x0000000000000000] "C2 CompilerThread0" daemon prio=10 tid=0x00000000013eb800 nid=0x60f7 waiting on condition [0x0000000000000000] "Signal Dispatcher" daemon prio=10 tid=0x00000000013dd000 nid=0x60f6 waiting on condition [0x0000000000000000] "Finalizer" daemon prio=10 tid=0x00000000013c5000 nid=0x60f5 in Object.wait() [0x00007f7094faf000] "Reference Handler" daemon prio=10 tid=0x00000000013c3800 nid=0x60f4 in Object.wait() [0x00007f70950b0000] "main" prio=10 tid=0x000000000135d800 nid=0x60f2 waiting on condition [0x00007f709b957000] "VM Thread" prio=10 tid=0x00000000013bc800 nid=0x60f3 runnable "VM Periodic Task Thread" prio=10 tid=0x00000000013f6000 nid=0x60fa waiting on condition JNI global references: 1763 Heap |
Awesome. VM, ec2, hardware? Can you describe the topology (e.g.g single node cluster with the proxy running on the namenode)? |
This is a cluster consisting of 12 datanodes with the nfs-proxy running on the namenode. The datanodes/namenode are on two Intel 64-bit Linux machines running Ubuntu 11.10. No virtual machines, EC2 or anything like that. All running on a local 1GigE network and connected into the same switch. I'm writing to the NFS proxy over the network, mounted like this: 192.168.1.2:/ /export/hdfs nfs4 rw,relatime,vers=4,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,port=2050,timeo=600,retrans=2,sec=sys,clientaddr=192.168.1.10,minorversion=0,local_lock=none,addr=192.168.1.2 0 0 |
@brockn: Out of curiousity, do you use a reordering buffer to ensure that we always write a contiguous range of bytes to HDFS? I vaguely remember reading somewhere that NFS clients can reorder writes (?) |
We do re-order the writes, but without question this the weakest area of the code as I wasn't sure what kind of data structure to use. Basically each file that is open has a background thread waiting for write requests. It stores the current file offset and any writes received higher than the current offset. Once it finally gets the current offset, it writes any queued write requests which are sequential. Then goes back to waiting for the write request at the current offset. I'd bet money this is where the bug is at. |
On RHEL5 5.8 I am only able to reproduce this since when the heap on the NFS Proxy is too small using the following: for x in $(seq 1 10); do ( dd if=/root/foo.test of=/mnt/hdfs/foo$x.test bs=1048576 count=500 & ); done I had to set the heap to 5.5GB to get the test to pass. The reason was due to the NFS write re-ordering. The proxy had to buffer 194MB of writes for this single file before the pre-request write showed up. See logs below.
|
The test passed 150 times overnight on a RHEL5.8 host. I'll see what I can do with an ubuntu client host. |
On Mon, Oct 1, 2012 at 4:14 PM, Brock Noland notifications@github.comwrote:
Colin —
|
Without doing synchronous io I don't think it's possible to force the clients to no-reorder the writes. I have thought about the temporary file solution and I think it's probably the best option. |
tl;dr I don't think the heap filling was necessarily your issue. I resolved that on the async-writes branch and will be merging that to master soon. I also resolved the issue we see below. More research needs to be done this, but for now the approach is working. I think your issue was something rather odd I found. I have only observed this on Ubuntu 12.1 but I am guessing it behaves similarly in 11.1 and newer RH/SuSE. The situation is as follows:
I am not entirely sure why it flips to sync mode. I think it might be because it feels we are responding to slowly. I do know that if the NFS server responds with an error it will flip over to sync mode so that the error can be propagated back up the client during a write(). That is a patch I tested years back. So I am guessing we don't respond in time, so the re-transmit trips that condition and it starts writing with the SYNC flag. Why at that point it doesn't then forward along 0-128 I don't know. However, the issue is that it will stop sending writes waiting for a response to the SYNC writes. As such, in this scenario we cannot honor the SYNC flag on the write. |
hadoop version 1.0.0
Copying files to hadoop via the nfs proxy using 'cp' or 'mv' can result in write deadlocks.
I'm finding that I can fairly easily lock the NFS server by running a few parallel write operations:
cd /hdfs/mountpoint/tmp
for x in $(seq 1 5); do ( dd if=/dev/urandom of=foo$x.test bs=1048576 count=500 & ); done
The Pending Write Offsets list never changes
kill -3 provides no output
12/04/23 17:56:30 INFO rpc.RPCServer: class com.cloudera.hadoop.hdfs.nfs.nfs4.NFS4Handler got client 192.168.1.10:944
12/04/23 17:56:30 INFO rpc.ClientWorker: Got SecurityHandler of type com.cloudera.hadoop.hdfs.nfs.security.SecurityHandler
12/04/23 17:56:30 WARN handlers.OperationRequestHandler: 0x7ffffff0 Error for client /192.168.1.10 and OPENResponse
12/04/23 17:56:30 WARN nfs4.NFS4Handler: 0x7ffffff0 Quitting due to 1 on OPENRequest for nickb
12/04/23 17:56:49 INFO handlers.OPENHandler: 0x7ffffff0 Opened /tmp/foo2.test for write org.apache.hadoop.fs.FSDataOutputStream@50703755
12/04/23 17:56:49 INFO handlers.OPENHandler: 0x7ffffff0 Opened /tmp/foo3.test for write org.apache.hadoop.fs.FSDataOutputStream@4dcac775
12/04/23 17:56:49 INFO handlers.OPENHandler: 0x7ffffff0 Opened /tmp/foo4.test for write org.apache.hadoop.fs.FSDataOutputStream@746cd30c
12/04/23 17:56:49 INFO handlers.OPENHandler: 0x7ffffff0 Opened /tmp/foo5.test for write org.apache.hadoop.fs.FSDataOutputStream@3dc81f66
12/04/23 17:57:08 INFO nfs4.WriteOrderHandler: Sync for org.apache.hadoop.fs.FSDataOutputStream@3dc81f66 and 34799616
12/04/23 17:57:08 INFO nfs4.WriteOrderHandler: Sync for org.apache.hadoop.fs.FSDataOutputStream@50703755 and 38600704
12/04/23 17:57:08 INFO nfs4.WriteOrderHandler: Sync for org.apache.hadoop.fs.FSDataOutputStream@746cd30c and 34603008
12/04/23 17:57:08 INFO nfs4.WriteOrderHandler: Sync for org.apache.hadoop.fs.FSDataOutputStream@4dcac775 and 49414144
12/04/23 17:57:09 INFO nfs4.WriteOrderHandler: Sync for org.apache.hadoop.fs.FSDataOutputStream@3dc81f66 and 105979904
12/04/23 17:57:09 INFO nfs4.WriteOrderHandler: Sync for org.apache.hadoop.fs.FSDataOutputStream@50703755 and 133173248
12/04/23 17:57:09 INFO nfs4.WriteOrderHandler: Sync for org.apache.hadoop.fs.FSDataOutputStream@746cd30c and 131076096
12/04/23 17:57:09 INFO nfs4.WriteOrderHandler: Sync for org.apache.hadoop.fs.FSDataOutputStream@3dc81f66 and 106012672
12/04/23 17:57:09 INFO nfs4.WriteOrderHandler: Sync for org.apache.hadoop.fs.FSDataOutputStream@50703755 and 133206016
12/04/23 17:57:09 INFO nfs4.WriteOrderHandler: Sync for org.apache.hadoop.fs.FSDataOutputStream@746cd30c and 131108864
12/04/23 17:57:18 INFO nfs4.WriteOrderHandler: Sync for org.apache.hadoop.fs.FSDataOutputStream@746cd30c and 34603008
12/04/23 17:57:18 INFO nfs4.WriteOrderHandler: Sync for org.apache.hadoop.fs.FSDataOutputStream@3dc81f66 and 34799616
12/04/23 17:57:18 INFO nfs4.WriteOrderHandler: Sync for org.apache.hadoop.fs.FSDataOutputStream@50703755 and 38600704
12/04/23 17:57:18 INFO nfs4.WriteOrderHandler: Sync for org.apache.hadoop.fs.FSDataOutputStream@4dcac775 and 85655552
12/04/23 17:57:24 INFO nfs4.WriteOrderHandler: Sync for org.apache.hadoop.fs.FSDataOutputStream@746cd30c and 161632256
12/04/23 17:57:24 INFO nfs4.WriteOrderHandler: Sync for org.apache.hadoop.fs.FSDataOutputStream@50703755 and 38600704
12/04/23 17:57:24 INFO nfs4.WriteOrderHandler: Sync for org.apache.hadoop.fs.FSDataOutputStream@3dc81f66 and 90570752
12/04/23 17:57:24 INFO nfs4.WriteOrderHandler: Sync for org.apache.hadoop.fs.FSDataOutputStream@4dcac775 and 103546880
12/04/23 17:57:24 INFO nfs4.WriteOrderHandler: Sync for org.apache.hadoop.fs.FSDataOutputStream@746cd30c and 34603008
12/04/23 17:57:24 INFO nfs4.WriteOrderHandler: Sync for org.apache.hadoop.fs.FSDataOutputStream@746cd30c and 161636352
12/04/23 17:57:24 INFO nfs4.WriteOrderHandler: Sync for org.apache.hadoop.fs.FSDataOutputStream@746cd30c and 161640448
12/04/23 17:57:26 INFO nfs4.WriteOrderHandler: Sync for org.apache.hadoop.fs.FSDataOutputStream@50703755 and 163704832
12/04/23 17:57:36 INFO nfs4.WriteOrderHandler: Pending Write Offsets: [38928384, 38993920, 39059456, 39124992, 39190528, 39256064, 39321600, 39387
136, 39452672, 39518208, 39583744, 39649280, 39714816, 39780352, 39845888, 39911424, 39976960, 40042496, 40108032, 40173568, 40239104, 40304640, 4
0370176, 40435712, 40501248, 40566784, 40632320, 40697856, 40763392, 40828928, 40894464, 40960000, 41025536, 41091072, 41156608, 41222144, 4128768
0, 41353216, 41418752, 41484288, 41549824, 41615360, 41680896, 41746432, 41811968, 41877504, 41943040, 42008576, 42074112, 42139648, 42205184, 422
70720, 42336256, 42401792, 42467328, 42532864, 42598400, 42663936, 42729472, 42795008, 42860544, 42926080, 42991616, 43057152, 43122688, 43188224,
43253760, 43319296, 43384832, 43450368, 43515904, 43581440, 43646976, 43712512, 43778048, 43843584, 43909120, 43974656, 44040192, 44105728, 44171
264, 44236800, 44302336, 44367872, 44433408, 44498944, 44564480, 44630016, 44695552, 44761088, 44826624, 44892160, 44957696, 45023232, 45088768, 4
... and lots more
Followed by many of these:
12/04/23 18:00:12 INFO rpc.ClientWorker: 0x7fffffe6 ignoring request 1845449483
12/04/23 18:00:13 INFO rpc.ClientWorker: 0x7fffffe6 ignoring request 1828672267
12/04/23 18:00:13 INFO rpc.ClientWorker: 0x7fffffe6 ignoring request 1811895051
12/04/23 18:00:13 INFO rpc.ClientWorker: 0x7fffffe6 ignoring request 1795117835
12/04/23 18:00:13 INFO rpc.ClientWorker: 0x7fffffe6 ignoring request 1778340619
12/04/23 18:00:13 INFO rpc.ClientWorker: 0x7fffffe6 ignoring request 1761563403
12/04/23 18:00:13 INFO rpc.ClientWorker: 0x7fffffe6 ignoring request 1744786187
12/04/23 18:00:13 INFO rpc.ClientWorker: 0x7fffffe6 ignoring request 1728008971
12/04/23 18:00:13 INFO rpc.ClientWorker: 0x7fffffe6 ignoring request 1711231755
12/04/23 18:00:13 INFO rpc.ClientWorker: 0x7fffffe6 ignoring request 1694454539
12/04/23 18:00:13 INFO rpc.ClientWorker: 0x7fffffe6 ignoring request 1677677323
12/04/23 18:00:13 INFO rpc.ClientWorker: 0x7fffffe6 ignoring request 1660900107
12/04/23 18:00:13 INFO rpc.ClientWorker: 0x7fffffe6 ignoring request 1644122891
12/04/23 18:00:13 INFO rpc.ClientWorker: 0x7fffffe6 ignoring request 1627345675
The text was updated successfully, but these errors were encountered: