Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High load concurrent queries hang #36

Closed
vmarkovtsev opened this issue Jun 19, 2017 · 10 comments
Closed

High load concurrent queries hang #36

vmarkovtsev opened this issue Jun 19, 2017 · 10 comments
Assignees
Labels

Comments

@vmarkovtsev
Copy link
Contributor

vmarkovtsev commented Jun 19, 2017

When I execute many (3000) concurrent queries (4 threads), Babelfish server either hangs or drops some requests without answering them. CPU load is 0%. After that, server becomes completely unresponsive and I have to restart it.

@abeaumont abeaumont added the bug label Jun 21, 2017
@abeaumont
Copy link
Contributor

@vmarkovtsev Can you explain which command you use to reproduce the bug?

@juanjux juanjux self-assigned this Jun 22, 2017
@juanjux
Copy link
Contributor

juanjux commented Jun 22, 2017

I've found some instances of the python-client hanging randomly on my debugging of #34. It seems to happen more frequently after the previous request failed for any reason and @smola and I suspect the driver containers could not be closing correctly after some errors, which could cause all the provisioned containers to be in a non-responsive state at the end.

I'll look into it as part of that bug but maybe the fix could be the same for this one. I will post an update when I've more information.

@juanjux
Copy link
Contributor

juanjux commented Jun 22, 2017

I'll be fixing bblfsh/python-driver/issues/27 first since it could also be related (the problem seem to be worsened by driver failures).

@juanjux
Copy link
Contributor

juanjux commented Jun 28, 2017

Update: now that I've the docker to test in a controlled environment I'm seeing different behaviours on different tests:

  • I can bang the server as much as I want in paralell (even to the point of killing my machine swapping) with a Python file that produces quickly a parse error and the server never fails in this way. I think the fix to Concurrent queries return randomly corrupted UASTs #34 helped with this. Also tried with other kinds of errors produced by the native driver with the same result.

  • ...but if I send a request that produces the buffer size error from the SDK (it must be a really huge one now that the buffer size is 4MB) after the first request the server doesn't reply anymore. I'm investigating this.

@juanjux
Copy link
Contributor

juanjux commented Jun 30, 2017

Update:

This bug have shown some existing problems that need to be fixed.

Currently there is a problem that show with some files that produce an error in the Go part of the driver (the SDK) where an error is correctly returned but the next request after that one hangs. That's because the driver reads from the driver container stdin but doesn't write anything to stdout, so the server to driver communication hangs (and thus also the client to server one) .

This doesn't mean that the server hangs; other connections to it while the first one is waiting will work since the server will instantiate more containers to manage these new connections.

  • Modify the encoding and decoding logic in the SDK so request to hanged drivers can timeout.
  • When a request timeouts trying to read or write from a driver container, log the problem and kill it.
  • Discover why a driver hangs on the second request after my torture files (even while the error is correctly returned on the first one) and fix it.

@juanjux
Copy link
Contributor

juanjux commented Jul 1, 2017

Fixed the cause of all tests I have that resulted in container hangs: bblfsh/sdk/pull/135

We'll test together on monday and see if we can close it.

The timeout and the container killing will be done in a separate PR since I want to check some things with my team.

@juanjux
Copy link
Contributor

juanjux commented Jul 1, 2017

PS: @bzz

@juanjux juanjux closed this as completed Jul 1, 2017
@juanjux juanjux reopened this Jul 1, 2017
@juanjux
Copy link
Contributor

juanjux commented Jul 5, 2017

@EgorBu and I just checked that this (already merged) fix and the previous ones make the previous problems unreproducible. We banged the server hard with three processes sending it all the tensorflow .py files, repeated 20 times, and the files are always the same and the server didn't hang.

I'll release new docker images of the server and the Python & Java drivers today or tomorrow with all these fixes incorporated. Closing the issue, if the new docker images doesn't work we'll reopen it.

@juanjux juanjux closed this as completed Jul 5, 2017
@EgorBu
Copy link

EgorBu commented Jul 5, 2017

here it's code that we used to test

  1. repository that gives different uasts for different tries: https://github.com/tensorflow/tensorflow
    because of JSON decoder buffer size - if it exceeds the limit than the driver is broken and returns errors for future requests -> so it make uasts generation not reproducible. @juanjo told that it’s related to :
    Fix hangs when the reading from the native driver raised an error before all input was consumed sdk#135
    Make JSON decoder buffer size dynamic sdk#130
    Fix/empty code fatal python-driver#26
    and should be published in new docker image (today?)
  2. code to reproduce:
# download scripts https://gist.github.com/EgorBu/dafee1247e91a8faee6328a19e67cad7
git clone https://github.com/tensorflow/tensorflow
cd tensorflow
# it may take a lot of time
bash ../test_34_36_tensorflow.sh
# if it's broken - here it's python script to find the first difference in files
# https://gist.github.com/EgorBu/f17353e5ed5c63f7afb865c8fad7f8c8
python finddiff.py
# don't forget to delete tmp files with uasts

@juanjux
Copy link
Contributor

juanjux commented Jul 6, 2017

New Python, Java and Server docker images have been released with all the latest fixes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants