Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Passthrough service hangs under high load #6474

Closed
xlight05 opened this issue Apr 18, 2024 · 9 comments
Closed

Passthrough service hangs under high load #6474

xlight05 opened this issue Apr 18, 2024 · 9 comments
Assignees

Comments

@xlight05
Copy link

xlight05 commented Apr 18, 2024

Description:
$Subject under load test.

Strand dump - https://gist.github.com/xlight05/af107fdd097dd6825fbe6bd6e13efa1e

Simpler sample -

import ballerina/http;
import ballerina/log;
type Location record {|
    string name;
    string id;
    string address;
|};

type Locations record {|
    *http:Links;
    Location[] locations;
|};

final http:Client snowpeakEP = check new("http://localhost:9090");

service /passthrough on new http:Listener(9091) {
    resource function get .() returns Locations|http:InternalServerError|error {
        log:printInfo("Request received at passthrough service");
        Locations locations = check snowpeakEP->get("/snowpeak/locations");
        return locations;
    }
}

Steps to reproduce:

  1. Checkout to HTTP branch with latest u9 changes
  2. build a pack using ./gradlew clean build -x check -x test
  3. open a new terminal and run the dependant service - docker run -p 9090:9090 ktharmi176/snowpeak:latest
  4. run the snowpeak sample with the created pack <http_path>/target/ballerina-runtime/bin/bal run
  5. execute the service once - curl http://localhost:9091/passthrough
  6. Run a load test against the service - wrk -t12 -c400 -d10s http://localhost:9091/passthrough
  7. once the load test is over, execute the service again - curl http://localhost:9091/passthrough

Service should be hanging at this point.

Affected Versions:
2201.9.0-20240418-164300-a37919c4

OS, DB, other environment details and versions:

Related Issues (optional):

Suggested Labels (optional):

Suggested Assignees (optional):

@xlight05 xlight05 changed the title Snowpeak passthrough service hangs under high load assthrough service hangs under high load Apr 18, 2024
@xlight05 xlight05 changed the title assthrough service hangs under high load Passthrough service hangs under high load Apr 18, 2024
@TharmiganK
Copy link
Contributor

The hanging happens in the client level and some how the service also hangs. This issue seems to be related to the changes done with the status code binding feature: ballerina-platform/module-ballerina-http#1915, specifically when the client call involves data-binding where we call a Ballerina function from Java to bind the payload to the required type: https://github.com/ballerina-platform/module-ballerina-http/blob/8e3226c5f8cc9cb0a5e2cb043ccc140644b5e7ec/native/src/main/java/io/ballerina/stdlib/http/api/nativeimpl/ExternResponseProcessor.java#L519

This should be an issue with the client's concurrent behaviour when data binding is involved.

The client behaviour with this status code binding feature looks like this:

Ballerina - Client call(External function) => Java - Client Execution method => Ballerina - Client call implementation function => Ballerina - Process response function(External function) => Java - Process response method => Java - Perform data-binding method => Ballerina - Perform data-binding method

The last 3 steps are just to build the payload with the required type, and based on the target type, further processing may happen. Without these steps i.e. if we bind it to http:Response this hanging issue is not reproducible

Reduced reproducer:

  1. Client file:
import ballerina/http;
import ballerina/io;

type Location record {|
    string name;
    string id;
    string address;
|};

type Locations record {|
    *http:Links;
    Location[] locations;
|};

final http:Client snowpeakEP = check new ("http://localhost:9090");

public function main() returns error? {
    future<error?> futureResult;
    foreach int i in 0...400 {
        futureResult = start test(i);
    }
    _ = check wait futureResult;
}

function test(int i) returns error? {
    io:println("Request started for ", i);
    Locations locations = check snowpeakEP->get("/snowpeak/locations");
    io:println("Request completed for ", i);
}
  1. Service file:
import ballerina/http;

type Location record {|
    string name;
    string id;
    string address;
|};

type Locations record {|
    *http:Links;
    Location[] locations;
|};

service /snowpeak on new http:Listener(9090) {

    resource function get locations() returns Locations {
        return {
            locations: [
                {name: "Snowpeak", id: "1", address: "1234 Snowpeak"},
                {name: "Snowpeak", id: "2", address: "1234 Snowpeak"},
                {name: "Snowpeak", id: "3", address: "1234 Snowpeak"}
            ],
            _links: {
                self: {href: "/snowpeak/locations"}
            }
        };
    }
}

I will further investigate on this

@hasithaa
Copy link

Can we get a JAVA thread dump and a ballerina strand dump together when this happens?

@gabilang
Copy link
Contributor

gabilang commented Apr 19, 2024

Following interop function async call inside the foreach loop hangs:

import ballerina/io;
import ballerina/jballerina.java;

public function main() returns error? {
    future<error?> futureResult;
    foreach int i in 0...25 {
        futureResult = start test(i);
        // check test(i);
    }
    _ = check wait futureResult;
}

public class ABC {

    public int port;

    public isolated function init(int port) {
        self.port = port;
    }

    public isolated function getPort() returns int {
        return self.port;
    }
}

function test(int i) returns error? {
    io:println("Request started for ", i);
    ABC abc = new ABC(9090);
    io:println(getResource(abc));
    io:println("Request completed for ", i);
}

isolated function getResource(ABC abc) returns anydata = @java:Method {
    'class: "myapp.app.App"
} external;

getResource method in the App.java class

    public static Object getResource(Environment env, BObject client) {

        final Object[] out = new Object[1];
        CountDownLatch countDownLatch = new CountDownLatch(1);

        Callback callback = new Callback() {
            @Override
            public void notifySuccess(Object o) {
                out[0] = o;
                countDownLatch.countDown();
            }

            @Override
            public void notifyFailure(BError bError) {
                out[0] = bError;
                countDownLatch.countDown();
            }
        };

        env.getRuntime().invokeMethodAsyncSequentially(client, "getPort", null,null, callback, null, PredefinedTypes.TYPE_INT);

        try {
            countDownLatch.await();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        return out[0];
    }
Strand Dump:
Ballerina Strand Dump [2024/04/19 16:18:41]
===========================================

Total strand group count : 29
Total strand count : 51
Active strand group count : 28
Active strand count : 51

group 2 [RUNNABLE]: [24]
strand 28 "interop-async-seq" [N/A] [RUNNABLE]

    strand 29 "interop-async-seq" [N/A] [RUNNABLE]

    strand 30 "interop-async-seq" [N/A] [RUNNABLE]

    strand 31 "interop-async-seq" [N/A] [RUNNABLE]

    strand 32 "interop-async-seq" [N/A] [RUNNABLE]

    strand 33 "interop-async-seq" [N/A] [RUNNABLE]

    strand 34 "interop-async-seq" [N/A] [RUNNABLE]

    strand 35 "interop-async-seq" [N/A] [RUNNABLE]

    strand 36 "interop-async-seq" [N/A] [RUNNABLE]

    strand 37 "interop-async-seq" [N/A] [RUNNABLE]

    strand 38 "interop-async-seq" [N/A] [RUNNABLE]

    strand 39 "interop-async-seq" [N/A] [RUNNABLE]

    strand 40 "interop-async-seq" [N/A] [RUNNABLE]

    strand 41 "interop-async-seq" [N/A] [RUNNABLE]

    strand 42 "interop-async-seq" [N/A] [RUNNABLE]

    strand 43 "interop-async-seq" [N/A] [RUNNABLE]

    strand 44 "interop-async-seq" [N/A] [RUNNABLE]

    strand 45 "interop-async-seq" [N/A] [RUNNABLE]

    strand 46 "interop-async-seq" [N/A] [RUNNABLE]

    strand 47 "interop-async-seq" [N/A] [RUNNABLE]

    strand 48 "interop-async-seq" [N/A] [RUNNABLE]

    strand 49 "interop-async-seq" [N/A] [RUNNABLE]

    strand 50 "interop-async-seq" [N/A] [RUNNABLE]

    strand 51 "interop-async-seq" [N/A] [RUNNABLE]

group 3 [QUEUED]: [1]
strand 1 "main" [gabilan.test_u9.0:main] [WAITING]:
at gabilan.test_u9.0.1.0:main(main.bal:201)
$moduleExecute

group 4 [RUNNABLE]: [1]
strand 2 "futureResult" [gabilan.test_u9.0:main][1] [RUNNABLE]

group 5 [RUNNABLE]: [1]
strand 3 "futureResult" [gabilan.test_u9.0:main][1] [RUNNABLE]

group 6 [RUNNABLE]: [1]
strand 4 "futureResult" [gabilan.test_u9.0:main][1] [RUNNABLE]

group 7 [RUNNABLE]: [1]
strand 5 "futureResult" [gabilan.test_u9.0:main][1] [RUNNABLE]

group 8 [RUNNABLE]: [1]
strand 6 "futureResult" [gabilan.test_u9.0:main][1] [RUNNABLE]

group 9 [RUNNABLE]: [1]
strand 7 "futureResult" [gabilan.test_u9.0:main][1] [RUNNABLE]

group 10 [RUNNABLE]: [1]
strand 8 "futureResult" [gabilan.test_u9.0:main][1] [RUNNABLE]

group 11 [RUNNABLE]: [1]
strand 9 "futureResult" [gabilan.test_u9.0:main][1] [RUNNABLE]

group 12 [RUNNABLE]: [1]
strand 10 "futureResult" [gabilan.test_u9.0:main][1] [RUNNABLE]

group 13 [RUNNABLE]: [1]
strand 11 "futureResult" [gabilan.test_u9.0:main][1] [RUNNABLE]

group 14 [RUNNABLE]: [1]
strand 12 "futureResult" [gabilan.test_u9.0:main][1] [RUNNABLE]

group 15 [RUNNABLE]: [1]
strand 13 "futureResult" [gabilan.test_u9.0:main][1] [RUNNABLE]

group 16 [RUNNABLE]: [1]
strand 14 "futureResult" [gabilan.test_u9.0:main][1] [RUNNABLE]

group 17 [RUNNABLE]: [1]
strand 15 "futureResult" [gabilan.test_u9.0:main][1] [RUNNABLE]

group 18 [RUNNABLE]: [1]
strand 16 "futureResult" [gabilan.test_u9.0:main][1] [RUNNABLE]

group 19 [RUNNABLE]: [1]
strand 17 "futureResult" [gabilan.test_u9.0:main][1] [RUNNABLE]

group 20 [RUNNABLE]: [1]
strand 18 "futureResult" [gabilan.test_u9.0:main][1] [RUNNABLE]

group 21 [RUNNABLE]: [1]
strand 19 "futureResult" [gabilan.test_u9.0:main][1] [RUNNABLE]

group 22 [RUNNABLE]: [1]
strand 20 "futureResult" [gabilan.test_u9.0:main][1] [RUNNABLE]

group 23 [RUNNABLE]: [1]
strand 21 "futureResult" [gabilan.test_u9.0:main][1] [RUNNABLE]

group 24 [RUNNABLE]: [1]
strand 22 "futureResult" [gabilan.test_u9.0:main][1] [RUNNABLE]

group 25 [RUNNABLE]: [1]
strand 23 "futureResult" [gabilan.test_u9.0:main][1] [RUNNABLE]

group 26 [RUNNABLE]: [1]
strand 24 "futureResult" [gabilan.test_u9.0:main][1] [RUNNABLE]

group 27 [RUNNABLE]: [1]
strand 25 "futureResult" [gabilan.test_u9.0:main][1] [RUNNABLE]

group 28 [RUNNABLE]: [1]
strand 26 "futureResult" [gabilan.test_u9.0:main][1] [RUNNABLE]

group 29 [RUNNABLE]: [1]
strand 27 "futureResult" [gabilan.test_u9.0:main][1] [RUNNABLE]

===========================================

@TharmiganK
Copy link
Contributor

TharmiganK commented Apr 19, 2024

Can we get a JAVA thread dump and a ballerina strand dump together when this happens?

This is for the HTTP client:
Strand Dump: https://docs.google.com/document/d/1Ip23ItLSbKvaumFTtjZ8d9y6npqgzlzIBCUJwGemBs0/edit?usp=sharing
Thread Dump: https://drive.google.com/file/d/1gdjb817zkH1JejsuyJTZkmaA7yxn4z-e/view?usp=sharing

@TharmiganK
Copy link
Contributor

With this PR: ballerina-platform/module-ballerina-http#1954, we have used a different approach to avoid countdown latch and moved the new feature to a new client type.

The root cause should be found from the lang side, so moving this to lang

@TharmiganK TharmiganK transferred this issue from ballerina-platform/ballerina-library Apr 24, 2024
@warunalakshitha
Copy link
Contributor

Following interop function async call inside the foreach loop hangs:

import ballerina/io;
import ballerina/jballerina.java;

public function main() returns error? {
    future<error?> futureResult;
    foreach int i in 0...25 {
        futureResult = start test(i);
        // check test(i);
    }
    _ = check wait futureResult;
}

public class ABC {

    public int port;

    public isolated function init(int port) {
        self.port = port;
    }

    public isolated function getPort() returns int {
        return self.port;
    }
}

function test(int i) returns error? {
    io:println("Request started for ", i);
    ABC abc = new ABC(9090);
    io:println(getResource(abc));
    io:println("Request completed for ", i);
}

isolated function getResource(ABC abc) returns anydata = @java:Method {
    'class: "myapp.app.App"
} external;

getResource method in the App.java class

    public static Object getResource(Environment env, BObject client) {

        final Object[] out = new Object[1];
        CountDownLatch countDownLatch = new CountDownLatch(1);

        Callback callback = new Callback() {
            @Override
            public void notifySuccess(Object o) {
                out[0] = o;
                countDownLatch.countDown();
            }

            @Override
            public void notifyFailure(BError bError) {
                out[0] = bError;
                countDownLatch.countDown();
            }
        };

        env.getRuntime().invokeMethodAsyncSequentially(client, "getPort", null,null, callback, null, PredefinedTypes.TYPE_INT);

        try {
            countDownLatch.await();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        return out[0];
    }

Strand Dump:
But still need to verify that this is the same problem as the original problem.

This is wrong. We cannot block the strand thread using countdown latch or any other concurrency mechanism. We need to use markasync and future.complete methods to block the strand without blocking underlying thread.

Refer [1] for more info.

[1] https://github.com/ballerina-platform/ballerina-lang/tree/master/bvm/ballerina-runtime#calling-a-ballerina-object-method

@gabilang
Copy link
Contributor

gabilang commented May 2, 2024

The hanging happens in the client level and some how the service also hangs. This issue seems to be related to the changes done with the status code binding feature: ballerina-platform/module-ballerina-http#1915, specifically when the client call involves data-binding where we call a Ballerina function from Java to bind the payload to the required type: https://github.com/ballerina-platform/module-ballerina-http/blob/8e3226c5f8cc9cb0a5e2cb043ccc140644b5e7ec/native/src/main/java/io/ballerina/stdlib/http/api/nativeimpl/ExternResponseProcessor.java#L519

This should be an issue with the client's concurrent behaviour when data binding is involved.

The client behaviour with this status code binding feature looks like this:

Ballerina - Client call(External function) => Java - Client Execution method => Ballerina - Client call implementation function => Ballerina - Process response function(External function) => Java - Process response method => Java - Perform data-binding method => Ballerina - Perform data-binding method

The last 3 steps are just to build the payload with the required type, and based on the target type, further processing may happen. Without these steps i.e. if we bind it to http:Response this hanging issue is not reproducible

Reduced reproducer:

  1. Client file:
import ballerina/http;
import ballerina/io;

type Location record {|
    string name;
    string id;
    string address;
|};

type Locations record {|
    *http:Links;
    Location[] locations;
|};

final http:Client snowpeakEP = check new ("http://localhost:9090");

public function main() returns error? {
    future<error?> futureResult;
    foreach int i in 0...400 {
        futureResult = start test(i);
    }
    _ = check wait futureResult;
}

function test(int i) returns error? {
    io:println("Request started for ", i);
    Locations locations = check snowpeakEP->get("/snowpeak/locations");
    io:println("Request completed for ", i);
}
  1. Service file:
import ballerina/http;

type Location record {|
    string name;
    string id;
    string address;
|};

type Locations record {|
    *http:Links;
    Location[] locations;
|};

service /snowpeak on new http:Listener(9090) {

    resource function get locations() returns Locations {
        return {
            locations: [
                {name: "Snowpeak", id: "1", address: "1234 Snowpeak"},
                {name: "Snowpeak", id: "2", address: "1234 Snowpeak"},
                {name: "Snowpeak", id: "3", address: "1234 Snowpeak"}
            ],
            _links: {
                self: {href: "/snowpeak/locations"}
            }
        };
    }
}

I will further investigate on this

The logic used here which uses countdown latch can be replaced with markasync and Ballerina future. The post processing logics can be moved to the notifySuccess() method of the callback.

    public static void getResource(Environment env, BObject obj) {
        Future balFuture = env.markAsync();
        Callback callback = getCallback(balFuture);
        env.getRuntime().invokeMethodAsyncConcurrently(obj, "getPort", "interop-async-seq",null, callback, null, PredefinedTypes.TYPE_INT);
    }

    private static Callback getCallback(Future balFuture) {
        final Object[] out = new Object[1];

        return new Callback() {
            @Override
            public void notifySuccess(Object o) {
                out[0] = ((Long) o) + 2;
                balFuture.complete(out[0]);
            }

            @Override
            public void notifyFailure(BError bError) {
                balFuture.complete(new RuntimeException(bError.toString()));
            }
        };

So, transferred this issue to ballerina-lib.

@gabilang gabilang transferred this issue from ballerina-platform/ballerina-lang May 2, 2024
@TharmiganK
Copy link
Contributor

This is wrong. We cannot block the strand thread using countdown latch or any other concurrency mechanism. We need to use markasync and future.complete methods to block the strand without blocking underlying thread.

Refer [1] for more info.

[1] https://github.com/ballerina-platform/ballerina-lang/tree/master/bvm/ballerina-runtime#calling-a-ballerina-object-method

Ack

The logic used here which uses countdown latch can be replaced with markasync and Ballerina future. The post processing logics can be moved to the notifySuccess() method of the callback.

Ack. Since we have found a workaround in this PR: ballerina-platform/module-ballerina-http#1954, we will use this approach for the moment

@TharmiganK
Copy link
Contributor

Closing this issue since the original implementation is not valid according to the Ballerina runtime implementation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants