Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Execute is blocking main thread #236

Closed
matkg opened this issue Nov 9, 2021 · 5 comments
Closed

Execute is blocking main thread #236

matkg opened this issue Nov 9, 2021 · 5 comments

Comments

@matkg
Copy link

matkg commented Nov 9, 2021

Hey there I'm trying to use Barracuda with the YOLOv5s network and I'm running into performance issues. Inside the update function im calling worker.Execute(textureTensor);

    void EvalCameraImage()
    {
        workerbusy = true;
        dh.sw.Start();
        Destroy(tex);
        tex = Utils.renderTextureToTexture2D(mainCameraTexture);
        Tensor textureTensor = new Tensor(tex, channels: 3);
        worker.Execute(textureTensor);
        Tensor output = worker.PeekOutput("output");
        var results = Utils.getBoundingBoxFromTensor(output, 1920, 1088);
        //results.ForEach(res => dh.log(res.ToString()));
        textureTensor.Dispose();
        output.Dispose();
        dh.stopSW("");
        workerbusy = false;
    }

Problem is the execute command is taking approximately 100 milliseconds to complete which results in 8 to 9 fps during runtime.

I already tried using a coroutine calling FlushSchedule each frame and then yielding but then the model takes 3 seconds to execute which is not ideal.

Here the coroutine:

    IEnumerator EvalCameraImageCoroutine()
    {
        dh.sw.Start();
        Destroy(tex);
        tex = Utils.renderTextureToTexture2D(mainCameraTexture);
        Tensor textureTensor = new Tensor(tex, channels: 3);
        var it =  worker.StartManualSchedule(textureTensor);
        workerbusy = true;
        int count = 0;
        while (it.MoveNext())
        {
            ++count;
            //Task.Run(() => { worker.FlushSchedule(false); });
            worker.FlushSchedule(true);
            if(count % 1 == 0) yield return null;
        }
        worker.FlushSchedule(true);
        Tensor output = worker.PeekOutput("output");
        var results = Utils.getBoundingBoxFromTensor(output, 1920, 1088);
        results.ForEach(res => dh.log(res.ToString()));
        textureTensor.Dispose();
        output.Dispose();
        workerbusy = false;
        dh.stopSW("");
    }

As you can see i already tried to run worker.FlushSchedule(false); async which resulted in a race condition inside the Barracuda framework. It seems that calling the Execute() function async would solve my problems but that results in an error which is telling me that Execute can only be called from the main thread.

Also calling FlushSchedule multiple times per frame made the performance even worse.

Any help is appreciated.

@Aurimasp
Copy link
Collaborator

Hi @matkg ,

You can try to profile and increase the number of layers per frame to balance the time.

@matkg
Copy link
Author

matkg commented Nov 11, 2021

Hey @Aurimasp thanks for the profiler suggestion didn't thought about it!

I re read the doc and i changed my code up a little:

    IEnumerator EvalCameraImageCoroutine()
    {
        dh.sw.Start();
        Destroy(tex);
        tex = Utils.renderTextureToTexture2D(mainCameraTexture);
        Tensor textureTensor = new Tensor(tex, channels: 3);
        var it =  worker.StartManualSchedule(textureTensor);
        workerbusy = true;
        int count = 0;

        while (it.MoveNext())
        {
            ++count;
            if (count % 20 == 0)
            {
                Task.Run(() => worker.FlushSchedule(false));
                yield return null;
            }
        }
        worker.FlushSchedule(true);
        Tensor output = worker.PeekOutput("output");
        var results = Utils.getBoundingBoxFromTensor(output, 1920, 1088);
        results.ForEach(res => dh.log(res.ToString()));
        textureTensor.Dispose();
        output.Dispose();
        workerbusy = false;
        dh.stopSW("");
    }

Now i flush only every 20th frame which keeps my fps around 30 with an inference time of 200ms. But I noticed that the it.MoveNext() function is blocking the main thread for up to 67ms. I tried to execute it async aswell but then it doesnt return true.

Here a picture of the profiler (the spikes are almost all due to MoveNext()

image

@FlorentGuinier
Copy link

FlorentGuinier commented Nov 16, 2021

Hi Matkg,

May i ask what backend you are using? I would expect Burst work to happens on threads and compute work to happens on GPU, is you are using one of those backend this might be bug indeed. In that case could you share a small repro please?

Thanks!
Florent

@matkg
Copy link
Author

matkg commented Nov 16, 2021

Hey there thanks for the reply.

I am using the ComputePrecompiled setting which should run on the GPU. Anyways I've managed to shrink down the execution time by reducing the input size of my yolov5n model. (model input was 1920x1088 and now is 192x192) That seemed to be the real issue which I overlooked. It now computes with an inference time of about 10ms which gives a 120 fps.

Thanks for the replies but I think the issue was on my side.

@FlorentGuinier
Copy link

Hey!

Ok thanks for feedback! I feel it is still quite unexpected that PrecompileCompute backend would take 65ms of CPU timing even at high input resolution. Something might be fishy somewhere :) I'm closing the bug for now however please feel free to reopen as needed.

Florent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants