Execute is blocking main thread #236

matkg · 2021-11-09T22:09:35Z

Hey there I'm trying to use Barracuda with the YOLOv5s network and I'm running into performance issues. Inside the update function im calling worker.Execute(textureTensor);

    void EvalCameraImage()
    {
        workerbusy = true;
        dh.sw.Start();
        Destroy(tex);
        tex = Utils.renderTextureToTexture2D(mainCameraTexture);
        Tensor textureTensor = new Tensor(tex, channels: 3);
        worker.Execute(textureTensor);
        Tensor output = worker.PeekOutput("output");
        var results = Utils.getBoundingBoxFromTensor(output, 1920, 1088);
        //results.ForEach(res => dh.log(res.ToString()));
        textureTensor.Dispose();
        output.Dispose();
        dh.stopSW("");
        workerbusy = false;
    }

Problem is the execute command is taking approximately 100 milliseconds to complete which results in 8 to 9 fps during runtime.

I already tried using a coroutine calling FlushSchedule each frame and then yielding but then the model takes 3 seconds to execute which is not ideal.

Here the coroutine:

    IEnumerator EvalCameraImageCoroutine()
    {
        dh.sw.Start();
        Destroy(tex);
        tex = Utils.renderTextureToTexture2D(mainCameraTexture);
        Tensor textureTensor = new Tensor(tex, channels: 3);
        var it =  worker.StartManualSchedule(textureTensor);
        workerbusy = true;
        int count = 0;
        while (it.MoveNext())
        {
            ++count;
            //Task.Run(() => { worker.FlushSchedule(false); });
            worker.FlushSchedule(true);
            if(count % 1 == 0) yield return null;
        }
        worker.FlushSchedule(true);
        Tensor output = worker.PeekOutput("output");
        var results = Utils.getBoundingBoxFromTensor(output, 1920, 1088);
        results.ForEach(res => dh.log(res.ToString()));
        textureTensor.Dispose();
        output.Dispose();
        workerbusy = false;
        dh.stopSW("");
    }

As you can see i already tried to run worker.FlushSchedule(false); async which resulted in a race condition inside the Barracuda framework. It seems that calling the Execute() function async would solve my problems but that results in an error which is telling me that Execute can only be called from the main thread.

Also calling FlushSchedule multiple times per frame made the performance even worse.

Any help is appreciated.

The text was updated successfully, but these errors were encountered:

Aurimasp · 2021-11-11T10:59:53Z

Hi @matkg ,

You can try to profile and increase the number of layers per frame to balance the time.

matkg · 2021-11-11T15:44:54Z

Hey @Aurimasp thanks for the profiler suggestion didn't thought about it!

I re read the doc and i changed my code up a little:

    IEnumerator EvalCameraImageCoroutine()
    {
        dh.sw.Start();
        Destroy(tex);
        tex = Utils.renderTextureToTexture2D(mainCameraTexture);
        Tensor textureTensor = new Tensor(tex, channels: 3);
        var it =  worker.StartManualSchedule(textureTensor);
        workerbusy = true;
        int count = 0;

        while (it.MoveNext())
        {
            ++count;
            if (count % 20 == 0)
            {
                Task.Run(() => worker.FlushSchedule(false));
                yield return null;
            }
        }
        worker.FlushSchedule(true);
        Tensor output = worker.PeekOutput("output");
        var results = Utils.getBoundingBoxFromTensor(output, 1920, 1088);
        results.ForEach(res => dh.log(res.ToString()));
        textureTensor.Dispose();
        output.Dispose();
        workerbusy = false;
        dh.stopSW("");
    }

Now i flush only every 20th frame which keeps my fps around 30 with an inference time of 200ms. But I noticed that the it.MoveNext() function is blocking the main thread for up to 67ms. I tried to execute it async aswell but then it doesnt return true.

Here a picture of the profiler (the spikes are almost all due to MoveNext()

FlorentGuinier · 2021-11-16T08:15:00Z

Hi Matkg,

May i ask what backend you are using? I would expect Burst work to happens on threads and compute work to happens on GPU, is you are using one of those backend this might be bug indeed. In that case could you share a small repro please?

Thanks!
Florent

matkg · 2021-11-16T08:26:59Z

Hey there thanks for the reply.

I am using the ComputePrecompiled setting which should run on the GPU. Anyways I've managed to shrink down the execution time by reducing the input size of my yolov5n model. (model input was 1920x1088 and now is 192x192) That seemed to be the real issue which I overlooked. It now computes with an inference time of about 10ms which gives a 120 fps.

Thanks for the replies but I think the issue was on my side.

FlorentGuinier · 2021-11-16T08:39:22Z

Hey!

Ok thanks for feedback! I feel it is still quite unexpected that PrecompileCompute backend would take 65ms of CPU timing even at high input resolution. Something might be fishy somewhere :) I'm closing the bug for now however please feel free to reopen as needed.

Florent

FlorentGuinier closed this as completed Nov 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Execute is blocking main thread #236

Execute is blocking main thread #236

matkg commented Nov 9, 2021 •

edited

Aurimasp commented Nov 11, 2021

matkg commented Nov 11, 2021 •

edited

FlorentGuinier commented Nov 16, 2021 •

edited

matkg commented Nov 16, 2021 •

edited

FlorentGuinier commented Nov 16, 2021

Execute is blocking main thread #236

Execute is blocking main thread #236

Comments

matkg commented Nov 9, 2021 • edited

Aurimasp commented Nov 11, 2021

matkg commented Nov 11, 2021 • edited

FlorentGuinier commented Nov 16, 2021 • edited

matkg commented Nov 16, 2021 • edited

FlorentGuinier commented Nov 16, 2021

matkg commented Nov 9, 2021 •

edited

matkg commented Nov 11, 2021 •

edited

FlorentGuinier commented Nov 16, 2021 •

edited

matkg commented Nov 16, 2021 •

edited