Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1/n] Move subscription to web worker #9795

Closed

Conversation

salazarm
Copy link
Contributor

@salazarm salazarm commented Sep 26, 2022

Summary & Motivation

This PR recreates #8734

Organizations have runs that can output thousands of logs and upwards of 500MB of logs. This causes extreme slowdown in the UI as the mainthread struggles to keep up with the giant spam of logs coming from the backend both when the run is active and especially when the run is completed and all of the logs are loaded upfront. The main source of lag in the UI is deserializing the JSON data. To reduce this time we're planning on only sending the logs/data for things that are actually visible in the UI.

https://elementl.quip.com/To6cAXcqGIbi/rfc-Improving-Job-Run-Performance

Right now the worker will only be enabled if a query parameter worker=1 is in the URL

How I Tested These Changes

Loaded a huge run with dagit-debug and saw everything loaded fine.

Ran a few runs with and without the worker and the worker consistently beat the non worker version:

many_events.py:

No worker:

  • 115898.29999999981 ms
  • 114627.20000000019 ms
  • 115695.5 ms
  • 116075.09999999963 ms
    Average: 115.5 seconds

Worker:

  • 97091 ms
  • 96878.70000000019 ms
  • 97609.20000000019 ms
  • 97558.39999999944 ms
    Average: 97.2 seconds

~16% improvement

Adtriba run debug file:

No Worker:

  • 332907.6000000015 ms
  • 328559.40000000224
  • 311353.5
  • 324224.69999999925

Average: 324.26 seconds

Worker:

  • 96900.20000000019
  • 102775.20000000298
  • 99352.19999999925
  • 100879.39999999851

Average: 99.98 seconds

~71% improvement

@linear
Copy link

linear bot commented Sep 26, 2022

@vercel
Copy link

vercel bot commented Sep 26, 2022

The latest updates on your projects. Learn more about Vercel for Git ↗︎

3 Ignored Deployments
Name Status Preview Comments Updated
dagit-storybook ⬜️ Ignored (Inspect) Sep 29, 2022 at 8:21PM (UTC)
dagster ⬜️ Ignored (Inspect) Sep 29, 2022 at 8:21PM (UTC)
dagster-oss-cloud-consolidated ⬜️ Ignored (Inspect) Sep 29, 2022 at 8:21PM (UTC)

Copy link
Member

@alangenfeld alangenfeld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see 1/N in the title but the PR summary does not make it clear what the expected scope of this is. From what i can tell this on its own doesn't improve anything, just adds a web worker hop to parsing the same volume of data.

worker.addEventListener('message', (event) => {
try {
onLogs(JSON.parse(arrayBufferToString(event.data)));
} catch (_) {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should make some noise somewhere on failure - this would be really gnarly to debug

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll remove the try/catch

@salazarm
Copy link
Contributor Author

salazarm commented Sep 26, 2022

@alangenfeld Yeah this is just setup for what we want to do which is only send data that would be visible from the webworker to the mainthread so that the mainthread does less work overall.

The linked linear task has more information and heres this quip describing the plan: https://elementl.quip.com/To6cAXcqGIbi/rfc-Improving-Job-Run-Performance

Btw I did do 4 trial runs for each version with and without the worker and the worker version was consistently faster at loading a run with 200k events

@salazarm
Copy link
Contributor Author

salazarm commented Sep 27, 2022

Just tested this on Adtriba's debug file from a while ago and the improvements are huge there:
No Worker:

  • 332907.6000000015 ms
  • 328559.40000000224
  • 311353.5
  • 324224.69999999925

Worker:

  • 96900.20000000019
  • 102775.20000000298
  • 99352.19999999925
  • 100879.39999999851

I don't see a lot more room for improvement on the frontend but my macbook's python server is pretty slow so its the bottleneck for me. On a faster machine it's possible it could overwhelm the frontend. I'll see if I can setup dagster on my personal workstation pc

@alangenfeld
Copy link
Member

What exactly is that time measuring? May be useful to include profiler screenshots.

Its not obvious to me why things are faster with this change if the volume of data parsed is the same in both approaches - unless

the main source of lag in the UI is deserializing the JSON data

isn't right or needs some more specific detail.

@salazarm
Copy link
Contributor Author

What exactly is that time measuring? May be useful to include profiler screenshots.

I can put up a commit with the test setup. It just measures from pageload until the last log message from the webserver is handled (I used a requestAnimationFrame to get the time when rendering of the last message is done)

Its not obvious to me why things are faster with this change if the volume of data parsed is the same in both approaches - unless
the main source of lag in the UI is deserializing the JSON data
isn't right or needs some more specific detail.

Yeah I'm going to be honest it did catch me by surprise. I can do some more digging to figure out exactly where the time is coming from. I setup my workstation so hopefully a beefier setup will make it easier to profile (my macbook hangs everytime I try to profile the non-worker version)

Copy link
Member

@hellendag hellendag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The perf benefit looks really cool. Thanks for putting this together!

Comment on lines +1 to +13
import {gql} from '@apollo/client';

export const PYTHON_ERROR_FRAGMENT = gql`
fragment PythonErrorFragment on PythonError {
__typename
message
stack
cause {
message
stack
}
}
`;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fragment is already defined in PythonErrorInfo, can you reuse that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I didn't want the initial PR to have a ton of changes so that its easier to review but I'll do what I did in the original PR where I moved the dependencies around and changed all the imports so that theres only 1 definition


let apolloClient: ApolloClient<any> | undefined = undefined;

export function setup(data: any) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this type be tightened up?

});

apolloClient = new ApolloClient({
cache: new InMemoryCache(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth using createAppCache?

Comment on lines +1 to +24
// https://developer.chrome.com/blog/how-to-convert-arraybuffer-to-and-from-string/

export function arrayBufferToString(buf: ArrayBuffer): string {
return String.fromCharCode.apply(null, new Uint16Array(buf) as any);
}

// Chunk the data into 16kb bytes to avoid hitting max argument errors when converting back to a string
// with String.fromCharCode.apply
const MAX_BUFFER_SIZE = 16000;
export function stringToArrayBuffers(str: string): ArrayBuffer[] {
const buffers: ArrayBuffer[] = [];
let buf = new ArrayBuffer(0);
let view = new Uint16Array(buf);
for (let i = 0, strLen = str.length; i < strLen; i++) {
const index = i % MAX_BUFFER_SIZE;
if (index === 0) {
buf = new ArrayBuffer(Math.min(MAX_BUFFER_SIZE, str.length - i) * 2); // 2 bytes for each char\
view = new Uint16Array(buf);
buffers.push(buf);
}
view[index] = str.charCodeAt(i);
}
return buffers;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commented URL looks fairly old and has an update note -- is there a more recent native API we can use for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is using native APIs underneath, it's just chunking to avoid going over the max number of arguments allowed for a function (Since we call String.fromCharCode.apply on the buffer, the number of arguments is the size of the array).

import {PipelineRunLogsSubscription} from '../../runs/types/PipelineRunLogsSubscription';

export type Message = INITIALIZE;
type INITIALIZE = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type InitializeMessage? It's a little confusing on the eyes for a type to be ALL_CAPS since it looks like a constant value at a glance.

Comment on lines 3 to 6
type SHUTDOWN = {
type: 'SHUTDOWN';
staticPathRoot?: undefined;
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define INITIALIZE and SHUTDOWN types in the same location in a union so that you can have data be a single union type? Just to tidy things up a bit.

@salazarm
Copy link
Contributor Author

salazarm commented Sep 28, 2022

I profiled both the worker and non-worker version. Getting a chrome profile is difficult because the buffer fills up pretty fast and the load takes a long time so instead I counted the number of GanttChart re-renders, the number of calls to throttleSetNodes and the number of times throttleSetNodes actually ran. This is with the 500mb debug file. I ran these runs multiple times and the results were very consistent.

nonWorkerVsWorker
You can see the worker version did more batching because it had less runs of setNodes than there were calls to throttleSetNodes.
This is because the GanttChart re-render is slow so the worker ends up queuing multiple messages for the main thread to process. I'm not entirely sure why this behavior doesn't happen without the worker. This means the non-worker version scale linearly with the number of batches of messages while the worker version takes as long as the webserver takes plus the time of a render for the last batch of messages.

In this profile I turned off the GanttChart which is the slowest part of the page to re-render. With the GanttChart disabled there were no batched throttleSetNode calls in either the worker or non-worker version because the rendering was done before the web-server sent the next message.

nonWorkerVsWorkerNoGantChart

With the worker version the main bottleneck now is the python webserver so I think we can skip all the other virtualization stuff since we wouldn't get much benefit. I suspect there are a lot of unnecessary react-renders in the Gantt chart but I'm not entirely sure. It's probably not worthwhile speeding this up too much because the app still feels usable though the last render is brutal and can last 10 seconds

@salazarm
Copy link
Contributor Author

salazarm commented Sep 28, 2022

I was able to take a full profile for both versions with the Gantt chart off.

Non-worker:
Screen Shot 2022-09-28 at 4 27 48 AM
Screen Shot 2022-09-28 at 4 11 31 AM

Worker:
Screen Shot 2022-09-28 at 4 23 38 AM

Screen Shot 2022-09-28 at 4 11 09 AM

One interesting thing is that the non worker version has double the "system" time, though its not much.
In both cases the mainthread spends a lot of time idle just waiting for the next message. In the non-worker case there is a lot more idle time though. I'm not entirely sure why there are a lot of possibilities here.

That last render is very long

I should diff the bottom-up performance data to figure out what that extra scripting time is in the non-worker version. I was expecting both of them to do the same amount of work since its the same number of renders for each one.

@salazarm salazarm changed the title [1/n] Virtualize run data subscription - Move subscription to web worker [1/n] Move subscription to web worker Sep 28, 2022
@salazarm
Copy link
Contributor Author

@hellendag I removed the branching based off the URL and made the worker the default since it performs better. Also thanks for the detailed feedback! I know the PR was really rough I was mostly trying to get something quick out there for feedback and testing

@salazarm salazarm force-pushed the marco/brr-289-run-page-performance-w-large-jobs-1 branch from d28ab99 to 704767a Compare September 29, 2022 20:20
Copy link
Member

@alangenfeld alangenfeld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copying this over from slack DMs - context was investigating the mysterious overall speed-up of the worker version:

so i was playing around with this change

        if (!logs.hasMorePastEvents) {
          requestAnimationFrame(throttledSetNodes);
        }

just against master, and noticed in the react profiler that even though we try to don’t set nodes, the apollo subscription update still causes RunWithData to re-render every time we receive a websocket payload

Screen Shot 2022-09-29 at 3 28 56 PM

which upon googling appears to be a common source of performance issues

https://stackoverflow.com/questions/61876931/how-to-prevent-re-rendering-with-usesubscription
https://medium.com/@seniv/improve-performance-of-your-react-apollo-application-440692e37026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants