Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reland "Migrate darwin common "framework_shared" target to ARC #37049" #37883

Merged
merged 1 commit into from Dec 15, 2022

Conversation

cyanglaz
Copy link
Contributor

@cyanglaz cyanglaz commented Nov 23, 2022

Relands #37219, which was reverted at #37320 due to a benchmark regression. flutter/flutter#114697

There is no change in this PR compares to #37219

The benchmark regression is due to a performance regression caused by ARC, specifically here: https://github.com/flutter/engine/blob/main/shell/platform/darwin/common/framework/Source/FlutterStandardCodec.mm#L458-L464 .
This regression is unavoidable and been discussed at several places online: https://jimmymandersson.medium.com/understanding-arcs-effect-on-your-apps-performance-377e39076a88

I create two simple sample test cases comparing ARC and ARC in performance and the result shows (from my machine) that the ARC version of the test took 8ms and MRC took 6ms: https://github.com/cyanglaz/MrcVSArc

ARC has many benefits so I think we should still migrate to it. Please advise if there are any to optimize the ARC code in this PR to reduce the regression.

I also have tested the performance if all the "copy"s are reverted to "retain", but the improvement is very minimal. I'd suggest keep them as copy so the code is safer.

part of flutter/flutter#112232

cc @zanderso @stuartmorgan @gaaclarke for thoughts.

Pre-launch Checklist

  • I read the Contributor Guide and followed the process outlined there for submitting PRs.
  • I read the Tree Hygiene wiki page, which explains my responsibilities.
  • I read and followed the Flutter Style Guide and the C++, Objective-C, Java style guides.
  • I listed at least one issue that this PR fixes in the description above.
  • I added new tests to check the change I am making or feature I am adding, or Hixie said the PR is test-exempt. See testing the engine for instructions on writing and running engine tests.
  • I updated/added relevant documentation (doc comments with ///).
  • I signed the CLA.
  • All existing and new tests are passing.

If you need help, consider asking for advice on the #hackers-new channel on Discord.

@stuartmorgan
Copy link
Contributor

The benchmark regression is due to a performance regression caused by ARC, specifically here: https://github.com/flutter/engine/blob/main/shell/platform/darwin/common/framework/Source/FlutterStandardCodec.mm#L458-L464 .

I don't have any specific suggestions for optimizing that code under ARC. That said, I'm also not super concerned about the exact perf of that piece of code; anyone passing huge amounts of data over method channels as a List<Object> is going to have a bad time regardless, and the regression wasn't huge. (15% isn't ideal of course, but it's not like it was doubling or something.)

ARC has many benefits so I think we should still migrate to it.

Agreed, I would not recommend holding back ARC migration on this benchmark regression. I would expect a non-trivial reduction in the number of crashes in Flutter applications over time by migrating to ARC, in addition to easier maintenance (which frees eng resources for other tasks--like going back and optimizing critical codepaths), both of which are very significant benefits.

I also have tested the performance if all the "copy"s are reverted to "retain", but the improvement is very minimal.

That seems expected to me; the primary foundation types all use CoW under the hood, I believe.

@gaaclarke
Copy link
Member

The benchmark regression is due to a performance regression caused by ARC, specifically here: https://github.com/flutter/engine/blob/main/shell/platform/darwin/common/framework/Source/FlutterStandardCodec.mm#L458-L464 .

Why not just pull this logic into its own function in a file that is compiled with MRC? You can even add the inline keyword and hope it gets inlined.

NSArray* FlutterStandardReaderReadArray(FlutterStandardReader* reader) {
  UInt32 length = [reader readSize];
  NSMutableArray* array = [NSMutableArray arrayWithCapacity:length];
  for (UInt32 i = 0; i < length; i++) {
    id value = [reader readValue];
    [array addObject:(value == nil ? [NSNull null] : value)];
  }
  return array;
}

I'd have to look at the assembly output to understand why it is slower, my guess is that the extra retain and release calls for value inside the loop create extra indirection and overhead versus just waiting for the autorelease pool to dump it. I'll see if there is some magic ARC annotation that will give you similar behavior.

@gaaclarke
Copy link
Member

gaaclarke commented Nov 29, 2022

Here's the loop with ARC:

LBB2_2:                                 ;   in Loop: Header=BB2_4 Depth=1
	ldr	x0, [x24, _OBJC_CLASSLIST_REFERENCES_$_.1@PAGEOFF]
	bl	_objc_msgSend$null
	mov	x29, x29
	bl	_objc_retainAutoreleasedReturnValue
	mov	x23, x0
	mov	x0, x21
	mov	x2, x23
	bl	"_objc_msgSend$addObject:"
	mov	x0, x23
	bl	_objc_release
LBB2_3:                                 ;   in Loop: Header=BB2_4 Depth=1
	mov	x0, x22
	bl	_objc_release
	subs	w20, w20, #1
	b.eq	LBB2_6
LBB2_4:                                 ; =>This Inner Loop Header: Depth=1
	mov	x0, x19
	bl	_objc_msgSend$readValue
	mov	x29, x29
	bl	_objc_retainAutoreleasedReturnValue
	mov	x22, x0
	cbz	x0, LBB2_2
; %bb.5:                                ;   in Loop: Header=BB2_4 Depth=1
	mov	x0, x21
	mov	x2, x22
	bl	"_objc_msgSend$addObject:"
	b	LBB2_3

In order to get rid of all the extra retains and releases I used the __unsafe_unretained annotation, which should give you the better performance.

-(NSArray*)readArray {
  UInt32 length = [self readSize];
  NSMutableArray* array = [NSMutableArray arrayWithCapacity:length];
  for (UInt32 i = 0; i < length; i++) {
    __unsafe_unretained id value = [self readValue];
    [array addObject:(value == nil ? [NSNull null] : value)];
  }
  return array;
}
LBB2_2:                                 ;   in Loop: Header=BB2_3 Depth=1
	ldr	x0, [x23, _OBJC_CLASSLIST_REFERENCES_$_.1@PAGEOFF]
	bl	_objc_msgSend$null
	mov	x29, x29
	bl	_objc_retainAutoreleasedReturnValue
	mov	x22, x0
	mov	x0, x21
	mov	x2, x22
	bl	"_objc_msgSend$addObject:"
	mov	x0, x22
	bl	_objc_release
	subs	w20, w20, #1
	b.eq	LBB2_5
LBB2_3:                                 ; =>This Inner Loop Header: Depth=1
	mov	x0, x19
	bl	_objc_msgSend$readValue
	mov	x29, x29
	bl	_objc_unsafeClaimAutoreleasedReturnValue
	cbz	x0, LBB2_2
; %bb.4:                                ;   in Loop: Header=BB2_3 Depth=1
	mov	x2, x0
	mov	x0, x21
	bl	"_objc_msgSend$addObject:"
	subs	w20, w20, #1
	b.ne	LBB2_3

I didn't benchmark them, but it should be faster. Although maybe not as fast as MRC because of the _objc_unsafeClaimAutoreleasedReturnValue call.

edit: the "unsafe" prefix is scary but in this case you are fine since the autorelease pool has a retain on the object.

@gaaclarke
Copy link
Member

To be clear, after investigating the problem I don't feel there is any legitimate reason to accept this loss of performance and this is something we work hard to optimized. You can either pull out the loop to an MRC file or use the __unsafe_unretained annotation in conjunction with enabling ARC.

@cyanglaz
Copy link
Contributor Author

cyanglaz commented Nov 29, 2022

I tried __unsafe_unretained, which crashed because ARC released the returned object of [self readValue] right away, making value pointing to an invalid address.

Having this method in MRC is OK to avoid performance regression.

However, there are many performance regressions like this as we migrate to ARC (Some might be caught by benchmarks some might not). If we are going to live with a partial MRC embedder for performance reason, we have to define some rules to draw a line between what we migrate and what we don't.

@gaaclarke
Copy link
Member

gaaclarke commented Nov 29, 2022

I tried __unsafe_unretained, which crashed because ARC released the returned object of [self readValue] right away, making value pointing to an invalid address.

Probably not worth debugging since those functions like objc_unsafeClaimAutoreleasedReturnValue aren't really documented and we have a workaround, but I'd be curious to see if that is the guy that is clearing out the variable (and to see the assembly of your readValue method).

we have to define some rules to draw a line between what we migrate and what we don't.

The line here is this is a particularly sensitive benchmark that potential customers evaluate when making decisions about adopting Flutter or not.

@cyanglaz
Copy link
Contributor Author

The line here is this is a particularly sensitive benchmark that potential customers evaluate when making decisions about adopting Flutter or not.

I guess the rule for the migration for now would be if we break a benchmark, we need to evaluate it case by case and decide if we should have some code remain in MRC.

@stuartmorgan
Copy link
Contributor

The line here is this is a particularly sensitive benchmark that potential customers evaluate when making decisions about adopting Flutter or not.

What's the real-world use case for a huge number of objects in a list over platform channels?

@gaaclarke
Copy link
Member

What's the real-world use case for a huge number of objects in a list over platform channels?

Where is the guarantee that it isn't a use-case? Platform channel performance is of particular importance to customers evaluating Flutter for add-to-app. That isn't a hypothetical. I'm happy to discuss privately. I'll also add that that the list doesn't have to be large, chatty smaller payloads would experience the same overhead. Decoding messages happens on the ui thread where we are talking about scales of 16ms to maintain smooth animations.

@stuartmorgan
Copy link
Contributor

Where is the guarantee that it isn't a use-case?

Nowhere. My position is based on the widely held industry view that premature optimization is generally a bad thing, so the bar for making code more complex and more difficult to maintain is not "do it unless you can prove that performance will never be a problem", but rather "only do it if there's concrete evidence that it is".

I'll also add that that the list doesn't have to be large, chatty smaller payloads would experience the same overhead.

What is the impact (either in ms or percentage of overall message time) of just this retain/release in a list with, say, 10 objects? Maybe I'm misunderstanding the benchmark; I thought it was visible here only because of having a very high element count.

Platform channel performance is of particular importance to customers evaluating Flutter for add-to-app. That isn't a hypothetical.

Does this benchmark reflect real-world-scenario performance though? Or is it accidentally and artificially highlighting one part that would be much smaller in most use cases?

@gaaclarke
Copy link
Member

Yea, this code has been production code for many years and we've got direct feedback from large clients about platform channel performance, specifically when evaluating for add-to-app scenarios where there is more communication over channels. I think we are beyond "premature optimization" =)

The benchmark isn't absurd. The list size is 1000 elements of various small types. The performance regression is 13% slower. Here's where the test payload is created: https://github.com/flutter/flutter/blob/71139099c5c4069f13dba4387b98af6158735f62/dev/benchmarks/platform_channels_benchmarks/lib/main.dart#L199

@stuartmorgan
Copy link
Contributor

stuartmorgan commented Nov 30, 2022

we've got direct feedback from large clients about platform channel performance

Was it about Lists of 1000s of objects? I'm well aware that platform channel performance has been an issue before. I've been consistently saying that I don't see any reason to believe that the performance of this particular case is likely to be important, not that no platform channel performance matters.

specifically when evaluating for add-to-app scenarios where there is more communication over channels.

There is a difference between "more communication" and "communication with these specific properties".

I think we are beyond "premature optimization" =)

I have asked several questions that would help me evaluate whether that's actually true:

  • "What's the real-world use case for a huge number of objects in a list over platform channels?"
  • "What is the impact (either in ms or percentage of overall message time) of just this retain/release in a list with, say, 10 objects?"

I haven't seen answers to those questions, just assertions that platform channel performance matters in some general way—which, again, I am not, and was not, disputing.

The benchmark isn't absurd. The list size is 1000 elements of various small types.

I didn't say that it was absurd, just that it doesn't seem like it would reflect typical use cases. Just saying it isn't absurd doesn't address my questions, it just dismisses them.

@gaaclarke
Copy link
Member

Just saying it isn't absurd doesn't address my questions, it just dismisses them.

I think looking at it that way is an unfair characterization.

Ultimately what you are asking for is analytics about real-world usage of platform channels. Then we can know exactly what cases matter the most. We don't have that, can't have that and even if we did have it we'd still have to generalize a solution that balances the needs of all usages. The best we can do is extrapolate based on what users have told us. That information is confidential. I'm happy to discuss privately. Xiao is another good resource if you want.

I think you are getting a bit bogged down about the maintenance being easier with ARC, but if you look at the blame for FlutterCodecs.mm the majority of it was written 6 years ago and hasn't changed in any meaningful way. In fact, it's very sensitive to any change, more than probably any part of the engine. So what really is the benefit of changing a file to ARC that doesn't change and has years of production testing at the cost of performance? Even if, for arguments sake, if that performance drop is rare?

@stuartmorgan
Copy link
Contributor

Ultimately what you are asking for is analytics about real-world usage of platform channels. Then we can know exactly what cases matter the most. We don't have that, can't have that and even if we did have it we'd still have to generalize a solution that balances the needs of all usages.

I am not asking for a complete accounting for every single current and future use of platform channels, I'm asking for very specific information about what we already know.

The best we can do is extrapolate based on what users have told us.

Agreed, which is why I'm asking what we already know about use cases based on what we've been told. Unlike your request above that I prove an impossible negative, I'm just asking whether we have actual evidence that this benchmark corresponds to a real-world use case (since it doesn't at all match up with anything I've personally seen, and I'm struggling to imagine a non-contrived use case). Either we do have such evidence, in which case this is an easy question to answer and thus resolve my concern, or we don't, in which case I continue to be skeptical about that benchmark reflecting real-world outcomes rather than being a potentially misleading microbenchmark (and in which case I'd like to know the regression for usage that looks like everything I've seen in the wild, which is lists several orders of magnitude smaller than this).

That information is confidential. I'm happy to discuss privately. Xiao is another good resource if you want.

I can follow up privately if needed, but let me rephrase my question to avoid issues of confidentiality first: have you seen a real-world use case that looks anything like a single message containing a 1000-object list?

So what really is the benefit of changing a file to ARC that doesn't change and has years of production testing at the cost of performance?

Having pockets of MRC code in an ARC codebase is a significant liability, because it is extremely easy for people to change it and/or review it without realizing that it's MRC:

  • Leaky MRC code looks exactly like correct ARC code, and compiles just fine.
  • It's very easy to have changes/diffs where nothing in the surrounding context would indicate to someone that it is MRC code (and thus needs to be reviewed as such).

When I say that it's extremely easy, I say this as someone with years of experience bouncing back and forth between ARC and non-ARC reviews (in the context of readability reviews in particular), and even when I knew to be vigilant because it was new code I still sometimes got it wrong. If we fast-forward, say, a year, at which point this could easily be the only MRC Obj-C code in the engine, that would become even more likely because nobody would be expecting to see MRC code in a review. Having the file be rarely edited is a double-edged sword because while it reduces the opportunity for mistakes, it also means people are even less likely to be on the lookout for it making them more likely to happen when those rare changes do come up.

@gaaclarke
Copy link
Member

Three more important datapoints:

  1. b/229707726 has CPU traces from and a conversations with customer:money. We had a multi month effort to crawl back performance from the startup time of their app, where platform channels was a significant portion. It would be an mistake to dismiss any portion of time in this code as inconsequential when we have a real world use case and a large investment in customer:money that proves otherwise.
  2. At some point I introduced platform channel analytics that can be turned on locally and customer:money did use it. The link is in that b/ conversation. You can see there are channels that are sending payloads of 27kb. The payload of the microbenchmark is 14kb.
  3. The algorithm for encoding and decoding is O(n). So thinking think only effects large size payloads isn't the whole story. Based on our 2 datapoints we can extrapolate any payload size, for example a payload 0.14kb should have a 1% regression.

@stuartmorgan
Copy link
Contributor

It would be an mistake to dismiss any portion of time in this code as inconsequential

Can you point me to the part of the conversation about readValue? I only see discussion of writeValue.

when we have a real world use case and a large investment in customer:money that proves otherwise.

This continues to conflate "real world use case of platform channel performance being important" and "real world use case of this particular piece of platform channels being important".

  1. At some point I introduced platform channel analytics that can be turned on locally and customer:money did use it. The link is in that b/ conversation. You can see there are channels that are sending payloads of 27kb. The payload of the microbenchmark is 14kb.

I'm not sure where in the thread that is, perhaps you can point me to the specific part offline. But payload size doesn't tell us anything relevant to this benchmark regression without knowing what the composition of the payload is. Adding fixed-size-per-item overhead to list decoding will have ~zero effect on a list of two 15kb objects, and a lot of effect on a list of 30,000 1-byte objects.

  1. The algorithm for encoding and decoding is O(n). So thinking think only effects large size payloads isn't the whole story. Based on our 2 datapoints we can extrapolate any payload size, for example a payload 0.14kb should have a 1% regression.

The specific piece that decodes per-list-item entries is O(n), yes, but unless the benchmark only tests that one exact piece of code (in which case it would definitely be a really unrealistic microbenchmark), then the overall benchmark won't scale linearly.

@stuartmorgan
Copy link
Contributor

stuartmorgan commented Dec 6, 2022

  1. We had a multi month effort to crawl back performance from the startup time of their app, where platform channels was a significant portion. It would be an mistake to dismiss any portion of time in this code as inconsequential when we have a real world use case and a large investment in customer:money that proves otherwise.

Also, wasn't this the investigation that concluded that specifically optimizing platform channel encode/decode performance, even though it had showed up in CPU profiles, actually had almost no measurable real-world impact on startup time? Because if so, that seems to directly support the argument that microoptimization of this particular piece of code is not so high priority that it should block ARC migration.

@gaaclarke
Copy link
Member

Can you point me to the part of the conversation about readValue? I only see discussion of writeValue.

You can see readValue's cost in the CPU traces. Keep in mind these were recorded on Android and, at least the first CPU trace, is just capturing the ui thread. The reading cost there is equal to the writing cost (it's helpful to do a pivot on StandardMessageCodec).

This continues to conflate "real world use case of platform channel performance being important" and "real world use case of this particular piece of platform channels being important".

Right, we can not include every use case as a benchmark so it requires a bit of extrapolation.

I'm not sure where in the thread that is, perhaps you can point me to the specific part offline. But payload size doesn't tell us anything relevant to this benchmark regression without knowing what the composition of the payload is. Adding fixed-size-per-item overhead to list decoding will have ~zero effect on a list of two 15kb objects, and a lot of effect on a list of 30,000 1-byte objects.

We don't have analytics about the actual composition of the payload. What we do know is that they are using the StandardMessageCodec, so it is unlikely to be one item. The idea that the contributing factor to the regression is List size is a theory. One that was put into question when Chris extracted that function and compiled it in MRC but didn't see an improvement. We know the channel that is generating these large payloads and we have contacts on the team that may know the payloads if we want to dig deeper.

The specific piece that decodes per-list-item entries is O(n), yes, but unless the benchmark only tests that one exact piece of code (in which case it would definitely be a really unrealistic microbenchmark), then the overall benchmark won't scale linearly.

I don't think I understand this point. That particular benchmark was designed for test coverage, to make sure are testing all datatypes and the size was chosen based off of choosing a size whose probability of being see in the wild is greater than 1%. We can always create another one if we want.


Stepping back here's the take aways you should have from looking at the conversation with customer:money:

  1. On the question of wether this regression represents the needs of any users, the probability of that should be increased in your mind since we've shown the payload size and technique are in fact used in real world applications.
  2. On the question of wether the performance of platform channels matters, looking at the traces of customer:money's app startup, the timescales of what we are trying to optimize and their investment in the problem; the probability that this regression matters should have also increased in your mind as well.

Now the question is: what is the probability that this regression will not negatively affect any users? I suspect that should be very low given our limited data. So hopefully we can dismiss the notion that this regression wouldn't affect any user or that this microbenchmark is unrepresentative on account of that being unlikely.

So, now the calculus becomes simply: "Is the cost of maintaining one MRC file that rarely changes and has 6 years of testing greater than the cost of negatively affecting users performance?"

@stuartmorgan
Copy link
Contributor

Now the question is: what is the probability that this regression will not negatively affect any users?

That is not the only question. "How many users will be affected" and "how much will users be affected" are also very important questions.

So, now the calculus becomes simply: "Is the cost of maintaining one MRC file that rarely changes and has 6 years of testing greater than the cost of negatively affecting users performance?"

There are several implicit value judgements baked into that "simple" distillation of the problem. I could just as accurately say that the calculus is "Is the cost of permanently maintaining a mixed MRC/ARC codebase, thus increasing the probability of future crash and leak bugs, greater than the cost of regressing a benchmark whose direct applicability to real-world performance has not been clearly established?"

My opinion is that we should move forward with the ARC migration without a carve-out for the code covered by this benchmark, giving ourselves a solid foundation for future engine work. If in the future we have further evidence that platform channel encoding/decoding in particular is a sufficiently high priority for revisiting for optimization, we can investigate the best way to optimize it (which, per comments in the related doc, would probably not be via MRC ObjC, but some other solution).

@jmagman
Copy link
Member

jmagman commented Dec 6, 2022

My opinion is that we should move forward with the ARC migration without a carve-out for the code covered by this benchmark, giving ourselves a solid foundation for future engine work. If in the future we have further evidence that platform channel encoding/decoding in particular is a sufficiently high priority for revisiting for optimization, we can investigate the best way to optimize it (which, per comments in the related doc, would probably not be via MRC ObjC, but some other solution).

I completely agree.

@gaaclarke
Copy link
Member

Okay, I'll try to distill all the variables used for making the decision. If we can agree on those at least I can feel confident that you are making this judgement with all the available information. Making a decision without having all the available information is the worst outcome here. It's hard for me to believe that if I conveyed what I know about all the work I've done on platform channels, you'd think migrating one file from MRC was worth it. So humor me in trying to lay out everything I know again.

Since we are dealing with uncertain variables, I'll give them rough probabilities backed up with the limited data we have:

  1. The probability that the regression will effect a user: P(>99.99%). We've seen payloads and methods that match or exceed this size in customer:money
  2. The probability that a user will care in the next year: P(10%). The datapoint we have here is customer:money only cared when they finally took the time to audit their startup time. It's the tragedy of the commons where everything contributes a little to destroy something like startup time. Most users will probably lack the resources to track down and prioritize it.
  3. The probability that the code will be modified after having been converted to ARC in such a way that it makes a difference (ex: a retain would have been added): P(0.1%). Looking at the history of the file, it has hardly changed and when it does it isn't changing things like adding ivars or what not where ARC would be most useful. Almost no one works on platform channels today but me and I have zero plans for it.
  4. The probability that we'll introduce some sort of crash on stable that happens because the code is MRC or manual memory management: P(0.00001%) We have automated testing for this, we have a staged release process and the code doesn't change often. So I don't think it's very probable.
  5. The probability that we'll introduce some sort of memory leak in stable as a result of having an MRC file: P(0.01%). The reason this is so low is because the variable is related to the low chance of change in the file that actually changes memory management. Furthermore memory leaks in platform channels are very noticeable because so much data is passing through them. The risk of the leak happening at a rate so low that it wouldn't create OOMs or some noticeable effect before hitting stable is very low. This could be mitigated with memory leak tests too, you can still leak with ARC so those would be valuable regardless.

Hopefully that better explains how I see this choice and is a complete list. I understand ARC is better than MRC and we should adopt it. Stuart, you know me, I tried to migrate all of the engine to ARC months ago, it was a huge push. All the same, deciding to migrate this particular file in the face of these benchmarks is not pragmatic or supported by what we know about the code's history or our user's feedback.

That's all ignoring what sort of message this decision communicates about our priorities, especially when these benchmarks have been called out by potential customers as key datapoints when deciding even to invest in Flutter.

@stuartmorgan
Copy link
Contributor

Making a decision without having all the available information is the worst outcome here.

Please answer my question above then.

@gaaclarke
Copy link
Member

Also, wasn't this the investigation that concluded that specifically optimizing platform channel encode/decode performance, even though it had showed up in CPU profiles, actually had almost no measurable real-world impact on startup time? Because if so, that seems to directly support the argument that microoptimization of this particular piece of code is not so high priority that it should block ARC migration.

My understanding from working on that team is that there were no easy solutions and that multiple fronts had to be addressed to get the total goal. In the buganizer issue linked above you see a customer mentioning that it's hard to measure real world performance because it's very noisy. They had to look at P95 numbers to detect a 2% decrease in startup time as a result of avoiding platform channel payload copies. While that is hard to detect, a 2% decrease is actually not insignificant as you'll see below.

Here's another example which is tied to the performance of the encoder: We noticed that customer:money was bottlenecked on the platform thread, which includes the code in question. In order to address this for customer:money we created the background platform channels which allows the encoding (and execution of the handlers) to happen on a background thread. Migrating one plugin over resulted in a 3.5% (47ms) decrease in startup time (b/237035671). I can't remember how much time we were trying to shave off, but let's say it was 2 seconds, that was 1/40th of our goal. We had multiple similar small gains and a few bigger ones, but that's how we reached the goal.

Note that while background platform channels provides a workaround for this regression, it isn't always possible to use them.

So on the question of wether encoding changes were measurable, what we do know is that the variance of real world benchmarks can hide significant changes. That's why we must consider the microbenchmarks, real-world benchmarks, and profiles when drawing conclusions. Not one single measurement paints the whole picture. If you want to link to me privately some results you are looking at I can better respond to them.

Copy link
Contributor

@stuartmorgan stuartmorgan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the buganizer issue linked above you see a customer mentioning that it's hard to measure real world performance because it's very noisy. They had to look at P95 numbers to detect a 2% decrease in startup time as a result of avoiding platform channel payload copies.

That is not what it says though. It says that with a benchmark resolution of 2% at P95 the benchmark showed no improvement.

While that is hard to detect, a 2% decrease is actually not insignificant as you'll see below.

"2%" and "0% +/-2%" are not the same thing.

Here's another example which is tied to the performance of the encoder: We noticed that customer:money was bottlenecked on the platform thread

It also has a lot to do with real-world thread contention, which this benchmark does not measure. The fact that encoder/decoder benchmark performance and real-world end-to-end performance are not at all the same thing is the core of my position.

(Edited to add one other point I forgot to include:

2. At some point I introduced platform channel analytics that can be turned on locally and customer:money did use it. The link is in that b/ conversation. You can see there are channels that are sending payloads of 27kb.

I looked at that spreadsheet, and it's showing total bytes, not individual messages, and it doesn't say how many messages there are. So what we know is that 27kb was sent total. Which again gets back to my concern about the huge potential difference between one huge messages vs lots of small ones here.)

we created the background platform channels which allows the encoding (and execution of the handlers) to happen on a background thread.

You are continuing to argue against a straw man that I have been very clear is not my position. I have, at no point, suggested that end-to-end, real-world platform channel performance does not matter.

Nobody in this review has advocated for the removal of background platform channel support. To the extent that the background channel data is relevant at all to the PR at hand, I believe it supports my position: in the real-world example you are drawing all of this data from the outcome was that

  • optimizing serialization had no measurable impact on startup time, while
  • moving the end-to-end process to another thread had a significant effect on startup time.

So the evidence is that encoder/decoder performance by itself, which is what this PR discussion is about, was actually not a significant driver of real-world perf outcomes based on the data we have, and that other parts of the system play a significantly larger role.

I don't think continued discussion involving me will be productive here, as we are both now repeating things we've already said. I am comfortable with the tradeoff of proceeding here for the reasons I've articulated at length above (and summarized at the end of this comment), so I'm approving this from my standpoint.

@cyanglaz given the disagreement here, please don't land based on my approval alone. At least @jmagman and @zanderso should approve (or not) before proceeding, and potentially Ian as well if this needs to be escalated.

@gaaclarke
Copy link
Member

Yea, I understand the need to move forward and at this point I feel my energy would be better spent fixing, what is in my estimation, a mistake instead of trying to convince you that it is a mistake. I would like you to at least bring this to the attention of @xster since he has close connections to our users and is aware of this problem, in case he has any more information that would be relevant and to at least put it on his radar since he fields many questions about this topic.

@xster quick recap of the discussion: This PR migrates StandardMessageCodec to ARC which regresses a microbenchmark where the payload is 14k. We have seen customer:money have payloads this big. Proponents of the refactor believe that this can avert bugs in the future and don't believe there is sufficient evidence that the regression matters. My take is that while we don't have direct data from users that that this is a bad regression, we have evidence that suggest it is probable to affect users negatively and the benefit of ARC is minimal because of the age and history of the code. Also of concern, and something you may have more data about, is some of the feedback we've got from customers who use platform channel microbenchmarks as a proxy for evaluating Flutter's fitness for add-to-app. Thanks.

Copy link
Member

@gaaclarke gaaclarke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We converted some retains to copies which may be contributing to some of the regression. The original PR didn't raise the question about wether copying the data is the correct course here.

Copy link
Member

@gaaclarke gaaclarke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Request change not because of the larger issue discussed, but I want to make sure we don't land this without explanation why some retains were changed to copies =)

@gaaclarke
Copy link
Member

I ran the profiler on this patch locally and verified I got a 10% regression. Here are the traces (isolating this test and cranking the number of iterations to 25000):

encode time

Screen Shot 2022-12-07 at 4 49 53 PM

decode time

Screen Shot 2022-12-07 at 4 50 26 PM

bottom up

Screen Shot 2022-12-07 at 4 55 49 PM

So the extra retain/releases ARC inserts are the problem like we suspected but it's not strictly list size that drives the regression. It would be more correct to say the recursion depth, so the total number of items in the payload. You can see the costs of objc_autoreleaseReturnValue objc_retainAutoreleasedReturnValue and objc_release are multiplied by the recursion depth. I'm not sure if Jenn's suggestion elsewhere to use CoreFoundation+Toll Free Bridge could get us back to where we started. It may give us the performance of ARC unless we duplicate an NSAutoreleasePool somehow. If we do get that working though it sort of obviates the argument of making the code easier to maintain though since it basically throws the benefit of ARC out the window.

I tried duplicate the MRC code that we had previously is using a combination of __attribute__((ns_returns_autoreleased)) and __autoreleasing and __unsafe_unretained. So that the reader looks something like this

- (nullable id)readValue __attribute__((ns_returns_autoreleased)) {
  __autoreleasing id result = [self readValueOfType:[self readByte]];
  return result;
}

- (nullable id)readValueOfType:(UInt8)type __attribute__((ns_returns_autoreleased)) {
  FlutterStandardField field = (FlutterStandardField)type;
  __autoreleasing id result;
  switch (field) {
    // ...
    case FlutterStandardFieldList: {
      UInt32 length = [self readSize];
      NSMutableArray* array = [NSMutableArray arrayWithCapacity:length];
      for (UInt32 i = 0; i < length; i++) {
        __unsafe_unretained id value = [self readValue];
        [array addObject:(value == nil ? [NSNull null] : value)];
      }
      result = array;
      break;
    }
  }
  return result;
}

I couldn't get the right combination to make it duplicate the performance of the MRC code. Suffice it to say, there are 2 things I know now:

  1. It won't be simple to match the MRC performance with ARC
  2. If we do match the MRC performance with ARC, the code will actually be harder to maintain since it will have goofy annotations to make sure we are doing exactly what the MRC code is doing.

TLDR I tried to fix the regression but came away convinced it will be more difficult than I thought to get MRC performance with ARC enabled, and if we do achieve it it will be harder to maintain than the MRC code.

@gaaclarke
Copy link
Member

Okay, I tested moving the code to CoreFoundation. I had forgotten about CFAutorelease so we can probably duplicate MRC that way. The simple test I did was faster with CoreFoundation, it wasn't quite to the levels we were seeing with MRC but the assembly looks close and I think we can probably duplicate it.

- (nullable id)readValue {
  return (__bridge id)ReadValue(self);
}

static CFTypeRef ReadValue(FlutterStandardReader* codec) {
  return ReadValueOfType(codec, [codec readByte]);
}

- (nullable id)readValueOfType:(UInt8)type {
  return (__bridge id)ReadValueOfType(self, type);
}

static CFTypeRef ReadValueOfType(FlutterStandardReader* codec, UInt8 type) {
  FlutterStandardField field = (FlutterStandardField)type;
  switch (field) {
    case FlutterStandardFieldNil:
      return nil;
    case FlutterStandardFieldTrue:
      return (__bridge CFBooleanRef) @YES;
    case FlutterStandardFieldFalse:
      return (__bridge CFBooleanRef) @NO;
    case FlutterStandardFieldInt32: {
      SInt32 value;
      [codec readBytes:&value length:4];
      return (__bridge CFNumberRef) @(value);
    }
    case FlutterStandardFieldInt64: {
      SInt64 value;
      [codec readBytes:&value length:8];
      return (__bridge CFNumberRef) @(value);
    }
    case FlutterStandardFieldFloat64: {
      Float64 value;
      [codec readAlignment:8];
      [codec readBytes:&value length:8];
      return (__bridge CFNumberRef)[NSNumber numberWithDouble:value];
    }
    case FlutterStandardFieldIntHex:
    case FlutterStandardFieldString:
      return (__bridge CFStringRef)[codec readUTF8];
    case FlutterStandardFieldUInt8Data:
    case FlutterStandardFieldInt32Data:
    case FlutterStandardFieldInt64Data:
    case FlutterStandardFieldFloat32Data:
    case FlutterStandardFieldFloat64Data:
      return (__bridge CFTypeRef)[codec readTypedDataOfType:FlutterStandardDataTypeForField(field)];
    case FlutterStandardFieldList: {
      UInt32 length = [codec readSize];
      CFMutableArrayRef array = CFArrayCreateMutable(kCFAllocatorDefault, length, &kCFTypeArrayCallBacks);
      for (UInt32 i = 0; i < length; i++) {
        CFTypeRef value = ReadValue(codec);
        CFArrayAppendValue(array, (value == nil ? kCFNull : value));
      }
      return CFAutorelease(array);
    }
    case FlutterStandardFieldMap: {
      UInt32 size = [codec readSize];
      CFMutableDictionaryRef dict =
          CFDictionaryCreateMutable(kCFAllocatorDefault, size, &kCFTypeDictionaryKeyCallBacks,
                                    &kCFTypeDictionaryValueCallBacks);
      for (UInt32 i = 0; i < size; i++) {
        CFTypeRef key = ReadValue(codec);
        CFTypeRef val = ReadValue(codec);
        CFDictionaryAddValue(dict, (key == nil ? kCFNull : key), (val == nil ? kCFNull : val));
      }
      return CFAutorelease(dict);
    }
    default:
      NSCAssert(NO, @"Corrupted standard message");
  }
}

Here is the profile of the test (it's 4.72s above):
Screen Shot 2022-12-07 at 9 23 32 PM

Here's the assembly output of the FlutterStandardFieldInt32 branch for ARC and MRC with this change:
Screen Shot 2022-12-07 at 9 26 13 PM

I'm comfortable with this change if we migrate the rest of the code to CoreFoundation. I can finish up the conversion tomorrow quick now that we have a nice proof of concept. I have to double check that this doesn't choke on FlutterStandardTypedData.

Copy link
Member

@gaaclarke gaaclarke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The copy I mentioned looked fine. If we just migrate the codec to CoreFoundation this sounds good to me.

@cyanglaz
Copy link
Contributor Author

cyanglaz commented Dec 8, 2022

Talked to @gaaclarke offline about the CoreFoundation approach. Since it would be a good practice for me, I will prototype the migration to CoreFoundation to see the benchmark results.

@cyanglaz
Copy link
Contributor Author

After investigation, detailed in this doc We have determined that migrating to static C function and CoreFoundation is not trivial. Discussed with @gaaclarke offline and we think we should move on and accept the regression for now and come back revisit the performance of platform channel later.

Copy link
Member

@gaaclarke gaaclarke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cyanglaz for looking into this. I strongly believe this migration to ARC should be limited to logic that does not affect benchmarks on the grounds that customer needs trump developer comfort, especially when the gravity of the benchmark is weighed against the utility of having this one file be ARC. But assuming the team has considered that and does agree with the tradeoff; this code is correct and a fair effort was made to improve it.

@cyanglaz cyanglaz added the autosubmit Merge PR when tree becomes green via auto submit App label Dec 14, 2022
@auto-submit auto-submit bot merged commit 4881fe2 into flutter:main Dec 15, 2022
@cyanglaz cyanglaz deleted the reland_arc branch December 15, 2022 05:30
engine-flutter-autoroll added a commit to engine-flutter-autoroll/flutter that referenced this pull request Dec 15, 2022
auto-submit bot pushed a commit to flutter/flutter that referenced this pull request Dec 15, 2022
…117148)

* df430c4fd Revert "[Impeller] Speculatively attempt to fix Metal PSO construction errors on host targets." (flutter/engine#38292)

* 41dd4f5e1 Revert "Revert "[Impeller] Speculatively attempt to fix Metal PSO construction errors on host targets." (#38292)" (flutter/engine#38301)

* 8ce9a3f71 Roll Skia from 3171deabd88a to b368746d696a (13 revisions) (flutter/engine#38294)

* 4881fe25e Revert "Revert "reland "Migrate darwin common "framework_shared" target to ARC #37049" (#37219)" (#37320)" (flutter/engine#37883)

* 3b2302c8d [Impeller] Remove validation log when the pipeline library is collected before pipeline is setup. (flutter/engine#38306)

* a04997c81 [Impeller] Disable impeller_unittests. (flutter/engine#38307)

* fc71faad0 License script improvements (flutter/engine#38148)

* 6a2560c35 [Windows] Synthesize modifier keys events on pointer events (flutter/engine#38138)

* b1d407563 Roll Skia from b368746d696a to 3f81f95176ce (11 revisions) (flutter/engine#38312)

* b25fcf748 Roll Skia from 3f81f95176ce to 46e8f2a18a3d (3 revisions) (flutter/engine#38314)

* 948699bba Collapse bounds calculations into DisplayListBuilder (flutter/engine#34365)

* 38367de84 Roll Fuchsia Mac SDK from u-tC0QEGUT4xQ4KOo... to VEOIaacOA75U7PYyz... (flutter/engine#38316)

* 29196519c Roll Skia from 46e8f2a18a3d to 9f728d78f10d (1 revision) (flutter/engine#38317)
loic-sharma pushed a commit to loic-sharma/flutter-engine that referenced this pull request Dec 16, 2022
loic-sharma pushed a commit to loic-sharma/flutter-engine that referenced this pull request Jan 3, 2023
gspencergoog pushed a commit to gspencergoog/flutter that referenced this pull request Jan 19, 2023
…lutter#117148)

* df430c4fd Revert "[Impeller] Speculatively attempt to fix Metal PSO construction errors on host targets." (flutter/engine#38292)

* 41dd4f5e1 Revert "Revert "[Impeller] Speculatively attempt to fix Metal PSO construction errors on host targets." (flutter#38292)" (flutter/engine#38301)

* 8ce9a3f71 Roll Skia from 3171deabd88a to b368746d696a (13 revisions) (flutter/engine#38294)

* 4881fe25e Revert "Revert "reland "Migrate darwin common "framework_shared" target to ARC flutter#37049" (flutter#37219)" (flutter#37320)" (flutter/engine#37883)

* 3b2302c8d [Impeller] Remove validation log when the pipeline library is collected before pipeline is setup. (flutter/engine#38306)

* a04997c81 [Impeller] Disable impeller_unittests. (flutter/engine#38307)

* fc71faad0 License script improvements (flutter/engine#38148)

* 6a2560c35 [Windows] Synthesize modifier keys events on pointer events (flutter/engine#38138)

* b1d407563 Roll Skia from b368746d696a to 3f81f95176ce (11 revisions) (flutter/engine#38312)

* b25fcf748 Roll Skia from 3f81f95176ce to 46e8f2a18a3d (3 revisions) (flutter/engine#38314)

* 948699bba Collapse bounds calculations into DisplayListBuilder (flutter/engine#34365)

* 38367de84 Roll Fuchsia Mac SDK from u-tC0QEGUT4xQ4KOo... to VEOIaacOA75U7PYyz... (flutter/engine#38316)

* 29196519c Roll Skia from 46e8f2a18a3d to 9f728d78f10d (1 revision) (flutter/engine#38317)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autosubmit Merge PR when tree becomes green via auto submit App platform-ios platform-macos
Projects
None yet
5 participants