Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ios: Battery/CPU usage audit #113

Closed
rebello95 opened this issue Jun 17, 2019 · 19 comments · Fixed by #217
Closed

ios: Battery/CPU usage audit #113

rebello95 opened this issue Jun 17, 2019 · 19 comments · Fixed by #217
Assignees
Labels
perf/cpu perf platform/ios Issues related to iOS
Milestone

Comments

@rebello95
Copy link
Contributor

rebello95 commented Jun 17, 2019

  • Analyze how Envoy Mobile affects battery usage on both iOS/Android while running and determine if this is a notable/concerning change that needs to be investigated further
  • Note: @mattklein123 is familiar with configs that we can disable (such as DNS polling). Sync with him prior to starting the investigation. @Reflejo also has some ideas for battery auditing
  • Audit CPU usage as well

Document reference here.
Android issue: #114

@rebello95 rebello95 added platform/ios Issues related to iOS perf labels Jun 17, 2019
@rebello95 rebello95 added this to the v0.2 "Primo" milestone Jun 17, 2019
@rebello95 rebello95 self-assigned this Jun 17, 2019
@rebello95
Copy link
Contributor Author

To complete this, I'm using the Energy Diagnostics tool from Xcode Instruments.

I'll be pushing to the same branch @buildbreaker is using for #114 here: #159.

This will be done using two apps:

  • Control - An app that makes a request every 0.2s to an endpoint without Envoy compiled with the app
  • Envoy - App that makes the same request at the same interval, but routed through an instance of Envoy

The apps will be run (one at a time) on a physical device (iPhone 6s iOS 12.2.x) for 1 hour with the battery/developer logging enabled (see Apple docs above), after which the CPU and battery usage will be measured.

@rebello95
Copy link
Contributor Author

Battery analysis (test apps)

Ran the above experiment, and found the following results.

  • Each app executed ~18,000 requests over the course of the 1 hour
  • The device was disconnected from USB
  • The device was connected to wifi, and kept the app foregrounded the entire time with medium brightness and Bluetooth enabled

According to Apple's documentation, battery consumption is scored at a level from 0 to 20:

The Energy Usage instrument indicates a level from 0 to 20, indicating how much energy your app is using at any given time. These numbers are subjective. If your app’s energy usage level is occasionally high, it doesn’t necessarily mean that your app has a problem. Your app may simply require more energy for some of the tasks it performs. For example, it may use the GPS while performing complex network operations. This is valid energy use. What you should look for are spikes or areas of high energy use that are unexpected or that could be performed at more optimal times.

The logs from Envoy showed a score of 5/20:

EnvoyObjc_Battery

This is compared with control, which had a score of 4/20:

Control_Battery

Overall, the device battery dropped by 5% over the hour that Envoy was running on the device. The app's battery usage was low without Envoy, and stayed low even with Envoy proxying requests, changing from 4/20 to 5/20.

@Reflejo
Copy link
Contributor

Reflejo commented Jun 23, 2019

@rebello95 The shape of that test result looks weird; your control has way less cellular transmit power draw and envoy has way more foreground activity that doesn't really correlate to network.

This might actually be an indication that something strange is happening or just the fact this is an unscientific test.

  • I'd try it a few times
  • Can you tap the network to see if traffic is the same on one and the other?
  • Overlapping a CPU analysis could be helpful to understand what's drawing power

@rebello95
Copy link
Contributor Author

rebello95 commented Jun 24, 2019

@Reflejo thanks for taking a look. Yep, I'm planning on running it again, and will do a CPU analysis in parallel. I may also run this with the Lyft passenger app as a more real-world environment.

@rebello95
Copy link
Contributor Author

rebello95 commented Jun 26, 2019

Did quite a bit more testing, and have some interesting findings.

Caching

The first pass above was using URLSession for performing requests, but had a cache policy set up by default. This explains what @Reflejo pointed out above regarding sporadic network traffic - we were mostly getting cache hits after the initial set of requests. Thus, the above data is invalid because outbound network requests weren't actually being performed throughout the whole test.

New approach

Instead of running a test with Envoy actually within the Lyft passenger app, I examined how many requests are executed on the "request ride" screen of the app (~1.5req/sec) and decided to simulate something a little more intense than that.

In this test, I continued sending a request every 0.2s, and sent 1,000 requests with the URLSession cache disabled (by setting it to nil). As shown in the screenshots below, network traffic is now much more consistent.

Inconsistent Envoy I/O

Once the above caching issue was identified with control, I did some validation against the build with Envoy installed, and found that the network I/O data still appeared sporadic in Xcode Instruments without the caching - even though the responses were coming back successfully consistently. As a quick test, I randomized URL request parameters, headers, and body data in the outbound requests, but still saw the same sporadic results.

After speaking with @goaway and @junr03, we decided it was possible that the data going in and out of the app simply wasn't being caught by Instruments since Envoy uses pure sockets to send it. To validate this theory, we tried installing Charles Proxy on the test device to see if traffic appeared there. For the same reasons, it did not.

To try something lower level, we took the following approach:

  1. Connected a laptop to ethernet, then opened a shared wifi network from the laptop
  2. Connected the test device to the laptop's wifi network
  3. Ran Wireshark on the laptop to sniff traffic - including traffic coming from the phone
  4. Profiled the app with Envoy installed on the test device, and checked to see if the requests we expect were actually being sent

Sure enough, traffic appeared consistently (every ~0.2s) within Wireshark. (Note: We had to disable TLS in the Envoy config file in order to make our lives easier identifying which traffic was coming from the app). Thus, we can ignore the seemingly sporadic network I/O shown in Instruments for the Envoy tests:

Wireshark

Battery

Battery for control averaged a score of 1/20:

Control-Battery

Comparatively, the app with Envoy compiled/running showed an average of 12/20, a massive increase from control:

Envoy-Battery

CPU

CPU usage in the control app didn't go much past 12%:

Control-CPU

Envoy, however, had magnitudes more CPU usage, around 185%:

Envoy-CPU

Memory

Memory usage in both apps was relatively reasonable, with control using ~6MB of persistent memory:

Control-Memory

And Envoy using ~12MB (a high relative increase, but still not too bad):

Envoy-Memory

Summary

  • Battery usage appears to have increased quite a bit
  • CPU usage also appears to be very high
  • Memory usage is likely at an acceptable level

I wanted to post these results to get feedback before continuing. Potential steps forward:

  • Run the tests several more times
  • Run the tests inside of another app like Lyft to see if we observe the same trends
  • Investigate why we see these increases

@mattklein123
Copy link
Member

Nice, awesome work getting this setup.

  1. Are you doing the profiling on an optimized build?
  2. Can you get a CPU trace/flamegraph of the CPU usage in both control and Envoy? There are some default things that Envoy does like flushing stats every 5s, resolving DNS every 5s, etc. that need to get altered for mobile, but it would be easier if we can see a profile.

@rebello95
Copy link
Contributor Author

  1. I'm using release configurations within Xcode for both, but am not specifying any of the flags being considered for the release configuration of Envoy.framework in bazel: add release configuration #155

  2. Sure, here's where I'm seeing the majority of CPU usage:

Envoy-CPU-Drilldown

@rebello95
Copy link
Contributor Author

And here's the second largest contributor:

Envoy-CPU-Drilldown-2

@mattklein123
Copy link
Member

Let's sync up either tomorrow afternoon or Monday (I'm out Thu/Fri), but it would be useful if there is a way I can inspect the profile myself vs. screenshots. At a high level the event loop is waking up too often, and this is probably a result of the watchdog timers being too aggressive, and probably other default things that need to be tuned (I'm pretty sure everything should be configurable).

Note also that once we move over to the native socket, iOS will handle all of the eventing for us which is likely to also be more efficient.

@mattklein123
Copy link
Member

One of them that I know needs to be configured is the miss timers: https://www.envoyproxy.io/docs/envoy/latest/api-v2/config/bootstrap/v2/bootstrap.proto#config-bootstrap-v2-watchdog. We need to completely remove that for Mobile (let's file an issue) but for now I would configure that to a very high value for the miss timers so they never wake up.

@rebello95
Copy link
Contributor Author

Sounds good, we have some time scheduled on Monday to sync up on this. In the meantime, if you'd like to load up the traces in Xcode Instruments, here are the files you can import from the runs above:

Control: Control.trace.zip
Envoy/treatment: Envoy.trace.zip

Let's chat regarding increasing the miss timers / disabling them as well.

@rebello95
Copy link
Contributor Author

Also, I ran the test again with the demo apps, and am indeed seeing the same results as before.

@mattklein123
Copy link
Member

Per offline convo, here are the things I would play around with next to get more info:

  1. Adjust miss/mega miss timeout to 1 minute each to avoid them waking up the event loop: https://www.envoyproxy.io/docs/envoy/latest/api-v2/config/bootstrap/v2/bootstrap.proto#config-bootstrap-v2-watchdog
  2. Adjust the DNS re-resolution timer to 1 minute on the cluster: https://www.envoyproxy.io/docs/envoy/latest/api-v2/api/v2/cds.proto (dns_refresh_rate)
  3. Adjust stat flush interval to 1 minute: https://www.envoyproxy.io/docs/envoy/latest/api-v2/config/bootstrap/v2/bootstrap.proto#config-bootstrap-v2-bootstrap (stats_flush_interval)
  4. Let's run at trace logging to see if we see any output from Envoy other than when we are making requests and/or resolving DNS.
  5. It is conceivable that the iOS kqueue implementation is somehow waking up more often than the Android epoll implementation. We can take a look at the libevent code or more profile data to try to uncover if this is the case, however given the trace screenshots above it seems like we are calling into Envoy code in each case so I doubt that is the case.

If we want to actually log every time we wake up the Envoy dispatcher, I realize that there isn't going to be a trivial way to do that in one place. I can provide more info as a next step if we want to take a look at that, so I would recommend that we start with ^ and see what we see.

@mattklein123
Copy link
Member

Hmm, @rebello95 sorry I just looked at this again and I think I know a major problem. libevent appears to be using the poll dispatcher vs. the kqueue dispatcher (the trace shows poll_dispatch). This is what we need to look into. iOS should definitely support kqueue, but it's possible we need to compile libevent with some extra flag to use it. cc @goaway @keith.

@rebello95
Copy link
Contributor Author

Thanks for posting those details @mattklein123.

Per those suggestions and our offline conversation, I ran the test using the configurations outlined in items 1-3 above (as seen in this commit), and am seeing basically no change in the CPU performance graphs.

For suggestion 4, I set Envoy's log level to trace and re-ran the test (including the above configs). The only logs being emitted are those detailing the requests I'm sending.

These findings coincide with your last suggestion that poll_dispatch / lack of using kqueue is very likely the culprit of the high resource usage.

After talking with @mattklein123 offline, I think the best next steps are to investigate how to use kqueue instead of polling. @goaway @keith would one/both of you be interested in helping with this?

@rebello95
Copy link
Contributor Author

Was able to verify that polling vs kqueue was the problem! When forcing kqueue usage, CPU dropped to just a few percent. Tracking in this issue so that we can fix the problem: #215

Screen Shot 2019-07-02 at 10 37 09 AM

rebello95 added a commit that referenced this issue Jul 2, 2019
- Adds iOS documentation on the CPU/battery usage based on the investigation done in #113
- Combines the existing documentation for this into a single file for CPU/battery

Resolves #113.
rebello95 added a commit that referenced this issue Jul 2, 2019
- Adds iOS documentation on the CPU/battery usage based on the investigation done in #113
- Combines the existing documentation for this into a single file for CPU/battery

Resolves #113.

Signed-off-by: Michael Rebello <mrebello@lyft.com>
rebello95 added a commit that referenced this issue Jul 2, 2019
There is a set of configurations that we can slow down on mobile from their defaults upstream because they aren't as relevant to mobile clients.

This change updates our example configurations to use new values based on [this discussion](#113 (comment)).

When we switch to typed configurations, these should also be set automatically for production clients: #169

Signed-off-by: Michael Rebello <mrebello@lyft.com>
rebello95 added a commit that referenced this issue Jul 2, 2019
There is a set of configurations that we can slow down on mobile from their defaults upstream because they aren't as relevant to mobile clients.

This change updates our example configurations to use new values based on [this discussion](#113 (comment)).

When we switch to typed configurations, these should also be set automatically for production clients: #169

Signed-off-by: Michael Rebello <mrebello@lyft.com>
rebello95 added a commit that referenced this issue Jul 2, 2019
- Adds iOS documentation on the CPU/battery usage based on the investigation done in #113
- Combines the existing documentation for this into a single file for CPU/battery

Resolves #113.

Signed-off-by: Michael Rebello <mrebello@lyft.com>
@rebello95
Copy link
Contributor Author

Re-ran against master right now and the issue is resolved: ~4% CPU usage and basically no change to battery consumption. Will update docs accordingly!

Screen Shot 2019-07-03 at 3 46 11 PM

Screen Shot 2019-07-03 at 3 46 30 PM

@mattklein123
Copy link
Member

Yay! Awesome!

@rebello95
Copy link
Contributor Author

#227

jpsim pushed a commit to envoyproxy/envoy that referenced this issue Nov 28, 2022
There is a set of configurations that we can slow down on mobile from their defaults upstream because they aren't as relevant to mobile clients.

This change updates our example configurations to use new values based on [this discussion](envoyproxy/envoy-mobile#113 (comment)).

When we switch to typed configurations, these should also be set automatically for production clients: envoyproxy/envoy-mobile#169

Signed-off-by: Michael Rebello <mrebello@lyft.com>
Signed-off-by: JP Simard <jp@jpsim.com>
jpsim pushed a commit to envoyproxy/envoy that referenced this issue Nov 28, 2022
- Adds iOS documentation on the CPU/battery usage based on the investigation done in envoyproxy/envoy-mobile#113
- Combines the existing documentation for this into a single file for CPU/battery

Resolves envoyproxy/envoy-mobile#113.

Signed-off-by: Michael Rebello <mrebello@lyft.com>
Signed-off-by: JP Simard <jp@jpsim.com>
jpsim pushed a commit to envoyproxy/envoy that referenced this issue Nov 29, 2022
There is a set of configurations that we can slow down on mobile from their defaults upstream because they aren't as relevant to mobile clients.

This change updates our example configurations to use new values based on [this discussion](envoyproxy/envoy-mobile#113 (comment)).

When we switch to typed configurations, these should also be set automatically for production clients: envoyproxy/envoy-mobile#169

Signed-off-by: Michael Rebello <mrebello@lyft.com>
Signed-off-by: JP Simard <jp@jpsim.com>
jpsim pushed a commit to envoyproxy/envoy that referenced this issue Nov 29, 2022
- Adds iOS documentation on the CPU/battery usage based on the investigation done in envoyproxy/envoy-mobile#113
- Combines the existing documentation for this into a single file for CPU/battery

Resolves envoyproxy/envoy-mobile#113.

Signed-off-by: Michael Rebello <mrebello@lyft.com>
Signed-off-by: JP Simard <jp@jpsim.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
perf/cpu perf platform/ios Issues related to iOS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants