Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak - 700mb RAM #394

Closed
x0a opened this issue Dec 17, 2018 · 54 comments
Closed

Memory leak - 700mb RAM #394

x0a opened this issue Dec 17, 2018 · 54 comments
Labels
bug An issue that is confirmed as a bug help wanted An issue that needs contributors

Comments

@x0a
Copy link

x0a commented Dec 17, 2018

Describe the bug
Accumulates hundreds of MB of ram over the course of several hours.

To Reproduce
Not entirely sure how to reproduce. But it happens pretty consistently. I've not changed any of the defaults, I've only connected one device, my tablet.

Screenshots
screenshot from 2018-12-16 18-52-29
screenshot from 2018-12-16 12-34-18

Debug output
Heap attached. Doesn't seem to match what the terminal shows.

g_object_unref: assertion 'G_IS_OBJECT (object)' failed
g_object_unref: assertion 'G_IS_OBJECT (object)' failed
g_object_unref: assertion 'G_IS_OBJECT (object)' failed
g_object_unref: assertion 'G_IS_OBJECT (object)' failed
g_object_unref: assertion 'G_IS_OBJECT (object)' failed
g_object_unref: assertion 'G_IS_OBJECT (object)' failed
g_object_unref: assertion 'G_IS_OBJECT (object)' failed
g_object_unref: assertion 'G_IS_OBJECT (object)' failed
g_object_unref: assertion 'G_IS_OBJECT (object)' failed
g_object_unref: assertion 'G_IS_OBJECT (object)' failed

gsconnect.heap.zip

System Details (please complete the following information):

  • GSConnect version: 17
    • Installed from: extensions.gnome.org
  • GNOME/Shell version: 3.30.1
  • Distro/Release: Ubuntu 18.10

GSConnect environment (if applicable):

  • Paired Device(s): Samsung T-230NU Tablet
  • KDE Connect app version: 1.10.1
@andyholmes andyholmes added the bug An issue that is confirmed as a bug label Dec 17, 2018
@andyholmes
Copy link
Collaborator

Could you describe your general usage; SMS, sshfs, etc?

Heap attached. Doesn't seem to match what the terminal shows.

Yes, these are not useful for this type of debugging. This will be removed from user view for the next release. You'll have to enable debugging as described here: Debugging.

@x0a
Copy link
Author

x0a commented Dec 17, 2018

That's the thing, on days where I found high RAM usage, I hadn't used it at all. Maybe once or twice I opened the collapsible gnome settings panel to see what my tablet's battery level was. I haven't used it to send files, texts, or anything other than occasionally pausing a movie from bed from multimedia controls or remote input.

I've restarted the process from the debugging window and am now at a healthy 26mb. If/when it rises again, would a gbd memory dump of the process be more useful? In the meantime I've enabled "Debug logging" and am keeping an eye on the terminal and process.

@andyholmes
Copy link
Collaborator

andyholmes commented Dec 17, 2018

A gdb dump probably wouldn't be much help without a lot of debug symbols, and even then I'd really hope it was my mistake and not GJS's since that would be a lot easier to fix 😝

Fortunately restarting the service should be relatively easy, if you can't find any obvious debug issues you can restart the service with gapplication action org.gnome.Shell.Extensions.GSConnect quit. That should at least avoid big problems, and the service reconnects quite easily.

There weren't any reports about this in v16, so hopefully I can track down what might have caused this.

EDIT one thing that might help; does this only happen when the screen is off? I know there are some extension problems in the default Ubuntu session, maybe they are also affecting GSConnect.

@x0a
Copy link
Author

x0a commented Dec 17, 2018

Actually, I do shut my monitor off and leave my desktop logged in throughout the night. And with the (minor) testing I could do in 10 minutes, messing around with remote input while the screen is off did make memory go up by about 15 MB. Also, opening the "Send files" dialog multiple times causes RAM to increase by 1-2MB and it never goes back down, even after triggering garbage collection. So it could definitely be a gjs issue.

For now I'll refrain from restarting the service and let it run throughout the night, report back if I can confirm that the screen being off is the cause of the memory leak. Also, thanks for the great work on this wonderful extension!

@andyholmes
Copy link
Collaborator

Hmm, thanks those are some good places for me to start looking.

It is strange that your help dump is so small, even though the memory usage is so high, not that there is explicitly useful information inside it.

@andyholmes
Copy link
Collaborator

andyholmes commented Dec 18, 2018

I believe there is a particular instance that is causing this. It seems like there is a leak that occurs only when destroying all rows in a GtkListBox, which is something that can happen pretty often. This doesn't seem to be a collection issue, because I've reproduced it in the absence of toggle refs, so the temporary solution I've used is pretty aggressive (7d416d7).

I'm waiting for some advice from the GJS maintainer, but in the mean time if you'd like to give this Zip a try, I'd be interested to hear if it fixes or at least keeps the problem under control:

gsconnect@andyholmes.github.io.zip

You can install it using the instructions in the Wiki, or alternatively just build straight from git.

EDIT nevermind, I think this is just a second bad leak I found 😞. I just had this memory leak appear after only a few minutes of use. Seems like a bad one :(

@x0a
Copy link
Author

x0a commented Dec 18, 2018

It's funny, I left it all night with the monitor off, like I said. In the morning it was up to 68mb which wasn't too bad. I almost forgot about it today. Haven't used KDE Connect at all. Then all of the sudden I see it had jumped up to 488mb sometime in the span of an hour? I've got the GJS log which might indicate something (very likely truncated because of the limited terminal scrollback)
gjs.log

One thing I did notice is the icon kept disappearing from gnome tray throughout the day. My android probably kept killing KDEConnect because of memory, and I also jump between different hotspots throughout the day so perhaps constant disconnections/reconnections could occasionally cause a memory leak somehow? Pure speculation. As I was typing this, the icon appeared and disappeared about 5 times.. sadly it didn't correspond to a memory increase. I'll try your build for a day and see how that affects things.

@andyholmes
Copy link
Collaborator

...I also jump between different hotspots throughout the day so perhaps constant disconnections/reconnections could occasionally cause a memory leak somehow? Pure speculation.

Ah ha, good speculation! I think that causes it, since I've duplicated it a few times now by walking my phone to where another router picks up the signal.

I'm not sure how it exactly it happens or how to fix it, but this is a very good start to bad bug (for me it's almost always ~800mb 😨)

@x0a
Copy link
Author

x0a commented Dec 18, 2018

Great to hear you can reliably duplicate it! The build I tried yesterday didn't do much difference as I'm now back at 453MB ram. If you can track it down, let me know if you have any other builds you would like me to try.

@andyholmes
Copy link
Collaborator

Great to hear you can reliably duplicate it!

Well, not reliably :) It is a weird situation and I'm not entirely sure why it happens (this kind of race condition should already be addressed), or where the giant memory leak comes from. I noticed output like this:

[/service/lan.js:_onIncomingChannel:88]: "remote host 192.168.1.68"
[/service/lan.js:_onIncomingChannel:88]: "remote host 192.168.1.68"
[/service/lan.js:_onIncomingChannel:88]: "remote host 192.168.1.68"
[/service/lan.js:_onIncomingChannel:88]: "remote host 192.168.1.68"
[/service/device.js:_setConnected:309]: "Connected to Nexus 4 (14a7714d768b2003)"

So I guess there are a bunch if incoming connections and the wrong one resolves first, resulting in a dead connection. ~800MB is probably more than a couple months worth of packet data, so I guess maybe it's a dead socket reading tons of null bytes into nowhere or something?

In any case, here's an attempt at a fix, so if you could test this and see if it reliably fixes the problem I'll try and do the same. The approach is just to keep track of connections coming from a single host (eg. 192.168.1.68) and always favour the newest incoming connection, closing any current one. This might not work if your IP is changing between networks/hotspots, but it's unclear whether the connection lasts long enough to determine what device it's coming from or if there's only a chance to get the IP.

gsconnect@andyholmes.github.io.zip

@x0a
Copy link
Author

x0a commented Dec 19, 2018

Very interesting, and it definitely looks like you might be correct as this build has been stable after about 10 hours! I'd give it an extra 24 before I call it and see about artificially inducing these disconnects/reconnects to really test this build, but it's looking pretty good so far.
screenshot from 2018-12-19 17-27-28

@x0a
Copy link
Author

x0a commented Dec 20, 2018

Sadly, I may have spoken too soon as I'm now back at 800mb, completely on its own before I had the chance to stress test it. This one might take a while 😣
I've set up a script to detect and display a notification when gjs hits 100mb, to see if I can maybe pinpoint it to specific events or at least figure out how fast this memory leak is happening.

@ferdnyc
Copy link
Member

ferdnyc commented Dec 20, 2018

You're both on Ubuntu, right?

I wonder if that common thread is somehow meaningful? Because, I've never seen anything even remotely like this, running v17 release. My Fedora 29 machine's two gjs processes, one started December 14 and the other started December 16, are sitting at 49MB and 10.6MB in resident memory, respectively. And I've never seen them waver from those svelte numbers.

Granted, my PC:phone connection might be a bit more stable than average, but there are times when I'm at home with the phone in my pocket (not charging), in which case the WiFi would disconnect every time I lock it and reconnect every time it wakes up again, and still no hints of this.

(I was also wondering if Android version might be involved, but @x0a 's tablet is likely running an older release than my Android N phone, and both are Samsung so that's probably not significant.)

@x0a
Copy link
Author

x0a commented Dec 20, 2018

@ferdnyc I tried something that I hadn't tried before, restarted my tablet. I hadn't figured it necessary since the service/app had been killed by Android dozens of times throughout the day. But since I've restarted last night I haven't seen memory rise above 35mb after 6 hours.
screenshot from 2018-12-20 17-01-50.

I am running Android 4.4.2. Could be an obscure issue with the kdeconnect client. But I think it also lends credence to the idea that constant reconnections are responsible for the memory leak as restarting Android also kills any misbehaving services that could have been responsible for Android constantly having to kill the kdeconnect service.

@andyholmes
Copy link
Collaborator

I think multiple connections is a good bet, but really GSConnect should be able to handle any of these events as well as KDE Connect. The real question is, what is this leaked memory made of and where is it being leaked to? ~800MB is a massive amount of memory, especially in this application, but on top of that why isn't being freed later anyways?

It's a head scratcher, for sure. I might have to try and figure out a way to manufacture this situation so I can really track down what's happening :/

@ferdnyc
Copy link
Member

ferdnyc commented Dec 21, 2018

I wonder if running gjs in profiling mode would help? I mean, sysprof isn't valgrind by any means, but burning up that much memory seems like it might show up as unexpected activity down some code path that shouldn't be getting worked that hard.

I ran gjs --profile=/tmp/gsc-daemon.syscap $GSC_EXT_DIR/daemon.js in a terminal — so it took over for replaced the daemon process I'd previously crashed while experimenting with other ways to launch it under profiling 👼 — and the capture data started updating pretty much immediately, and on fairly short intervals. After a minute or two I loaded it into sysprof:

image

..About as uninformative as you'd expect, under normal operation. Still, *shrug*

@x0a
Copy link
Author

x0a commented Dec 23, 2018

@ferdnyc I've been eager to try this and have been waiting for the bug to manifest itself again, but I haven't been able to reproduce the memory leak since I restarted my tablet (nearly 36 hours ago). Great news for me obviously since I don't have to keep killing GJS anymore, but bad news for debugging. I suppose if it doesn't show up again after a month we can chalk it up to a bad instance of KDEConnect and call it done?

@andyholmes
Copy link
Collaborator

This might also be related to a bug fixed in kdeconnect-android (d4d4849) where sockets were not being properly closed. GSConnect on the other hand, goes way over board when closing a channel, but that might still mean there's an obscure bug lurking somewhere.

I've haven't hit this bug in awhile myself, but I've also been running nightly builds on and off so that may have something to do with it. In any case, I'll probably release v18 soon regardless, since it's been some time since v17 and there are a lot of bug fixes. I'll probably just commit the "fix" I added anyways, since it doesn't seem to harm anything.

@andyholmes andyholmes added the help wanted An issue that needs contributors label Dec 24, 2018
@andyholmes
Copy link
Collaborator

How has this one been going? Personally this hasn't reoccurred for a very long time, safe to close it now?

@x0a
Copy link
Author

x0a commented Jan 16, 2019

Yeah I think we can call this one. Zero re-occurrences Thanks for the hard work!

@x0a x0a closed this as completed Jan 16, 2019
@ivandotv
Copy link

ivandotv commented Jan 16, 2019

This is literally happening to me right now
GNOME Shell 3.30.2
gjs points to daemon.js file from the extension.
image

@andyholmes
Copy link
Collaborator

Hi, could you describe your usage when the leak happened or how to reporoduce?

@andyholmes andyholmes reopened this Jan 17, 2019
@ivandotv
Copy link

I don't really know why it happens, maybe it has something to do with restarting the phone... I'm not really sure. I can't figure out if there is some kind of pattern to it.

@driscollw
Copy link

driscollw commented Jan 23, 2019

This has happened to me as well a few times. Each time is was right after I closed my laptop lid with power attached. Disconnect power and take my device to another room and open the lid. Consistantly I hear my CPU fan ramp up. 700MB+ used. Haven't touched my phone in hours....but I do have two ubuntu systems, with gsconnect enabled, connected to my phone.

@ferdnyc
Copy link
Member

ferdnyc commented Jan 23, 2019

I have actually, now, seen this exactly once — my occurrence looked exactly the same as @ivandotv (also on Fedora 29) — gjs was around 450MB in size.

I have no idea what triggered it, and was unable to find any causes. (Interestingly, though, I'm on a desktop computer with no lid, no suspend beyond display sleep, etc.) But it's only ever been that one time.

I've since restarted, and the gjs process running on my machine for the past 5 days is still < 80MB in size.

@ivandotv
Copy link

I'm using desktop computer

@driscollw
Copy link

@driscollw

Just out of suspend it's 700MB+ with 25% CPU usage.

The gjs process, specifically, was using 25% of the CPU? (I assume that's on a 4-core processor, so one core pegged?) See, that's something else I've never seen — not from gjs, anyway. (I can get Firefox to do it by just looking at it funny.) Do you know how long the CPU spike lasted?

Even when the memory-consumption bug manifested for me, as in @ivandotv 's screenshot the CPU usage was still minimal. Heck, that 5-day-old gjs on my current machine only has a total of 5m14s of CPU time racked up over its entire lifetime... I don't recall it being significantly more even that one time it ballooned to 450MB. (Though, if the CPU spike only lasted briefly, it might not have really registered as a significant increase in accumulated time.)

Correct on CPU pegged just 1 core. The spike never ended, I just killed it. I can tell when it's doing by my CPU temp and my fan starts ramping up. It was the gjs process.

@driscollw
Copy link

driscollw commented Jan 24, 2019

OK so today my VM Ubuntu 18.04 is doing it. See Photo
Memory went up a little. Anything I can do/grab to help with this issue? Photo
Restarted my phone and did not change anything.

Sorry for all the edits. One thing I am noticing is that there are two old notifications in my notifications panel from 2 days ago. Unknown Contact (Ongoing Call) and Smarthings battery low. These notifications were cleared off my phone yesterday and the VM has been up for 2 days. I also got the Ubuntu update prompt this morning, which I did with no reboot.

2019-01-24_15-01-15
2019-01-24_14-53-09

I also noticed that my waiting channel is blank vs the screenshot @x0a put up. May not be anything just sharing what I have found.

2019-01-24_15-09-33

Ping from phone to desktop work. The battery power is incorrect on my workstation. It says 97% but it's actually 51%. 97% would have been when I charged it last night around 8PM EST or so. What is really strange about that is I remember unplugging my phone from the charger at exactly 97%. When I ring my workstation it does ring, the gsconnect app comes up on my panel, but the gui never opens. I had to do a quit from the menu and gjs went from 40% cpu back to the 25%

No matter what I do I cannot get the GSConnect window to open. It comes up on my side bar, but no window.

I am wondering if this because I use Gnome Tweaks and hide all my desktop icons....I see that GSConnect mounts the phone network share.

@ferdnyc
Copy link
Member

ferdnyc commented Jan 24, 2019

@driscollw 's report inspired me to look at my own gjs process, and I was surprised to discover that it was up to 470 MB1. This is still the same (now) 6-day-old process that was only 80MB last night, so my best guess is that the size must've increased some time in the past 12-18 hours.

That's a helpful narrowing of the window, since in my case the size increase still isn't associated with any sort of increased CPU usage... total CPU time for that gjs process is now just 7m24s, not significantly up from 5m14s over 24 hours ago.

Anyway, I first launched the GUI and ran "Generate Support Log". This is the entirety of what that produced:

GSConnect Version: 19
GSConnect Install: user
GJS: 15403
XDG_SESSION_TYPE: x11
GDMSESSION: gnome-xorg
--------------------------------------------------------------------------------
-- Logs begin at Wed 2018-10-17 21:23:43 EDT, end at Thu 2019-01-24 16:13:39 EST. --
Jan 24 16:13:39 gjs[2834]: g_menu_item_set_attribute_value: assertion 'valid_attribute_name (attribute)' failed

Um. Seems a little... spare. So I ran journalctl --user -b _PID=2834 > /tmp/gjs-journal.txt, and the full output is attached: gjs-journal.txt (I've checked it for personally identifying data, nothing jumped out at me and I'm not particularly concerned.)

The lines that look interesting to me, because (a) I've never seen them before, and (b) they fall nicely within the time period in question, having been logged around 8 hours ago, are these. (Line-wrapped so you don't have to scroll horizontally.)

Jan 24 07:28:04 teevey gjs[2834]: Attempting to run a JS callback during garbage collection.
This is most likely caused by destroying a Clutter actor or GTK widget with ::destroy signal
connected, or using the destroy(), dispose(), or remove() vfuncs. Because it would crash the
application, it has been blocked.
Jan 24 07:28:04 teevey gjs[2834]: The offending callback was AsyncReadyCallback().
Jan 24 07:28:04 teevey gjs[2834]: Attempting to run a JS callback during garbage collection.
This is most likely caused by destroying a Clutter actor or GTK widget with ::destroy signal
connected, or using the destroy(), dispose(), or remove() vfuncs. Because it would crash the
application, it has been blocked.
Jan 24 07:28:04 teevey gjs[2834]: The offending callback was SourceFunc().

Other than that, everything that's visible in the log seems to be business as usual, mostly just hundreds and hundreds of lines of "g_object_unref: assertion 'G_IS_OBJECT (object)' failed".

Notes

  1. (Actually, it's grown a bit more in just the past 5 minutes, now at 487 MB. It doesn't seem to be growing regularly, watching it in top most refreshes show no size change. In fact, the growth from 470 MB to 487 MB could be normal/meaningless, as it might've just been a consequence of me using "Generate Support Log".)

@driscollw
Copy link

On my T420 as SOON as I opened my lid the CPU went strait to 25% 220MB to 525MB in a matter of 10 seconds. It was fine for about 10secs as I still had system monitor up from when I suspended it last night. GSConnect menu showing incorrect battery level.

gjs was running fine last night when I suspended it.

@andyholmes
Copy link
Collaborator

andyholmes commented Jan 24, 2019

Um. Seems a little... spare.

Yeah, after the third person posted a 10MB+ heap dump and several complaints about GSConnect being insecure because debug logging was left on by accident, it became obvious the debug tool was confusing users. The developer tool is also hidden behind a GAction now. So the usage is:

  1. Open Generate Support Log dialog (debug logging is enabled)
  2. Reproduce bug or action that doesn't work
  3. Close Generate Support Log dialog (debug logging is disabled, log opens in text editor)

You can still turn on debug logging, but it must be done in dconf and whenever the Generate Support Log dialog closes, it disables it.

JS ERROR: TypeError: packet.body.incomingCapabilities is undefined
                     _handleIdentity@/home/ferd/.local/share/gnome-shell/extensions/gsconnect@andyholmes.github.io/service/device.js:270:1
                     attach@/home/ferd/.local/share/gnome-shell/extensions/gsconnect@andyholmes.github.io/service/core.js:334:9
                     _onIncomingIdentity@/home/ferd/.local/share/gnome-shell/extensions/gsconnect@andyholmes.github.io/service/lan.js:244:13
                     Async*@/home/ferd/.local/share/gnome-shell/extensions/gsconnect@andyholmes.github.io/service/daemon.js:778:2
JS WARNING: [/home/ferd/.local/share/gnome-shell/extensions/gsconnect@andyholmes.github.io/service/lan.js 249]: reference to undefined property "code"

These lines here are probably the most helpful yet. It indicates that the Android app is sending a broken identity packet, which is only failing once Channel.attach() is called.

Attempting to run a JS callback during garbage collection. This is most likely caused by destroying a Clutter actor or GTK widget with ::destroy signal connected, or using the destroy(), dispose(), or remove() vfuncs. Because it would crash the application, it has been blocked.
The offending callback was AsyncReadyCallback().
Attempting to run a JS callback during garbage collection. This is most likely caused by destroying a Clutter actor or GTK widget with ::destroy signal connected, or using the destroy(), dispose(), or remove() vfuncs. Because it would crash the application, it has been blocked.
The offending callback was SourceFunc().

Probably this is an unresolved async function/Promise that eventually gets culled by GJS. Promises are queued in GJS in the GMainLoop with a custom GSource, but there's not much information here about which it is. Most likely it's some async function in Lan.Channel that never resolves or rejects because of the uncaught error in Lan.Channel.attach() and when it's eventually removed GSConnect stops leaking memory.

I just pushed 71337e9 which catches errors in attach() and closes the channel, but I'm not sure if that will fix the memory leak since I still can't reliably reproduce this problem (or at all anymore actually).

Updated ZIP

gsconnect@andyholmes.github.io.zip

@driscollw
Copy link

driscollw commented Jan 24, 2019

It's happened twice in the past hour for me. One thing I will add is that I am on a 5GHz wifi channel with an 6MB/s connection speed. Thinking more about this I often see this the most when my signal is poor. Out of suspend I sometimes have a ? where my wifi signal should be. Also my phone is on 5Ghz and has a poor signal in my Den. This morning when I stated that the power level was incorrect on my Ubuntu VM (Excellent Signal, right at AP) at 97% I was charging my phone in the Den (The night before).

@ferdnyc
Copy link
Member

ferdnyc commented Jan 25, 2019

Yeah, after the third person posted a 10MB+ heap dump and several complaints about GSConnect being insecure because debug logging was left on by accident, it became obvious the debug tool was confusing users.
Oh, yeah, getting rid of that switch makes total sense to me.

...So, here's another interesting wrinkle: That very same gjs process on my machine — the one that's been running for > 7 days and as of now has racked up a grand total of 8m45s CPU time — is back down to only 90MB resident. Without restarting it, without any new interesting log messages being output, without any conscious action of my own that could account for the reduction in memory footprint.

So now I'm wondering if those Attempting to run a JS callback during garbage collection messages could be significant after all, or at least this could perhaps have something to do with the garbage collector. Perhaps they're signs of a GC run being interrupted/deferred, so that the memory in question does eventually get released in a subsequent sweep. (Weird still that it's so much memory, but...)

@driscollw
Copy link

Last night I sat my phone right next to my T420 for about 2 hours. I watched the memory usage go from about 22MB to 225MB in the course of two hours. I did not use either device. I just moved my mouse once in a while to see what is was currently using.

@fabiscafe
Copy link

It's the same for me on Arch Linux. The only thing I do with GSConnect is to sync notification from my phone to my desktop and yet I end at 400-800MiB at the end of the day


GSConnect Version: 19
GSConnect Install: user
GJS: 15403
XDG_SESSION_TYPE: wayland
GDMSESSION: gnome
--------------------------------------------------------------------------------
-- Logs begin at Thu 2019-02-07 00:17:10 CET, end at Tue 2019-02-12 23:10:01 CET. --
Feb 12 23:09:18 org.gnome.Shell.desktop[10758]: Memory pressure relief: Total: res = 251236352/250851328/-385024, res+swap = 246235136/246235136/0
Feb 12 23:09:19 org.gnome.Shell.desktop[10758]: Memory pressure relief: Total: res = 122286080/122286080/0, res+swap = 122343424/122343424/0
Feb 12 23:09:19 org.gnome.Shell.desktop[10758]: Memory pressure relief: Total: res = 449200128/449400832/200704, res+swap = 444035072/444035072/0
Feb 12 23:09:20 org.gnome.Geary.desktop[10758]: Memory pressure relief: Total: res = 18378752/18378752/0, res+swap = 13615104/13615104/0
Feb 12 23:09:29 org.gnome.Shell.desktop[10758]: Memory pressure relief: Total: res = 291098624/291098624/0, res+swap = 286150656/286150656/0
Feb 12 23:09:44 org.gnome.Shell.desktop[10758]: Window manager warning: Buggy client sent a _NET_ACTIVE_WINDOW message with a timestamp of 0 for 0x1600034 (GSConnect)
Feb 12 23:09:48 org.gnome.Shell.desktop[10758]: Memory pressure relief: Total: res = 255680512/255410176/-270336, res+swap = 250839040/250839040/0
Feb 12 23:09:49 org.gnome.Shell.desktop[10758]: Memory pressure relief: Total: res = 122277888/122277888/0, res+swap = 122339328/122339328/0
Feb 12 23:09:50 org.gnome.Geary.desktop[10758]: Memory pressure relief: Total: res = 18362368/18362368/0, res+swap = 13602816/13602816/0
Feb 12 23:09:51 org.gnome.Shell.desktop[10758]: Memory pressure relief: Total: res = 449937408/449998848/61440, res+swap = 444633088/444633088/0
Feb 12 23:10:00 org.gnome.Shell.desktop[10758]: Memory pressure relief: Total: res = 290807808/291045376/237568, res+swap = 286105600/286105600/0
Feb 12 23:10:01 org.gnome.Shell.Extensions.GSConnect[11230]: [/service/lan.js:broadcast:273]: "Identifying to 192.168.2.101"

@andyholmes
Copy link
Collaborator

Hi, since I still can't reproduce this locally I rely on people who can to test the fixes I apply. Here is a Zip built from master which includes 71337e9 from the Zip above:

gsconnect@andyholmes.github.io.zip

See Install from Zip in the Wiki for instructions.

andyholmes added a commit that referenced this issue Feb 13, 2019
g_data_input_stream_read_line() will return NULL when there is no data
to read, considering it the "end of stream". In practice this seems to
happen occasionally with KDE Connect sockets even when the connection is
not closed, however ignoring the NULL byte can result in the receive
loop spinning out.

We now close the connection when we read a NULL byte, which *may* solve
the mysterious high CPU usage and large memory leak, but comes with the
risk that clients won't be able to connect at all depending on why this
happens for KDE Connect packet streams.

addresses #394
@andyholmes
Copy link
Collaborator

I've pushed 5401bc2 with an additional attempt to plug this leak. As detailed in the commit message, this comes with the risk that some clients may be unable to connect at all, so this won't be in a stable release unless I get lots of testing and confirmation. Here is an updated Zip:

gsconnect@andyholmes.github.io.zip

@fabiscafe
Copy link

Thanks!
Just to ask that: Whats a normal amount of ram gjs/gsconnect should use?

@andyholmes
Copy link
Collaborator

Normal memory usage is about 30-40MB (resident - shared).

@fabiscafe
Copy link

Looks good. yesterdays normal usage at work and at home. gjs still chills at 24-36MiB. Usually I hit the bug every day once or twice at some point, so I'm really happy with what I see now. I'll go on with the testing for some longer time and report back

@marosoft
Copy link

@andyholmes I found something interesting as well. However I did not test it yet as I would need to clone and build the plugin.

There is a missing 'await' in front of 'this.__cache_write()'
as you can see here: https://github.com/andyholmes/gnome-shell-extension-gsconnect/blob/master/src/service/plugins/base.js#L196

__cache_write method returns a promise and because the code does not wait for it to be resolved it creates a loop. The data to be cached is not being released from the memory:
https://github.com/andyholmes/gnome-shell-extension-gsconnect/blob/master/src/service/__init__.js#L184

This is just a hypothetical scenario as I could not test it yet.

@andyholmes
Copy link
Collaborator

I may be misunderstanding you, because that code is a bit convoluted (mixing task threads in a single-threaded engine, while maintaining file IO thread safety is a bit weird), but here's my take:

There is a missing 'await' in front of 'this.__cache_write()' as you can see here:

__cache_write() itself has no "resolve" value, catches it's own errors and does it's own queuing, so this is basically intentional. The idea is to make it a "fire and forget" async function, because await can only be used in an async function, so either every function in the stack must be an async function (which is effectively impossible in a GLib application) or at some point you'll have to call a Promise without awaiting it.

__cache_write method returns a promise and because the code does not wait for it to be resolved it creates a loop. The data to be cached is not being released from the memory

Hmm, I think there is a recursive loop problem here, although I'm not sure if it's the one you're thinking. I believe this conditional:

https://github.com/andyholmes/gnome-shell-extension-gsconnect/blob/4010b69df5ee3fc59ea58f0c33a85e9a6af3596c/src/service/plugins/base.js#L173-L177

should be outside the try-catch-finally since finally is always invoked even if return is called (inside a try block). Otherwise I believe the loop should only happen once.

There's two "mutex" related variables: __cache_lock to prevent later data from failing to write while the older data is being written, and __cache_queue. Whenever the first finishes, the later data will be written if __cache_queue is true, however __cache_queue is just a boolean (not a counter) and is reset before the function recurses so it should only result in a single-loop.

As an example, assuming the cache data has changed before each call to __cache_write():

// (1) First attempt is made and __cache_lock is false...
//      so __cache_lock is set to true and IO is started.
foo.__cache_write();
// (2) Second attempt is made and __cache_lock is true, so __cache_queue is set to true and
//     so __cache_queue is set to true and the function returns
foo.__cache_write();
// (3) Third attempt is made and __cache_lock is true, so __cache_queue is set to true and
//     the function returns
foo.__cache_write();
// (4) First attempt completes, __cache_lock is set to false
//     __cache_queue is true, so __cache_queue is set to false and the function recurses

Did I explain and/or understand you correctly?

andyholmes added a commit that referenced this issue Mar 12, 2019
Because finally will be invoked when returning a try block, the mutex
check in the __cache_write() should be outside the try block for the
function to recurse in proper order.

addresses #394
@edwinfoss
Copy link

I found a 100 thousand lines of this in my journalctl log after I noticed that gjs locks in one core at 100% cpu usage raising my cpu temperature:
JS ERROR: Error: 0x40000020 is not a valid value for flags GSocketMsgFlags_onIncomingIdentity@/home/user/.local/share/gnome-shell/extensions/gsconnect@andyholmes.github.io/service/lan.js:171:24

@andyholmes
Copy link
Collaborator

I found a 100 thousand lines of this in my journalctl log after I noticed that gjs locks in one core at 100% cpu usage raising my cpu temperature:

This is not related to high memory usage. Most likely you are using a distribution which has already updated to glib2 >= 2.60.0. You should update to v21-rc2 or wait until the appropriate fix is backported to glib2.

@andyholmes
Copy link
Collaborator

Since there have been no further reports of the original high-memory usage bug, I'm going to close this issue. I believe the original issue was fixed in 5401bc2. Thanks for everyone help with logs and testing on this bug!

@vberthet
Copy link

I've the same issue with GSConnect 24 on Ubuntu 18.04.

Killed process 11014 (gjs) total-vm:27882468kB, anon-rss:19246980kB, file-rss:0kB, shmem-rss:40kB

Out of memory: Kill process 11014 (gjs) score 723 or sacrifice child

I'll try to send an heap file next time, the process was killed by OOM before a can get on :(

@andyholmes
Copy link
Collaborator

@vberthet this is probably a different bug than the one fixed in this issue. Can you open a new issue with logs for this?

@Crucizer
Copy link

I have the same issue!

@pegasusearl
Copy link

Hello, I'm not sure if this one is related but I experienced giant memory usage as well.
11_10-10-03_381

In my case it's caused by a software that send massive amount of notifications. Perhaps the more notification in a session, the bigger ram usage is.

Here is the video of my reproduction:
https://www.youtube.com/watch?v=pYR2W7_bsjQ&list=PLhcbq4A0YXYCYHE-XN5-nQcBam4Vq_DKj&index=1

@ferdnyc
Copy link
Member

ferdnyc commented Mar 15, 2022

@pegasusearl

Hm, that's interesting! Some differential diagnosis:

  1. Are the notifications originating on the Linux side, or the Android side? (I'm not familiar with that application.)
  2. Whichever side they're on, do you have that device relaying notifications to peer devices?
  3. If it's on the Android side, what happens if you block relaying of that app's notifications?
  4. If it's on the Linux side and you're sharing notifications,
    1. Does the memory leak still happen if you disable notification sharing?
    2. Does it still happen if you disable the notification plugin completely?
  5. If it's on the Linux side and you're NOT sharing, does it still happen if you enable sharing? (I'm thinking maybe there's a leak where notifications that aren't being sent across the wire aren't properly disposed of.)

@pegasusearl
Copy link

pegasusearl commented Mar 15, 2022

  1. Linux side
  2. no

4~5. there is no notification shared. At the time I didn't connect to any android devices. I only use GSconnect occasionally to copy files. I never do anything else with it. I'm not sure how do I do that.
I already connected to my android device, in GSconnect I already enabled notification in Advanced Setting and make sure "Share Notification" is checked. And then in KDEconnect I made sure Notification Sync is enabled, and then made sure notification plugin does not require more permission.
But notification is not sent when I call notify-send on linux. notify-send in GSconnect setting is also checked.

About that app:
It was picosnitch's bug that sends massive amount of notification at short period. And the program that trigger said bug is a windows game, Nova Ragnarok. And it happened while there is problem with my internet.
Very rare case that no one else will ever encounter. Not even me!

This can also be reproduced by calling notify-send every 0.05 second for few seconds. But doing that might make gnome unresponsive when opening notification window.

I'm not sure if there is any case where user would receive a lot of notification. One thing that I can think of is if I enable all notification instead of just mentions for Discord and run it for long period of time without rebooting.

Doing the discord thing appears to have similar effect. Memory usage slowly increase, sometimes it went down but overall it kept increasing. In my case it's still not a big problem.

Perhaps this is different than what people discussed above.

@ferdnyc
Copy link
Member

ferdnyc commented Mar 16, 2022

@pegasusearl

Ah, sorry, I was on my phone when I sent that previous reply and couldn't look into things too deeply.

On the Android side, to receive notifications from a Linux peer you'd want the "Receive notifications" plugin enabled for that peer ­— "Notification sync" enables sharing in the other direction, for notifications that originate locally on the Android device.

On the Linux side, the Notifications plugin handles sharing in both directions, unlike on Android where it's two separate plugins. It can be enabled/disabled under "Advanced". (It's normally enabled). When enabled, the Notifications settings panel lets you enable/disable global notification sending, and then if that's enabled, also disable sharing for specific Linux application sources.

One thing I notice is that the DBus notification handler unpacks the notification contents anytime a notification comes in, because it handles managing the applications list unconditionally. Only then does it check whether notification sharing is enabled. If not, it bails, otherwise it attempts to generate a message over the wire to the connected peer.

_onNotificationAdded(listener, notification) {
try {
const notif = notification.full_unpack();
// An unconfigured application
if (notif.appName && !this._applications[notif.appName]) {
this._applications[notif.appName] = {
iconName: 'system-run-symbolic',
enabled: true,
};
// Store the themed icons for the device preferences window
if (notif.icon === undefined) {
// Keep default
} else if (typeof notif.icon === 'string') {
this._applications[notif.appName].iconName = notif.icon;
} else if (notif.icon instanceof Gio.ThemedIcon) {
const iconName = notif.icon.get_names()[0];
this._applications[notif.appName].iconName = iconName;
}
this._applicationsChangedSkip = true;
this.settings.set_string(
'applications',
JSON.stringify(this._applications)
);
this._applicationsChangedSkip = false;
}
// Sending notifications forbidden
if (!this.settings.get_boolean('send-notifications'))
return;

So, it's possible there's some sort of leak in GLib.Variant.deepUnpack() or Gio.Icon.deserialize(), which are used by notification.full_unpack() to unpack the notification payload. (That's another thing, this may only occur for notifications with a certain type of associated icon — you can simulate that by running notify-send with a -i argument and providing either a stock-icon name, or the path to an icon file. The code path for each is different.)

The other thing, though, is that AFAICT all of this plugin code runs on a per-peer basis. So if you didn't have any peers connected to the Linux machine at the time, none of this should even have been running. If that's the case, I'm really at a loss as to how the gjs process could've racked up so much memory usage. Unless perhaps we're not correctly dropping our DBus listener connections when peers disconnect, so that we'd still be bombarded with unhandled notifications.

You don't happen to see any messages like these in your user journal, referencing the GSConnect session bus connection, do you?

The maximum number of pending replies for "(some DBus address)" (connection details) has been reached

(I know what that message would look like because, in an unrelated problem, I've been getting flooded with them for goa-identity-service for some time now.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug An issue that is confirmed as a bug help wanted An issue that needs contributors
Projects
None yet
Development

No branches or pull requests