Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port XQF to G_IO #4

Closed
illwieckz opened this issue Oct 8, 2014 · 14 comments
Closed

Port XQF to G_IO #4

illwieckz opened this issue Oct 8, 2014 · 14 comments

Comments

@illwieckz
Copy link
Member

XQF uses old and deprecated functions like gdk_input_remove or gdk_input_add.

The problem is that:

  1. it is deprecated
  2. the glib changed the way it handles source, so without modification, outdated codes are broken since the glib 2.39.0 release.

You can read here:
https://developer.gnome.org/gdk2/stable/gdk2-Input.html#gdk-input-remove

gdk_input_remove is deprecated and should not be used in newly-written code.

Fallback for gdk_input_remove seems to be an overlay of g_source_remove.

You can read here the glib 2.39.0 releases notes:
https://download.gnome.org/sources/glib/2.39/glib-2.39.0.news

  • g_source_remove() will not throw a critical in the case that you
    try to remove a non-existent source. We expect that there is some
    code in the wild that will fall afoul of this new critical but
    considering that we now reuse source IDs, this code is already
    broken and should probably be fixed.

So, without modification, when XQF updates server list, it prints many error messages like that:

(xqf:29892): GLib-CRITICAL *: Source ID 656 was not found when attempting to remove it
(xqf:29892): GLib-CRITICAL *
: Source ID 662 was not found when attempting to remove it
(xqf:29892): GLib-CRITICAL *: Source ID 674 was not found when attempting to remove it
(xqf:29892): GLib-CRITICAL *
: Source ID 678 was not found when attempting to remove it
(xqf:29892): GLib-CRITICAL *: Source ID 678 was not found when attempting to remove it
(xqf:29892): GLib-CRITICAL *
: Source ID 702 was not found when attempting to remove it

and sometimes crashes like that:

*** Error in `/home/Archive/dev/build/xqf/src/xqf': munmap_chunk(): invalid pointer: 0x000000000095ef70 ***

Program received signal SIGABRT, Aborted.
0x00007ffff5c32bb9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56  ../nptl/sysdeps/unix/sysv/linux/raise.c: Aucun fichier ou dossier de ce type.
(gdb) bt
#0  0x00007ffff5c32bb9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007ffff5c35fc8 in __GI_abort () at abort.c:89
#2  0x00007ffff5c6fe14 in __libc_message (do_abort=do_abort@entry=1, fmt=fmt@entry=0x7ffff5d7e668 "*** Error in `%s': %s: 0x%s ***\n")
    at ../sysdeps/posix/libc_fatal.c:175
#3  0x00007ffff5c7ab77 in malloc_printerr (action=<optimized out>, str=0x7ffff5d7e9e8 "munmap_chunk(): invalid pointer", ptr=<optimized out>)
    at malloc.c:4996
#4  0x00000000004430f7 in server_free_info (s=0x863c00) at server.c:205
#5  0x0000000000450dcf in parse_server (token=0x7fffffffdee0, n=8, refreshed=1412806096, saved=0) at stat.c:573
#6  0x0000000000451633 in parse_qstat_record (conn=0xadb250) at stat.c:761
#7  0x0000000000451d21 in stat_servers_input_callback (conn=0xadb250, fd=13, condition=GDK_INPUT_READ) at stat.c:920
#8  0x00007ffff72e141f in ?? () from /usr/lib/x86_64-linux-gnu/libgdk-x11-2.0.so.0
#9  0x00007ffff697ace5 in g_main_dispatch (context=0x6ee150) at /build/buildd/glib2.0-2.40.2/./glib/gmain.c:3064
#10 g_main_context_dispatch (context=context@entry=0x6ee150) at /build/buildd/glib2.0-2.40.2/./glib/gmain.c:3663
#11 0x00007ffff697b048 in g_main_context_iterate (context=0x6ee150, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>)
    at /build/buildd/glib2.0-2.40.2/./glib/gmain.c:3734
#12 0x00007ffff697b30a in g_main_loop_run (loop=0x7fb7e0) at /build/buildd/glib2.0-2.40.2/./glib/gmain.c:3928
#13 0x00007ffff769a447 in gtk_main () from /usr/lib/x86_64-linux-gnu/libgtk-x11-2.0.so.0
#14 0x0000000000465ddb in main (argc=3, argv=0x7fffffffe278) at xqf.c:4079

To see XQF crashing you must update many servers and well populated master servers (like Nexuiz), sometimes many times. If you are unlucky, it could be fast to have a crash, or not. Sometime I update 20+ master servers without problem, sometime I update one master server and it crashes (prefer Nexuiz master servers for testing ;-) ).

@illwieckz
Copy link
Member Author

First, we can replace gdk_input_remove with g_source_remove like that:

diff --git a/src/stat.c b/src/stat.c
index 1ed1fb9..5f730b0 100644
--- a/src/stat.c
+++ b/src/stat.c
@@ -915,7 +915,7 @@ static void stat_servers_input_callback (struct stat_conn *conn, int fd,

       if (conn->buf[conn->lastnl] == '\0') {
        blocked = TRUE;
-       gdk_input_remove (conn->tag);
+       g_source_remove (conn->tag);

        parse_qstat_record (conn);

but it will not remove the GLib-CRITICAL **: Source ID ### was not found when attempting to remove it message since the gdk_input_add is used.

It seems that using both g_source_remove and gdk_input_addincreases the risk to have a crash but it’s not proven and you can use XQF and update servers without problem, there is a bit randomness to have a crash.

Replacing gdk_input_add is more complicated than replacing gdk_source_remove. You must replace it with g_io_add_watch than did not take a fd as argument but a channel (GioChannel), and input_callback function must be rewritten.

@illwieckz
Copy link
Member Author

I created the branch g_io to work on the migration:
https://github.com/XQF/xqf/tree/g_io

@illwieckz
Copy link
Member Author

The commit cf2b506 is a first attempt.

It does not work because I get a problem like this one:
http://stackoverflow.com/questions/2417681/need-help-implementing-simple-socket-server-using-gioservice-glib-glib-gio?answertab=oldest#tab-top

It's not documented in the GSocketService docs (I had to go through the GLib sources to find it), but the routine that calls the callback (new_connection in this case) does a g_object_unref() on the connection object after it returns. This effectively closes the connection immediately new_connection() returns to it.

This is what happens with my code.

Tracing with gdb, I see that when I update a master server (like etmaster.idsoftware.com for Wolf:ET), it crashes like that:

Program received signal SIGSEGV, Segmentation fault.
0x0000000000450663 in stat_master_input_callback (conn=0xa8c920, fd=1, condition=(G_IO_HUP | unknown: 10304064)) at stat.c:397
397   if(conn->master->type == SAS_SERVER && conn->master->master_type == MASTER_HTTP
(gdb) bt
#0  0x0000000000450663 in stat_master_input_callback (conn=0xa8c920, fd=1, condition=(G_IO_HUP | unknown: 10304064)) at stat.c:397
#1  0x00007ffff697ace5 in g_main_dispatch (context=0x6ee100) at /build/buildd/glib2.0-2.40.2/./glib/gmain.c:3064
#2  g_main_context_dispatch (context=context@entry=0x6ee100) at /build/buildd/glib2.0-2.40.2/./glib/gmain.c:3663
#3  0x00007ffff697b048 in g_main_context_iterate (context=0x6ee100, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>)
    at /build/buildd/glib2.0-2.40.2/./glib/gmain.c:3734
#4  0x00007ffff697b30a in g_main_loop_run (loop=0x7fb920) at /build/buildd/glib2.0-2.40.2/./glib/gmain.c:3928
#5  0x00007ffff769a447 in gtk_main () from /usr/lib/x86_64-linux-gnu/libgtk-x11-2.0.so.0
#6  0x0000000000465e72 in main (argc=3, argv=0x7fffffffe288) at xqf.c:4079

Tracing with nemiver/gdb I see that:

  1. stat_update_masters calls stat_update_master_qstat with a master m with address 0x847420 (for example) that contains the string hostname with the value etmaster.idsoftware.com.
  2. stat_update_master_qstat receives m with the same address 0x847420 and the same content and calls star_qstat.
  3. start_qstat receives m with the same address 0x847420 and the same content and store it as conn->master in a conn structure with address 0x807710 (for example) and calls g_io_add_watch with a pointer to stat_master_input_callback (0x450631 for example) and a pointer to 0x807710 (the conn structure that have a pointer 0x847420 to m).
  4. g_io_add_watch receives a pointer to a function addressed by 0x450631 (stat_master_input_callback), and a pointer to a user_data addressed by 0x807710 (conn), with condition with value (G_IO_IN | G_IO_HUP | G_IO_ERR | G_IO_PRI).
  5. g_io_add_watch calls g_io_add_watch_full with 0x450631 (stat_master_input_callback) and 0x807710 (conn).
  6. g_io_add_watch_full receives 0x450631 (stat_master_input_callback) and 0x807710 (conn). conn (0x807710) has still a pointer tomaster(0x847420) with a string`etmaster.idsoftware.com``.
  7. g_io_add_watch_full calls g_io_unix_create_watch but it does not need conn and others, it works on channels etc.
  8. g_io_add_watch_full exits. conn (0x807710) has still a pointer to master (0x847420) with a string ``etmaster.idsoftware.com`.
  9. g_io_add_watch exits. conn (0x807710) has still a pointer to master (0x847420) with a string ``etmaster.idsoftware.com`.
  10. start_qstats exits. conn and master are ok.
  11. stat_update_master_qstat exits. conn and master are ok.
  12. stat_update_masters exits. conn (0x807710) has still a pointer to master (0x847420) with a string etmaster.idsoftware.com.
  13. stat_master_resolved_callback exits (the one that passes the id named etmaster.idsoftware.com).
  14. helper_parse_callback exits.
  15. dns_input_callback exits.
  16. a unknown function exits.
  17. g_main_dispatch continue its execution.
  18. g_main_dispatch calls g_source_callback_unref.
  19. g_main_dispatch calls stat_master_input_callback with bad conn 0xa90460 that have a NULL master, and a with a bad condition (G_IO_HUP | unknown: 8419072).
  20. stat_master_input_callback attempts to read conn->master and XQF get SIGSEV.

@illwieckz
Copy link
Member Author

We must find a way to prevent g_main_dispatch to forget the objects conn and conditions before it calls stat_master_input_callback.

@illwieckz
Copy link
Member Author

This bug affect stat.c and stat.h:

The current changes are here: cf2b506

@illwieckz
Copy link
Member Author

As Chris Vine said:

If stat_master_input_callback() is a call back for a GIOChannel watch,
then by returning void the callback is definitely defective. That may
well explain your observations.

So, I rewrite the callbacks to return gboolean : 0351e93

The problem is not fixed, but it will be better like that.

@illwieckz
Copy link
Member Author

We can read how g_io_add_watch is declared and how it calls g_io_add_watch_full to see how we can pass a callback function as reference.

https://git.gnome.org/browse/glib/tree/glib/giochannel.c#n700

This part seems OK.

@illwieckz
Copy link
Member Author

One thing to change, stat_master_input_callback() read fd with read() instead of reading the new chan with g_io_channel_read_chars().

Changing that will probably not fix the segfault because the segfault appears before the function attempt to read fd. But we MUST rewrite that.

There is probably a collateral damage due to the fact that stat_master_input_callback passed to g_io_add_watch has a wrong type for the first parameter.

For information, the read() prototype is:

ssize_t read(int fd, void *buf, size_t count);

The g_io_channel_read_chars() prototype is:

GIOStatus
g_io_channel_read_chars (GIOChannel *channel,
                         gchar *buf,
                         gsize count,
                         gsize *bytes_read,
                         GError **error);

cf. https://developer.gnome.org/glib/stable/glib-IO-Channels.html#g-io-channel-read-chars

Fix in 89dafe5 .

@illwieckz
Copy link
Member Author

Another problem:

stat.c:202:27: warning: pointer targets in initialization differ in signedness [-Wpointer-sign]
       unsigned char* ip = conn->buf+off;

in stat.h:

char    *buf;

But because g_io_channel_read_chars() argument 2 is signed, ip must be signed.

I fix that.

@illwieckz
Copy link
Member Author

As seen on https://git.gnome.org/browse/glib/tree/glib/giounix.c#n165

old callback takes (gpointer *user_data, int fd, GdkInputCondition condition).
new callback takes (GIOChannel *chan, GIOCondition condition, gpointer *user_data)!

see f084f66

@illwieckz
Copy link
Member Author

Works, but there is always
GLib-CRITICAL **: Source ID 614 was not found when attempting to remove it, and sometimes segfault.

When source was blocked, there was two g_source_remove (conn->tag); and the first can be called more than one time.

Drop the g_source_remove in the while…if something; blocked==TRUE; g_source_remove part and keep only the one in the endwhile; if blocked==TRUE; g_source_remove part.

cf. 4553122

@illwieckz
Copy link
Member Author

champagne ! 💃

I got it !

The branch g_io was just merged into master, huhu.

I can reload dozens of master server plenty of servers without segfault IN THIS PART.

(oh yes, I get a segfault elsewhere, but in another part, will be another ticket, the #9).

@illwieckz
Copy link
Member Author

Many many thanks for Chris Vine for his precious help!

@illwieckz
Copy link
Member Author

Notice : the wrong freeing was there in the first CVS commit in year 2000, so it is another “pre y2k bug”.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant