New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make geany -g tags output reproducible #1989

Closed
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
5 participants
@bmwiedemann

bmwiedemann commented Nov 6, 2018

Without this patch, openSUSE's glfw package would always differ in /usr/share/geany/tags/glfw.c.tags
because inode numbers differ between builds (that happen in disposable VMs)

This was previously discussed in https://bugzilla.opensuse.org/show_bug.cgi?id=1049382

Make geany -g tags output reproducible
without this patch, openSUSE's glfw package would always differ in
/usr/share/geany/tags/glfw.c.tags
because inode numbers differ between builds (that happen in
disposable VMs)

@bmwiedemann bmwiedemann force-pushed the bmwiedemann:inode branch from 6788e59 to 4a134ac Nov 6, 2018

@elextr

This comment has been minimized.

Member

elextr commented Nov 6, 2018

You don't say what the actual problem is, but I presume your problem is that the order of tags changes because the order of files in the list lookup_includes() returns changes as the inode changes. If you want reproducible builds, instead of relying on application internals and globbing orders you should list the files explicitly instead of using globs.

This change will also fail to de-dup paths that are linked to the same file, but hopefully that will be rare.

If we can live with that, performance wise this should be better (stated without benchmarking of course :) because it no longer stats the filesystem to make a hash key.

@bmwiedemann

This comment has been minimized.

bmwiedemann commented Nov 7, 2018

@elextr we did list files explicitly in https://build.opensuse.org/package/view_file/openSUSE:Factory/glfw/glfw.spec line 88 (still missing a | sort in that version), but even then, the hashing influenced entries in the output.

@elextr

This comment has been minimized.

Member

elextr commented Nov 7, 2018

the hashing influenced entries in the output.

Geany either:

  1. if -P creates a temporary file full of #includes one for each file in the list and uses the C preprocessor to combine them, or

  2. if no -P just copys the content of all the files in the list together into a temporary file and parses that.

Both of these do it in the hash table list order.

So, maybe to ensure reproducible output you can pre-combine the files you want in the order you want by running the C preprocessor first and then pass the result of that to Geany.

If you can get reproducible output without depending on implementation details that might change to support other languages besides C, that would be good.

@bmwiedemann

This comment has been minimized.

bmwiedemann commented Nov 7, 2018

@elextr I do not understand 'This change will also fail to de-dup paths that are linked to the same file'
I thought, the old code did not do that either, but just hashed it into the same bucket.

Diffs looked like http://rb.zq1.de/compare.factory-20180730/glfw-compare.out

--- old//usr/share/geany/tags/glfw.c.tags	2017-02-12 12:00:00.000000000 +0000
+++ new//usr/share/geany/tags/glfw.c.tags	2017-02-12 12:00:00.000000000 +0000
@@ -525,23 +525,23 @@
 DN_MULTISHOTÌ65536Ö0
 DN_RENAMEÌ65536Ö0
 DTTOIFÌ131072Í(dirtype)Ö0
-DT_BLKÌ4Îanon_enum_26Ö0
+DT_BLKÌ4Îanon_enum_29Ö0
 DT_BLKÌ65536Ö0
-DT_CHRÌ4Îanon_enum_26Ö0
+DT_CHRÌ4Îanon_enum_29Ö0
 DT_CHRÌ65536Ö0
 [...]
-KHR_xlib_surfaceÌ64Î_GLFWlibrary::anon_struct_20Ö0ÏGLFWbool
+KHR_xlib_surfaceÌ64Î_GLFWlibrary::anon_struct_4Ö0ÏGLFWbool

So is there any reason against merging this change?

@b4n

This comment has been minimized.

Member

b4n commented Nov 7, 2018

@elextr I do not understand 'This change will also fail to de-dup paths that are linked to the same file'
I thought, the old code did not do that either, but just hashed it into the same bucket.

Yeah, it looks like that, which seems absurd. My guess would be that the goal was to do what @elextr expects, that is using the inode itself as a key, which indeed would have the nice property of avoid re-computing the same file even under a different path, but as is it actually leads to never match any entry (as g_direct_equal() on a different string pointer will always deem those different).
So your changes are better than the previous behavior, as at least identical paths will match.

However, this function seems a bit absurd. I don't see why it doesn't simply builds the list directly, because traversing a GHashTable is not very fast, and IIUC it wouldn't change anything. So we could probably change it to just do that, which would be simpler and fix everyone's concerns.

Also, relying on the order of traversal of a GHashTable seems fragile, I don't think it is guaranteed to be stable across GLib versions, architectures or whatnot.

@elextr

This comment has been minimized.

Member

elextr commented Nov 7, 2018

@b4n, no the current code works just fine, the hash key is the inode, which is an int stored in the pointer, so g_direct_equal() is correct for the current system. But it has to be changed to g_str_equal() if the key becomes the filename string.

Originally a list was used but then it was de-duped, so likely thats why it got changed to a hash but that change is copied from Anjuta, so who knows.

Not sure why duplicates are a problem, TM should handle multiple definitions of the same tagname? @b4n TM-spurtese needed.

@bmwiedemann did you try running cpp yourself? Or you could just cat the files together if you don't want recursive inclusion, thats all Geany is doing internally.

@b4n

This comment has been minimized.

Member

b4n commented Nov 7, 2018

@b4n, no the current code works just fine, the hash key is the inode

Not it's not, it's hashed as an inode, but the key is a filename, so the equal function will compare the string pointers.

Originally a list was used but then it was de-duped

Well, if we used a hash table as a mean of deduping and a list as a mean of well, building the list, it would not be a problem.

@elextr

This comment has been minimized.

Member

elextr commented Nov 7, 2018

Not it's not, it's hashed as an inode, but the key is a filename, so the equal function will compare the string pointers.

ahhh, ok. So a) it doesn't work now because its wrong, and b) comparing strings won't actually work either if both paths resolve to the same inode, the hash table will see the hash collide but the strings being different and add it again.

So using inode as the hash is just a slow way of making a hash of a file path. In fact Anjuta seems to have made a right hash of the whole thing [pun intended]. Maybe thats why it switched to sqlite for its tags later.

Well, if we used a hash table as a mean of deduping and a list as a mean of well, building the list, it would not be a problem.

Yeah, since 2.40 g_hash_table_insert() returns if it inserted or not, so the filename could be added to the list on the fly. But Glib 2.40 didn't exist back when this was originally written.

@b4n

This comment has been minimized.

Member

b4n commented Nov 7, 2018

So using inode as the hash is just a slow way of making a hash of a file path.

Well, the inode gets a goodish hash, but the equal check makes it worse than a string hash and comparison on the file path yeah.
If really the inode thing was useful, it should be used as the key directly rather than a bad convoluted hash that's guaranteed to have a worse stat() call count anyway.

Yeah, since 2.40 g_hash_table_insert() returns if it inserted or not, so the filename could be added to the list on the fly. But Glib 2.40 didn't exist back when this was originally written.

The code already checks if the hash table contains the element before inserting it, so it should be OK no matter which version is used.

@elextr

This comment has been minimized.

Member

elextr commented Nov 7, 2018

The code already checks if the hash table contains the element before inserting it, so it should be OK no matter which version is used.

Except that it will always fail to find it due to the bug, so it gets added anyway.

If really the inode thing was useful, it should be used as the key directly rather than a bad convoluted hash that's guaranteed to have a worse stat() call count anyway.

Well, as I said above, everything gets added because a hash collision won't stop different paths being added, and due to the bug all paths are seen as different at the moment anyway, so duplicates so far havn't caused any "trouble" (the word from the original de-dup comment). @b4n do you know what "trouble" they might cause?

So clearly the inode stuff is not useful, and we should simply use g_str_hash() or totally forget de-duping since so far the fact that its failing hasn't caused problems (except perhaps some tags files are silently bigger than they need to be, but nobody has noticed and complained).

@kugel-

This comment has been minimized.

Member

kugel- commented Nov 8, 2018

Isn't the inode only unique within the same file system? We can't assume all files are on the same fs.

@elextr

This comment has been minimized.

Member

elextr commented Nov 8, 2018

@kugel- true.

But it doesn't matter, its a hash not the key, they can collide, it just costs time if they do. Nothing should break. Also what do you get for "inode" for files served via Samba?

But the fact that the key comparisons are wrong in the current code means it just creates a list of all the files using a g_hash in pseudo random order, then copies that list to a g_list, no de-duping, nothing, nada, useless waste of time.

So unless somebody can identify the "trouble" having multiple copies of the same file will cause (apart from some space and time) then I think we should just expand the globs straight into the g_list and thats it.

@kugel-

This comment has been minimized.

Member

kugel- commented Nov 9, 2018

@elextr

This comment has been minimized.

Member

elextr commented Nov 10, 2018

Also waiting to see if the OP had success cating the files themselves, or running cpp themselves.

Even if we go the g_list route, reproducible output will still not be guaranteed, just an implementation side effect. That may be acceptable of course.

@bmwiedemann

This comment has been minimized.

bmwiedemann commented Nov 10, 2018

yes, cat'ing files on the caller side also helps:

-geany -c geany_config -g glfw.c.tags $(find src \( ! -name CMakeFiles \) -type f \( -iname "*.c" -o -iname "*.h" \) \( ! -iname "win32*" \) \( ! -iname "cocoa*" \) | sort
-) include/GLFW/glfw3.h
+cat $(find src \( ! -name CMakeFiles \) -type f \( -iname "*.c" -o -iname "*.h" \) \( ! -iname "win32*" \) \( ! -iname "cocoa*" \) | sort) include/GLFW/glfw3.h > tags.in.h
+geany -c geany_config -g glfw.c.tags tags.in.h

but having tools themselves more deterministic is still a plus, because it avoids having to patch all callers.

@elextr

This comment has been minimized.

Member

elextr commented Nov 10, 2018

but having tools themselves more deterministic is still a plus, because it avoids having to patch all callers.

Sure, but its a workaround until Geany is changed.

Also in many cases the callers should care which order files are included, for context sensitive languages like C/C++ knowing that an identifier is a type name is important to correct parsing. So you want to control order so that the declaration of the type name is before the use. Essentially you want things to be seen in the order that the compiler will see them.

So the files need to be explicitly specified in the correct order.

That means Geany should keep the order the same as the command line, except that where a command line item is a glob, it is expanded to all matching items.

This behaviour is deterministic so meets both requirements.

@b4n

This comment has been minimized.

Member

b4n commented Nov 10, 2018

See #1991

@b4n b4n closed this in fb39f67 Nov 13, 2018

@b4n b4n added this to the 1.34 milestone Nov 30, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment