SIGSEGV at startup of sci-mathematics/sage-4.6-r1 #40

Closed
gagern opened this Issue Jan 7, 2011 · 51 comments

Projects

None yet

4 participants

@gagern
Contributor
gagern commented Jan 7, 2011

Simply starting sage will cause it to crash with a backtrace:

$ sage
----------------------------------------------------------------------
| Sage Version 4.6, Release Date: 2010-10-30                         |
| Type notebook() for the GUI, and license() for information.        |
----------------------------------------------------------------------
** Message: pygobject_register_sinkfunc is deprecated (GtkWindow)
** Message: pygobject_register_sinkfunc is deprecated (GtkInvisible)
** Message: pygobject_register_sinkfunc is deprecated (GtkObject)


------------------------------------------------------------
Unhandled SIGSEGV: A segmentation fault occurred in Sage.
This probably occurred because a *compiled* component
of Sage has a bug in it (typically accessing invalid memory)
or is not properly wrapped with _sig_on, _sig_off.
You might want to run Sage under gdb with 'sage -gdb' to debug this.
Sage will now terminate (sorry).
------------------------------------------------------------

Using sage -gdb will give a backtrace like this:

0  __cxxabiv1::__cxa_allocate_exception (thrown_size=144)
   at /var/tmp/portage/sys-devel/gcc-4.4.5/work/gcc-4.4.5/libstdc++-v3/libsupc++/eh_alloc.cc:133
1  0x00007fffd613182f in GiNaC::function::find_function (name=, nparams=)
   at /var/tmp/portage/sci-libs/pynac-0.2.1/work/pynac-0.2.1/src/ginac/function.cpp:1446
2  0x00007fffd594883d in __pyx_f_4sage_8symbolic_8function_15BuiltinFunction__is_registered (__pyx_v_self=0x3031280)
   at sage/symbolic/function.cpp:7185
3  0x00007fffd59476a5 in __pyx_pf_4sage_8symbolic_8function_8Function___init__ (__pyx_v_self=0x3031280, ...)
   at sage/symbolic/function.cpp:2462
4  0x000000359ae9bbec in wrap_init (self=0x7ffff7fa4010, args=0x0, wrapped=0x10, kwds=0x2b36900)
   at Objects/typeobject.c:4707
5  0x000000359ae48684 in PyObject_Call (func=0x302f290, arg=0x0, kw=0x3)
   at Objects/abstract.c:2529
6  0x000000359aedcbc2 in PyEval_CallObjectWithKeywords (func=0x302f290, arg=0x3001590, kw=0x3)
   at Python/ceval.c:3881
7  0x000000359ae604b1 in wrapperdescr_call (descr=, args=0x3001590, kwds=0x0)
   at Objects/descrobject.c:304
8  0x000000359ae48684 in PyObject_Call (func=0x202dc80, arg=0x0, kw=0x3)
   at Objects/abstract.c:2529
9  0x00007fffd593dd7f in __pyx_pf_4sage_8symbolic_8function_15BuiltinFunction___init__ (__pyx_v_self=0x3031280, ...)
   at sage/symbolic/function.cpp:7101

The complete backtrace is 275 frames long, I can try to atach it here if you want it.

I have no clue (yet) what this is about. Ideas what I might try to investigate this?

@kiwifb
Collaborator
kiwifb commented Jan 7, 2011

There has been two reports on the gentoo-science mailing list for something that looks very similar. So far we have no clues. Look at this thread:
http://archives.gentoo.org/gentoo-science/msg_f8b16ae6a1f48dca0792e75e7e3b702d.xml

Is your machine an amd64?

@kiwifb
Collaborator
kiwifb commented Jan 7, 2011

Actually do you have a working "vanilla" sage install? Not from ebuild that is?
If you do could you move the "vanilla" pynac so lib somewhere and create link to the gentoo installed pynac lib.
In short move $SAGE_LOCAL/lib/libpynac-0.2.so.1.0.0 and create a link to /usr/lib64/libpynac-0.2.so.1.0.0 in its place.
If it is a pynac problem it will now show up in your "vanilla" sage install.

@gagern
Contributor
gagern commented Jan 7, 2011

Yes, this is an amd64 machine. This backtrace from the thread you referenced looks pretty much like mine.

Probably this thing is related to Gentoo bug 338513 as well.

I don't have a vanilla sage here, and I fail to find a binary package that doesn't have a distribution name in its file name. I'll try whether the ubuntu one runs here as well, and if not, how difficult a compile from source appears to be.

I'm also compiling gcc / libstdc++ from source, with -O0, in order to get more details at the place where the segfault actually happens.

@kiwifb
Collaborator
kiwifb commented Jan 7, 2011

the bug you mention is probably related to the deprecation warnings you get but I doubt it has anything to do with the SIGSEGV (I could be wrong).

If you install sage from source and have ATLAS installed you can use SAGE_ATLAS to use your system install (you need to eselect atlas for blas/cblas/lapack as well and create a link libf77blas to libblas). That speeds things up a little bit.
SAGE_ATLAS=/usr/lib64

@cschwan
Owner
cschwan commented Jan 7, 2011

I have two boxes running amd64 gentoo, but i could not reproduce this error. I tried installing matplotlib with USE=gtk, but it does not break sage.

@cschwan
Owner
cschwan commented Jan 7, 2011

Soory for closing, hit the wrong button ...

@gagern
Contributor
gagern commented Jan 7, 2011

You are wrong: comment 3 from that bug has a SIGSEGV in the same libstdc++ function we have.

I have first results from debugging with an unoptimized libstdc++:
Dump of assembler code for function __cxxabiv1::__cxa_allocate_exception(size_t):
<+31>: callq 0x7ffff2cc6710 __cxa_get_globals@plt
=> <+36>: addl $0x1,0x8(%rax)
This corresponds to these lines in eh_alloc.cc:132:
__cxa_eh_globals *globals = __cxa_get_globals ();
globals->uncaughtExceptions += 1;
The function this calls is from eh_globals.cc:
Dump of assembler code for function __cxxabiv1::__cxa_get_globals():
<+0>: lea 0x2328f1(%rip),%rdi # 0x7ffff2f69528
<+7>: callq 0x7ffff2cc5fb0 __tls_get_addr@plt
<+12>: add $0x10,%rax
<+18>: retq
I guess this should be the version with the __thread keyword on line 52. As a result of all this code, we get %rxa=0x10, which is reason enough for a segfault. It seems that __tls_get_addr, which should return the address of the thread-local storage area, returns a NULL pointer instead. Reason unknown.

@kiwifb
Collaborator
kiwifb commented Jan 7, 2011

Does that means that you can avoid the problem using the workaround from that bug?

@gagern
Contributor
gagern commented Jan 7, 2011

Nope, the solution of editing /etc/matplotlib/matplotlibrc is specific to matplotlib, and doesn't apply to pynac. The common thing is some C++ code invoked from python trying to throw an exception. The code throwing the exception is different. The cause why TLS is broken should be the same, though.

I'm currently stepping through __tls_get_addr, but as the call to that doesn't explicitely appear in the code, I still haven't fully understood how this is supposed to work.

@cschwan
Owner
cschwan commented Jan 7, 2011

Could you please post us your emerge --info ?

@gagern
Contributor
gagern commented Jan 7, 2011

emerge --info follows. By the way, I'm currently on #gentoo-science@freenode, so if you want to discuss this live, feel free to talk to MvG.

$ emerge --info
Portage 2.2.0_alpha14 (default/linux/amd64/10.0/desktop/gnome, gcc-4.4.5, glibc-2.12.1-r3, 2.6.36-gentoo-r5 x86_64)
=================================================================
System uname: Linux-2.6.36-gentoo-r5-x86_64-AMD_Phenom-tm-_II_X4_945_Processor-with-gentoo-2.0.1
Timestamp of tree: Fri, 07 Jan 2011 09:15:01 +0000
distcc 3.1 x86_64-pc-linux-gnu [disabled]
ccache version 3.1.3 [disabled]
app-shells/bash:     4.1_p9
dev-java/java-config: 2.1.11-r3
dev-lang/python:     2.4.6, 2.5.4-r4, 2.6.6-r1::sage-on-gentoo, 2.7.1::sage-on-gentoo, 3.1.3
dev-util/ccache:     3.1.3
dev-util/cmake:      2.8.3-r1
sys-apps/baselayout: 2.0.1-r1
sys-apps/openrc:     0.6.8
sys-apps/sandbox:    2.4
sys-devel/autoconf:  2.13, 2.68
sys-devel/automake:  1.4_p6-r1, 1.6.3, 1.7.9-r1, 1.8.5-r3, 1.9.6-r3, 1.10.3, 1.11.1
sys-devel/binutils:  2.21
sys-devel/gcc:       3.3.6-r1, 3.4.6-r2, 4.1.2, 4.2.4-r1, 4.4.5, 4.5.2
sys-devel/gcc-config: 1.4.1
sys-devel/libtool:   2.4-r1
sys-devel/make:      3.82
virtual/os-headers:  2.6.36.1 (sys-kernel/linux-headers)
Repositories: gentoo generated mvgLocal mvg-java-experimental sunrise-enabled bugfix bump kde-sunset sage-on-gentoo
ACCEPT_KEYWORDS="amd64 ~amd64"
ACCEPT_LICENSE="* -@EULA dlj-1.1 sun-bcla-java-vm skype-eula googleearth AdobeFlash-10 AdobeFlash-10.1"
CBUILD="x86_64-pc-linux-gnu"
CFLAGS="-march=amdfam10 -O2 -ggdb -pipe"
CHOST="x86_64-pc-linux-gnu"
CONFIG_PROTECT="/etc /usr/kde/3.5/env /usr/kde/3.5/share/config /usr/kde/3.5/shutdown /usr/share/config /usr/share/maven-bin-3.0/conf /usr/share/openvpn/easy-rsa /var/lib/hsqldb"
CONFIG_PROTECT_MASK="/etc/ca-certificates.conf /etc/env.d /etc/env.d/java/ /etc/eselect/postgresql /etc/fonts/fonts.conf /etc/gconf /etc/gentoo-release /etc/php/apache2-php5.3/ext-active/ /etc/php/cgi-php5.3/ext-active/ /etc/php/cli-php5.3/ext-active/ /etc/revdep-rebuild /etc/sandbox.d /etc/terminfo /etc/texmf/language.dat.d /etc/texmf/language.def.d /etc/texmf/updmap.d /etc/texmf/web2c"
CXXFLAGS="-march=amdfam10 -O2 -ggdb -pipe"
DISTDIR="/usr/portage/distfiles"
FEATURES="assume-digests binpkg-logs collision-protect distlocks fixlafiles fixpackages news parallel-fetch preserve-libs protect-owned sandbox sfperms splitdebug strict unknown-features-warn unmerge-logs unmerge-orphans userfetch"
GENTOO_MIRRORS="http://mirror.cambrium.nl/pub/os/linux/gentoo/ http://mirror.leaseweb.com/gentoo/ ftp://mirror.netcologne.de/gentoo/"
LANG="de_DE.utf8"
LDFLAGS="-Wl,--as-needed"
LINGUAS="en de en_US en_GB"
MAKEOPTS="-j5"
PKGDIR="/usr/portage/packages"
PORTAGE_CONFIGROOT="/"
PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --stats --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages"
PORTAGE_TMPDIR="/var/tmp"
PORTDIR="/usr/portage"
PORTDIR_OVERLAY="/usr/portage/local/generated /usr/portage/local/mvg /usr/portage/local/mvg-java /usr/portage/local/sunrise-enabled /usr/portage/local/bugfix /usr/portage/local/bump /usr/portage/local/kde-sunset /usr/portage/local/layman/sage-on-gentoo"
SYNC="rsync://rsync.de.gentoo.org/gentoo-portage"
USE="X a52 aac acl acpi alsa amd64 apache2 audiofile avahi bash-completion bcmath berkdb bluray branding bzip2 c++ cairo cdda cdparanoia cdr chroot cli cracklib crypt css cups curl cxx dba dbus dhcp doc dri dts dv dvd dvdr eds emacs emboss encode escreen evo exif fam fastcgi ffmpeg fftw firefox flac flatfile fortran ftp gcc-libffi gd gdbm gdu gif gimp gnome gnome-keyring gnutls gphoto2 gpm graphviz gs gstreamer gtk hal hbci html iconv idn imagemagick ipv6 iso14755 ithreads jabber jack java jpeg jpeg2k kde kerberos kpathsea kvm ladspa latex lcms ldap leim libnotify lirc lm_sensors logrotate lzo mad maildir mhash mikmod mime mjpeg mmx mng modules mozxmlterm mp3 mp4 mpeg mpeg2 mplayer mudflap multilib mysql nautilus ncurses network nls nptl nptlonly nsplugin objc odbc ofx ogg openexr opengl openmp pam pango pcre pdf perl php plotutils png policykit povray ppds pppd procmail python qt3support qt4 quicktime rdesktop readline recode sasl scanner sdl session smime sndfile snmp sockets socks5 sox speex spell sqlite sse sse2 ssl startup-notification subversion svg sysfs tcl tcpd threads thunderbird tiff tokenizer transcode translator truetype type1 udev unicode usb userlocales v4l v4l2 vhosts vorbis wmf x264 xanim xcb xcomposite xine xinerama xinetd xml xorg xprint xscreensaver xulrunner xv xvid xvmc zlib" ALSA_CARDS="bt87x emu10k1x hda-intel usb-audio via82xx via82xx-modem" ALSA_PCM_PLUGINS="adpcm alaw asym copy dmix dshare dsnoop empty extplug file hooks iec958 ioplug ladspa lfloat linear meter mmap_emul mulaw multi null plug rate route share shm softvol" APACHE2_MODULES="actions alias asis auth_basic auth_digest authn_alias authn_anon authn_dbm authn_default authn_file authz_dbm authz_default authz_groupfile authz_host authz_owner authz_user autoindex cache cgi cgid dav dav_fs dav_lock deflate dir disk_cache env expires ext_filter filter headers include info log_config logio mem_cache mime mime_magic negotiation proxy proxy_connect proxy_ftp proxy_http rewrite setenvif speling status unique_id userdir usertrack vhost_alias" APACHE2_MPMS="prefork" COLLECTD_PLUGINS="df interface irq load memory rrdtool swap syslog" ELIBC="glibc" FRITZCAPI_CARDS="fcpci" GPSD_PROTOCOLS="ashtech aivdm earthmate evermore fv18 garmin garmintxt gpsclock itrax mtk3301 nmea ntrip navcom oceanserver oldstyle oncore rtcm104v2 rtcm104v3 sirf superstar2 timing tsip tripmate tnt ubx" INPUT_DEVICES="evdev joystick keyboard mouse wacom" KERNEL="linux" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" LINGUAS="en de en_US en_GB" LIRC_DEVICES="hauppauge" MISDN_CARDS="avmfritz" PHP_TARGETS="php5-3" RUBY_TARGETS="ruby18 jruby" USERLAND="GNU" VIDEO_CARDS="nvidia nouveau nv intel fbdev v4l vesa vga" XTABLES_ADDONS="quota2 psd pknock lscan length2 ipv4options ipset ipp2p iface geoip fuzzy condition tee tarpit sysrq steal rawnat logmark ipmark dhcpmac delude chaos account" 
Unset:  CPPFLAGS, CTARGET, EMERGE_DEFAULT_OPTS, FFLAGS, INSTALL_MASK, LC_ALL, PORTAGE_BUNZIP2_COMMAND, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS
@gagern
Contributor
gagern commented Jan 8, 2011

Random collection of intermediate results:

  1. Plain vanilla sage runs fine, even with pynac replaced.
  2. plain vanilla throws an exception internally, causing the same call to __cxa_allocate_exception, so that part is to be excpected as well. It's not the exception itself that causes problems, but the thread local store setup at that moment.
  3. There is only a single thread, so this is no race condition.
  4. I'm using python 2.7, which might be a core reason for this problem.
  5. /usr/bin/sage-sage doesn't select any particular python version. Maybe it should.
  6. My pynac was compiled agains python 2.6 headers, but recompiling against 2.7 doesn't help. gagern@07493f4 might provide a sane way for python_updater to handle pynac, if this makes sense, but it won't solve this issue here.

Some more deep code analysis:
__tls_get_addr from libc ist the place where the NULL pointer crops up. In the gentoo sage build, dtv[GET_ADDR_MODULE].pointer.val == NULL and dtv[0].counter == GL(dl_tls_generation). In the plain vanilla build, counter and generation differ at first call of that function, leading to a call to _dl_update_slotinfo, which seems to fix things. In the gentoo build, _dl_update_slotinfo gets called when modules are dlopened, but for some reason that's not the same. The order of loaded TLS-enabled libs is different, so things are somewhat hard to compare.

@kiwifb
Collaborator
kiwifb commented Jan 8, 2011

The first reports we had were with python-2.6.6 so if it is python related it has to be
anything over 2.6.5.I am wondering about the version of glibc used by other people.

@gagern
Contributor
gagern commented Jan 8, 2011

cschwan told on irc about one vladimir experiencing the same issue with glibc 2.11.2-r3, just like cschwan himself. So glibc version in itself isn't enough either.

@kiwifb
Collaborator
kiwifb commented Jan 15, 2011

Any chance there is a relation to this:
http://trac.sagemath.org/sage_trac/ticket/9880

@gagern
Contributor
gagern commented Jan 15, 2011

I see no relation to Sage bug 9880 except for pynac and sigsegv being common to both. A backtrace could easily clarify things, though. Haven't found one yet. Can you reproduce that, and generate a backtrace? Otherwise I'd ask one of the people on that report for one.

@kiwifb
Collaborator
kiwifb commented Jan 20, 2011

sorry for not getting back to you. No I cannot reproduce it. But yes it would be interesting if there was a trace available on trac.

@gagern
Contributor
gagern commented Jan 20, 2011

Reporter of that bug considers it unrelated, and backtrace supports that opinion.

@gagern
Contributor
gagern commented Jan 20, 2011

Can reproduce this issue here with python 2.6 instead of 2.7, after I got a consistent 2.6 setup from gagern@e110b07 and gagern@4e8ffce.

@kiwifb
Collaborator
kiwifb commented Jan 21, 2011

I am wondering if some optimizations flag plays a role here. Could you try with the sage ebuild compiled with -O3 like vanilla.

@kiwifb
Collaborator
kiwifb commented Jan 22, 2011

Another possibility this bug first emerged at the same time as the use of python-2.6.6.
I know it is a pain but python-2.6.5 should be tested.

@gagern
Contributor
gagern commented Jan 22, 2011

Negative on both accounts: Downgraded python:2.6 to 2.6.5-r3 and reemerged sage with CFLAGS=CXXFLAGS="-O3 -ggdb". Problem persists.

@kiwifb
Collaborator
kiwifb commented Jan 22, 2011

Well it had to be tried. I looked at hg commits around sage-4.5.3 (Vladimir's first report)
to see if there were any code changes in symbolic/function.pyx but couldn't find anything.
It may come in from something called before that. Or it could be a change external to sage in the main tree - but you just ruled out python.
What happens if you switch symbolic/function.so between vanilla and s-o-g versions?
What is called in sage before function.cpp? I can see it go through some python code
before landing there but what was before that?

@gagern
Contributor
gagern commented Jan 22, 2011

The function.cpp location in the code is just where the error manisfests, not where it's caused, I'd say. The problem is with accessing the thread local storage of a dynamically loaded library, so I presume the problem occurs somewhere when the library is dynloaded. To understand the details, I'd have to figure out how dl-tls.c in glibc is supposed to work, and how that could go wrong. So far this has eluded me.

@kiwifb
Collaborator
kiwifb commented Jan 23, 2011

I get what you say. On the other hand vanilla works. And I would to use it to gather more clues. If I am not mistaken by reading the code libs/function.cpp is used to produce a catalogue of function. It would be nice to know which function is involved when things break, were there any other going through before?
Your backtrace comparison with vanilla suggest we may not look at the same function or possibly not going in the same order.
So I'd still like to know what was called before.
There is also the fact that not all amd64 install suffer from this. There must be a common factor but which one?

@kiwifb
Collaborator
kiwifb commented Jan 23, 2011

Vladimir's emerge --info:

Portage 2.1.9.34 (default/linux/amd64/10.0, gcc-4.5.2, glibc-2.12.2-r0,
2.6.37-gentoo x86_64)
=================================================================
System uname:
Linux-2.6.37-gentoo-x86_64-Mobile_AMD_Sempron-tm-_Processor_3800
+-with-gentoo-2.0.1 Timestamp of tree: Fri, 21 Jan 2011 00:30:01 +0000
distcc 3.1 x86_64-pc-linux-gnu [disabled] app-shells/bash:     4.1_p9
dev-lang/python:     2.6.6-r1::sage-on-gentoo, 3.1.3
dev-util/cmake:      2.8.3-r1 sys-apps/baselayout: 2.0.1-r1
sys-apps/openrc:     0.7.0
sys-apps/sandbox:    2.4
sys-devel/autoconf:  2.13, 2.68
sys-devel/automake:  1.8.5-r3, 1.9.6-r3, 1.10.3, 1.11.1
sys-devel/binutils:  2.21
sys-devel/gcc:       4.4.5, 4.5.2
sys-devel/gcc-config: 1.4.1
sys-devel/libtool:   2.4-r1
sys-devel/make:      3.82
virtual/os-headers:  2.6.36.1 (sys-kernel/linux-headers)
ACCEPT_KEYWORDS="amd64 x86 ~amd64 ~x86"
ACCEPT_LICENSE="* -@EULA PUEL skype-eula"
CBUILD="x86_64-pc-linux-gnu"
CFLAGS="-march=k8 -O2 -pipe"
CHOST="x86_64-pc-linux-gnu"
CONFIG_PROTECT="/etc /usr/share/gnupg/qualified.txt"
CONFIG_PROTECT_MASK="/etc/ca-certificates.conf /etc/env.d /etc/fonts/fonts.conf /etc/gconf /etc/gentoo-release /etc/revdep-rebuild /etc/sandbox.d /etc/splash /etc/terminfo /etc/texmf/language.dat.d /etc/texmf/language.def.d /etc/texmf/updmap.d /etc/texmf/web2c"
CXXFLAGS="-march=k8 -O2 -pipe"
DISTDIR="/home/Install/GNU-Linux/distfiles/"
FEATURES="assume-digests binpkg-logs distlocks fixlafiles fixpackages
news parallel-fetch protect-owned sandbox sfperms strict
unknown-features-warn unmerge-logs unmerge-orphans userfetch" FFLAGS=""
GENTOO_MIRRORS=" http://gentoo.kiev.ua/ftp/ ftp://gentoo.kiev.ua/"
LANG="uk_UA.UTF-8" LC_ALL="uk_UA.UTF-8"
LDFLAGS="-Wl,-O1 -Wl,--as-needed"
LINGUAS="ru"
MAKEOPTS="-j3"
PKGDIR="/home/Install/GNU-Linux/binpkg/"
PORTAGE_CONFIGROOT="/"
PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times
--compress --force --whole-file --delete --stats --timeout=180
--exclude=/distfiles --exclude=/local --exclude=/packages"
PORTAGE_TMPDIR="/var/tmp" PORTDIR="/usr/portage"
PORTDIR_OVERLAY="/var/lib/layman/science /var/lib/layman/sage-on-gentoo /usr/local/overlays"
SYNC="rsync://rsync.gentoo.org/gentoo-portage" USE="3dnow 3dnowext
64bit 7zip X a52 aac aalib acpi alsa amd64 amrnb amrwb apm arts ass atm
audiofile bash-completion bcmath bzip2 cairo calendar cdb cddb cdr cgi
clamav cli cracklib crypt ctype curl curlwrappers cxx dbm dbus dbx dga
djvu dri dssi dts dvd dvdr dvdread encode evo exif expat faac faad
fastcgi fbcon fbcondecor festival ffmpeg fftw firefox flac flatfile
freetds ftp fuse gd gdbm geoip gif gimp ginac git glut gmp gnuplot
gnustep gnutls gsl hal hddtemp htmlhandbook icc iconv icq idn imlib
inifile innodb irc jabber jack javascript jbig jikes jpeg krb4 lame
laptop lash latex ldap leim libcaca libnotify libsamplerate libwww
lm_sensors lua lzo mad maildir matroska matrox mcal mhash mikmod milter
mime mmap mmx mmxext mng modplug modules motiff mozilla mp3 mpeg
mplayer msn mudflap mule multilib musepack musicbrainz mysql mysqli nas
ncurses nforce2 nls nptl nptlonly nsplugin nvidia ogg openal
opencore-amr opengl openmp osc pam pcntl pdf plotutils pmu png posix
pppd prelude profile python qt3support qt4 quicktime radius readline
recode rss rtc samba sasl sdl session sharedmem shorten simplexml skins
slang slp sndfile snmp soap sockets socks5 sox speex spell sqlite
sqlite3 sse sse2 ssl startup-notification svg symlink sysfs syslog
systray sysvipc szip taglib tcl tcpd tetex theora threads tidy tiff
timidity truetype unicode usb vcd vhosts vorbis wavpack wddx webkit
x264 xattr xcb xcomposite xface xine xinerama xml xml-rpc xorg xosd xpm
xsl xvid zeroconf zlib" ALSA_CARDS="ali5451 als4000 atiixp atiixp-modem
bt87x ca0106 cmipci emu10k1x ens1370 ens1371 es1938 es1968 fm801
hda-intel intel8x0 intel8x0m maestro3 trident usb-audio via82xx
via82xx-modem ymfpci" ALSA_PCM_PLUGINS="adpcm alaw asym copy dmix
dshare dsnoop empty extplug file hooks iec958 ioplug ladspa lfloat
linear meter mmap_emul mulaw multi null plug rate route share shm
softvol" APACHE2_MODULES="actions alias auth_basic authn_alias
authn_anon authn_dbm authn_default authn_file authz_dbm authz_default
authz_groupfile authz_host authz_owner authz_user autoindex cache cgi
cgid dav dav_fs dav_lock deflate dir disk_cache env expires ext_filter
file_cache filter headers include info log_config logio mem_cache mime
mime_magic negotiation rewrite setenvif speling status unique_id
userdir usertrack vhost_alias" COLLECTD_PLUGINS="df interface irq load
memory rrdtool swap syslog" ELIBC="glibc" GPSD_PROTOCOLS="ashtech aivdm
earthmate evermore fv18 garmin garmintxt gpsclock itrax mtk3301 nmea
ntrip navcom oceanserver oldstyle oncore rtcm104v2 rtcm104v3 sirf
superstar2 timing tsip tripmate tnt ubx" INPUT_DEVICES="keyboard mouse
evdev synaptics" KERNEL="linux" LCD_DEVICES="bayrad cfontz cfontz633
glk hd44780 lb216 lcdm001 mtxorb ncurses text" LINGUAS="ru"
PHP_TARGETS="php5-3" QEMU_SOFTMMU_TARGETS="i386 x86_64"
QEMU_USER_TARGETS="i386 x86_64" RUBY_TARGETS="ruby18" USERLAND="GNU"
VIDEO_CARDS="vesa nouveau" XTABLES_ADDONS="quota2 psd pknock lscan
length2 ipv4options ipset ipp2p iface geoip fuzzy condition tee tarpit
sysrq steal rawnat logmark ipmark dhcpmac delude chaos account" Unset:
CPPFLAGS, CTARGET, EMERGE_DEFAULT_OPTS, INSTALL_MASK,
PORTAGE_BUNZIP2_COMMAND, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS,
PORTAGE_RSYNC_EXTRA_OPTS

I also got jari's emerge --info and I will post it a little bit later. Jari has a phenom II cpu and I was hoping that Vladimir had one too and that it could be the incriminating element. No such luck, although it is still AMD.

@kiwifb
Collaborator
kiwifb commented Jan 24, 2011

Jari's emerge --info

 Portage 2.1.9.35 (default/linux/amd64/10.0, gcc-4.5.2, glibc-2.12.2-r0, 2.6.37-gentoo x86_64)
=================================================================
System uname: Linux-2.6.37-gentoo-x86_64-AMD_Phenom-tm-_II_X4_B55_Processor-with-gentoo-2.0.1
Timestamp of tree: Sun, 23 Jan 2011 12:45:01 +0000
ccache version 3.1.4 [enabled]
app-shells/bash:     4.1_p9
dev-java/java-config: 2.1.11-r3
dev-lang/python:     2.6.6-r1::sage-on-gentoo, 2.7.1, 3.1.3
dev-util/ccache:     3.1.4
dev-util/cmake:      2.8.3-r1
sys-apps/baselayout: 2.0.1-r1
sys-apps/openrc:     0.7.0
sys-apps/sandbox:    2.4
sys-devel/autoconf:  2.13, 2.68
sys-devel/automake:  1.4_p6-r1, 1.9.6-r3, 1.10.3, 1.11.1
sys-devel/binutils:  2.21
sys-devel/gcc:       4.5.2
sys-devel/gcc-config: 1.4.1
sys-devel/libtool:   2.4-r1
sys-devel/make:      3.82
virtual/os-headers:  2.6.36.1 (sys-kernel/linux-headers)
ACCEPT_KEYWORDS="amd64 ~amd64"
ACCEPT_LICENSE="*"
CBUILD="x86_64-pc-linux-gnu"
CFLAGS="-march=amdfam10 -O2 -pipe"
CHOST="x86_64-pc-linux-gnu"
CONFIG_PROTECT="/etc /usr/share/config /usr/share/gnupg/qualified.txt"
CONFIG_PROTECT_MASK="/etc/ca-certificates.conf /etc/env.d /etc/env.d/java/ /etc/fonts/fonts.conf /etc/gconf /etc/gentoo-release /etc/revdep-rebuild /etc/sandbox.d /etc/splash /etc/terminfo /etc/texmf/language.dat.d /etc/texmf/language.def.d /etc/texmf/updmap.d /etc/texmf/web2c"
CXXFLAGS="-march=amdfam10 -O2 -pipe"
DISTDIR="/usr/portage/distfiles"
FEATURES="assume-digests binpkg-logs ccache distlocks fail-clean fixlafiles fixpackages news parallel-fetch protect-owned sandbox sfperms strict unknown-features-warn unmerge-logs unmerge-orphans userfetch"
FFLAGS=""
GENTOO_MIRRORS="http://gentoo.virginmedia.com/ http://de-mirror.org/distro/gentoo/ http://gentoo.tiscali.nl/ http://gentoo.mneisen.org/ http://gentoo.supp.name/"
LANG="en_GB"
LC_ALL="en_GB"
LDFLAGS="-Wl,-O1 -Wl,--as-needed"
LINGUAS="en en_GB"
MAKEOPTS="-j5"
PKGDIR="/usr/portage/packages"
PORTAGE_CONFIGROOT="/"
PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --stats --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages"
PORTAGE_TMPDIR="/dev/shm"
PORTDIR="/usr/portage"
PORTDIR_OVERLAY="/var/lib/layman/kde /var/lib/layman/sunrise /var/lib/layman/sage-on-gentoo /var/lib/layman/kuroo /usr/local/overlays"
SYNC="rsync://rsync1.uk.gentoo.org/gentoo-portage"
USE="3dnow 3dnowext X a52 aac acl acpi alsa amd64 amr bash-completion berkdb bzip2 cdaudio cdda cdr cli consolekit cracklib crypt cups cxx dri dvb dvd dvdr dvdread encode ffmpeg firefox flac fortran gdbm gimp glitz gpm gstreamer hal iconv ipv6 jpeg kde lzma mad midi mmx mmxext modules mp3 mp4 mudflap multilib ncurses nls nptl nptlonly nvidia opengl openmp pam pcre pdf perl png pppd python qt qt3 qt4 readline samba semantic-desktop session spell sse sse2 ssl svg sysfs tcpd truetype unicode visualization vorbis xorg xulrunner zlib" ALSA_CARDS="ali5451 als4000 atiixp atiixp-modem bt87x ca0106 cmipci emu10k1x ens1370 ens1371 es1938 es1968 fm801 hda-intel intel8x0 intel8x0m maestro3 trident usb-audio via82xx via82xx-modem ymfpci" ALSA_PCM_PLUGINS="adpcm alaw asym copy dmix dshare dsnoop empty extplug file hooks iec958 ioplug ladspa lfloat linear meter mmap_emul mulaw multi null plug rate route share shm softvol" APACHE2_MODULES="actions alias auth_basic auth_digest authn_anon authn_dbd authn_dbm authn_default authn_file authz_dbm authz_default authz_groupfile authz_host authz_owner authz_user autoindex cache dav dav_fs dav_lock dbd deflate dir disk_cache env expires ext_filter file_cache filter headers ident imagemap include info log_config logio mem_cache mime mime_magic negotiation proxy proxy_ajp proxy_balancer proxy_connect proxy_http rewrite setenvif so speling status unique_id userdir usertrack vhost_alias" CAMERAS="canon" COLLECTD_PLUGINS="df interface irq load memory rrdtool swap syslog" ELIBC="glibc" GPSD_PROTOCOLS="ashtech aivdm earthmate evermore fv18 garmin garmintxt gpsclock itrax mtk3301 nmea ntrip navcom oceanserver oldstyle oncore rtcm104v2 rtcm104v3 sirf superstar2 timing tsip tripmate tnt ubx" INPUT_DEVICES="evdev" KERNEL="linux" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" LINGUAS="en en_GB" PHP_TARGETS="php5-3" RUBY_TARGETS="ruby18" USERLAND="GNU" VIDEO_CARDS="nvidia" XTABLES_ADDONS="quota2 psd pknock lscan length2 ipv4options ipset ipp2p iface geoip fuzzy condition tee tarpit sysrq steal rawnat logmark ipmark dhcpmac delude chaos account" 
Unset:  CPPFLAGS, CTARGET, EMERGE_DEFAULT_OPTS, INSTALL_MASK, PORTAGE_BUNZIP2_COMMAND, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS

And he also gave a backtrace that confirm it is the issue.

@kiwifb
Collaborator
kiwifb commented Jan 24, 2011

Some news from Paulo César Pereira de Andrade from Mandriva:

> But we are fairly stump with select cases of failure:
> https://github.com/cschwan/sage-on-gentoo/issues#issue/40
> If you have any ideas.

  After a quick look, the first thing that come to my mind was,
guess what, pari :-) Here is the specific diff:
http://svn.mandriva.com/cgi-bin/viewvc.cgi/packages/cooker/pari/current/SPECS/pari.spec?r1=560893&r2=595896
I disabled the new tls support, because things can break easily, e.g.
the build itself links with the external pari library if available, and
also disabled runpath because it would also cause several issues.
$ grep -r ENABLE_TLS /usr/include/pari/  ## or sage specific dir
should help if it is the issue.

It looks like an interesting lead. Certainly better than what we have now.

@cschwan
Owner
cschwan commented Jan 24, 2011

I do not think it will solve the issue but you should try it nevertheless:

mv ~/.sage ~/sage-backup

@gagern
Contributor
gagern commented Jan 25, 2011

Adding --disable-tls to the pari ebuild as Mandrivia does it doesn't resolve this issue.

@kiwifb
Collaborator
kiwifb commented Jan 25, 2011

We were unfortunately thinking that would be the case (me and Christopher) because Vladimir started to suffer from it while sage was still using pari-2.3.5. But it was an interesting suggestion considering that pari is linked to almost everything.

@kiwifb
Collaborator
kiwifb commented Jan 25, 2011

Ok so it is driving us, nuts. Christopher, when did you re-enable shared libraries for polybori and could Vladimir have had them in time for sage-4.5.3?

@kiwifb
Collaborator
kiwifb commented Jan 25, 2011

Easier to find than I thought. 21st of August 2010. Plenty of time. So next thing to try is removing polybori shared object again and keep the static ones. polybori is a notable troublemaker.

@kiwifb
Collaborator
kiwifb commented Jan 26, 2011

Martin, you didn't experience the issue before 4.6-r1? In particular not in 4.5.3?
Or you never tested 4.5.3?

@gagern
Contributor
gagern commented Jan 27, 2011

I didn't experience this in 4.5.3. But if I remember correctly, I didn't experience this with 4.6 at first either. So whatever caused this wasn't a simple update of sage, but a modification to some other part of my system, I'd say. Unfortunately I hadn't been using sage very regularly, so I can't give a reasonably narrow set of updates that might have caused this.

@kiwifb
Collaborator
kiwifb commented Jan 27, 2011

Ok so your window of updates between the two is not that narrow. How many lines of /var/log/emerge.log between 4.6 and 4.6-r1 would you say? I may try to see if the other two have something more narrow.

@gagern
Contributor
gagern commented Jan 27, 2011

Never had 4.6 installed. Unfortunately I can't remember when I last successfully used the command line version of sage. I've been using the notebook more often, and sage -n currently gives a python backtrace instead of a segfault, so I'm not sure whether the notebook might have worked even after this bug was already present.

@kiwifb
Collaborator
kiwifb commented Jan 27, 2011

Ok I will try to see what I can get from jari and vladimir. It may be coming from one of the non-mathematical components of sage at the back end. That could explain why a vanilla sage still work, but it may be a part of the main tree rather than the overlay.
Anything interesting in the python backtrace?

@gagern
Contributor
gagern commented Jan 27, 2011

With the modifications from #48 in place, sage-notebook runs without exhibiting this problem here, so my notebook sessions will be no help in narrowing down the time window when this was introduced.

@gagern
Contributor
gagern commented Jan 30, 2011

OK, I had yet another debugging session. Now I think the issue is in fact a glibc problem, and have reported Gentoo bug 353224 about it. I'll copy the majority of its description, mainly because links will format nicer here in github.

The problem occurs when the python runtime loads the _gtk.so module, along with all its dependencies. Some of the dependencies make use of thread local storage: libpixman.so, libEGL.so, libstdc++.so, libnvidia-tls.so and libuuid.so. They are assigned module ids 2 through 6, as 1 is for libc.so. _dl_next_tls_modid is a suitable breakpoint here.

Next the modules are loaded to the global slot database using the function _dl_add_to_slotinfo, one after the other. The global generation counter, GL(dl_tls_generation), stays at value 1 the whole time, as it is only incremented after the whole dl_open call for the complete set of libraries is done (dl-open.c line 458). So all new libraries are marked as belonging to the next generation, generation 2. This information is stored in their slots.

However, some of the libraries need some special kind of tls initialization. For libEGL.so and libnvidia-tls.so, imap->l_need_tls_init in dl-open.c line 428 will evaluate true, causing an immediate call to _dl_update_slotinfo for module ids 3 and 5, intermixed with the _dl_add_to_slotinfo calls.

This is where things go wrong: when running _dl_update_slotinfo for module 3, this function finds that module 3 is in generation 2. It then updates the dtv (the thread-local vector of thread-local data blocks for the modules) with the data for all generation 2 modules that it knows about at that point, i.e. modules 2 and 3. It then marks the dtv to be up to date with generation 2.

This mark becomes incorrect when subsequent calls to _dl_add_to_slotinfo add more slots to generation 2. _dl_update_slotinfo is executed again for module 4, but in this case the check in line 571 of dl-tls.c skips the actual update, as the dtv seems to be up to date already, as judged by its generation counter.

Later on, when at runtime some code in module 4 attempts to access its thread local storage, __tls_get_addr determines in line 758 of dl-tls.c that no update to the dtv is required, as its generation still seems up to date. Therefore an uninitialized part of the dtv will be returned, wich usually will tend to be a NULL pointer. That's what's causing the application to SIGSEGV.

I have tried to reproduce this in a small demo setup, but so far I couldn't get my own code to reproduce the case where l_need_tls_init is set. The corresponding code is line 111 in dl-reloc.c. Comments in functions leading up to this indicate that it has something to do with discouraged practices, so I'm not sure how to willfully enter that branch.

@gagern
Contributor
gagern commented Jan 30, 2011

Finally solved this by patching glibc. Also wrote a report for this in the glibc issue tracker.

As far as I'm concerned, we can close this as it's not a sage bug.

@kiwifb
Collaborator
kiwifb commented Jan 30, 2011

Do you have an ebuild with that fix? We could put it on the overlay (with a package.mask) with instructions for people who gets the failure.

@gagern
Contributor
gagern commented Jan 30, 2011

Don't have an ebuild, and I doubt this would work: package.mask is per atom, not restricted to the overlay, right? So we'd end up masking our own ebuild, but at the same time, also mask some future ebuild that portage might soon provide. And if we choose our revision number so high that a future collision is unlikely, people might not "downgrade" when main portage has a newer glibc with perhaps important security fixes.

So I'd give people instructions without an ebuild:

ebuild $(equery w sys-libs/glibc) clean unpack
wget -O- 'https://bugs.gentoo.org/attachment.cgi?id=261104' | patch /var/tmp/portage/sys-libs/glibc-*/work/glibc-*/elf/dl-open.c
ebuild $(equery w sys-libs/glibc) merge
@kiwifb
Collaborator
kiwifb commented Jan 31, 2011

I think some of the latest portage-2.2_alpha support overlay masking now.
But that's too much to ask to use that in the first place.
I'll point Jari and Vladimir to your instructions so that they can have a go.
It is best to leave the issue open until the fix is in the tree.
At least that's the way I like to proceed.

@kiwifb
Collaborator
kiwifb commented Feb 13, 2011

Fun, I got bit this morning on my intel 64 box!

#0  0x00007fffea52b364 in __cxa_allocate_exception ()
   from /usr/lib/gcc/x86_64-pc-linux-gnu/4.5.2/libstdc++.so.6
#1  0x00007fffcb2978a6 in GiNaC::function::find_function(std::basic_string, std::allocator > const&, unsigned int) () from /usr/lib/libpynac-0.2.so.1
#2  0x00007fffcaa9d4f5 in ?? () from /usr/lib64/python2.6/site-packages/sage/symbolic/function.so
#3  0x00007fffcaaa6066 in ?? () from /usr/lib64/python2.6/site-packages/sage/symbolic/function.so
#4  0x00007ffff7ac7dec in ?? () from /usr/lib/libpython2.6.so.1.0
#5  0x00007ffff7a7bb53 in PyObject_Call () from /usr/lib/libpython2.6.so.1.0
@kiwifb
Collaborator
kiwifb commented Feb 13, 2011

Fix worked, so that's definitely it. Not sure what triggered it I made a bunch of xorg upgrade last night and a few python thing like dbus-python.

@kiwifb
Collaborator
kiwifb commented Feb 20, 2011

Now seen on x86. coincide with an X11 upgrade again, possibly also linux-headers.

@drkirkby
drkirkby commented Apr 3, 2011

I'm a bit suspicious this is the problem. I'm seeing a very similar issue on Sage sage-4.7.alpha3 when building a 64-bit version of it on OpenSolaris, where I'm 99% sure the Sun C library will be used and not the GNU C library. See:

http://trac.sagemath.org/sage_trac/ticket/11116

Note, I do not see this issue with a 32-bit build on OpenSolaris on the same machine. Note I also use the Sun linker, not the GNU one.

Burcin who wrote the code thinks he has found the problem

------- Corrment from Burcin ---
It seems that cones.py looks for posets.py, which needs the graphs module, which initializes the graph_editor. The graph editor tries to see if it's in the notebook or the command line, but sagenb imports SR and Expression from sage.symbolic.all (line 563 of sagenb/misc/support.py). This tries to initialize the functions (integrate in this case) before pynac is initialized...

We need a better solution for making sure modules are initialized properly before anything is imported from them. I thought putting an init.py file in sage/symbolic/ with "import pynac" would solve the problem. However, it seems that python just ignores that file.

@gagern
Contributor
gagern commented Apr 3, 2011

The exception throwing mechanism should not cause a segmentation fault. The reason why it does is addressed by the descriptions above: the problem can be reproduced independentyl from sage, and sage does work when the issue is fixed. However, there might be something wrong in sage causing the exception to be thrown in the first place. I guess that's what Burcin is aiming for. Fixing that would make the issue vanish as well. Doesn't mean glibc shouldn't be fixed, though. And I'd say the Sun C library might be broken in some (probably different) way as well, and might require a fix just the same. Not sure without the code, though.

@kiwifb
Collaborator
kiwifb commented Jul 7, 2011

Newsflash: with glibc-2.12.2 and pynac-0.2.3 (sage-4.7.1_alpha4) I get to the sage prompt without a segfault on a amd64 machine. Without patching glibc myself. Not sure if the patch itself finally made it or if the newer pynac solved the problem.

@gagern
Contributor
gagern commented Dec 31, 2011

As the upstream bugs reports in Gentoo and glibc have both been closed fixed, and 2.13-r4 is stable and should include the fix, let's close this here.

As kiwifb wrote: "It is best to leave the issue open until the fix is in the tree." Now it is.

@gagern gagern closed this Dec 31, 2011
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment