New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Troubles with GS register #607
Comments
Hi @gweodoo! Thanks again for continuing to work on this. It's been a while since I looked at this part of the code, so it'll take However, I do have one question: why did we not see this issue earlier? In |
To be short, we discovered it because we need to use GS as a per-CPU register. We implement a user-level thread scheduler relying over POSIX threads. In such case, we can potentially have multiple user-level threads scheduled on the same Unix thread. The FS register is used by the NPTL to store POSIX thread storage while the GS is used to provide the "per user-level" storage (a "nested one"). One typical example is a simple MPI program setting its rank as a global variable. As this data changes at MPI task level, the storage model for this variable has to change too. Thus, we reproduce, somehow, what the libc does with the TLS mechanism at our scale. For 64bit mode, GS register is not used, whereas in 32bit mode, DMTCP uses a macro For a proper fix, would not be better to replace these explicit field with a "JIT-determined" type, depending on the architecture ? Something like: typedef enum { FS, GS, NB } reg_type_t;
#ifdef __i386__
#define TLS_TYPE_T struct user_desc
#define tls_get_thread_area(st, which) get_thread_area(st)
#define tls_set_thread_area(st, which) set_thread_area(st)
#elif __x86_64__
#define TLS_TYPE_T unsigned long
#define tls_get_thread_area(st, which) arch_prctl(ARCH_GET_#which, st)
#define tls_set_thread_area(st, which) arch_prctl(ARCH_SET_#which, st)
#elif __arm__
//...
#endif
TLS_TYPE_T reg_arrays[NB]; and then, only referring a register by calling |
Hmmm... Before I try to come up with one, do you have a simple program that I can try locally (to reproduce this issue and poke around)? (I don't think I need access to your entire stack; just a program that simulates this bug by using the GS register?) |
Sure, the following code should reproduce the pattern: create multiple threads bound to the same core, then each thread sets its own buffer for the GS register: #define _GNU_SOURCE
#include <pthread.h>
#include <asm/prctl.h>
#include <sys/prctl.h>
#include <assert.h>
#include <stdio.h>
#include <unistd.h>
#ifndef NB_TH
#define NB_TH 4
#endif
void * func (void * arg)
{
int id = *(int*)arg;
void* buf = &id; /* this whill be our GS buffer */
cpu_set_t set;
/* check the effective affinity */
CPU_ZERO(&set);
pthread_getaffinity_np(pthread_self(), sizeof(cpu_set_t), &set);
assert(CPU_ISSET(0, &set));
assert(CPU_COUNT(&set) == 1);
/* set the GS register for the current thread */
arch_prctl(ARCH_SET_GS, (unsigned long)buf);
fprintf(stderr, "PRE: Th#%d - buf = %d (%p)\n", id, *(int*)buf, buf );
/* Do the checkpoint here */
sleep(4);
/* check again the affinity (need fix PR#606) */
/* But the bug should be reproductible without this check */
CPU_ZERO(&set);
pthread_getaffinity_np(pthread_self(), sizeof(cpu_set_t), &set);
assert(CPU_ISSET(0, &set));
assert(CPU_COUNT(&set) == 1);
/* Get the GS value back */
arch_prctl(ARCH_GET_GS, (unsigned long*)&buf);
fprintf(stderr, "POST: Th#%d - buf = %d (%p)\n", id, *(int*)buf, buf );
assert(buf == &id);
return NULL;
}
int main(int argc, char *argv[])
{
pthread_t th[NB_TH];
int i;
cpu_set_t set;
/* attach all threads to 0, creating race over GS register */
for(i = 0; i < NB_TH; ++i)
{
pthread_attr_t attr;
CPU_ZERO(&set);
CPU_SET(0, &set);
pthread_attr_init(&attr);
pthread_attr_setaffinity_np(&attr, sizeof(cpu_set_t), &set);
pthread_create(th+i, &attr, func, &i);
}
for (i = 0; i < NB_TH; ++i)
{
pthread_join(th[i], NULL);
}
return 0;
} |
@gweodoo: Thanks for sharing the code. Please go ahead and generate a pull request with your patch. I will approve and merge it. Also, we should add your test case to |
Do you want a PR with the workaround or the more generic solution ? Because the first one is implemented, the second one was just prospective. Maybe, as a first step, the hot fix could be used to fix GS specifically. Yes absolutely, feel free to use this test for your validation system. The affinity is not mandatory but it was allowing me to check that the checkpoint was done in a per-thread basis (=each thread effectively saves its own value) and not a per-CPU (someone iterating over CPUs to back registers up). But indeed, to demonstrate the issue described here, the thread pinning is not necessary. Note: as |
Hi guys,
I'm still working on DMTCP integration as a C/R system for a custom MPI implementation and I'm focused to the DMTCP TLS mechanism.
TL;DR
The GS register does not seem to be handled in a per-thread storage, leading to inconsistent values at restart time.
Explanation
I saw that TLS save/restore are handled through a
ThreadTLSInfo
, located in a per-thread structure, andTLSInfo_RestoreTLSState()/TLSInfo_SaveTLSState()
are responsible of updating thistlsInfo
.%fs
register seems to be stored in a field namedgdtentrytls
, which is a 1-entry array:This field is updated by
tls_get_thread_area()/tls_set_thread_area()
, the "internals" thatTLSInfo_*
functions are wrapping. Why%gs
is not handled accordingly ? It seems that this register is stored in a per-process storage:myinfo_gs
. However,%gs
being a per-CPU register too, it should be stored the same way as%fs
. The fact is that we use GS (supposed unused in 64-bit mode) to reproduce a TLS mechanism for our own user-level thread model (like the NPTL does with FS).Is there any reason for FS and GS to be handled differently ? According to me, these registers could just be saved/restored through
arch_prctl
calls, manipulating 64-bit addresses. Then, DMTCP just needs to update PID/TID in the TCB accordingl (at least for x86_64).Furthermore, I saw that the
struct user_desc
was used to store the FS register in thebase_addr
member (probably to transparently switch betweenarch_prctl()
and{g,s}et_thread_area()
depending on the architecture). But, on 64-bit systems, this member is anint
, to stay compliant with thelong
definition in 32-bit mode. This leadsarch_prctl
to erase the next struct member, namedlimit
. As these two fields are depicted as "not used" for 64-bit mode, it is not a problem but just to say (actually a bit confusing when debugging).This post is based on my knowledge of TLS mechanism for x86_64 architecture and complexity introduced by DMTCP should be related to the need to support more than one (like ARM). So, maybe it is non-trivial to do it.
For now, my work-around (reachable at gweodoo/dmtcp@07b4957) consists in replacing the memeber with a 2-entry array, to store both the register states and an update of two "internals" accordingly (still on the 2.5 branch). But i think a proper solution would be better.
Let me know if some parts of my post are unclear, I will be pleased to provide some examples.
Thanks for reading and your help.
The text was updated successfully, but these errors were encountered: